Quantcast
Does anyone use Python to sort data? - Beyond.ca - Car Forums
Results 1 to 14 of 14

Thread: Does anyone use Python to sort data?

  1. #1
    Join Date
    Sep 2016
    Location
    Calgary, Ab
    My Ride
    2021 Zonda CRV
    Posts
    1,015
    Rep Power
    19

    Default Does anyone use Python to sort data?

    Is anyone on here that uses Python to sort large data?
    Ive asked chat GPT to code some things. Simple things like sort my data smallest to largest. Highlight duplicates and move them. It struggles.
    Can anyone look at my code and give pointers? Other than forums, can anyone recommend a place that can help and give direction?

  2. #2
    Join Date
    Jul 2004
    Location
    YYC
    Posts
    4,320
    Rep Power
    86

    Default

    Personally I think SQL might be easier. But not a big Python user.

    https://learnsql.com/blog/sql-basics-cheat-sheet/

  3. #3
    Join Date
    Apr 2006
    Location
    Cowtown
    My Ride
    10' 4Runner SR5
    Posts
    6,373
    Rep Power
    60

    Default

    I liked using DataCamp, but lots of other online classes.

    Surprised ChatGPT didn't give you answers, this is pretty basic functioning. I assume you're wanting to work outside of Excel?
    Ultracrepidarian

  4. #4
    Join Date
    Apr 2004
    Location
    Calgary
    Posts
    2,093
    Rep Power
    44

    Default

    Quote Originally Posted by tonytiger55 View Post
    This quote is hidden because you are ignoring this member. Show Quote
    Is anyone on here that uses Python to sort large data?
    Ive asked chat GPT to code some things. Simple things like sort my data smallest to largest. Highlight duplicates and move them. It struggles.
    Can anyone look at my code and give pointers? Other than forums, can anyone recommend a place that can help and give direction?
    How large is your dataset? Under a million rows?

  5. #5
    Join Date
    Sep 2016
    Location
    Calgary, Ab
    My Ride
    2021 Zonda CRV
    Posts
    1,015
    Rep Power
    19

    Default

    Quote Originally Posted by sabad66 View Post
    This quote is hidden because you are ignoring this member. Show Quote
    How large is your dataset? Under a million rows?
    Yeah, 2.5 to 5 thousand rows.

    Its mostly data cleaning and sorting. I want to automate it. The data has many headers.
    Column A has the policy number as a example. But this is a mixture of letters and numbers. I want the data descending by policy number. Then if they are duplicates I want the policy numbers and the data associated with it pulled into a separate sheet.

    There could be 2-5 of the same policy numbers. So A2-A6. Then if the policy numbers match. As a example check column CD2 -CD6. If one of them has a value and the others of that policy are null. Then pull the most recent one (column D) row into a separate sheet.

    Or if there are 2 rows that have a value, one is red(negative) and other is positive. Both equal to 0. Then I need that the recent version (by date Colum D) usually the top row pulled into a separate sheet.
    There are many more layers to this. Its a lot of if this then do that. if not then do this. I can write out the logic tree to have it all automated as some of the data sorting logic can start to be very complicated. This can take 2-4 days as there are business rules to the logic.
    I can write out each of the processes broken down. Then compile them so the data is cleaned up in one big dump. Anything outside of that I we can do manually.
    But VBA crashes just copying duplicates. So VBA is out.

    I tried chat gpt. Python gets confused a lot as the data is a mixture of letters, numbers and numbers with decimals. It can't seem to associate the rows with the Policy number, headers etc. So I can't use Openpxl. So its Pandas. But I can't even get it to sort column A descending. Pandas does not use headers. It uses indexing. If I tell it to use headers. Then i have to write them all out. Ok fine. But then the rest of my data is mixture (letters, numbers and numbers with decimals). Then it can't seem to take the data, clean it in memory and paste it.
    For example. On a basic script, It cant take the first row (A2) sort descending with the rest of the data (It leaves the rest of the data) and it cant group the duplicates together. Ok fine. So I converted my data to CSV file. Told the code to go to column name. Now the stupid code can't read the headers. The header name matches the code. But I got a feeling I may have to re-write it in excel as it gets weird like that. The debugging is driving me nutts.

    Its like dealing with 5 personalities that don't talk to each other. After debugging the code. It then recommends what it did at the start. like wtf? Its been awesome for Excel formulas.

    We have SQL, but it would mean uploading to the SQL database. We use ACCESS, but that would be writing script. But I don't want to touch that. We have Azure I think, but I would need to get access and I have no idea how that works.

    I thought Python would be simple. Just amend the code per logical task(s)
    I ask GPT and then copy and modify the code to suite my data (I cant use it direct for security reasons). Its not so simple.
    Im trying to figure out the most efficient way to clean this up and I thought it best to talk to someone to guide me.

    Edit. Breakthrough! I actually managed to get Python to sort and cut the duplicates to a separate file. Now another 100 steps to go.
    Last edited by tonytiger55; 12-05-2023 at 03:29 PM.

  6. #6
    Join Date
    Mar 2010
    Location
    Calgary
    My Ride
    Ioniq 5
    Posts
    1,810
    Rep Power
    46

    Default

    Quote Originally Posted by tonytiger55 View Post
    This quote is hidden because you are ignoring this member. Show Quote
    Python gets confused a lot as the data is a mixture of letters, numbers and numbers with decimals. It can't seem to associate the rows with the Policy number, headers etc.
    It might be worth learning how to use regular expressions as your search criteria instead. It may make your life a lot easier.

    For example
    100-G-01
    101-G02
    102.G.3
    103-G.1

    Find:
    [0-9]+[.-]?[A-Z][.-]?[0-9]+

    That string will 'find' all of those lines regardless of them using a dash, period, or none.

    You can also add saved portions as part of a replace using () and $
    Find:
    ([0-9]+)[.-]?([A-Z])[.-]?([0-9]+)
    Replace:
    $1-$2-$3

    Will output:
    100-G-01
    101-G-02
    102-G-3
    103-G-1
    Which can help clean all your data in an intelligent way.



    Or in your example
    100-A1
    100.A2
    100A3
    101-A1
    101.A2
    102A1

    Find:
    100[.-]?A[0-9]+

    Would only select the first 3 policies under 100 regardless of how the remainder is punctuated or which A1-3 it is.

    https://cheatography.com/davechild/c...r-expressions/

    A quick explaination:
    [A-Z] Look for any letter character
    [0-9] Looks for any numerical character
    + means for as many digits that fall within that criteria. So 10 or 1011100 are all still numerical
    [A-Z0-9]+ would mean grab any combination of letters & numbers until you hit a different character.
    [.-] means find a . or a -
    ? means optional so look for a . or - but also okay if there isn't one
    ( ) means save the result from this section
    $1 means replace with the result from the first set of ( ) in the search criteria

    If you want to get complex you can also include return characters to span multiple rows in your search criteria but that's a bit more advanced.
    Last edited by pheoxs; 12-06-2023 at 11:52 AM.

  7. #7
    Join Date
    Jul 2004
    Location
    YYC
    Posts
    4,320
    Rep Power
    86

    Default

    I hate regex. But did find some good sites to help build the queries. But lost my best regex search site in bookmarks when my work pc cratered.
    Anyone have some good ones?
    Regex builder and regex 101.

  8. #8
    Join Date
    Sep 2012
    Location
    Calgary, AB
    Posts
    1,654
    Rep Power
    87

    Default

    The old saying goes:

    “Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.”

  9. #9
    Join Date
    May 2002
    Location
    Calgary, Alberta
    My Ride
    (maah raahde)
    Posts
    5,799
    Rep Power
    44

    Default

    If OP can't figure out python, he's sure in the hell not going to figure out regex.

  10. #10
    Join Date
    Sep 2016
    Location
    Calgary, Ab
    My Ride
    2021 Zonda CRV
    Posts
    1,015
    Rep Power
    19

    Default

    Quote Originally Posted by pheoxs View Post
    This quote is hidden because you are ignoring this member. Show Quote
    It might be worth learning how to use regular expressions as your search criteria instead. It may make your life a lot easier.

    For example
    100-G-01
    101-G02
    102.G.3
    103-G.1

    Find:
    [0-9]+[.-]?[A-Z][.-]?[0-9]+

    That string will 'find' all of those lines regardless of them using a dash, period, or none.

    You can also add saved portions as part of a replace using () and $
    Find:
    ([0-9]+)[.-]?([A-Z])[.-]?([0-9]+)
    Replace:
    $1-$2-$3

    Will output:
    100-G-01
    101-G-02
    102-G-3
    103-G-1
    Which can help clean all your data in an intelligent way.



    Or in your example
    100-A1
    100.A2
    100A3
    101-A1
    101.A2
    102A1

    Find:
    100[.-]?A[0-9]+

    Would only select the first 3 policies under 100 regardless of how the remainder is punctuated or which A1-3 it is.

    https://cheatography.com/davechild/c...r-expressions/

    A quick explaination:
    [A-Z] Look for any letter character
    [0-9] Looks for any numerical character
    + means for as many digits that fall within that criteria. So 10 or 1011100 are all still numerical
    [A-Z0-9]+ would mean grab any combination of letters & numbers until you hit a different character.
    [.-] means find a . or a -
    ? means optional so look for a . or - but also okay if there isn't one
    ( ) means save the result from this section
    $1 means replace with the result from the first set of ( ) in the search criteria

    If you want to get complex you can also include return characters to span multiple rows in your search criteria but that's a bit more advanced.
    Thanks for this. I appreciate you writing this out.
    So its just breaking up the problem in the Column into specific components and taking on each one.

    Sorry I should have been more specific. The policy column does not have dashes or decimals. Just letters and numbers. The other columns have addresses. Specific columns have numbers or numbers with decimals. So one policy number may be listed 3 or 7 times. Three of them had different addresses listed in another column. Another column defines this by location (i.e 1, 2, 3).

    So the first thing is to remove duplicate Polices from the main data and put it into a separate sheet. Organize the data by Policy number, Date (most recent) and location. and a particular variable order in another column (i.e new, reissue, cancellation etc) So the data is grouped. In that order so the data displays in the most recent format in terms of transactions on our system
    Example:
    12345 1/1/2004 10 Downing Street London. Endorsement 100 100 True
    12345 1/1/2004 10 Downing Street London. CXL -100 -100 False
    12345 1/1/2004 Buckingham Palace London, New False
    12345 1/1/2004 Buckingham Palace London, Endorsement False
    12345 1/1/2004 Buckingham Palace London, Endorsement
    ABC77 2/2/2004 Silverstone GP, Nothampton New
    ABC77 2/2/2004 Silverstone GP, Nothampton Cancelation

    The first two have the same Policy number and address. So I need to check five other columns. If the numbers in these columns match (many will not). Top is a postive number, bottom row is negative. Then I need the rows removed(or put into another sheet or file for manual checking).
    In Excel i can get a formula to give me a true or false value by summing the numbers as they would equate to 0. So if the Policy numbers and address match. Then sum the top and bottom cell of X Y Z columns. The problem I have is the formula counts down and not up. So the second row checks against the bottom and gives a false value. Both of rows 1 and 2 should be True. Then the next part of the code would be to remove them to different sheet. Ideally i don't want to do this in excel.

    That is one of the problems within the data. There are many many many....I am trying to automate some of them as the volume of data is large.

    I converted the data to a CVS file. Then I used Pandas. I managed to get GPT to write a Python script to look at two specific headers for duplicates and put the headers and rows into a output file. I even got it to sort from ascending. With this done then its easy. Then I realised. My data has negative numbers (premium). CVS file does not save negative numbers. I need that data. So im back to square one. No pandas. OpenPyxl is going to give me headaches.

  11. #11
    Join Date
    Mar 2010
    Location
    Calgary
    My Ride
    Ioniq 5
    Posts
    1,810
    Rep Power
    46

    Default

    Quote Originally Posted by tonytiger55 View Post
    This quote is hidden because you are ignoring this member. Show Quote
    In Excel i can get a formula to give me a true or false value by summing the numbers as they would equate to 0. So if the Policy numbers and address match. Then sum the top and bottom cell of X Y Z columns. The problem I have is the formula counts down and not up. So the second row checks against the bottom and gives a false value. Both of rows 1 and 2 should be True. Then the next part of the code would be to remove them to different sheet. Ideally i don't want to do this in excel.
    Are you trying to use MATCH and then compare to the row below it? If so instead try to change it to COUNTIF() > 1. Here's an example:
    How to identify duplicates in Excel: find, highlight, count, filter (ablebits.com)

    Then you can have a flag for duplicate lines on both rows.

  12. #12
    Join Date
    Sep 2016
    Location
    Calgary, Ab
    My Ride
    2021 Zonda CRV
    Posts
    1,015
    Rep Power
    19

    Default

    Quote Originally Posted by pheoxs View Post
    This quote is hidden because you are ignoring this member. Show Quote
    Are you trying to use MATCH and then compare to the row below it? If so instead try to change it to COUNTIF() > 1. Here's an example:
    How to identify duplicates in Excel: find, highlight, count, filter (ablebits.com)

    Then you can have a flag for duplicate lines on both rows.
    I used Countif. It does not fully work. I used the formula to count the occurrences on the polices. Those that have 2. Then check on the 2 if the addresses match on those duplicates (important). If they do, then check the specific numerical columns. This did not work and drove me nutts. Then I realized 100 and -100 are not duplicates. Duh..
    So I got it to sum instead. If the value equals 0. Then Flag as True. This worked. But it does not work for the second row. As the formula references the row below it.
    That being said I don't want to do this in excel.

    I tried again in Python and had a breakthrough. I can move the duplicate data and preserve the negative numbers (not a CVS file). So I think I can build on this with ease now. But the code does some weird buggy stuff. The output file messes up the date format. Not a big thing, just annoying.

  13. #13
    Join Date
    Mar 2010
    Location
    Calgary
    My Ride
    Ioniq 5
    Posts
    1,810
    Rep Power
    46

    Default

    Is it always a positive and negative? You could just use ABS() to remove the negative sign and then do the compare instead of a sum.

  14. #14
    Join Date
    Sep 2016
    Location
    Calgary, Ab
    My Ride
    2021 Zonda CRV
    Posts
    1,015
    Rep Power
    19

    Default

    Thats not a bad idea. But I need the negative sign in there to check the data and its gets uploaded after to the main system. Some polices have negative values. Those are ok. I don't touch those rows.
    Its only the negative numbers that are exactly the same as row above it and share the same policy number. These are what we call in/out transactions. These need to be highlighted and taken out.

Similar Threads

  1. Data usage while travelling. Roaming off. Cell Data off. Still uses data. iPhone 6+

    By benyl in forum Computers, Consoles, and other Electronics
    Replies: 41
    Latest Threads: 02-04-2015, 05:01 PM
  2. Anyone know how to use there data on the Android?

    By SoSimpo in forum Computers, Consoles, and other Electronics
    Replies: 4
    Latest Threads: 08-22-2010, 05:49 PM
  3. FS: Jungle Carpet Python

    By andrewhamill in forum Miscellaneous Buy/Sell/Trade
    Replies: 4
    Latest Threads: 02-09-2004, 10:35 PM
  4. Jungle Carpet Python

    By andrewhamill in forum General
    Replies: 2
    Latest Threads: 02-04-2004, 11:03 PM
  5. Good Ole Monty Python

    By RickDaTuner in forum General
    Replies: 5
    Latest Threads: 01-30-2003, 07:49 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •