Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6990841
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T19:24:39+00:00 2026-05-27T19:24:39+00:00

I have to compare massive database dumps in xls format to parse for changes

  • 0

I have to compare massive database dumps in xls format to parse for changes day-to-day (gross, right?). I’m currently doing this in the most backwards way possible, and using xlrd to turn the xls into csv files, and then I’m running diffs to compare them.

Since it’s a database, and I don’t have a means of knowing if the data ever stays in the same order after something like an item deletion, I can’t do a compare x line to x line between the files, so doing lists of tuples or something wouldn’t make the most sense to me.

I basically need to find every single change that could have happened on any row REGARDLESS of that row’s position in the actual dump, and the only real “lookup” I could think of is SKU as a unique ID (it’s a product table from an ancient DB system), but I need to know a lot more than just products being deleted or added, because they could modify pricing or anything else in that item.

Should I be using sets? And once I’ve loaded 75+ thousand lines of this database file into a “set”, is my ram usage going to be hysterical?

I thought about loading in each row of the xls as a big concatenated string to add to a set. Is that an efficient idea? I could basically get a list of rows that differ between sets and then go back after those rows in the original db file to find my actual differences.

I’ve never worked with data parsing on a scale like this. I’m mostly just looking for any advice to not make this process any more ridiculous than it has to be, and I came here after not really finding something that seemed specific enough to my case to feel like good advice. Thanks in advance.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T19:24:40+00:00Added an answer on May 27, 2026 at 7:24 pm

    I use sets for exactly that purpose, but try to keep the number of items down to several million at a time. As S.Lott said, 75,000 is nothing. I use a similar system for populating database tables from imported date while only issuing the minimum number of INSERTs and DELETEs required to “patch” the table from the results of the last import. The basic algorithm is along the lines of:

    lastset = [...]  # Populate with the output of the last run
    thisset = [...]  # The current results
    
    # Remove rows that aren't in the current result set
    for row in lastset - thisset:
        deleteentry(row[0])  # Where row[0] is the unique key for the table
    
    # Add rows that weren't in the last result set
    for row in thisset - lastset:
        insertentry(row)
    

    To convince yourself that set operations are quick and sufficiently RAM efficient, try this:

    >>> a = set(range(10000000))
    >>> b = set(range(100, 10000100))
    >>> len(a - b)
    100
    >>> len(b - a)
    100
    

    That takes about 1.25GB on my Mac. That’s a lot of RAM, true, but probably from over 100 times the number of entries you’re working with. The set operations run in well under a second here.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have this code: # Compare phone number phone_pattern = '^\d{3} ?\d{3}-\d{4}$' phoneNumber =
I have set compare attribute for comparing passwords as explained in this blog. But
I have to compare a user entered date, Dt (in mm/dd/yyyy format) with the
I have to compare different versions of HTML pages for formatting and text changes.
I have to compare two selectors and I was wondering why does this return
We have to compare responses (XML) of two different but they are doing the
I have to compare two dates where first date is in Calendar format and
I have this compare validator: <asp:CompareValidator ID=cpvBirthDate Type=Date ControlToValidate=txtBirthDate Operator=DataTypeCheck runat=server ErrorMessage=Please enter a
I have to compare 2 values which are the text property of 2 textbox
I have to compare three dates in linq query (datetime a < datetime b

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.