I have to compare massive database dumps in xls format to parse for changes day-to-day (gross, right?). I’m currently doing this in the most backwards way possible, and using xlrd to turn the xls into csv files, and then I’m running diffs to compare them.
Since it’s a database, and I don’t have a means of knowing if the data ever stays in the same order after something like an item deletion, I can’t do a compare x line to x line between the files, so doing lists of tuples or something wouldn’t make the most sense to me.
I basically need to find every single change that could have happened on any row REGARDLESS of that row’s position in the actual dump, and the only real “lookup” I could think of is SKU as a unique ID (it’s a product table from an ancient DB system), but I need to know a lot more than just products being deleted or added, because they could modify pricing or anything else in that item.
Should I be using sets? And once I’ve loaded 75+ thousand lines of this database file into a “set”, is my ram usage going to be hysterical?
I thought about loading in each row of the xls as a big concatenated string to add to a set. Is that an efficient idea? I could basically get a list of rows that differ between sets and then go back after those rows in the original db file to find my actual differences.
I’ve never worked with data parsing on a scale like this. I’m mostly just looking for any advice to not make this process any more ridiculous than it has to be, and I came here after not really finding something that seemed specific enough to my case to feel like good advice. Thanks in advance.
I use sets for exactly that purpose, but try to keep the number of items down to several million at a time. As S.Lott said, 75,000 is nothing. I use a similar system for populating database tables from imported date while only issuing the minimum number of INSERTs and DELETEs required to “patch” the table from the results of the last import. The basic algorithm is along the lines of:
To convince yourself that set operations are quick and sufficiently RAM efficient, try this:
That takes about 1.25GB on my Mac. That’s a lot of RAM, true, but probably from over 100 times the number of entries you’re working with. The set operations run in well under a second here.