I’ve got a data source that provides a list of objects and their properties (a CSV file, but that doesn’t matter). Each time my program runs, it needs to pull a new copy of the list of objects, compare it to the list of objects (and their properties) stored in the database, and update the database as needed.
Dealing with new objects is easy – the data source gives each object a sequential ID number, check the top ID number in the new information against the database, and you’re done. I’m looking for suggestions for the other cases – when some of an object’s properties have changed, or when an object has been deleted.
A naive solution would be to pull all the objects from the database and get the complement of the intersection of the two sets (old and new) and then examine those results, but that seems like it wouldn’t be very efficient if the sets get large. Any ideas?
The standard approach for huge piles of data amounts to this.
We’ll assume that list_1 is the “master” (without duplicates) and list_2 is the “updates” which may have duplicates.
Yes, it involves a potential sort. If list_1 is kept in sorted order when you write it back to the file system, that saves considerable time. If list_2 can be accumulated in a structure that keeps it sorted, then that saves considerable time.
Sorry about the wordiness, but you need to know which iterator raised the
StopIteration, so you can’t (trivially) wrap the whole while loop in a big-old-try block.