I have a large amount of data that I am pulling from an xml file that all needs to be validated against each other (in excess of 500,000 records). It is location data, so it has information such as: county, street prefix, street suffix, street name, starting house number, ending number. There are duplicates, house number overlaps, etc. and I need to report on all this data (such as where there are issues). Also, there is no ordering of the data within the xml file, so each record needs to be matched up against all others.
Right now I’m creating a dictionary of the location based on the street name info, and then storing a list of the house number starting and ending locations. After all this is done, I’m iterating through the massive data structure that was created to find duplicates and overlaps within each list. I am running into problems with the size of the data structure and how many errors are coming up.
One solution that was suggested to me was to create a temporary SQLite DB to hold all data as it is read from the file, then run through the DB to find all issues with the data, report them out, and then destroy the DB. Is there a better/more efficient way to do this? And any suggestions on a better way to approach this problem?
As an fyi, the xml file I’m reading in is over 500MB (stores other data than just this street information, although that is the bulk of it), but the processing of the file is not where I’m running into problems, it’s only when processing the data obtained from the file.
EDIT: I could go into more detail, but the poster who mentioned that there was plenty of room in memory for the data was actually correct, although in one case I did have to run this against 3.5 million records, in that instance I did need to create a temporary database.
500,000 is not a large number, why can’t you just go thru all record create a dict out of relevant entries and check whatever you need to check e.g.
output:
So how different is your scenario from this case, can it be done in similar way?