I have a large amount of data that I am pulling from an xml

Question

0

Asked: May 25, 20262026-05-25T13:00:30+00:00 2026-05-25T13:00:30+00:00

I have a large amount of data that I am pulling from an xml

0

I have a large amount of data that I am pulling from an xml file that all needs to be validated against each other (in excess of 500,000 records). It is location data, so it has information such as: county, street prefix, street suffix, street name, starting house number, ending number. There are duplicates, house number overlaps, etc. and I need to report on all this data (such as where there are issues). Also, there is no ordering of the data within the xml file, so each record needs to be matched up against all others.

Right now I’m creating a dictionary of the location based on the street name info, and then storing a list of the house number starting and ending locations. After all this is done, I’m iterating through the massive data structure that was created to find duplicates and overlaps within each list. I am running into problems with the size of the data structure and how many errors are coming up.

One solution that was suggested to me was to create a temporary SQLite DB to hold all data as it is read from the file, then run through the DB to find all issues with the data, report them out, and then destroy the DB. Is there a better/more efficient way to do this? And any suggestions on a better way to approach this problem?

As an fyi, the xml file I’m reading in is over 500MB (stores other data than just this street information, although that is the bulk of it), but the processing of the file is not where I’m running into problems, it’s only when processing the data obtained from the file.

EDIT: I could go into more detail, but the poster who mentioned that there was plenty of room in memory for the data was actually correct, although in one case I did have to run this against 3.5 million records, in that instance I did need to create a temporary database.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T13:00:30+00:00

500,000 is not a large number, why can’t you just go thru all record create a dict out of relevant entries and check whatever you need to check e.g.

import random
import time

class Data(object):
    ID = 0
    def __init__(self, data):
        Data.ID+=1
        self.id =Data.ID
        self.data = data
        self.duplicates = None

def fill_data(N):
    data_list = []
    # create alist of random data
    sample = list("anuraguniyal")
    for i in range(N):
        random.shuffle(sample)
        data_list.append(Data("".join(sample)))
    return data_list

def find_duplicate(data_list):
    data_map = {}
    for data in data_list:
        if data.data in data_map:
            data_map[data.data].append(data)
        else:
            data_map[data.data] = [data]

        data.duplicates = data_map[data.data]

st = time.time()
data_list = fill_data(500000)
print "fill_data time:", time.time()-st
st = time.time()
find_duplicate(data_list)
print "find_duplicate time:", time.time()-st

total_duplicates = 0
max_duplicates = 0
for data in data_list:
    total_duplicates += (len(data.duplicates) - 1)
    max_duplicates = max(len(data.duplicates),max_duplicates)
print "total_duplicates count:",total_duplicates
print "max_duplicates count:",max_duplicates

output:

fill_data time: 7.83853507042
find_duplicate time: 2.55058097839
total_duplicates count: 12348
max_duplicates count: 3

So how different is your scenario from this case, can it be done in similar way?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a large amount of data that I am pulling from an xml

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply