Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6556925
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T13:00:30+00:00 2026-05-25T13:00:30+00:00

I have a large amount of data that I am pulling from an xml

  • 0

I have a large amount of data that I am pulling from an xml file that all needs to be validated against each other (in excess of 500,000 records). It is location data, so it has information such as: county, street prefix, street suffix, street name, starting house number, ending number. There are duplicates, house number overlaps, etc. and I need to report on all this data (such as where there are issues). Also, there is no ordering of the data within the xml file, so each record needs to be matched up against all others.

Right now I’m creating a dictionary of the location based on the street name info, and then storing a list of the house number starting and ending locations. After all this is done, I’m iterating through the massive data structure that was created to find duplicates and overlaps within each list. I am running into problems with the size of the data structure and how many errors are coming up.

One solution that was suggested to me was to create a temporary SQLite DB to hold all data as it is read from the file, then run through the DB to find all issues with the data, report them out, and then destroy the DB. Is there a better/more efficient way to do this? And any suggestions on a better way to approach this problem?

As an fyi, the xml file I’m reading in is over 500MB (stores other data than just this street information, although that is the bulk of it), but the processing of the file is not where I’m running into problems, it’s only when processing the data obtained from the file.

EDIT: I could go into more detail, but the poster who mentioned that there was plenty of room in memory for the data was actually correct, although in one case I did have to run this against 3.5 million records, in that instance I did need to create a temporary database.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T13:00:30+00:00Added an answer on May 25, 2026 at 1:00 pm

    500,000 is not a large number, why can’t you just go thru all record create a dict out of relevant entries and check whatever you need to check e.g.

    import random
    import time
    
    class Data(object):
        ID = 0
        def __init__(self, data):
            Data.ID+=1
            self.id =Data.ID
            self.data = data
            self.duplicates = None
    
    def fill_data(N):
        data_list = []
        # create alist of random data
        sample = list("anuraguniyal")
        for i in range(N):
            random.shuffle(sample)
            data_list.append(Data("".join(sample)))
        return data_list
    
    def find_duplicate(data_list):
        data_map = {}
        for data in data_list:
            if data.data in data_map:
                data_map[data.data].append(data)
            else:
                data_map[data.data] = [data]
    
            data.duplicates = data_map[data.data]
    
    st = time.time()
    data_list = fill_data(500000)
    print "fill_data time:", time.time()-st
    st = time.time()
    find_duplicate(data_list)
    print "find_duplicate time:", time.time()-st
    
    total_duplicates = 0
    max_duplicates = 0
    for data in data_list:
        total_duplicates += (len(data.duplicates) - 1)
        max_duplicates = max(len(data.duplicates),max_duplicates)
    print "total_duplicates count:",total_duplicates
    print "max_duplicates count:",max_duplicates
    

    output:

    fill_data time: 7.83853507042
    find_duplicate time: 2.55058097839
    total_duplicates count: 12348
    max_duplicates count: 3
    

    So how different is your scenario from this case, can it be done in similar way?

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a windows service that receives a large amount of data that needs
I have a large amount of data stored in an XML file, 173 MB
I have a large amount of static data that needs to offer random access.
I have a rather large amount of data (100 MB or so), that I
I have a stored proc that processes a large amount of data (about 5m
I have a text file with a large amount of data which is tab
I have a large amount of text data that I want to display in
I have a large amount of log data that I need to get some
In my program, we split up a large amount of data that needs to
I have a table in MySQL with a large amount of data that I

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.