Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8015715
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 4, 20262026-06-04T20:19:48+00:00 2026-06-04T20:19:48+00:00

I have a large amount of data which is received in separated XML files

  • 0

I have a large amount of data which is received in separated XML files each morning. I need to combine the objects within the XML and generate a report from them. I am looking to use an optimal solution for this problem.

To demonstrate I have fabricated the following example:

There are 2 XML files:

The first is a list of languages and the countries they are spoken in. The second is a list of products and the countries they are sold in. The report I generate is the product name followed by the languages the packaging has to be in.

XML1:

<?xml version="1.0" encoding="utf-8"?>
<languages>
  <language>
    <name>English</name>
    <country>8</country>
    <country>9</country>
    <country>3</country>
    <country>11</country>
    <country>12</country>
  </language>
  <language>
    <name>French</name>
    <country>3</country>
    <country>6</country>
    <country>7</country>
    <country>13</country>
  </language>
  <language>
    <name>Spanish</name>
    <country>1</country>
    <country>2</country>
    <country>3</country>
  </language>
</languages>

XML2:

<?xml version="1.0" encoding="utf-8"?>
<products>
  <product>
    <name>Screws</name>
    <country>3</country>
    <country>12</country>
    <country>29</country>
  </product>
  <product>
    <name>Hammers</name>
    <country>1</country>
    <country>13</country>
  </product>
  <product>
    <name>Ladders</name>
    <country>12</country>
    <country>39</country>
    <country>56</country>
  </product>
  <product>
    <name>Wrenches</name>
    <country>8</country>
    <country>13</country>
    <country>456</country>
  </product>
  <product>
    <name>Levels</name>
    <country>19</country>
    <country>18</country>
    <country>17</country>
  </product>
</products>

Sample Program Output:

 Screws ->  English, French, Spanish
 Wrenches ->  English, French
 Hammer - > French, Spanish
 Ladders-> English

Currently I deserialise into a DataSet and then use linq to join across the datasets to generate the required report strings. (Shown Below – Passing the names of the files in as command line arguments).

public static List<String> XMLCombine(String[] args)
{
    var output = new List<String>();
    var dataSets = new List<DataSet>();
    //Load each of the Documents specified in the args
    foreach (var s in args)
    {
        var path = Environment.CurrentDirectory + "\\" + s;
        var tempDS = new DataSet();
        try
        {
            tempDS.ReadXml(path);
        }
        catch (Exception ex)
        {
            //Custom Logging + Error Reporting
            return null;
        }
        dataSets.Add(tempDS);
    }
    //determine order of files submitted
    var productIndex = dataSets[0].DataSetName == "products" ? 0:1;
    var languageIndex = dataSets[0].DataSetName == "products" ? 1:0;
    var joined = from tProducts in dataSets[productIndex].Tables["product"].AsEnumerable()
                 join tProductCountries in dataSets[productIndex].Tables["country"].AsEnumerable() on (int)tProducts["product_id"] equals (int)tProductCountries["product_id"]
                 join tLanguageCountries in dataSets[languageIndex].Tables["country"].AsEnumerable() on (String)tProductCountries["country_text"] equals (String)tLanguageCountries["country_text"]
                 join tLanguages in dataSets[languageIndex].Tables["language"].AsEnumerable() on (int)tLanguageCountries["language_Id"] equals (int)tLanguages["language_Id"]
                  select new
                  {
                      Language = tLanguages["name"].ToString(),
                      Product = tProducts["name"].ToString()
                  };

    var listOfProducts = joined.OrderByDescending(_ => _.Product).Select(_ => _.Product).Distinct().ToList();

    foreach (var e in listOfProducts)
    {
        var e1 = e;
        var languages = joined.Where(_ => _.Product == e1).Select(_ => _.Language).Distinct().ToList();
        languages.Sort();
        //Custom simple Array to text method
        output.Add(String.Format("{0} {1}", e, ArrayToText(languages)));
    }
    return output;
}

This works fine but I know there must be more optimal solutions to this problem (particularly when the XML files are huge in real life). Does anyone have experience in alternate approaches (other than linq) or advice on optimising the current approach which would bring me closer to the best solution?

Many thanks in advance.

Solution
Implementation of suggested solutions:
Casperah’s approach using Dictionaries processed data set in 312ms.
yamen’s approach using Linq Lookup processed data set in 452ms.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-04T20:19:50+00:00Added an answer on June 4, 2026 at 8:19 pm

    You have two problems, memory usage and CPU usage.

    To limit the memory usage you can use XmlReader, which only reads a small chunk of the huge xml files.
    To limit CPU usage you should have an index on the country code.

    I would do like this:
    1. Read all languages and insert it into a dictionary like this:
    // The key is country, the value is a list of languages.
    Dictionary> countries = new Dictionary>();
    2. Read products one at a time using XmlReader
    3. Lookup countries and write out Languages maybe using a HashSet to avoid duplicate Languages.

    That would be my approch – Good luck

    I have created this example:

            Dictionary<int, List<string>> countries = new Dictionary<int, List<string>>();
    
            XmlReader xml = XmlReader.Create("file://D:/Development/Test/StackOverflowQuestion/StackOverflowQuestion/Countries.xml");
            string language = null;
            string elementName = null;
            while (xml.Read())
            {
                switch (xml.NodeType)
                {
                    case XmlNodeType.Element:
                        elementName = xml.Name;
                        break;
    
                    case XmlNodeType.Text:
                        if (elementName == "name") language = xml.Value;
                        if (elementName == "country")
                        {
                            int country;
                            if (int.TryParse(xml.Value, out country))
                            {
                                List<string> languages;
                                if (!countries.TryGetValue(country, out languages))
                                {
                                    languages = new List<string>();
                                    countries.Add(country, languages);
                                }
                                languages.Add(language);
                            }
                        }
                        break;
                }
            }
            using (StreamWriter result = new StreamWriter(@"D:\Development\Test\StackOverflowQuestion\StackOverflowQuestion\Output.txt"))
            {
                xml = XmlReader.Create("file://D:/Development/Test/StackOverflowQuestion/StackOverflowQuestion/Products.xml");
                string product = null;
                elementName = null;
                HashSet<string> languages = new HashSet<string>();
                while (xml.Read())
                {
                    switch (xml.NodeType)
                    {
                        case XmlNodeType.Element:
                            elementName = xml.Name;
                            break;
    
                        case XmlNodeType.Text:
                            if (elementName == "name")
                            {
                                if (product != null && languages != null)
                                {
                                    result.Write(product);
                                    result.Write(" -> ");
                                    result.WriteLine(string.Join(", ", languages.ToArray()));
                                    languages.Clear();
                                }
                                product = xml.Value;
                            }
                            if (elementName == "country")
                            {
                                int country;
                                if (int.TryParse(xml.Value, out country))
                                {
                                    List<string> countryLanguages;
                                    if (countries.TryGetValue(country, out countryLanguages))
                                        foreach (string countryLanguage in countryLanguages) languages.Add(countryLanguage);
                                }
                            }
                            break;
                    }
                }
            }
        }
    

    It produces this example:

    Screws -> English, French, Spanish
    Hammers -> Spanish, French
    Ladders -> English
    Wrenches -> English, French
    

    XmlReader.Create takes an uri, you could also use something like: “http://www.mysite.com/countries.xml”

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a simulation in which I need to dump large amount of data,
I have a highly formatted file with large amount of data which I used
I have a database which I regularly need to import large amounts of data
I have a text file with a large amount of data which is tab
I have a large amount of Data, which has to be arranged in the
The application we have takes large amount of data most of which is called
I have one voter table which contain large amount of data. Like Voter_id name
I have an application where i need to download a large amount of data
I have server 1 which is generating a large amount of data, e.g there
I have a situation in which a large amount of data output from Matlab

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.