I have a large amount of data which is received in separated XML files

Question

0

Asked: June 4, 20262026-06-04T20:19:48+00:00 2026-06-04T20:19:48+00:00

I have a large amount of data which is received in separated XML files

0

I have a large amount of data which is received in separated XML files each morning. I need to combine the objects within the XML and generate a report from them. I am looking to use an optimal solution for this problem.

To demonstrate I have fabricated the following example:

There are 2 XML files:

The first is a list of languages and the countries they are spoken in. The second is a list of products and the countries they are sold in. The report I generate is the product name followed by the languages the packaging has to be in.

XML1:

<?xml version="1.0" encoding="utf-8"?>
<languages>
  <language>
    <name>English</name>
    <country>8</country>
    <country>9</country>
    <country>3</country>
    <country>11</country>
    <country>12</country>
  </language>
  <language>
    <name>French</name>
    <country>3</country>
    <country>6</country>
    <country>7</country>
    <country>13</country>
  </language>
  <language>
    <name>Spanish</name>
    <country>1</country>
    <country>2</country>
    <country>3</country>
  </language>
</languages>

XML2:

<?xml version="1.0" encoding="utf-8"?>
<products>
  <product>
    <name>Screws</name>
    <country>3</country>
    <country>12</country>
    <country>29</country>
  </product>
  <product>
    <name>Hammers</name>
    <country>1</country>
    <country>13</country>
  </product>
  <product>
    <name>Ladders</name>
    <country>12</country>
    <country>39</country>
    <country>56</country>
  </product>
  <product>
    <name>Wrenches</name>
    <country>8</country>
    <country>13</country>
    <country>456</country>
  </product>
  <product>
    <name>Levels</name>
    <country>19</country>
    <country>18</country>
    <country>17</country>
  </product>
</products>

Sample Program Output:

 Screws ->  English, French, Spanish
 Wrenches ->  English, French
 Hammer - > French, Spanish
 Ladders-> English

Currently I deserialise into a DataSet and then use linq to join across the datasets to generate the required report strings. (Shown Below – Passing the names of the files in as command line arguments).

public static List<String> XMLCombine(String[] args)
{
    var output = new List<String>();
    var dataSets = new List<DataSet>();
    //Load each of the Documents specified in the args
    foreach (var s in args)
    {
        var path = Environment.CurrentDirectory + "\\" + s;
        var tempDS = new DataSet();
        try
        {
            tempDS.ReadXml(path);
        }
        catch (Exception ex)
        {
            //Custom Logging + Error Reporting
            return null;
        }
        dataSets.Add(tempDS);
    }
    //determine order of files submitted
    var productIndex = dataSets[0].DataSetName == "products" ? 0:1;
    var languageIndex = dataSets[0].DataSetName == "products" ? 1:0;
    var joined = from tProducts in dataSets[productIndex].Tables["product"].AsEnumerable()
                 join tProductCountries in dataSets[productIndex].Tables["country"].AsEnumerable() on (int)tProducts["product_id"] equals (int)tProductCountries["product_id"]
                 join tLanguageCountries in dataSets[languageIndex].Tables["country"].AsEnumerable() on (String)tProductCountries["country_text"] equals (String)tLanguageCountries["country_text"]
                 join tLanguages in dataSets[languageIndex].Tables["language"].AsEnumerable() on (int)tLanguageCountries["language_Id"] equals (int)tLanguages["language_Id"]
                  select new
                  {
                      Language = tLanguages["name"].ToString(),
                      Product = tProducts["name"].ToString()
                  };

    var listOfProducts = joined.OrderByDescending(_ => _.Product).Select(_ => _.Product).Distinct().ToList();

    foreach (var e in listOfProducts)
    {
        var e1 = e;
        var languages = joined.Where(_ => _.Product == e1).Select(_ => _.Language).Distinct().ToList();
        languages.Sort();
        //Custom simple Array to text method
        output.Add(String.Format("{0} {1}", e, ArrayToText(languages)));
    }
    return output;
}

This works fine but I know there must be more optimal solutions to this problem (particularly when the XML files are huge in real life). Does anyone have experience in alternate approaches (other than linq) or advice on optimising the current approach which would bring me closer to the best solution?

Many thanks in advance.

Solution
Implementation of suggested solutions:
Casperah’s approach using Dictionaries processed data set in 312ms.
yamen’s approach using Linq Lookup processed data set in 452ms.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T20:19:50+00:00

You have two problems, memory usage and CPU usage.

To limit the memory usage you can use XmlReader, which only reads a small chunk of the huge xml files.
To limit CPU usage you should have an index on the country code.

I would do like this:
1. Read all languages and insert it into a dictionary like this:
// The key is country, the value is a list of languages.
Dictionary> countries = new Dictionary>();
2. Read products one at a time using XmlReader
3. Lookup countries and write out Languages maybe using a HashSet to avoid duplicate Languages.

That would be my approch – Good luck

I have created this example:

        Dictionary<int, List<string>> countries = new Dictionary<int, List<string>>();

        XmlReader xml = XmlReader.Create("file://D:/Development/Test/StackOverflowQuestion/StackOverflowQuestion/Countries.xml");
        string language = null;
        string elementName = null;
        while (xml.Read())
        {
            switch (xml.NodeType)
            {
                case XmlNodeType.Element:
                    elementName = xml.Name;
                    break;

                case XmlNodeType.Text:
                    if (elementName == "name") language = xml.Value;
                    if (elementName == "country")
                    {
                        int country;
                        if (int.TryParse(xml.Value, out country))
                        {
                            List<string> languages;
                            if (!countries.TryGetValue(country, out languages))
                            {
                                languages = new List<string>();
                                countries.Add(country, languages);
                            }
                            languages.Add(language);
                        }
                    }
                    break;
            }
        }
        using (StreamWriter result = new StreamWriter(@"D:\Development\Test\StackOverflowQuestion\StackOverflowQuestion\Output.txt"))
        {
            xml = XmlReader.Create("file://D:/Development/Test/StackOverflowQuestion/StackOverflowQuestion/Products.xml");
            string product = null;
            elementName = null;
            HashSet<string> languages = new HashSet<string>();
            while (xml.Read())
            {
                switch (xml.NodeType)
                {
                    case XmlNodeType.Element:
                        elementName = xml.Name;
                        break;

                    case XmlNodeType.Text:
                        if (elementName == "name")
                        {
                            if (product != null && languages != null)
                            {
                                result.Write(product);
                                result.Write(" -> ");
                                result.WriteLine(string.Join(", ", languages.ToArray()));
                                languages.Clear();
                            }
                            product = xml.Value;
                        }
                        if (elementName == "country")
                        {
                            int country;
                            if (int.TryParse(xml.Value, out country))
                            {
                                List<string> countryLanguages;
                                if (countries.TryGetValue(country, out countryLanguages))
                                    foreach (string countryLanguage in countryLanguages) languages.Add(countryLanguage);
                            }
                        }
                        break;
                }
            }
        }
    }

It produces this example:

Screws -> English, French, Spanish
Hammers -> Spanish, French
Ladders -> English
Wrenches -> English, French

XmlReader.Create takes an uri, you could also use something like: “http://www.mysite.com/countries.xml”

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a large amount of data which is received in separated XML files

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply