I have a large amount of data which is received in separated XML files each morning. I need to combine the objects within the XML and generate a report from them. I am looking to use an optimal solution for this problem.
To demonstrate I have fabricated the following example:
There are 2 XML files:
The first is a list of languages and the countries they are spoken in. The second is a list of products and the countries they are sold in. The report I generate is the product name followed by the languages the packaging has to be in.
XML1:
<?xml version="1.0" encoding="utf-8"?>
<languages>
<language>
<name>English</name>
<country>8</country>
<country>9</country>
<country>3</country>
<country>11</country>
<country>12</country>
</language>
<language>
<name>French</name>
<country>3</country>
<country>6</country>
<country>7</country>
<country>13</country>
</language>
<language>
<name>Spanish</name>
<country>1</country>
<country>2</country>
<country>3</country>
</language>
</languages>
XML2:
<?xml version="1.0" encoding="utf-8"?>
<products>
<product>
<name>Screws</name>
<country>3</country>
<country>12</country>
<country>29</country>
</product>
<product>
<name>Hammers</name>
<country>1</country>
<country>13</country>
</product>
<product>
<name>Ladders</name>
<country>12</country>
<country>39</country>
<country>56</country>
</product>
<product>
<name>Wrenches</name>
<country>8</country>
<country>13</country>
<country>456</country>
</product>
<product>
<name>Levels</name>
<country>19</country>
<country>18</country>
<country>17</country>
</product>
</products>
Sample Program Output:
Screws -> English, French, Spanish
Wrenches -> English, French
Hammer - > French, Spanish
Ladders-> English
Currently I deserialise into a DataSet and then use linq to join across the datasets to generate the required report strings. (Shown Below – Passing the names of the files in as command line arguments).
public static List<String> XMLCombine(String[] args)
{
var output = new List<String>();
var dataSets = new List<DataSet>();
//Load each of the Documents specified in the args
foreach (var s in args)
{
var path = Environment.CurrentDirectory + "\\" + s;
var tempDS = new DataSet();
try
{
tempDS.ReadXml(path);
}
catch (Exception ex)
{
//Custom Logging + Error Reporting
return null;
}
dataSets.Add(tempDS);
}
//determine order of files submitted
var productIndex = dataSets[0].DataSetName == "products" ? 0:1;
var languageIndex = dataSets[0].DataSetName == "products" ? 1:0;
var joined = from tProducts in dataSets[productIndex].Tables["product"].AsEnumerable()
join tProductCountries in dataSets[productIndex].Tables["country"].AsEnumerable() on (int)tProducts["product_id"] equals (int)tProductCountries["product_id"]
join tLanguageCountries in dataSets[languageIndex].Tables["country"].AsEnumerable() on (String)tProductCountries["country_text"] equals (String)tLanguageCountries["country_text"]
join tLanguages in dataSets[languageIndex].Tables["language"].AsEnumerable() on (int)tLanguageCountries["language_Id"] equals (int)tLanguages["language_Id"]
select new
{
Language = tLanguages["name"].ToString(),
Product = tProducts["name"].ToString()
};
var listOfProducts = joined.OrderByDescending(_ => _.Product).Select(_ => _.Product).Distinct().ToList();
foreach (var e in listOfProducts)
{
var e1 = e;
var languages = joined.Where(_ => _.Product == e1).Select(_ => _.Language).Distinct().ToList();
languages.Sort();
//Custom simple Array to text method
output.Add(String.Format("{0} {1}", e, ArrayToText(languages)));
}
return output;
}
This works fine but I know there must be more optimal solutions to this problem (particularly when the XML files are huge in real life). Does anyone have experience in alternate approaches (other than linq) or advice on optimising the current approach which would bring me closer to the best solution?
Many thanks in advance.
Solution
Implementation of suggested solutions:
Casperah’s approach using Dictionaries processed data set in 312ms.
yamen’s approach using Linq Lookup processed data set in 452ms.
You have two problems, memory usage and CPU usage.
To limit the memory usage you can use XmlReader, which only reads a small chunk of the huge xml files.
To limit CPU usage you should have an index on the country code.
I would do like this:
1. Read all languages and insert it into a dictionary like this:
// The key is country, the value is a list of languages.
Dictionary> countries = new Dictionary>();
2. Read products one at a time using XmlReader
3. Lookup countries and write out Languages maybe using a HashSet to avoid duplicate Languages.
That would be my approch – Good luck
I have created this example:
It produces this example:
XmlReader.Create takes an uri, you could also use something like: “http://www.mysite.com/countries.xml”