i have written a program to clean some financial data i have collected over the months. it’s about 100GB in total and growing everyday, and each file is about 1-2GB each. it is currently stored in a text file format.
the code below is use to clean the data:
static void Main()
{
string inputString;
string outputString;
// others variable omitted
string[] lineSplit;
foreach (string fullPath in Directory.GetFiles(inputDirectory))
{
using (StreamReader reader = new StreamReader(fullPath)) //read from input file
{
while ((line = reader.ReadLine()) != null)
{
//logic to clean data
...
///////////////////////////////////////////////////////////
using (StreamWriter writer = File.AppendText(outputFile))
{
writer.WriteLine(outputString);
}
}
}
}
}
it is very slow, i estimate for 100GB of data it will take me about 3-4 days to finish it. i know it is about my IO operation, as i have no buffer etc to do it. i am still relatively new to C# and i couldn’t find any relevant example to build a proper buffer for IO. most example i find are for downloading and not applicable to reading text files. And i cant load the whole file into memory to process it as it is too big. how can i do it? can anyone give me some snippet of code i can use? thanks
You’re reopening the output file on every single line. Move the loop to inside the block which starts by calling
File.AppendText:Of course this assumes you’ve got one output file per input file. If that’s not the case – if each line can go to a different file within a small collection – you may want to keep all the output files open, and just keep a dictionary (or something similar) so you can quickly write to whichever you want.