I have multiple 1.5 GB CSV Files which contain billing information on multiple accounts for clients from a service provider. I am trying to split the large CSV file into smaller chunks for processing and formatting the data inside it.
I do not want to roll out my own CSV parser but this is something I haven’t seen yet so please correct me if I am wrong. The 1.5GB files contains information in the following order: account information, account number, Bill Date, transactions , Ex gst , Inc gst , type and other lines.
note that BillDate here means the date when the invoice was made, so occassionally we have more than two bill dates in the same CSV.
Bills are grouped by : Account Number > Bill Date > Transactions.
Some accounts have 10 lines of Transaction details, some have over 300,000 lines of Transaction details. A large 1.5GB CSV file contains around 8million lines of data (I used UltraEdit before) to cut paste into smaller chunks but this has become very inefficient and a time consuming process.
I just want to load the large CSV files in my WinForm, click a button, which will split this large files in chunks of say no greater than 250,000 lines but some bills are actually bigger than 250,000 lines in which case keep them in one piece and not split accounts across multiple files since they are ordered anyway. Also I do not wan’t accounts with multiple bill date in CSV in which case the splitter can create another additional split.
I already have a WinForm application that does the formatting of the CSV in smaller files automatically in VS C# 2010.
Is it actually possible to process this very large CSV files? I have been trying to load the large files but MemoryOutOfException is an annoyance since it crashes everytime and I don’t know how to fix it. I am open to suggestions.
Here is what I think I should be doing:
- Load the large CSV file (but fails since OutOfMemoryException). How to solve this?
- Group data by account name, bill date, and count the number of lines for each group.
- Then create an array of integers.
- Pass this array of integers to a file splitter process which will take these arrays and write the blocks of data.
Any suggestions will be greatly appreciated.
Thanks.
Yea about that…. being out of memory is going to happen with files that are HUGE. You need to take your situation seriously.
As with most problems, break everything into steps.
I have had a similar type of situation before (large data file in CSV format, need to process, etc).
What I did:
Make step 1 of your program suite or whatever, something that merely cuts your huge file into many smaller files. I have broken 5GB zipped up PGP encrypted files (after decryption…thats another headache) into many smaller pieces. You can do something simple like numbering them sequentially (i.e. 001, 002, 003…)
Then make an app to do the INPUT processing. No real business logic here. I hate FILE IO with a passion when it comes to business logic and I love the warm fuzzy feeling of data being in a nice SQL Server DB. That’s just me. I created a thread pool and have N amount of threads (like 5, you decide how much your machine can handle) read those .csv part files you created.
Each thread reads one file. One to one relationship. Because it is file I/O, make sure you only dont have too many running at the same time. Each thread does the same basic operation. Reads in data, puts it in a basic structure for the db (table format), does lots of inserts, then ends the thread. I used LINQ to SQL because everything is strongly typed and what not, but to each their own. The better the db design the better for you later to do logic.
After all threads have finished executing, you have all the data from the original CSV in the database. Now you can do all your business logic and do whatever from there. Not the prettiest solution, but I was forced into developing that given my situation/data flow/size/requirements. You might go with something completely different. Just sharing I guess.