I need to parse a large CSV file in real-time, while it’s being modified (appended) by a different process. By large I mean ~20 GB at this point, and slowly growing. The application only needs to detect and report certain anomalies in the data stream, for which it only needs to store small state info (O(1) space).
I was thinking about polling the file’s attributes (size) every couple of seconds, opening a read-only stream, seeking to the previous position, and then continuing to parse where I first stopped. But since this is a text (CSV) file, I obviously need to keep track of new-line characters when continuing somehow, to ensure I always parse an entire line.
If I am not mistaken, this shouldn’t be such a problem to implement, but I wanted to know if there is a common way/library which solves some of these problems already?
Note: I don’t need a CSV parser. I need info about a library which simplifies reading lines from a file which is being modified on the fly.
There is a small problem here:
First thought: Keep it open. If both the producer and the analyzer operate in non-exclusive mode It should be possible to ReadLine-until-null, pause, ReadLine-until-null, etc.
That makes it feasible to track the file Position (pos += line.Length+2). Do make sure you open it with
Encoding.ASCII. You can then re-open it as a plain binary Stream, Seek to the last position and only then attach a StreamReader to that stream.