I have a situation in a code where there is a huge function that parses records line-by-line, validates and writes to another file.
In case there are errors in the file, it calls another function that rejects the record and writes the reject reason.
Due to a memory leak in the program, it crashes with SIGSEGV. One solution to kind of ‘Restart’ the file from where it crashed, was to write the last processed record to a simple file.
To achieve this the current record number in the processing loop needs to be written to a file. How do I make sure that the data is overwritten on the file within the loop?
Does using fseek to first position / rewind within a loop degrade the performance?
The number of records can be lot, at times (upto 500K).
Thanks.
EDIT: The memory leak has already been fixed. The restart solution was suggested as an additional safety measure and means to provide a restart mechanism along with a SKIP n records solution. Sorry for not mentioning it earlier.
When faced with this kind of problem, you can adopt one of two methods:
ftellon the input file) to a separate bookmark file. To ensure that you resume exactly where you left off, as to not introduce duplicate records, you mustfflushafter every write (to bothbookmarkand output/reject files.) This, and unbuffered write operations in general, slow down the typical (no-failure) scenario significantly. For completeness’ sake, note that you have three ways of writing to your bookmark file:fopen(..., 'w') / fwrite / fclose– extremely slowrewind / truncate / fwrite / fflush– marginally fasterrewind / fwrite / fflush– somewhat faster; you may skiptruncatesince the record number (orftellposition) will always be as long or longer than the previous record number (orftellposition), and will completely overwrite it, provided you truncate the file once at startup (this answers your original question)fflushfiles, or at least not so often. You still need tofflushthe main output file before switching to writing to the rejects file, andfflushthe rejects file before switching back to writing to the main output file (probably a few hundred or thousand times for a 500k-record input.) Simply remove the last unterminated line from the output/reject files, everything up to that line will be consistent.I strongly recommend method #2. The writing entailed by method #1 (whichever of the three possibilities) is extremely expensive compared to any additional (buffered) reads required by method #2 (
fflushcan take several milliseconds; multiply that by 500k and you get minutes – whereas counting the number of lines in a 500k-record file takes mere seconds and, what’s more, the filesystem cache is working with, not against you on that.)EDIT Just wanted to clarify the exact steps you need to implement method 2:
when writing to the output and rejects files respectively you only need to flush when switching from writing to one file to writing to another. Consider the following scenario as illustration of the ncessity of doing these flushes-on-file-switch:
if you are not happy with the interval between the runtime’s automatic flushes, you may also do manual flushes every 100 or every 1000 records. This depends on whether processing a record is more expensive than flushing or not (if procesing is more expensive, flush often, maybe after each record, otherwise only flush when switching between output/rejects.)
resuming from failure
records_resume_counter) until you reach the end of fileftell), let’s call itlast_valid_record_ends_here\nor `r`)fseekback tolast_valid_record_ends_here, and stop reading from this output/reject files; do not increment the counter; proceed to the next output or rejects file unless you’ve gone through all of themrecords_resume_counterrecords from itlast_valid_record_ends_here) – you will have no duplicate, garbage or missing records.