I need to process an input file, and copy its content (line by ilne) to an output file. However, there are some unimportant data (stray) inside this input file that I need to skip.
The main problem that I am trying to solve is actually more complicated than this, but I am just gonna simplify the problem:
So, I have an input file containing hundreds of thousands of lines.
If the following sequence of 3-lines occur inside the input file:
A
B
C
then I need to skip these 3 lines and proceed with the next line in the input file. I can only skip these 3 lines if these 3 lines occur as a sequence of consecutive lines.
For example:
Input File:
A
A
B
C
B
P
A
B
C
A
B
A
A
B
C
A
Output file:
A
B
P
A
B
A
A
Clarification:
A
A (skipped)
B (skipped)
C (skipped)
B
P
A (skipped)
B (skipped)
C (skipped)
A
B
A
A (skipped)
B (skipped)
C (skipped)
A
Notice that I can skip the sequence of lines (A, B, C) only if they occur sequentially. All the other lines which are not skipped have to be copied to the output file.
If I use BufferedReader.nextLine(), I cannot backtrack to the previous lines if the next line doesn’t match the input pattern. For example, if I already encounter an A, and the next line is another A (not B), I then have to copy the first A to the output file, and start the filtering again from the second A which I have not processed, and check the next following line, and so on.
One way that I can think of is to firstly save the contents of the input text file, so I can easily backtrack when traversing the input file contents if it doesn’t match the pattern I am looking for. However this is not a memory-wise solution. Is there any clever algorithm to solve this, preferably in one time traversal, i.e. O (N) complexity? Or if this is not possible, what would be the most optimal solution which is still memory-wise? Some example C / Java codes will be really helpful.
You could do this with a 3-element array.
Whenever you encounter an A, check if the first element of the array is empty — if not, flush the array to the output file — then store the new A to the first element of the array.
Whenever you encounter a B, check if the second element of the array is empty but the first element is full — if not, flush the array to the output file along with the new B. Otherwise (that is, if the first element is full but the 2nd is empty) you’ll store the new B as the 2nd element of the array.
For C, repeat the logic for B, incremented by one: Whenever you encounter a C, check if the third element of the array is empty but the 2nd element is full — if not, flush the array to the output file along with the new C. Otherwise (that is, if the 2nd element is full but the 3rd is empty) you’ll store the new C as the 3rd element of the array.
When you encounter neither A nor B nor C, flush any existing array elements to the output file, then write the new line directly to the output file.
The main trick here is that you’re defining explicit rules for filling each slot of the buffer array, and using this to avoid re-checking any line matches, while flushing the buffer to the output and resetting the sequence whenever you break pattern.
Granted, you admit that your actual ruleset is somewhat more complicated, but the same type of approach should work.