Lets say the input file is:
Hi my name NONE
Hi my name is ABC
Hi my name is ABC
Hi my name is DEF
Hi my name is DEF
Hi my name is XYZ
I have to create the following output:
Hi my name NONE 1
Hi my name is ABC 2
Hi my name is DEF 2
Hi my name is XYZ 1
The number of words in a single line can vary from 2 to 10. File size will be more than 1GB.
How can I get the required output in the minimum possible time. My current implementation uses a C++ program to read a line from the file and then compare it with next line. The running time of this implementation will always be O(n) where n is the number of characters in the file.
To improve the running time, the next option is to use the mmap. But before implementing it, I just wanted to confirm is there a faster way to do it? Using any other language/scripting?
The perl step is only to take the output of
uniq(which looks like “2 Hi my name is ABC”) and re-order it into “Hi my name is ABC 2”. You can use a different language for it, or else leave it off entirely.As for your question about runtime, big-O seems misplaced here; surely there isn’t any chance of scanning the whole file in less than O(n).
mmapandstrchrseem like possibilities for constant-factor speedups, but a stdio-based approach is probably good enough unless your stdio sucks.The code for BSD uniq could be illustrative here. It does a very simple job with
fgets,strcmp, and a very few variables.