I need to read lines from a text file but, where the ‘end of line’ caracter is not always \n or \x or a combination and may be any combination of characters like ‘xyz’ or ‘|’, but the ‘end of line’ is always the same and known for each type of file.
As the text file may be a big one and I have to keep performances and memory usage in mind what seems to be the best solution ?
Today I use a combinaison of string.read(1000) and split(myendofline) or partition(myendofline) but I would know if a more elegant and standard solution exists.
Here’s a generator function thats acts as an iterator on a file, cuting the lines according exotic newline being identical in all the file.
It reads the file by chunks of
lenchunkcharacters and displays the lines in each current chunk, chunk after chunk.Since the newline is 3 characters in my exemple (‘:;:’), it may happen that a chunk ends with a cut newline: this generator function takes care of this possibility and manages to display the correct lines.
In case of a newline being only one character, the function could be simplified. I wrote only the function for the most delicate case.
Employing this function allows to read a file one line at a time, without reading the entire file into memory.
Here’s the same, with printings here and there to allow to follow the algorithm.
.
EDIT
I compared the times of execution of my code and of the chmullig’s code.
With a ‘fofo.txt’ file about 10 MB, created with
and measuring times like that:
I obtained the following minimum times observed on several essays:
.
EDIT 2
I changed my generator function:
liner2()takes a file-handler instead of the file’s name. So the opening of the file can be put out of the measuring of time, as it is for the measuring of chmullig’s codeThe results, after numerous essays to see the minimum times, are
In fact the opening of the file counts for an infinitesimal part in the total time.