Say I have an absurdly large text file. I would not think my file would grow larger than ~500mb, but for the sake of scalability and my own curiosity, let’s say it is on the order of a few gig.
My end goal is to map it to an array of sentences (separated by ‘?’ ‘!’ ‘.’ and for all intents and purposes ‘;’) and each sentence to an array of words. I was then going to use numpy for some statistical analysis.
What would be the most scalable way to go about doing this?
PS: I thought of rewriting the file to have one sentence per line, but I ran into problems trying to load the file into memory. I know of the solution where you read off chucks of data in one file, manipulate them, and write them to another, but that seems inefficient with disk memory. I know, most people would not worry about using 10gig of scratch space nowadays, but it does seem like there ought to be a way of directly editing chucks of the file.
My first thought would be to use a stream parser: basically you read in the file a piece at a time and do the statistical analysis as you go. This is typically done with markup languages like HTML and XML, so you’ll find a lot of parsers for those languages out there, including in the Python standard library. A simple sentence parser is something you can write yourself, though; for example:
This will only read data from the file as needed to complete a sentence. It reads in 512-byte blocks so you’ll be holding less than a kilobyte of file contents in memory at any one time, no matter how large the actual file is.
After a stream parser, my second thought would be to memory map the file. That way you could go through and replace the space that (presumably) follows each sentence terminator by a newline; after that, each sentence would start on a new line, and you’d be able to open the file and use
readline()or aforloop to go through it line by line. But you’d still have to worry about multi-line sentences; plus, if any sentence terminator is not followed by a whitespace character, you would have to insert a newline (instead of replacing something else with it) and that could be horribly inefficient for a large file.