I have several ~50 GB text files that I need to parse for specific contents. My files contents are organized in 4 line blocks. To perform this analysis I read in subsections of the file using file.read(chunk_size) and split into blocks of 4 then analyze them.
Because I run this script often, I’ve been optimizing and have tried varying the chunk size. I run 64 bit 2.7.1 python on OSX Lion on a computer with 16 GB RAM and I noticed that when I load chunks >= 2^31, instead of the expected text, I get large amounts of /x00 repeated. This continues as far as my testing has shown all the way to, and including 2^32, after which I once again get text. However, it seems that it’s only returning as many characters as bytes have been added to the buffer above 4 GB.
My test code:
for i in range((2**31)-3, (2**31)+3)+range((2**32)-3, (2**32)+10):
with open('mybigtextfile.txt', 'rU') as inf:
print '%s\t%r'%(i, inf.read(i)[0:10])
My output:
2147483645 '@HWI-ST550'
2147483646 '@HWI-ST550'
2147483647 '@HWI-ST550'
2147483648 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
2147483649 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
2147483650 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
4294967293 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
4294967294 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
4294967295 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
4294967296 '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
4294967297 '@\x00\x00\x00\x00\x00\x00\x00\x00\x00'
4294967298 '@H\x00\x00\x00\x00\x00\x00\x00\x00'
4294967299 '@HW\x00\x00\x00\x00\x00\x00\x00'
4294967300 '@HWI\x00\x00\x00\x00\x00\x00'
4294967301 '@HWI-\x00\x00\x00\x00\x00'
4294967302 '@HWI-S\x00\x00\x00\x00'
4294967303 '@HWI-ST\x00\x00\x00'
4294967304 '@HWI-ST5\x00\x00'
4294967305 '@HWI-ST55\x00'
What exactly is going on?
Yes, this is the known issue according to the comment in cpython’s source code. You can check it in Modules/_io/fileio.c. And the code add a workaround on Microsoft windows 64bit only.