I have several large (30+ million lines) text databases which I am cleaning up with the following code, I need to split the file into 1 million lines or less and retain the header line. I have looked at chunk and itertools but can’t get a clear solution. It is to use in an arcgis model.
== updated code as per response from icyrock.com
import arcpy, os
#fc = arcpy.GetParameter(0)
#chunk_size = arcpy.GetParameter(1) # number of records in each dataset
fc='input.txt'
Name = fc[:fc.rfind('.')]
fl = Name+'_db.txt'
with open(fc) as f:
lines = f.readlines()
lines[:] = lines[3:]
lines[0] = lines[0].replace('Rx(db)', 'Rx_'+Name)
lines[0] = lines[0].replace('Best Unit', 'Best_Unit')
records = len(lines)
with open(fl, 'w') as f: #where N is the chunk number
f.write('\n'.join(lines))
with open(fl) as file:
lines = file.readlines()
headers = lines[0:1]
rest = lines[1:]
chunk_size = 1000000
def chunks(lst, chunk_size):
for i in xrange(0, len(lst), chunk_size):
yield lst[i:i + chunk_size]
def write_rows(rows, file):
for row in rows:
file.write('%s' % row)
part = 1
for chunk in chunks(rest, chunk_size):
with open(Name+'_%d' % part+'.txt', 'w') as file:
write_rows(headers, file)
write_rows(chunk, file)
part += 1
See Remove specific lines from a large text file in python and split a large text (xyz) database into x equal parts for background. I don’t want a cygwin based solution any longer as it over complicates the model. I need a pythonic way. We can use the “records” to iterate through but what is not clear is how to specify line 1:999,999 in db #1, lines 1,000,0000 to 1,999,999 in db#2 etc. It’s fine if the last dataset has less than 1m records.
Error with 500mb file (I have 16GB RAM).
Traceback (most recent call last): File
“P:\2012\Job_044_DM_Radio_Propogation\Working\test\clean_file.py”,
line 10, in
lines = f.readlines() MemoryErrorrecords 2249878
The records amount above is not the total record count it just where it went out of memory (I think).
=== With the new code from Icyrock.
The chunk seems to work ok but gives errors when used in Arcgis.
Start Time: Fri Mar 09 17:20:04 2012 WARNING 000594: Input feature
1945882430: falls outside of output geometry domains. WARNING 000595:
d:\Temp\cb_vhn007_1.txt_Features1.fid contains the full list of
features not able to be copied.
I know it is an issue with chunking as the “Make Event Layer” process works fine with full pre-chunk dataset.
Any ideas?
You can do something like this:
Here’s a test run:
Obviously, change the values of the
chunk_sizeand how you fetchheadersdepending on their count.Credits:
Edit – to do this line-by-line to avoid memory issues, you can do something like this:
Credits:
Test case (put the above in the file called
mkt2.py):Make a file containing 5-line header and 1234567 lines in it:
Shell script to test (put in file called
rt.sh):Sample output:
Timing of the above:
Hope this helps.