Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7610867
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 31, 20262026-05-31T01:31:22+00:00 2026-05-31T01:31:22+00:00

I have several large (30+ million lines) text databases which I am cleaning up

  • 0

I have several large (30+ million lines) text databases which I am cleaning up with the following code, I need to split the file into 1 million lines or less and retain the header line. I have looked at chunk and itertools but can’t get a clear solution. It is to use in an arcgis model.

== updated code as per response from icyrock.com

import arcpy, os
#fc = arcpy.GetParameter(0)
#chunk_size = arcpy.GetParameter(1) # number of records in each dataset

fc='input.txt'
Name = fc[:fc.rfind('.')]
fl = Name+'_db.txt'

with open(fc) as f:
  lines = f.readlines()
lines[:] = lines[3:]
lines[0] = lines[0].replace('Rx(db)', 'Rx_'+Name)
lines[0] = lines[0].replace('Best Unit', 'Best_Unit')
records = len(lines)
with open(fl, 'w') as f: #where N is the chunk number
  f.write('\n'.join(lines))

with open(fl) as file:
  lines = file.readlines()

headers = lines[0:1]
rest = lines[1:]
chunk_size = 1000000

def chunks(lst, chunk_size):
  for i in xrange(0, len(lst), chunk_size):
    yield lst[i:i + chunk_size]

def write_rows(rows, file):
  for row in rows:
    file.write('%s' % row)

part = 1
for chunk in chunks(rest, chunk_size):
  with open(Name+'_%d' % part+'.txt', 'w') as file:
    write_rows(headers, file)
    write_rows(chunk, file)
  part += 1

See Remove specific lines from a large text file in python and split a large text (xyz) database into x equal parts for background. I don’t want a cygwin based solution any longer as it over complicates the model. I need a pythonic way. We can use the “records” to iterate through but what is not clear is how to specify line 1:999,999 in db #1, lines 1,000,0000 to 1,999,999 in db#2 etc. It’s fine if the last dataset has less than 1m records.

Error with 500mb file (I have 16GB RAM).

Traceback (most recent call last): File
“P:\2012\Job_044_DM_Radio_Propogation\Working\test\clean_file.py”,
line 10, in
lines = f.readlines() MemoryError

records 2249878

The records amount above is not the total record count it just where it went out of memory (I think).

=== With the new code from Icyrock.

The chunk seems to work ok but gives errors when used in Arcgis.

Start Time: Fri Mar 09 17:20:04 2012 WARNING 000594: Input feature
1945882430: falls outside of output geometry domains. WARNING 000595:
d:\Temp\cb_vhn007_1.txt_Features1.fid contains the full list of
features not able to be copied.

I know it is an issue with chunking as the “Make Event Layer” process works fine with full pre-chunk dataset.

Any ideas?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-31T01:31:23+00:00Added an answer on May 31, 2026 at 1:31 am

    You can do something like this:

    with open('file') as file:
      lines = file.readlines()
    
    headers = lines[0:1]
    rest = lines[1:]
    chunk_size = 4
    
    def chunks(lst, chunk_size):
      for i in xrange(0, len(lst), chunk_size):
        yield lst[i:i + chunk_size]
    
    def write_rows(rows, file):
      for row in rows:
        file.write('%s' % row)
    
    part = 1
    for chunk in chunks(rest, chunk_size):
      with open('part%d' % part, 'w') as file:
        write_rows(headers, file)
        write_rows(chunk, file)
      part += 1
    

    Here’s a test run:

    $ cat file && python mkt.py && for p in part*; do echo ---- $p; cat $p; done
    header
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    ---- part1
    header
    1
    2
    3
    4
    ---- part2
    header
    5
    6
    7
    8
    ---- part3
    header
    9
    10
    11
    12
    ---- part4
    header
    13
    14
    

    Obviously, change the values of the chunk_size and how you fetch headers depending on their count.

    Credits:

    • https://stackoverflow.com/a/312464/438544

    Edit – to do this line-by-line to avoid memory issues, you can do something like this:

    from itertools import islice
    
    headers_count = 5
    chunk_size = 250000
    
    with open('file') as fin:
      headers = list(islice(fin, headers_count))
    
      part = 1
      while True:
        line_iter = islice(fin, chunk_size)
        try:
          first_line = line_iter.next()
        except StopIteration:
          break
        with open('part%d' % part, 'w') as fout:
          for line in headers:
            fout.write(line)
          fout.write(first_line)
          for line in line_iter:
            fout.write(line)
        part += 1
    

    Credits:

    • Python how to read N number of lines at a time

    Test case (put the above in the file called mkt2.py):

    Make a file containing 5-line header and 1234567 lines in it:

    with open('file', 'w') as fout:
      for i in range(5):
        fout.write(10 * ('header %d ' % i) + '\n')
      for i in range(1234567):
        fout.write(10 * ('line %d ' % i) + '\n')
    

    Shell script to test (put in file called rt.sh):

    rm part*
    echo ---- file
    head -n7 file
    tail -n2 file
    
    python mkt2.py
    
    for i in part*; do
      echo ---- $i
      head -n7 $i
      tail -n2 $i
    done
    

    Sample output:

    $ sh rt.sh 
    ---- file
    header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
    header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
    header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
    header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
    header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
    line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 
    line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 
    line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 
    line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 
    ---- part1
    header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
    header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
    header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
    header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
    header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
    line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 
    line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 
    line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 
    line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 
    ---- part2
    header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
    header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
    header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
    header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
    header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
    line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 
    line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 
    line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 
    line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 
    ---- part3
    header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
    header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
    header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
    header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
    header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
    line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 
    line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 
    line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 
    line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 
    ---- part4
    header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
    header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
    header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
    header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
    header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
    line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 
    line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 
    line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 
    line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 
    ---- part5
    header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
    header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
    header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
    header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
    header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
    line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 
    line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 
    line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 
    line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 
    

    Timing of the above:

    real    0m0.935s
    user    0m0.708s
    sys     0m0.200s
    

    Hope this helps.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have several large MyISAM tables (data file around 1.5 GB) which I need
I have several large code bases which compile into dynamic libraries. I know that
I have a large class, which I have divided into several different class extension
I have a large body of text which has several sentences wrapped in yada
I have a large table (~10 million records) that contains several keys into other,
I have several large files, each of which I want to chunk/split it in
I have several large Javascript files that I need to document/digg into. Unfortunately I
I have several large csv files with thousands of columns that I need to
I have a large table with a multi-part index. I need to run several
I have a text file which contains a time stamp on each line. My

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.