Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 840051
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T05:33:48+00:00 2026-05-15T05:33:48+00:00

Let’s say that I routinely have to work with files with an unknown, but

  • 0

Let’s say that I routinely have to work with files with an unknown, but large, number of lines. Each line contains a set of integers (space, comma, semicolon, or some non-numeric character is the delimiter) in the closed interval [0, R], where R can be arbitrarily large. The number of integers on each line can be variable. Often times I get the same number of integers on each line, but occasionally I have lines with unequal sets of numbers.

Suppose I want to go to Nth line in the file and retrieve the Kth number on that line (and assume that the inputs N and K are valid — that is, I am not worried about bad inputs). How do I go about doing this efficiently in Python 3.1.2 for Windows?

I do not want to traverse the file line by line.

I tried using mmap, but while poking around here on SO, I learned that that’s probably not the best solution on a 32-bit build because of the 4GB limit. And in truth, I couldn’t really figure out how to simply move N lines away from my current position. If I can at least just “jump” to the Nth line then I can use .split() and grab the Kth integer that way.

The nuance here is that I don’t just need to grab one line from the file. I will need to grab several lines: they are not necessarily all near each other, the order in which I get them matters, and the order is not always based on some deterministic function.

Any ideas? I hope this is enough information.

Thanks!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T05:33:48+00:00Added an answer on May 15, 2026 at 5:33 am

    Python’s seek goes to a byte offset in a file, not to a line offset, simply because that’s the way modern operating systems and their filesystems work — the OS/FS just don’t record or remember “line offsets” in any way whatsoever, and there’s no way for Python (or any other language) to just magically guess them. Any operation purporting to “go to a line” will inevitably need to “walk through the file” (under the covers) to make the association between line numbers and byte offsets.

    If you’re OK with that and just want it hidden from your sight, then the solution is the standard library module linecache — but performance won’t be any better than that of code you could write yourself.

    If you need to read from the same large file multiple times, a large optimization would be to run once on that large file a script that builds and saves to disk the line number – to – byte offset correspondence (technically an “index” auxiliary file); then, all your successive runs (until the large file changes) could very speedily use the index file to navigate with very high performance through the large file. Is this your use case…?

    Edit: since apparently this may apply — here’s the general idea (net of careful testing, error checking, or optimization;-). To make the index, use makeindex.py, as follows:

    import array
    import sys
    
    BLOCKSIZE = 1024 * 1024
    
    def reader(f):
      blockstart = 0
      while True:
        block = f.read(BLOCKSIZE)
        if not block: break
        inblock = 0
        while True:
          nextnl = block.find(b'\n', inblock)
          if nextnl < 0:
            blockstart += len(block)
            break
          yield nextnl + blockstart
          inblock = nextnl + 1
    
    def doindex(fn):
      with open(fn, 'rb') as f:
        # result format: x[0] is tot # of lines,
        # x[N] is byte offset of END of line N (1+)
        result = array.array('L', [0])
        result.extend(reader(f))
        result[0] = len(result) - 1
        return result
    
    def main():
      for fn in sys.argv[1:]:
        index = doindex(fn)
        with open(fn + '.indx', 'wb') as p:
          print('File', fn, 'has', index[0], 'lines')
          index.tofile(p)
    
    main()
    

    and then to use it, for example, the following useindex.py:

    import array
    import sys
    
    def readline(n, f, findex):
      f.seek(findex[n] + 1)
      bytes = f.read(findex[n+1] - findex[n])
      return bytes.decode('utf8')
    
    def main():
      fn = sys.argv[1]
      with open(fn + '.indx', 'rb') as f:
        findex = array.array('l')
        findex.fromfile(f, 1)
        findex.fromfile(f, findex[0])
        findex[0] = -1
      with open(fn, 'rb') as f:
        for n in sys.argv[2:]:
          print(n, repr(readline(int(n), f, findex)))
    
    main()
    

    Here’s an example (on my slow laptop):

    $ time py3 makeindex.py kjv10.txt 
    File kjv10.txt has 100117 lines
    
    real    0m0.235s
    user    0m0.184s
    sys 0m0.035s
    $ time py3 useindex.py kjv10.txt 12345 98765 33448
    12345 '\r\n'
    98765 '2:6 But this thou hast, that thou hatest the deeds of the\r\n'
    33448 'the priest appointed officers over the house of the LORD.\r\n'
    
    real    0m0.049s
    user    0m0.028s
    sys 0m0.020s
    $ 
    

    The sample file is a plain text file of King James’ Bible:

    $ wc kjv10.txt
    100117  823156 4445260 kjv10.txt
    

    100K lines, 4.4 MB, as you can see; this takes about a quarter second to index and 50 milliseconds to read and print out three random-y lines (no doubt this can be vastly accelerated with more careful optimization and a better machine). The index in memory (and on disk too) takes 4 bytes per line of the textfile being indexed, and performance should scale in a perfectly linear way, so if you had about 100 million lines, 4.4 GB, I would expect about 4-5 minutes to build the index, a minute to extract and print out three arbitrary lines (and the 400 MB of RAM taken for the index should not inconvenience even a small machine — even my tiny slow laptop has 2GB after all;-).

    You can also see that (for speed and convenience) I treat the file as binary (and assume utf8 encoding — works with any subset like ASCII too of course, eg that KJ text file is ASCII) and don’t bother collapsing \r\n into a single character if that’s what the file has as line terminator (it’s pretty trivial to do that after reading each line if you want).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

No related questions found

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.