I am seeking for a way to speed up a file loading like this

Question

0

Asked: June 10, 20262026-06-10T20:53:43+00:00 2026-06-10T20:53:43+00:00

I am seeking for a way to speed up a file loading like this

0

I am seeking for a way to speed up a file loading like this :

The data contains about 1 millions lines, tab separated with “\t” (tabulation char) and utf8 encoding, it takes about 9 seconds to parse the full file with the code below. However, I would have like to be almost in an order of a second!

def load(filename):
    features = []
    with codecs.open(filename, 'rb',  'utf-8') as f:
        previous = ""
        for n, s in enumerate(f):
            splitted = tuple(s.rstrip().split("\t"))
            if len(splitted) != 2:
                sys.exit("wrong format!")
            if previous >= splitted:
                sys.exit("unordered feature")
            previous = splitted
            features.append(splitted)
    return features

I am wondering if any binary format data could speed up something? Or if I could benefit from a some NumPy or any other libraries to have faster loading speed.

Maybe you could give me advice on another speed bottleneck?

EDIT: so i try some of your ideas, thanks! BTW i really need the tuple (string, string) inside the huge list… here are the results, i’m gaining 50% of the time 🙂 now i am going to look after the NumPy binary data, as i have noticed that another huge file was really really quick to load…

import codecs

def load0(filename): 
    with codecs.open(filename, 'rb',  'utf-8') as f: 
    return f.readlines() 

def load1(filename): 
    with codecs.open(filename, 'rb',  'utf-8') as f: 
    return [tuple(x.rstrip().split("\t")) for x in f.readlines()]

def load3(filename):
    features = []
    with codecs.open(filename, 'rb',  'utf-8') as f:
    for n, s in enumerate(f):
        splitted = tuple(s.rstrip().split("\t"))
        features.append(splitted)
    return features

def load4(filename): 
    with codecs.open(filename, 'rb',  'utf-8') as f: 
    for s in f: 
        yield tuple(s.rstrip().split("\t")) 

a = datetime.datetime.now()
r0 = load0(myfile)
b = datetime.datetime.now()
print "f.readlines(): %s" % (b-a)

a = datetime.datetime.now()
r1 = load1(myfile)
b = datetime.datetime.now()
print """[tuple(x.rstrip().split("\\t")) for x in f.readlines()]: %s""" % (b-a)

a = datetime.datetime.now()
r3 = load3(myfile)
b = datetime.datetime.now()
print """load3: %s""" % (b-a)
if r1 == r3: print "OK: speeded and similars!"

a = datetime.datetime.now()
r4 = [x for x in load4(myfile)] 
b = datetime.datetime.now()
print """load4: %s""" % (b-a)
if r4 == r3: print "OK: speeded and similars!"

results :

f.readlines(): 0:00:00.208000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:02.310000
load3: 0:00:07.883000
OK: speeded and similars!
load4: 0:00:07.943000
OK: speeded and similars!

something very strange is that i notice that i can have almost double time on two consecutive runs (but not everytime) :

>>> ================================ RESTART ================================
>>> 
f.readlines(): 0:00:00.220000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:02.479000
load3: 0:00:08.288000
OK: speeded and similars!
>>> ================================ RESTART ================================
>>> 
f.readlines(): 0:00:00.279000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:04.983000
load3: 0:00:10.404000
OK: speeded and similars!

EDIT LATEST: well i tried to modify to use the numpy.load… it is very strange to me… from “normal” file with my 1022860 strings, and 10 KB.
After doing a numpy.save(numpy.array(load1(myfile))) i went to a 895 MB ! an then reloading this with numpy.load() i get this kind of timing on consecutive runs :

  >>> ================================ RESTART ================================
  loading: 0:00:11.422000 done.
  >>> ================================ RESTART ================================
  loading: 0:00:00.759000 done.

may be does numpy do some memory stuff to avoid future reload?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T20:53:44+00:00

Editorial Team

2026-06-10T20:53:44+00:00Added an answer on June 10, 2026 at 8:53 pm

check how many seconds is to actually read the lines of the file, like

def load(filename):
    features = []
    with codecs.open(filename, 'rb',  'utf-8') as f:
        return f.readlines()

If it is significantly less then 9 sec, then

try other to use multiprocessing and split the work of checking lines between cpu cores and/or
use faster interpreter like pypy

and see if any of these speed things up

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am seeking for a way to speed up a file loading like this

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply