I am seeking for a way to speed up a file loading like this :
The data contains about 1 millions lines, tab separated with “\t” (tabulation char) and utf8 encoding, it takes about 9 seconds to parse the full file with the code below. However, I would have like to be almost in an order of a second!
def load(filename):
features = []
with codecs.open(filename, 'rb', 'utf-8') as f:
previous = ""
for n, s in enumerate(f):
splitted = tuple(s.rstrip().split("\t"))
if len(splitted) != 2:
sys.exit("wrong format!")
if previous >= splitted:
sys.exit("unordered feature")
previous = splitted
features.append(splitted)
return features
I am wondering if any binary format data could speed up something? Or if I could benefit from a some NumPy or any other libraries to have faster loading speed.
Maybe you could give me advice on another speed bottleneck?
EDIT: so i try some of your ideas, thanks! BTW i really need the tuple (string, string) inside the huge list… here are the results, i’m gaining 50% of the time 🙂 now i am going to look after the NumPy binary data, as i have noticed that another huge file was really really quick to load…
import codecs
def load0(filename):
with codecs.open(filename, 'rb', 'utf-8') as f:
return f.readlines()
def load1(filename):
with codecs.open(filename, 'rb', 'utf-8') as f:
return [tuple(x.rstrip().split("\t")) for x in f.readlines()]
def load3(filename):
features = []
with codecs.open(filename, 'rb', 'utf-8') as f:
for n, s in enumerate(f):
splitted = tuple(s.rstrip().split("\t"))
features.append(splitted)
return features
def load4(filename):
with codecs.open(filename, 'rb', 'utf-8') as f:
for s in f:
yield tuple(s.rstrip().split("\t"))
a = datetime.datetime.now()
r0 = load0(myfile)
b = datetime.datetime.now()
print "f.readlines(): %s" % (b-a)
a = datetime.datetime.now()
r1 = load1(myfile)
b = datetime.datetime.now()
print """[tuple(x.rstrip().split("\\t")) for x in f.readlines()]: %s""" % (b-a)
a = datetime.datetime.now()
r3 = load3(myfile)
b = datetime.datetime.now()
print """load3: %s""" % (b-a)
if r1 == r3: print "OK: speeded and similars!"
a = datetime.datetime.now()
r4 = [x for x in load4(myfile)]
b = datetime.datetime.now()
print """load4: %s""" % (b-a)
if r4 == r3: print "OK: speeded and similars!"
results :
f.readlines(): 0:00:00.208000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:02.310000
load3: 0:00:07.883000
OK: speeded and similars!
load4: 0:00:07.943000
OK: speeded and similars!
something very strange is that i notice that i can have almost double time on two consecutive runs (but not everytime) :
>>> ================================ RESTART ================================
>>>
f.readlines(): 0:00:00.220000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:02.479000
load3: 0:00:08.288000
OK: speeded and similars!
>>> ================================ RESTART ================================
>>>
f.readlines(): 0:00:00.279000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:04.983000
load3: 0:00:10.404000
OK: speeded and similars!
EDIT LATEST: well i tried to modify to use the numpy.load… it is very strange to me… from “normal” file with my 1022860 strings, and 10 KB.
After doing a numpy.save(numpy.array(load1(myfile))) i went to a 895 MB ! an then reloading this with numpy.load() i get this kind of timing on consecutive runs :
>>> ================================ RESTART ================================
loading: 0:00:11.422000 done.
>>> ================================ RESTART ================================
loading: 0:00:00.759000 done.
may be does numpy do some memory stuff to avoid future reload?
check how many seconds is to actually read the lines of the file, like
If it is significantly less then 9 sec, then
and see if any of these speed things up