I have found a few similar questions here in Stack Overflow, but I believe I could benefit from advice specific for my case.
I must store around 80 thousand lists of real valued numbers in a file and read them back later.
First, I tried cPickle, but the reading time wasn’t appealing:
>>> stmt = """
with open('pickled-data.dat') as f:
data = cPickle.load(f)
"""
>>> timeit.timeit(stmt, 'import cPickle', number=1)
3.8195440769195557
Then I found out that storing the numbers as plain text allows faster reading (makes sense, since cPickle must worry about a lot of things):
>>> stmt = """
data = []
with open('text-data.dat') as f:
for line in f:
data.append([float(x) for x in line.split()])
"""
>>> timeit.timeit(stmt, number=1)
1.712096929550171
This is a good improvement, but I think I could still optimize it somehow, since programs written in other languages can read similar data from files considerably faster.
Any ideas?
If numpy arrays are workable,
numpy.fromfilewill likely be the fastest option to read the files (here’s a somewhat related question I asked just a couple days ago)Alternatively, it seems like you could do a little better with
struct, though I haven’t tested it:This assumes that storing the data as 4-byte floats is good enough. If you want a real double precision number, change the format statements from f to d and change
nelem*4tonelem*8. There might be some minor portability issues here (endianness and sizeof datatypes for example).