I have about 50 million lists of strings in Python like this one:
["1", "1.0", "", "foobar", "3.0", ...]
And I need to turn these into a list of floats and Nones like this one:
[1.0, 1.0, None, None, 3.0, ...]
Currently I use some code like:
def to_float_or_None(x):
try:
return float(x)
except ValueError:
return None
result = []
for record in database:
result.append(map(to_float_or_None, record))
The to_float_or_None function is taking in total about 750 seconds (according to cProfile)… Is there a faster way to perform this conversion from a list of strings to a list of floats/Nones?
Update
I had identified the to_float_or_None function as the main bottleneck. I can not find a significant difference in speed between using map and using list comprehensions.
I applied Paulo Scardine’s tip to check the input, and it already saves 1/4 of the time.
def to_float_or_None(x):
if not(x and x[0] in "0123456789."):
return None
try:
return float(x)
except:
return None
The use of generators was new to me, so thank you for the tip Cpfohl and Lattyware! This indeed speeds up the reading of the file even more, but I was hoping to save some memory by converting the strings to floats/Nones.
The answers given thus far don’t really fully answer the question.
try...catchvs a validatingif thencan result in different performance (see: https://stackoverflow.com/a/5591737/456188). To summarize that answer: depends on the ratio of failures to successes and the MEASURED time of a failure and success in both cases. Basically we can’t answer this, but we can tell you how to:if/thenthat tests the same as thetry/catchoptimize it and then measure how long it takes both version of theto_float_or_Noneto fail 100 times and measure how long it takes both versions of theto_float_or_Noneto succeed 100 times.Side note about the list comprehension issue:
Depending on whether the you want to be able to index the results of this, or whether you just want to iterate over it a generator expression would actually be even better than a list comprehension (just replace the
[]characters with()characters).It takes essentialy no time to create, and the actual execution of to_float_or_None (which is the expensive part) can be delayed until the result it needed.
This is useful for many reasons, but won’t work if you’re going to need to index it. It will however, allow you to zip the original collection with the generator so you can still have access to the original string along with its float_or_none result.