The numpy documentation shows an example of masking existing values with ma.masked a posteriori (after array creation), or creating a masked array from an list of what seem to be valid data types (integer if dtype=int). I am trying to read in data from a file (and requires some text manipulation) but at some point I will have a list of lists (or tuples) containing strings from which I want to make a numeric (float) array.
An example of the data might be textdata='1\t2\t3\n4\t\t6' (typical flat text format after cleaning).
One problem I have is that missing values may be encoded as ”, which when trying to convert to float using the dtype argument, will tell me
ValueError: setting an array element with a sequence.
So I’ve created this function
def makemaskedarray(X,missing='',fillvalue='-999.',dtype=float):
arr = lambda x: x==missing and fillvalue or x
mask = lambda x: x==missing and 1 or 0
triple = dict(zip(('data','mask','dtype'),
zip(*[(map(arr,x),map(mask,x)) for x in X])+
[dtype]))
return ma.array(**triple)
which seems to do the trick:
>>> makemaskedarray([('1','2','3'),('4','','6')])
masked_array(data =
[[1.0 2.0 3.0]
[4.0 -- 6.0]],
mask =
[[False False False]
[False True False]],
fill_value = 1e+20)
Is this the way to do it? Or there is a built-in function?
The way you’re doing it is fine. (though you could definitely make it a bit more readable by avoiding building the temporary “
triple” dict, just to expand it a step later, i.m.o.)The built-in way is to use
numpy.genfromtxt. Depending on the amount of pre-processing you need to do to your text file, it may or may not do what you need. However, as a basic example: (Using StringIO to simulate a file…)Which yields:
One word of caution: If you do use tabs as your delimiter and an empty string as your missing value marker, you’ll have issues with missing values at the start of a line. (
genfromtxtessentially callsline.strip().split(delimiter)). You’d be better off using something like"xxx"as a marker for missing values, if you can.