I want to read a dataframe from a fixed width flat file. This is a somewhat performance sensitive operation.
I would like all blank whitespace to be stripped from column value. After that whitespace is stripped, I want blank strings to be converted to NaN or None values. Here are the two ideas I had:
pd.read_fwf(path, colspecs=markers, names=columns,
converters=create_convert_dict(columns))
def create_convert_dict(columns):
convert_dict = {}
for col in columns:
convert_dict[col] = null_convert
return convert_dict
def null_convert(value):
value = value.strip()
if value == "":
return None
else:
return value
or:
pd.read_fwf(path, colspecs=markers, names=columns, na_values='',
converters=create_convert_dict(columns))
def create_convert_dict(columns):
convert_dict = {}
for col in columns:
convert_dict[col] = col_strip
return convert_dict
def col_strip(value):
return value.strip()
The second option depends on the converter (which strips whitespace) be evaluated before na_values.
I was wondering if the second one would work. The reason I am curious is because it seems better to retain NaN has the Null value opposed to None.
I am also open to any other suggestions for how I might perform this operation (stripping whitespace and then converting blank strings to NaN).
I do not have access to a computer with pandas installed at the moment, which is why I cannot test this myself.
In case of fixed width file, no need to do anything special to strip white space, or handle missing fields. Below a small example of a fixed width file, three columns each of width 5. There is trailing and leading white space + missing data.