I am trying to parse tab separated data files generated by our services using Amazon’s Elastic Map Reduce via a Pig program. Things are going well except that all of our data files contain a header row that defines the purpose of each column. Obviously, the (string) headers can’t be cast to numeric data values, so I get warnings from Pig like the following:
2011-03-17 22:49:55,378 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.PigStorage: Unable to interpret value [<snip>] in field being converted to double, caught NumberFormatException <For input string: "headerName"> field discarded
I’ve got a filter after the load statement that tries to ensure that I don’t later operate on any header lines (by filtering out header terms), but I’d like to get rid of the warning noise to avoid masking any potential problems (like actual data fields that don’t cast properly).
Is this possible?
You can do it before submitting Pig job (if possible), or try writing UDF that would emit null values if certain conditions are met, so later You could filter this out.