I process several server logfiles (around 40) and collect a bunch of metrics using Apache Hadoop. If one or more of those files are inconsistent or corrupted, I would like to exclude all metrics collected from those files but keep the metrics from the other files.
What do you think would be the smartest way to do this?
When loading your file, enrich each row with an identifier indicating the file that the row came from (maybe a hash of the filename). If you need to keep corrupt or inconsistent data (and just avoid processing it), then you can exclude rows based on that identifier. Otherwise, you can perform a second-pass ‘scrubbing’ map/reduce to just eliminate them.