I process several server logfiles (around 40) and collect a bunch of metrics using

Question

0

Asked: May 26, 20262026-05-26T07:23:57+00:00 2026-05-26T07:23:57+00:00

I process several server logfiles (around 40) and collect a bunch of metrics using

0

I process several server logfiles (around 40) and collect a bunch of metrics using Apache Hadoop. If one or more of those files are inconsistent or corrupted, I would like to exclude all metrics collected from those files but keep the metrics from the other files.

What do you think would be the smartest way to do this?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T07:23:58+00:00

Editorial Team

2026-05-26T07:23:58+00:00Added an answer on May 26, 2026 at 7:23 am

When loading your file, enrich each row with an identifier indicating the file that the row came from (maybe a hash of the filename). If you need to keep corrupt or inconsistent data (and just avoid processing it), then you can exclude rows based on that identifier. Otherwise, you can perform a second-pass ‘scrubbing’ map/reduce to just eliminate them.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I process several server logfiles (around 40) and collect a bunch of metrics using

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply