I have a hadoop application that -depending on a parameter- only needs certain (few!)

Question

0

Asked: June 14, 20262026-06-14T15:03:46+00:00 2026-06-14T15:03:46+00:00

I have a hadoop application that -depending on a parameter- only needs certain (few!)

0

I have a hadoop application that -depending on a parameter- only needs certain (few!) input files from the input directory. My question is now: where is the best place (read: as early as possible) to skip those files? Right now I customized a RecordReader to take care of that, but I was wondering whether I could skip those files sooner? In my current implmentation hadoop still has a huge overhead due to irrelevant files.

Maybe I should add that it is very easy to see whether I need a certain input file. If the filename starts with a parameter, it is needed. Structuring my input directory hierachically might be a solution, but one that is not very likely for my project since every files would end up lonely in a certain directory.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T15:03:47+00:00

Editorial Team

2026-06-14T15:03:47+00:00Added an answer on June 14, 2026 at 3:03 pm

I’d propose you to filter out the input files by applying the appropriate pattern on the input Paths as mentioned here: https://stackoverflow.com/a/13454344/1050422
Note that this solution doesn’t consider subdirectories. Alter it
to be able to recursively visit all subdirectories, within the base path.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a hadoop application that -depending on a parameter- only needs certain (few!)

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply