I have a hadoop application that -depending on a parameter- only needs certain (few!) input files from the input directory. My question is now: where is the best place (read: as early as possible) to skip those files? Right now I customized a RecordReader to take care of that, but I was wondering whether I could skip those files sooner? In my current implmentation hadoop still has a huge overhead due to irrelevant files.
Maybe I should add that it is very easy to see whether I need a certain input file. If the filename starts with a parameter, it is needed. Structuring my input directory hierachically might be a solution, but one that is not very likely for my project since every files would end up lonely in a certain directory.
I’d propose you to filter out the input files by applying the appropriate pattern on the input
Paths as mentioned here: https://stackoverflow.com/a/13454344/1050422Note that this solution doesn’t consider subdirectories. Alter it
to be able to recursively visit all subdirectories, within the base path.