I am trying to run a MapReduce job on my cluster that only runs on a specific file extension. We have a bunch of heterogeneous data that sits on the cluster and for this particular job I only want to execute on .jpg. Is there a way this can be done without restricting it in the mapper. It seems like this should be something easy to do when you execute the job. I’m thinking something like hadoop fs JobName /users/myuser/data/*.jpg /users/myuser/output.
Share
Your example should work as written, but you’ll want to check with the input format that you’re calling the setInputPaths(Job, String) method, as this will resolve the glob string “/users/myuser/data/*.jpg” into the individual jpg files in /users/myuser/data.