Would it make a difference to the number of map tasks spawned by a job if I have a lot of small files (~HDFS block size) vs a few large files
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
It depends which
InputFormatyou use, because this is what determines the input splits computation, and thus the number of map tasks.If you use the default
TextInputFormat, each file will have at least 1 split, so at least 1 mapper per file, even if these files are a few kB, each mapper doing very little work, but this introduces a lot of overhead for the Map/Reduce framework. That said if you have a guarantee that these “small” files will be close to the block size, that probably doesn’t matter too much.If you have no control over your files and they might get really small, I would advise using a different
InputFormatcalledCombineFileInputFormatwhich combines several input files in the same split, the number of maps in this case will only depend on the overall amount of data, regardless of the number of files. An implementation can be found here.