When should I use the MultithreadedMapper?
Will I make my job faster if I use the MultithreadedMapper where my application is pure computation. (No latency type mappers)
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
It depends but I would say avoid using MultithreadedMapper as first solution.
As such it is better to scale using a single threaded Mapper by having simultaneous launch of more mappers so that they can work on multiple inputs. The more cores you have, the higher you can set your mapred.tasktracker.map.tasks.maximum value. Of course, you will need beefier machines for this.
My understanding is that MultithreadedMapper is useful when you are I/O bound like fetch pages from web which has more latency than from local i/o. In such case, using MultithreadedMapper would help as you are not blocked on a single network I/O call and you can continue processing as data is made available to you.
But if you have large data in HDFS to be processed then they are readily fetched as the data is localized and if the computation is CPU bound then multi-core, multi-process solution is more helpful.
Also you will have to ensure that your mappers are thread safe.