I have a 32 core system. When I run a MapReduce job using Hadoop I never see the java process use more than 150% CPU (according to top) and it usually stays around the 100% mark. It should be closer to 3200%.
Which property do I need to change (and in which file) to enable more workers?
There could be two issues, which I outline below. I’d also like to point out that this is a very common question and you should look at the previously asked Hadoop questions.
Your
mapred.tasktracker.map.tasks.maximumcould be set low inconf/mapred-site.xml. This will be the issue if when you check the JobTracker, you see several pending tasks, but only a few running tasks. Each task is a single thread, so you would hypothetically need 32 maximum slots on that node.Otherwise, likely your data is not being split into enough chunks. Are you running over a small amount of data? It could be that your MapReduce job is running over only a few input splits and thus does not require more mappers. Try running your job over hundreds of MB of data instead and see if you still have the same issue.
Hadoop automatically splits your files. The number of blocks a file is split up into is the total size of the file divided by the block size. By default, one map task will be assigned to each block (not each file).
In your
conf/hdfs-site.xmlconfiguration file, there is adfs.block.size parameter. Most people set this to 64 or 128mb. However, if you are trying to do something tiny you could set this up to split up the work more.You can also manually split your file into 32 chunks.