I am managing a Hadoop cluster that is shared between a number of users.

Question

0

Asked: May 26, 20262026-05-26T21:27:46+00:00 2026-05-26T21:27:46+00:00

I am managing a Hadoop cluster that is shared between a number of users.

0

I am managing a Hadoop cluster that is shared between a number of users. We frequently run jobs with extremely slow mappers. For example, we might have a 32 GB file of sentences (one sentence per line) that we want to NLP parse (which takes say 100 ms per sentence). If the block size is 128 MB, this is 250 mappers. This fills our rather small cluster (9 nodes times 12 mappers per node is 108 mappers) but each mapper takes a very long time to complete (hours).

The problem is that if the cluster is empty and such a job is started, it uses all of the mappers on the cluster. Then, if anyone else wants to run a short job, it is blocked for hours. I know that newer versions of Hadoop support preemption in the Fair Scheduler (we are using the Capacity Scheduler), but newer versions also are not stable (I’m anxiously awaiting the next release).

There used to be the option of specifying the number of mappers but now JobConf is deprecated (strangely, it is not deprecated in 0.20.205). This would alleviate the problem because, with more mappers, each map task would work on a smaller data set and thus finish sooner.

Is there any way around this problem in 0.20.203? Do I need to subclass my InputFormat (in this case TextInputFormat)? If so, what exactly do I need to specify?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T21:27:47+00:00

Editorial Team

2026-05-26T21:27:47+00:00Added an answer on May 26, 2026 at 9:27 pm

I believe that you should be able to increase the block size for these files : if you do that, then , naturally, your application will use far fewer mappers.

Remember also that there is the map.input.length parameter in the job configuration. This would increase the splits, so that you had, effectively, fewer mappers with larger inputs.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am managing a Hadoop cluster that is shared between a number of users.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply