I have a single node instance of Apache Hadoop 1.1.1 with default parameter values (see e.g. [1] and [2]) on the machine with a lot of RAM and very limited free disk space size. Then, I notice that this Hadoop instance wastes a lot of disk space during map tasks. What configuration parameters should I pay attention to in order to take advantage of high RAM capacity and decrease disk space usage?
I have a single node instance of Apache Hadoop 1.1.1 with default parameter values
Share
You can use several of the mapred.* params to compress map output, which will greatly reduce the amount of disk space needed to store mapper output. See this question for some good pointers.
Note that different compression codecs will have different issues (i.e. GZip needs more CPU than LZO, but you have to install LZO yourself). This page has a good discussion of compression issues in Hadoop, although it is a bit dated.
The amount of RAM you need depends upon what you are doing in your map-reduce jobs, although you can increase your heap-size in:
See cluster setup for more details on this.