We have a hadoop+hbase cluster on amazon EMR with the default configuration, so that both mapred.child.tmp and hbase.tmp.dir point to /tmp. Our cluster has been running for a while and now /tmp is 500Gb, compared to 70Gb for actual /hbase data.
This kind of difference seems too much, are we supposed to periodically delete some of the /tmp data?
After some investigation I found that the largest part of our
/tmpdata was created by failed mapreduce tasks during Amazon’s automatic backup of Hbase to S3. Our successful mapreduce tasks don’t leave much data in/tmp.We have decided to disable Amazon’s automatic backup and implement our own backup script using Hbase tool for importing/exporting tables.