I want to run a hadoop unit test, using the local filesystem mode… I would ideally like to see several part-m-* files written out to disk (rather than just 1). However, since it just a test, I dont want to process 64M of data (the default size is ~64megs per block, i believe).
In distributed mode we can set this using
dfs.block.size
I am wondering wether there a way that i can get my local file system to write small part-m files out, i.e. so that my unit test will mimic the contents of large scale data with several (albeit very small) files.
Assuming your input format can handle splitable files (see the
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.isSplitable(JobContext, Path)method), you can amend the input split size to process a smaller file with multi mappers (i’m going to assume you’re using the new API mapreduce package):For example, if you’re using the
TextInputFormat(or most input formats that extendFileInputFormat), you can call the static util methods:FileInputFormat.setMaxInputSplitSize(Job, long)FileInputFormat.setMinInputSplitSize(Job, long)The long argument is the size of the split in bytes, so just set to you’re desired size
Under the hood, these methods set the following job configuration properties:
mapred.min.split.sizemapred.max.split.sizeFinal note, some input formats may override the
FileInputFormat.getFormatMinSplitSize()method (which defaults to 1 byte forFileInputFormat), so be weay if you set a value and hadoop is appearing to ignore it.A final point – have you considered MRUnit http://incubator.apache.org/mrunit/ for actual ‘unit’ testing of your MR code?