I’ve been tasked with processing multiple terabytes worth of SCM data for my company. I set up a hadoop cluster and have a script to pull data from our SCM servers.
Since I’m processing data with batches through the streaming interface, I came across an issue with the block sizes that O’Reilly’s Hadoop book doesn’t seem to address: what happens to data straddling two blocks? How does the wordcount example get around this? To get around the issue so far, we’ve resorted to making our input files smaller than 64mb each.
The issue came up again when thinking about the reducer script; how is aggregated data from the maps stored? And would the issue come up when reducing?
This should not be an issue providing that each block can cleanly break a part the data for the splits (like by line break). If your data is not a line by line data set then yes this could be a problem. You can also increase the size of your blocks on your cluster too (dfs.block.size).
You can also customize in your streaming how the inputs are going into your mapper
http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+the+Way+to+Split+Lines+into+Key%2FValue+Pairs
Data from the map step gets sorted together based on a partioner class against the key of the map.
http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#A+Useful+Partitioner+Class+%28secondary+sort%2C+the+-partitioner+org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner+option%29
The data is then shuffled together to make all the map keys get together and then transferred to the reducer. Sometimes before the reducer step happens a combiner comes in if you like.
Most likely you can create your own custom -inputreader (here is example of how to stream XML documents http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html)