I’ve been tasked with processing multiple terabytes worth of SCM data for my company.

Question

0

Asked: May 16, 20262026-05-16T05:56:23+00:00 2026-05-16T05:56:23+00:00

I’ve been tasked with processing multiple terabytes worth of SCM data for my company.

0

I’ve been tasked with processing multiple terabytes worth of SCM data for my company. I set up a hadoop cluster and have a script to pull data from our SCM servers.

Since I’m processing data with batches through the streaming interface, I came across an issue with the block sizes that O’Reilly’s Hadoop book doesn’t seem to address: what happens to data straddling two blocks? How does the wordcount example get around this? To get around the issue so far, we’ve resorted to making our input files smaller than 64mb each.

The issue came up again when thinking about the reducer script; how is aggregated data from the maps stored? And would the issue come up when reducing?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T05:56:24+00:00

This should not be an issue providing that each block can cleanly break a part the data for the splits (like by line break). If your data is not a line by line data set then yes this could be a problem. You can also increase the size of your blocks on your cluster too (dfs.block.size).

You can also customize in your streaming how the inputs are going into your mapper

http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+the+Way+to+Split+Lines+into+Key%2FValue+Pairs

Data from the map step gets sorted together based on a partioner class against the key of the map.

http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#A+Useful+Partitioner+Class+%28secondary+sort%2C+the+-partitioner+org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner+option%29

The data is then shuffled together to make all the map keys get together and then transferred to the reducer. Sometimes before the reducer step happens a combiner comes in if you like.

Most likely you can create your own custom -inputreader (here is example of how to stream XML documents http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve been tasked with processing multiple terabytes worth of SCM data for my company.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply