so when we use Java for writing map/reduce program, the map collects the data and reducer receives the list of values per key, like
Map(k, v) -> k1, v1
then shuffle and sort happens
then reducer gets it
reduce(k1, List<values>)
to work on. but is it possible to do the same with python using streaming? I used this as reference and seems like reducer gets data per line as supplied on command-line
In Hadoop Streaming, the mapper writes key-value pairs to
sys.stdout. Hadoop does the shuffle and sort and directs the results to the mapper insys.stdin. How you actually handle the map and the reduce is entirely up to you, so long as you follow that model (map to stdout, reduce from stdin). This is why it can be tested independently of Hadoop viacat data | map | sort | reduceon the command line.The input to the reducer is the same key-value pairs that were mapped, but comes in sorted. You can iterate through the results and accumulate totals as the example demonstrates, or you can take it further and pass the input to
itertools.groupby()and that will give you the equivalent to thek1, List<values>input that you are used to, and which work well the thereduce()builtin.The point being that it’s up to you to implement the reduce.