I was just going through the word count example in MapReduce. The map function is very straightforward. Is there a higher level function that decides what part of the file go to what mapper?
Suppose you are relying on a function (such as SHA1) that relies on the input of the entire file, is there any to tell the framework not to split files?
I was just going through the word count example in MapReduce. The map function
Share
When a map slot is free on a node, the scheduler picks a split which is nearest to the node to avoid data transfer as much as possible. If an unprocessed input split is on the same node as the free map slot then that split is processed, if not then a split in the same rack is chosen or else a split outside the rack is chosen.
Implement the FileInputFormat#isSplitable(). Then the input files are not split and are processed one per map.