Is there a way to have a whole file sent to a mapper without being split?
I have read this but I am wondering if there is another way of doing the same thing without having to generate an intermediate file. Ideally, I would like an existing option on the command line to Hadoop.
I am using the streaming facility with Python scripts on Amazon EMR.
Just set the configuration property
mapred.min.split.sizeto something huge (10G):Or compress the input file using a codec that isn’t splittable (Gzip). With the .gz extension, TextInputFormat will return false to the
isSplittable(FileSystem, Path)method