Is there any way to set and (later) get a custom configuration object in Hadoop, during Map/Reduce?
For example, assume an application that preprocesses a large file and determines dynamically some characteristics related to the file. Furthermore, assume that those characteristics are saved in a custom Java object (e.g., a Properties object, but not exclusively, since some may not be strings) and are subsequently necessary for each of the map and of the reduce jobs.
How could the application “propagate” this configuration, so that each mapper and reducer function can access it, when needed?
One approach could be to use the set(String, String) method of the JobConf class and, for instance, pass the configuration object serialized as a JSON string via the second parameter, but this may be too much of a hack and then the appropriate JobConf instance must be accessed by each Mapper and Reducer anyway (e.g., following an approach like the one suggested in an earlier question).
Unless I’m missing something, if you have a
Propertiesobject containing every property you need in your M/R job, you simply need to write the content of thePropertiesobject to the HadoopConfigurationobject. For example, something like this:Then inside your M/R job, you can use the
Contextobject to get back yourConfigurationin both the mapper (themapfunction) or the reducer (thereducefunction), like this:Note that when using the
Configurationobject, you can also access theContextin thesetupandcleanupmethods, useful to do some initialization if needed.Also it’s worth mentioning you could probably directly call the
addResourcemethod from theConfigurationobject to add your properties directly as anInputStreamor a file, but I believe this has to be an XML configuration like the regular Hadoop XML configs, so that might just be overkill.EDIT: In case of non-String objects, I would advise using serialization: You can serialize your objects, and then convert them to Strings (probably encode them for example with Base64 as I’m not sure what would happen if you have unusual characters), and then on the mapper/reducer side de-serialize the objects from the Strings you get from the properties inside
Configuration.Another approach would be to do the same serialization technique, but instead write to HDFS, and then add these files to the
DistributedCache. Sounds a bit overkill, but this would probably work.