Is there any way to give constructor args to a Mapper in Hadoop? Possibly through some library that wraps the Job creation?
Here’s my scenario:
public class HadoopTest {
// Extractor turns a line into a "feature"
public static interface Extractor {
public String extract(String s);
}
// A concrete Extractor, configurable with a constructor parameter
public static class PrefixExtractor implements Extractor {
private int endIndex;
public PrefixExtractor(int endIndex) { this.endIndex = endIndex; }
public String extract(String s) { return s.substring(0, this.endIndex); }
}
public static class Map extends Mapper<Object, Text, Text, Text> {
private Extractor extractor;
// Constructor configures the extractor
public Map(Extractor extractor) { this.extractor = extractor; }
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String feature = extractor.extract(value.toString());
context.write(new Text(feature), new Text(value.toString()));
}
}
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for (Text val : values) context.write(key, val);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "test");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
As should be clear, since the Mapper is only given to the Configuration as a class reference (Map.class), Hadoop has no way to pass a constructor argument and configure a specific Extractor.
There are Hadoop-wrapping frameworks out there like Scoobi, Crunch, Scrunch (and probably many more I don’t know about) that seem to have this capability, but I don’t know how they accomplish it. EDIT: After some more working with Scoobi, I discovered I was partially wrong about this. If you use an externally defined object in the “mapper”, Scoobi requires that it be serializable, and will complain at runtime if it isn’t. So maybe the right way is just to make my Extractor serializable and de-serialize it in the Mapper’s setup method…
Also, I actually work in Scala, so Scala-based solutions are definitely welcome (if not encouraged!)
The best solution I’ve come up with so far is to pass a serialized version of the object I want to the Mapper, and to use reflection to construct the object at runtime.
So, the main method would say something like:
Then, in the Mapper we use a helper function
construct(defined below) and can say:Definition of
constructthat uses reflection to recursively construct an object at runtime from a String:(This may not be the most robust parser, but it could easily be extended to cover more types of objects.)