I have an application run on hadoop. How can I pass the objects to the mappers and reducers so as to process the data. For example, I declare a FieldFilter object for filter the rows processed in the Mappers. The filters contains many filter rules which are specified by users. So, I am wondering how can I pass the filters and rules to the Mappers and Reducers?
My idea is to serialize the objects into String, pass around the string by configure, re-then construct the object by the string. But seems not good for me! any other approaches?
thanks!
public class FieldFilter {
private final ArrayList<FieldFilterRule> rules = new ArrayList<FieldFilterRule>();
public FieldFilter addRule(FieldFilterRule ... rules) {
for (int i = 0; i < rules.length; i++) {
this.rules.add(rules[i]);
rules[i].setFieldFilter(this);
}
return this;
} }
Serialize FieldFilter and put it in HDFS and later read it in the mapper/reducer functions using the HDFS API. If you have a large cluster, then you might want to increase the replication factor which is defaulted to 3 for the serialized FieldFilter class, since a larger number of mapper and reader tasks would be reading the serialized FieldFilter class.
If new MapReduce API is used then the serialized FieldFilter file can be read in Mapper.setup() function. This is called during the initialization of the map task. Could not find something similar for the old MapReduce API.
You can also consider using DistributedCache to distribute the serialized FieldFilter class to the different nodes.