I am trying to exploit hadoop to train multiple models . My data are small enough to fit in memory so i want to have one model trained in every map task.
My problem is that when i have finished training my model, i need to send it to the reducer. I am using Weka to train the model. I don’t want to start looking how to implement the Writable interface in Weka classes, because it needs a lot of effort. I am looking for a simple way to do this.
The Classifier class in Weka implements the Serializable interface. How can i send this object to the reducer?
edits
Here is the link that mentions weka objects serialization: http://weka.wikispaces.com/Serialization
Here is what my code looks like:
Configuring the job(only a part of the configuration is posted):
conf.set("io.serializations","org.apache.hadoop.io.serializer.JavaSerialization," + "org.apache.hadoop.io.serializer.WritableSerialization");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Classifier.class);
Map function:
//load dataset in data variable
Classifier tree=new J48();
tree.buildClassifier();
context.write(new Text("whatever"), tree);
My Map class extends Mapper (Object,Text,Text,Classifier)
But i am getting this error:
java.lang.NullPointerException
at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:964)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:673)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:755)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)
What i am doing wrong??
You can define your own serialization mechanism
I think it resolves around implementing the Serialization interface, and defining your implementation in the
io.serializationsconfiguration propertyIn your case, if you just want to use java serialization, set this property to:
org.apache.hadoop.io.serializer.JavaSerialization