I am trying to read a file which has lines in the following format.
100,1:2:3
200,10:20:30
Assuming that the inputs will always be numbers, I am trying to read the file by setting the input key and value as IntWritable and Text respectively. But when I run it, I get the following error:
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
Now, though I understand what it means, I am unable to figure out how to read the key as an integer. The code runs fine if I read the key as a Text as well. I have checked everywhere in the code if I have missed the configuration, but it seems fine to me.
conf.set("mapred.textoutputformat.separator", "|");
conf.setInputFormatClass(KeyValueTextInputFormat.class);
conf.setOutputFormatClass(TextOutputFormat.class);
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(Text.class);
I have also checked the mapper class and methods (There is no reducer). Is it that the KeyValueTextInputFormat can read the key as only Text? I am unable to understand what I am doing wrong. Any help would be deeply appreciated.
Thanks,
EG
Looking at the source of
KeyValueTextInputFormat, it extends fromFileInputFormat<Text, Text>. What that means is that both key and value for your input are expected to beText.You could fix that implementing your own
RecordReaderwhich you could model after theKeyValueLineRecordRederdescribed here, but extend fromRecordReader<IntWritable, Text>instead and modify the code accordingly.When you have your
RecordReader, you can create your ownInputFormatand use your newRecordReaderand then in your main code you just need to set your newInputFormatlike this:Another approach I would recommend if you’re really worried about performance is that you could use
SequenceFileInputFormat. This involves storing your input as SequenceFiles, which means it will be in binary format directly. This avoids the overhead of parsing every line as you need to do in your case. You can use this format like this: