I have a directory OUTPUT where I have the output files from a Map Reduce job. The output files are Text files written with a TextOutputFormat.
Now I want to read the key value pairs from the output file. How can I do so using some existing classes in hadoop. One way I could do it was as follows
FileSystem fs = FileSystem.get(conf);
FileStatus[] files = fs.globStatus(new Path(OUTPUT + "/part-*"));
for(FileStatus file:files){
if(file.getLen() > 0){
FSDataInputStream in = fs.open(file.getPath());
BufferedReader bin = new BufferedReader(new InputStreamReader(
in));
String s = bin.readLine();
while(s!=null){
System.out.println(s);
s = bin.readLine();
}
in.close();
}
}
This approach would work but increases my task to a great deal as I now need to manually parse the key value pairs out of each individual line. I am looking for something more handy that directly lets me read key and value in some variables.
Are you forced to use
TextOutputFormatas your output format in the previous job?If not then consider using SequenceFileOutputFormat, then you can use a SequenceFile.Reader to read back the file in Key / Value pairs. You can also still ‘view’ the file using
hadoop fs -text path/to/output/part-r-00000EDIT: You can also use the
KeyValueLineRecordReaderclass, you’ll just need to pass in a FileSplit to teh constructor.