I have developed a code that runs a map reduce job to read files from FTP server and write it into HDFS. Into HDFS it writes the file from FTP into the specified output directory naming it as part-0000. In case I have multiple files on the FTP server I get all of them written to that one part-0000 file in HDFS.
To avoid this I plan to pass the name of the file as key along with the data as value . Thus the reducer gets the data into an output file with the key as the name of the file.
I understand that I have to use an outputformat that extends MultipleTextOutputFormat. I have written it as follows
static class MultiFileOutput extends MultipleTextOutputFormat<Text, Text> {
protected String generateFileNameForKeyValue(Text key, Text value,String name) {
System.out.println("key is :"+ key.toString());
System.out.println("value is :"+ value.toString());
System.out.println("name is :"+ name.toString());
return key.toString();
}
But I fail to pass the name of the input file being processed . How do I get the name of the input file ?
map.input.file
and
FileSystem fs = file.getFileSystem(conf);
String fileName=fs.getName();
do not return the name of the input file.
Any pointers ?
I used
FileStatusobject in the following code as my customised input format would not split the input file. It worked fine for me ..