I have a directory of text-based, compressed log files, each containing many records. In older versions of Hadoop I would extend MultiFileInputFormat to return a custom RecordReader which decompressed the log files and continue from there. But I’m trying to use Hadoop 0.20.2.
In the Hadoop 0.20.2 documentation, I notice MultiFileInputFormat is deprecated in favor of CombineFileInputFormat. But to extend CombineFileInputFormat, I have to use the deprecated classes JobConf and InputSplit. What is the modern equivalent of MultiFileInputFormat, or the modern way of getting records from a directory of files?
o.a.h.mapred.* has the old API, while the o.a.h.mapreduce.* is the new API. Some of the Input/Output formats have not been migrated to the new API. MultiFileInputFormat/CombineFileInputFormat have not been migrated to the new API in 20.2. I remember a JIRA being opened to migrate the missing formats, but I don’t remember the Jira #.
For now it should be OK to use the old API. Check this response in the Apache forums. I am not sure of the exact plans for stopping the support to the old API. I don’t think many have started using the new API, so I think it would be supported for a foreseeable future.