We are trying to grab the total number of input paths our MapReduce program is iterating through in our mapper. We are going to use this along with a counter to format our value depending on the index. Is there an easy way to pull the total input path count from the mapper? Thanks in advance.
Share
You could look through the source for
FileInputFormat.getSplits()– this pulls back the configuration property formapred.input.dirand then resolves this CSV to an array of Paths.These paths can still represent folders and regex’s so the next thing getSplits() does is to pass the array to a protected method
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(JobContext). This actually goes through the dirs / regex’s listed and lists the directory / regex matching files (also invoking aPathFilterif configured).So with this method being protected, you could create a simple ‘dummy’ extension of FileInputFormat that has a listStatus method, accepting the Mapper.Context as it’s argument, and in turn wrap a call to the FileInputFormat.listStatus method:
EDIT: In fact it looks like
FileInputFormatalready does this for you, configuring a job propertymapreduce.input.num.filesat the end of the getSplits() method (at least in 1.0.2, probably introduced in 0.20.203)Here’s the JIRA ticket