Here is the use case:
I have a nutch crawldb (its a hadoop map file) containing data about urls which includes its status as visited and not-visited. I want to split it into 2 crawldb (map files) based on the status of the urls.
Till now i tried using MultipleOutputFormat but I read that it will work for sequence files or text files and NOT map files.
(FYI: i am using hadoop v20.2)
Look instead at MultipleOutputs, you’ll have to write a custom reducer to call the MultipleOutputs.getCollector() method for each type, there’s example usage in the javadocs.
In your job configuration: