I’ve recently started looking into the MapReduce/Hadoop framework and am wondering if my problem is truly lends itself to the framework.
Consider this. Consider an example where I have a large set of input text files, and additionally, as input, I want to take in a large set of keywords (say, contained in a single file). For each keyword, I would want to search in each text file and report the number of instances of that keyword in that file. text I would repeat this for each keyword, for each text file.
This scenario differs a bit from the examples I’ve seen online, in that I would like to take as input not only the text documents to search, but the keywords to search them. This means that each Map task might be processing on the same input text file multiple times (once per keyword).
Could a problem like this be suitable for a MapReduce framework?
The scenario mentioned is definitely suitable for the MapReduce framework.
The keywords to search need not be an input parameter to the map function. There are two options.
The file containing the keywords can be put in HDFS and read in the map function using the HDFS API.
DistributedCache can also be considered for sharing the same file across mappers.
All the initialization like reading the file in HDFS can be done in the o.a.h.mapreduce.mapper#setup() function.
Once the list of keywords is obtained in the mapper, they can be searched in the input files and the count emitted.
There might be some better algorithms for text processing. Check the Data-Intensive Text Processing with MapReduce book for text processing with MapReduce.
One of the thing to consider, if the data is small then using Hadoop is an overhead than using a Shell script. For large data, using Hadoop is an advantage.