Scenario:
I have one subset of database and one dataware house. I have bring this both things on HDFS.
I want to analyse the result based on subset and datawarehouse.
(In short, for one record in subset I have to scan each and every record in dataware house)
Question:
I want to do this task using Map-Reduce algo. I am not getting that how to take both files as a input in mapper and also how to handle both files in map phase of map-reduce.
Pls suggest me some idea so that I can able to perform it?
Check the Section 3.5 (Relations Joins) in Data-Intensive Text Processing with MapReduce for Map-Side Joins, Reduce-Side Joins and Memory-Backed Joins. In any case MultipleInput class is used to have multiple mappers process different files in a single job.
FYI, you could use Apache Sqoop to import DB into HDFS.