I have a mapreduce Mapper. This Mapper should use some set of read-only parameters.
Let’s imagine that I want to count occurences of some substrings (title of something) in input lines.
I do have a list of pairs : “some title” => “a regular expression to extract this title from input line”.
These pairs are stored in usual text file.
What is the best way to pass this file to Mapper?
I have only this idea:
- Upload file with pairs to hdfs.
- Pass path to file using -Dpath.to.file.with.properties
- in static{} section of mapper read file and populate map pair “some title” => “regular expr for the title”.
Is it good or bad? please adivce
You’re on track, but I would recommend using the distributed cache. Its purpose is for exactly this – passing read-only files to task nodes.
configureorsetupmethod depending on which version of the API you are using. In that method it can read from the distributed cache and store everything in memory.