I have a file where I store some data, this data should be used by every mapper for some calculations.
I know how to read the data from the file and this can be done inside the mapper function, however, this data is the same for every mapper so I would like to store it somewhere(variable) before the mapping process beings and then use the contents in the mappers.
if I do this in the map function and have for example a file with 10 lines as input, then the map function will be called 10 times, correct? so if I read the file contents in the map function I will read it 10 times which is unnecessary
thanks in advance
Because many of your Mappers run inside of a different JVM (and possibly on different machines), you cannot read the data into your application once prior to submitting it to Hadoop. However, you can use the Distributed Cache to “Distribute application-specific large, read-only files efficiently.”
As per that link: “Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.”