I have a list of files that I want to add using distributed cache facility. Different files are needed for different reduce tasks. For example, file A is needed by reduce 1, while file B is needed by reduce 2, and so on.
In Job Conf, both the files are added using DistributedCache.addCacheFile() method.
In the reduce class configure method, I use DistributedCache.getCacheFiles() to get the files.
Is it possible that I can have only File A in memory of reduce 1 and only file B in memory of reduce 2. Or the both the files get added to the memory, before the reduce task starts.
If I understand this, I can use distributed cache for my program. My concern is about scalability. The files are big. So the reduce task cannot have both the files in memory. But can hold one of the files.
Pls help!!!
Thanks
The method for returning the cache files, returns an array of all the names of the files you cached in the order you added them. So it is possible to tell reducer 1 to get the array[0] file and reduce 2 to get the array[1] file. This cache is also recommended not to have very large files in it.