What is the maximum number of paths Map side join can actually join?
I have n folders – path/to/folder1 , path/to/folder2 , path/to/folder3 ….so on path/to/foldern in HDFS
path/to/folder1 contains 3 files, say part-1,part-2,part-3.Similarly,each of all the remaining folders have 3 files each with same names as in folder1.
I want to join these folders using map side join like below
pathsToJoin <- path/to/folder1 , path/to/folder2 , path/to/folder3 ….so on path/to/folder*n*
String joinStmt = CompositeInputFormat.compose(“outer”,TextInputFormat,pathsToJoin);
conf.set(“mapred.join.expr”, joinStmt);
Since there are 3 files in each folder,the job will spawn 3 map tasks (content of all part-1 files joined to one mapper,content of all part-2 files to 2nd mapper and content of all part-3 files to 3rd mapper) but I would like to know what can be the max value of n here?
There doesn’t appear to be a hard limit in the source code for CompositeInputFormat, the paths are appended to a String expression describing the join and then parsed into splits. You’re probably limited by memory, but i imagine you can list 100’s if not 1000’s without any problem