I was going through the paper related to the HIPI image processing API for Hadoop at:
http://cs.ucsb.edu/~cmsweeney/papers/undergrad_thesis.pdf
While explaining the covariance example in that, the paper says “Because HIPI allocates one image per map task, it is simple to randomly sample an image for 100 patches and perform this calculation”.
But the very first figure that have shown in the paper, depicts an architecture with multiple images being input to one map task!
And it is surprising that they have written that one image is processed by one map task, because it would be spawning too many map tasks then since they are addressing the small files problem also.
If this is true, then Sequence File with MultithreadedMapper is a better alternative, am I right or wrong?
Thanks in advance..
While i’m not able to explain what the author is saying in the paper, looking at the API for HIPI, i can only see a single InputFormat:
This works on an ImageBundle, which is as it sounds – a collection(bundle) of images in a single file.
I guess what the author is probably trying to say is:
Looking through the code for the related Covariance example supports this theory.