I want to write MapReduce (may be multiple rounds!) to
1. Sample N records from Large data - for say X RandomTree
2. Train each tree (totally X)
3. And then test records on all these trees
Sequentially,
for X = 0 to 199:
- sample N records from Large data
- Train this tree
- test for all test records
This is my homework problem, so I just need idea ..!
I’m not sure with
- In mapper can I sample exactly N records and generate 200 small
training data file? - To test each record on all 200 first option
I thought of each reducer will run small test (part of test file)
for ALL trees. second option I’m not sure, how to implement this
is, run 200 tree independently and test file is in Distributed
cache, predict for each test record.
It depends on how formal you work.
A formal mapper cannot sample exactly N records. Because it cannot keep a counter, and it doesn’t know the total data size. A practical mapper in Hadoop certainly could. But he probably won’t know how many records he is going to receive in total.
But as you said this is a homework, I don’t think you need to ensure it is exactly N.
In particular, as you are sampling, what is the benefit of having exactly N records?
Try the following: