I would like to use the appengine mapper to iterate over a range of dates (from-date and to-date passed as properties to the configuration). For each date in the range, I would retrieve the entities that have this date as a property and operate on this set.
For example, if I have the following set of entities:
Key Date Value
a 2011/09/09 323
b 2011/09/09 132
c 2011/09/08 354
d 2011/09/08 432
e 2011/09/08 234
f 2011/09/07 423
g 2011/09/07 543
I would like to specify a date range of 2011/09/09 – 2011/09/07 which would create three mapper instances, for 2011/09/09, 2011/09/08 and 2011/09/07. In turn these would query for entities a+b, c+d+e and f+g respectively, and perform some operations on the values. (Each of the mappers would also make other datastore queries for additional data, hence the ‘bonus question’ below)
Presumably I need to create a custom InputFormat class, however I’m quite new to mapreduce/hadoop and I was hoping someone had some examples?
Bonus question: is it “bad form” to use a dao to load data in a mapper? Other distributed computing platforms I have worked with (eg DataSynapse) would require that you parcel all inputs up and provide with the task to prevent too much contention on a dataserver. However, with the appengine HR datastore I presume this isn’t a concern?
It’s not currently possible to iterate over a subset of entities of a given kind in App Engine’s mapreduce implementaiton. If the entities make up a large proportion of the data, you can simply iterate over everything and ignore the unwanted entities; if they only make up a small proportion, you will have to roll-your-own update procedure using the task queue.