When retrieving rows from a Google AppEngine’s datastore, we would like to implement retrieval of all data of an entity type, with several, simultaneous, processes. The processes run asynchronously in back-end Python servers. The point would be to have each process retrieve a “chunk” of the whole data set, so that we can nearly-evenly distribute the load across all of them, like this:
|_____|_____|_____|_____|_____|_____|_____|.....|_____|_____|
p1 p2 p3 p4 p5 p6 p7 pk-1 pk
Where each pn is a process and all the entities are retrieved.
I think the way to enable this is to somehow say something like this (in Python):
chunk_size = num_entities / num_chunks
base_query = 'select * from entity offset %d limit %d'
for chunk in range(0, to = num_entities, step_by = chunk_size):
cursor = get_cursor(base_query, offset = chunk, limit = chunk_size)
while is_ready(cursor):
do_task_with_data(cursor.next())
Where get_cursor would get a cursor from AppEngine which scrolls from results starting from the given offset. I am only including the limit argument here in case it helps, but it could also be enforced inside the while loop, for example. In any case, we would hopefully like to get a situation where queries are not O(n) with limit and offset (i.e. the last queries have to scroll through nearly all the data before fetching data).
Another option might be distributing entities based on some random value (which we do have), using a range of 0 -> 1 divided into chunk_num chunks.
It might even be possible to somehow get a data dump out of App Engine and then work on that (although due to size it would not be our first choice).
What would be a good way to achieve this? Is there a better way to solve this problem? Any ideas on this would be really appreciated.
I think you’re pretty much describing what the mapreduce framework does.