After talking with a friend of mine from Google, I’d like to implement some kind of Job/Worker model for updating my dataset.
This dataset mirrors a 3rd party service’s data, so, to do the update, I need to make several remote calls to their API. I think a lot of time will be spent waiting for responses from this 3rd party service. I’d like to speed things up, and make better use of my compute hours, by parallelizing these requests and keeping many of them open at once, as they wait for their individual responses.
Before I explain my specific dataset and get into the problem, I’d like to clarify what answers I’m looking for:
- Is this a flow that would be well suited to parallelizing with MapReduce?
- If yes, would this be cost effective to run on Amazon’s mapreduce module, which bills by the hour, and rounds hour’s up when the job is complete? (I’m not sure exactly what counts as a “Job”, so I don’t know exactly how I’ll be billed)
- If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS)?
- Are there any problems you see with the way I’ve designed this job flow?
Ok, now onto the details:
The dataset consists of users who have favorite items and who follow other users. The aim is to be able to update each user’s queue — the list of items the user will see when they load the page, based on the favorite items of the users she follows. But, before I can crunch the data and update a user’s queue, I need to make sure I have the most up-to-date data, which is where the API calls come in.
There are two calls I can make:
- Get Followed Users — Which returns all the users being followed by the requested user, and
- Get Favorite Items — Which returns all the favorite items of the requested user.
After I call get followed users for the user being updated, I need to update the favorite items for each user being followed. Only when all of the favorites are returned for all the users being followed can I start processing the queue for that original user. This flow looks like:

Jobs in this flow include:
- Start Updating Queue for user — kicks off the process by fetching the users followed by the user being updated, storing them, and then creating Get Favorites jobs for each user.
- Get Favorites for user — Requests, and stores, a list of favorites for the specified user, from the 3rd party service.
- Calculate New Queue for user — Processes a new queue, now that all the data has been fetched, and then stores the results in a cache which is used by the application layer.
So, again, my questions are:
- Is this a flow that would be well suited to parallelizing with MapReduce? I don’t know if it would let me start the process for UserX, fetch all the related data, and come back to processing UserX’s queue only after that’s all done.
- If yes, would this be cost effective to run on Amazon’s mapreduce module, which bills by the hour, and rounds hour’s up when the job is complete? Is there a limit on how many “threads” I can have waiting on open API requests if I use their module?
- If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS?)?
- Are there any problems you see with the way I’ve designed this job flow?
Thanks for reading, I’m looking forward to some discussion with you all.
Edit, in response to JimR:
Thanks for a solid reply. In my reading since I wrote the original question, I’ve leaned away from using MapReduce. I haven’t decided for sure yet how I want to build this, but I’m beginning to feel MapReduce is better for distributing / parallelizing computing load when I’m really just looking to parallelize HTTP requests.
What would have been my “reduce” task, the part that takes all the fetched data and crunches it into results, isn’t that computationally intensive. I’m pretty sure it’s going to wind up being one big SQL query that executes for a second or two per user.
So, what I’m leaning towards is:
- A non-MapReduce Job/Worker model, written in Python. A google friend of mine turned me onto learning Python for this, since it’s low overhead and scales well.
- Using Amazon EC2 as a compute layer. I think this means I also need an EBS slice to store my database.
- Possibly using Amazon’s Simple Message queue thingy. It sounds like this 3rd amazon widget is designed to keep track of job queues, move results from one task into the inputs of another and gracefully handle failed tasks. It’s very cheap. May be worth implementing instead of a custom job-queue system.
Seems that we’re going with Node.js and the Seq flow control library. It was very easy to move from my map/flowchart of the process to a stubb of the code, and now it’s just a matter of filling out the code to hook into the right APIs.
Thanks for the answers, they were a lot of help finding the solution I was looking for.