I have an app…
The app does a market comparison for a financial product – for a given quote request, it contacts several other sites for their quotes. It then gives the user the results – several quotes for their details.
To manage these requests they get saved to MySQL and then my app kicks in, picking up the pending quotes and farms these out to threads (all same Linux box) to process each site lookup.
I am using JRuby as I had thread/db related issues. Using Java threadpools to control the number of threads. With the current hardware/VPS – it can handle around 200 threads. A lot of the limitations seem to relate to each thread grabbing their own MySQL connection – grabbing the quote details and saving back the results. We want to handle more concurrent threads and so looking for ways to scale up.
Wondering which way to go …
- Bigger hardware…
- More machines and use some kind of queueing
mechanism (with priorities) to share the load across the machines –
so the threads dont touch the db, all the details/responses go via
the queue – so the DB hit is less, but then maybe I am just pushing
the problem into the queue. Thinking of using something like
MongoDB for the queue, but open to suggestions – something easy to
use with Ruby 🙂 - Some kind of remote/RPC mechanism, eg dRb –
theoretically this seems like a good option, but not done anything
with this yet to know how complex it will make things. - Something
else…?
From this link Reasons for NOT scaling-up vs. -out? – it would seem this problem is suited to running more machines to solve it.
So, any thoughts on which way to go…
Cheers,
Chris
My usual approach to problems like this is to pay very close attention to the database queries you’re making and tune them aggressively. Retrieve only what you need, skipping columns that aren’t explicitly used, and be very careful about eager loading things you don’t need in their entirety.
You’ll often find you can get significant speed gains by adding indexes, or strategically de-normalizing certain attributes in your database to avoid ugly, time-consuming
JOINoperations.Further, think about caching: The fastest database call is the one that’s never made. It’s not hard to leverage in something like Memcached to save the results of a moderately time-consuming record retrieval and if done carefully it’s even easy to invalidate and expire this provided you channel your updates through a few methods.
For scheduling workers, a simple first-in, first-out queue can be implemented in Redis to off-load a lot of the processing overhead from MySQL itself. This is usually very simple to add if you follow an example.
A cache like Memcached can handle an extremely high amount of traffic, so whenever possible, cache against this to avoid hitting your database for every last thing.
If you’ve exhausted these options, it’s time for more front-end servers and even more database capacity, but only then.