I’m working on a problem that requires caching paginated ‘search’ results: Paginating very large datasets
The search works as follows: given an item_id, I find the matching item_ids and their rank.
I’m willing to concede not showing my users any results past, say, 500. After 500, I’m going to assume they’re not going to find what they’re looking for… the results are sorted in order of match anyway. So I want to cache these 500 results so I only have to do the heavy lifting of the query once, and users can still page the results (up to 500).
Now, suppose I use an intermediate MySQL table as my cache… that is, I store the top 500 results for each item in a ‘matches’ table, like so: ‘item_id (INTEGER), matched_item_id (INTEGER), match_rank (REAL)’. The search now becomes the extremely fast:
SELECT item.* FROM item, matches WHERE matches.item_id=<item in question> AND item.id=matches.matched_item_id ORDER BY match_rank DESC LIMIT x,y
I’d have no problem reindexing items and their matches into this table as they are requested by clients if the results are older than, say, 24 hours. Problem is, storing 500 results for N items (where N is ~100,000 to 1,000,000) this table becomes rather large… 50,000,000 – 500,000,000 rows.
Can MySQL handle this? What should I look out for?
MySQL can handle this many rows, and there are several techniques to scale when you are starting to hit the wall. Partioning and replication are the main solutions for this scenario.
You can also check additional scaling techniques for MySQL in a question I previously asked here on stackoverflow.