I have a data model like the following:
username | product1 | product2
-------------------------------
harold abc qrs
harold abc def
harold def abc
kim abc def
kim lmn qrs
...
username | friend_username
---------------------------
john harold
john kim
...
I want to build a histogram of the most frequent product1 to product2 records there are, restricted to a given product1 id, and restricted only to friends of john. So something like:
What do friends of john link to for product1, when product1=’abc’:
Select all of john’s friends from the friends table. For each friend, count and group the number of records where product1 = ‘abc’, sort results in desc order:
Results:
abc -> def (2 instances)
abc -> qrs (1 instance)
I know we can do the following in a relational database, but there will be some threshold where this kind of query will start utilizing a lot of resources. Users might have a large number of friend records (500+). If this query is running 5 times every time a user loads a page, I’m worried I’ll run out of resources quickly.
Is there some other table I can introduce to my model to relieve the overhead of doing the above query everytime users want to see the histogram break down? All I can think of is to precompute the histograms when possible so that reads optimized.
Thanks for any ideas
Here’s your query:
To handle 5 products, use:
It’s pretty simple, and the more you can filter the records down, the faster it will run because of being a smaller dataset.
My first question is why you’d run this query more than once per page. If it’s to cover more than one friend, the query I posted can be updated to expose counts for products on a per friend or user basis.
After that, I’d wonder if the query can be cached at all. How fresh do you really need the data to be – is 2 hours acceptable? How about 6 or 12… We’d all like the data to be instantaneous, but you need to weigh that against performance and make a decision.