I have a problem driving me nuts for the last 2 days. I basically have 4 tables with inheritance in the following order:
users
|
categories blogs
| | |
---- pages visits
So a user has many blogs which has many pages and visits. Each page also belongs to a category.
All I want is to extract all users with the following counts associated:
- total number of blogs each user has
- total number of pages each user has
- total number of categories each user has blogs in
- total number of visits each user has
- total number of visitors each user has (visits but we count by distinct ip_address)
My query is as follows:
SELECT
u.id
u.username,
COUNT(b.id) as blogs_count,
COUNT(p.id) as pages_count,
COUNT(v.id) as visits_count,
COUNT(distinct ip_address) as visitors_count
COUNT(c.id) as categories_count
FROM
users u
LEFT JOIN
blogs b ON(b.user_id=u.id)
LEFT JOIN
pages p ON(p.blog_id=b.id)
LEFT JOIN
visits v ON(v.blog_id=b.id)
LEFT JOIN
categories c ON(v.category_id=c.id)
GROUP BY u.id, blogs_count, pages_count, visits_count,
visitors_count, categories_count
I should get 24 users with their counts but, given the fact that I have almost 300,000 visits I get my SQL database hanging in forever probably trying to pull millions of rows.
I’m not a db guru and it’s obvious. Can someone point me to the right direction somehow so I can make a good query able to perform well on even millions of records (with the right hardware of course)?
Try this:
Breakdown
This should also work across different DBMSs like PGSQL, SQL-Server, etc.
The challenge is that you have this sort of hierarchy of 1:M relationships in which joining them all together can easily throw off the different types of counts (as you want distinct counts in some places, but total counts in others).
What I’ve decided to do is first subselect the count of each page and visit / distinct visitors, grouping by the
blog_id. This ensures that we get only one row perblog_id, even after joining the subselects on the blogs table.For the category count, you want a count of distinct categories per user, but the challenge is that categories is linked deep within the relationship hierarchy (to the pages table), so you have to make a separate subselect that joins on the user_id instead of the blog_id.
Even with as many subselects as this query contains, it should still be quite fast as no two subselects are joining against each other. As long as there is an indexed table (subselects are actually unindexed temporary tables) on either side of the join, you should be fine.