Using SQL Server, I’m trying to query a kind of averaged count from a table I didn’t design, where basically I want a list, grouped by one column, with the number of distinct values of another column matching a given criterion, and of those, the number of rows matching another criterion (which I’ll use to created the averaged count or whatever it is). This can’t be hard, but I’m having a bad set theory day and any pointers will be gratefully received.
Here’s the simplified and genericized scenario (schema and sample data below). Say we have three columns:
objid(has a clustered index)userid(no index, I might be able to add one)actiontype(no index, I might be able to add one)
None of these is unique, and none can be null. We want to completely ignore any rows where actiontype is none. We want to know, per userid, how many actiontype = 'flag' rows there are on average per object that user has interacted with.
So if we have “ahmed”, “joe”, and “maria”, and joe interacted with 3 objects and raised 5 flags, the number there is 5 / 3 = 1.6666 continuous; if “ahmed” interacted with 3 objects and didn’t raise any flags, his number would be 0; if “maria” interacted with 5 objects and raised 4 flags, her number would be 4 / 5 = 0.8:
+--------+------------------+ | userid | flags_per_object | +--------+------------------+ | ahmed | 0 | | joe | 1.66666667 | | maria | 0.8 | +--------+------------------+
I won’t be remotely surprised if this is closed as a duplicate, I’m just not finding it.
Here’s the simplified table setup and sample data:
create table tmp
(
objid varchar(254) not null,
userid varchar(254) not null,
actiontype varchar(254) not null
)
create clustered index tmp_objid on tmp(objid)
insert into tmp (objid, userid, actiontype) values ('alpha', 'joe', 'none')
insert into tmp (objid, userid, actiontype) values ('alpha', 'joe', 'none')
insert into tmp (objid, userid, actiontype) values ('alpha', 'joe', 'update')
insert into tmp (objid, userid, actiontype) values ('alpha', 'joe', 'close')
insert into tmp (objid, userid, actiontype) values ('alpha', 'joe', 'flag')
insert into tmp (objid, userid, actiontype) values ('alpha', 'joe', 'flag')
insert into tmp (objid, userid, actiontype) values ('alpha', 'joe', 'flag')
insert into tmp (objid, userid, actiontype) values ('alpha', 'joe', 'flag')
insert into tmp (objid, userid, actiontype) values ('beta', 'joe', 'none')
insert into tmp (objid, userid, actiontype) values ('beta', 'joe', 'none')
insert into tmp (objid, userid, actiontype) values ('beta', 'joe', 'close')
insert into tmp (objid, userid, actiontype) values ('beta', 'joe', 'flag')
insert into tmp (objid, userid, actiontype) values ('gamma', 'joe', 'none')
insert into tmp (objid, userid, actiontype) values ('delta', 'joe', 'update')
insert into tmp (objid, userid, actiontype) values ('alpha', 'maria', 'update')
insert into tmp (objid, userid, actiontype) values ('beta', 'maria', 'flag')
insert into tmp (objid, userid, actiontype) values ('beta', 'maria', 'flag')
insert into tmp (objid, userid, actiontype) values ('gamma', 'maria', 'flag')
insert into tmp (objid, userid, actiontype) values ('gamma', 'maria', 'flag')
insert into tmp (objid, userid, actiontype) values ('gamma', 'maria', 'update')
insert into tmp (objid, userid, actiontype) values ('gamma', 'maria', 'close')
insert into tmp (objid, userid, actiontype) values ('delta', 'maria', 'update')
insert into tmp (objid, userid, actiontype) values ('epsilon', 'maria', 'update')
insert into tmp (objid, userid, actiontype) values ('alpha', 'ahmed', 'none')
insert into tmp (objid, userid, actiontype) values ('beta', 'ahmed', 'none')
insert into tmp (objid, userid, actiontype) values ('gamma', 'ahmed', 'none')
insert into tmp (objid, userid, actiontype) values ('gamma', 'ahmed', 'update')
insert into tmp (objid, userid, actiontype) values ('delta', 'ahmed', 'update')
insert into tmp (objid, userid, actiontype) values ('delta', 'ahmed', 'close')
insert into tmp (objid, userid, actiontype) values ('epsilon', 'ahmed', 'update')
insert into tmp (objid, userid, actiontype) values ('epsilon', 'ahmed', 'close')
The answer is: It depends.
In my testing, my solution is the slowest of the bunch, regardless of what test data I use. With real life data, it’s about half the speed of the fastest solution.
Mikael’s solution is faster for the test data quoted in my question, and faster for a larger-but-still-small data set (our testing system, about 2k rows) in my real-life tables.
But a1ex07’s solution is faster for my full-size real-life tables (our live system, about 700k rows). There’s not a lot of distance between a1ex07’s and Mikael’s, but a1ex07’s definitely has the edge.
I ended up actually using Mikael’s solution, though, because it’s easier to conceptualize if you’re not a l33t DB person (and the people doing maintenance on this code, of which the SQL is only a small part, won’t be) and easier to adapt to various other scenarios.
Thus this community wiki meta-answer, which I’ll accept when the time limit passes, rather than accepting either of their excellent answers. If you found this helpful, please do vote up both Mikael’s answer and a1ex07’s answer, as I have done.