I have documents which can belong to several classes and can contain several tokens (words):
create table Tokens (
Id INT not null,
Text NVARCHAR(255) null,
primary key (Id)
)
create table DocumentClassTokens (
Id INT not null,
DocumentFk INT null,
ClassFk INT null,
TokenFk INT null,
primary key (Id)
)
I would like to determine these stats (for all tokens given the class):
- A = number of distinct documents which contain token and belong to class
- B = number of distinct documents which contain token and do not belong to class
- C = number of distinct documents which do not contain token and belong to class
- D = number of distinct documents which do not contain token and do not belong to class
I am using this at the moment but it does not look right (I am pretty sure that the computation of A and B is correct):
declare @class int;
select @class = id from dbo.Classes where text = 'bla'
;with A as
(
select
a.text as token,
count(distinct DocumentFk) as A
from dbo.Tokens as a
inner join dbo.DocumentClassTokens as b on a.id = b.TokenFk and b.ClassFk = @class
group by a.text
)
,B as
(
select
a.text as token,
count(distinct DocumentFk) as B
from dbo.Tokens as a
inner join dbo.DocumentClassTokens as b on a.id = b.TokenFk and b.ClassFk != @class
group by a.text
)
,C as
(
select
a.text as token,
count(distinct DocumentFk) as C
from dbo.Tokens as a
inner join dbo.DocumentClassTokens as b on a.id != b.TokenFk and b.ClassFk = @class
group by a.text
)
,D as
(
select
a.text as token,
count(distinct DocumentFk) as D
from dbo.Tokens as a
inner join dbo.DocumentClassTokens as b on a.id != b.TokenFk and b.ClassFk != @class
group by a.text
)
select
case when A is null then 0 else A end as A,
case when B is null then 0 else B end as B,
case when C is null then 0 else C end as C,
case when D is null then 0 else D end as D,
t.Text,
t.id
from dbo.Tokens as t
left outer join A as a on t.text = a.token
left outer join B as b on t.text = b.token
left outer join C as c on t.text = c.token
left outer join D as d on t.text = d.token
order by t.text
Any feedback would be very much appreciated. Many thanks!
Best wishes,
Christian
PS:
Some test data:
use play;
drop table tokens
create table Tokens
(
Id INT not null,
Text NVARCHAR(255) null,
primary key (Id)
)
insert into Tokens (id, text) values (1,'1')
insert into Tokens (id, text) values (2,'2')
drop table DocumentClassTokens
create table DocumentClassTokens (
Id INT not null,
DocumentFk INT null,
ClassFk INT null,
TokenFk INT null,
primary key (Id)
)
insert into DocumentClassTokens (Id,documentfk,ClassFk,TokenFk) values (1,1,1,1)
insert into DocumentClassTokens (Id,documentfk,ClassFk,TokenFk) values (2,1,1,2)
insert into DocumentClassTokens (Id,documentfk,ClassFk,TokenFk) values (3,2,1,1)
insert into DocumentClassTokens (Id,documentfk,ClassFk,TokenFk) values (4,2,2,1)
insert into DocumentClassTokens (Id,documentfk,ClassFk,TokenFk) values (5,3,2,1)
insert into DocumentClassTokens (Id,documentfk,ClassFk,TokenFk) values (6,3,2,3)
Your question seems now much clearer, and if I haven’t overlooked anything, then here’s a query you might try to run against your data.
Please let us know if it’s all right.
EDIT
The above solution is wrong. Here’s the fixed one, and it certainly seems correct only I do not like it as much as the wrong version (what an irony…).
Note: I think I can see now how you can check if the figures are wrong: the sum of A, B, C, D in every row (i.e. for every token) must be equal to the total document count, which should not be surprising, because every document can satisfy 1 and only 1 of the 4 cases being explored. If the row sum is different from the total document count then some figures in the row are certainly wrong.