I have documents which can belong to several classes and can contain several tokens

Question

0

Asked: May 19, 20262026-05-19T22:38:18+00:00 2026-05-19T22:38:18+00:00

I have documents which can belong to several classes and can contain several tokens

0

I have documents which can belong to several classes and can contain several tokens (words):

create table Tokens (
        Id INT not null,
       Text NVARCHAR(255) null,
       primary key (Id)
    )

create table DocumentClassTokens (
        Id INT not null,
       DocumentFk INT null,
       ClassFk INT null,
       TokenFk INT null,
       primary key (Id)
    )

I would like to determine these stats (for all tokens given the class):

A = number of distinct documents which contain token and belong to class
B = number of distinct documents which contain token and do not belong to class
C = number of distinct documents which do not contain token and belong to class
D = number of distinct documents which do not contain token and do not belong to class

I am using this at the moment but it does not look right (I am pretty sure that the computation of A and B is correct):

declare @class int;

select @class = id from dbo.Classes where text = 'bla'

;with A as
(
    select
        a.text as token,
        count(distinct DocumentFk) as A
    from dbo.Tokens as a
    inner join dbo.DocumentClassTokens as b on a.id = b.TokenFk and b.ClassFk = @class
    group by a.text
)
,B as
(
    select
        a.text as token,
        count(distinct DocumentFk) as B
    from dbo.Tokens as a
    inner join dbo.DocumentClassTokens as b on a.id = b.TokenFk and b.ClassFk != @class
    group by a.text
)
,C as
(
    select
        a.text as token,
        count(distinct DocumentFk) as C
    from dbo.Tokens as a
    inner join dbo.DocumentClassTokens as b on a.id != b.TokenFk and b.ClassFk = @class
    group by a.text
)
,D as
(
    select
        a.text as token,
        count(distinct DocumentFk) as D
    from dbo.Tokens as a
    inner join dbo.DocumentClassTokens as b on a.id != b.TokenFk and b.ClassFk != @class
    group by a.text
)
select 
    case when A is null then 0 else A end as A,
    case when B is null then 0 else B end as B,
    case when C is null then 0 else C end as C,
    case when D is null then 0 else D end as D,
    t.Text,
    t.id
from dbo.Tokens as t
left outer join A as a on t.text = a.token
left outer join B as b on t.text = b.token
left outer join C as c on t.text = c.token
left outer join D as d on t.text = d.token
order by t.text

Any feedback would be very much appreciated. Many thanks!

Best wishes,

Christian

PS:

Some test data:

use play;

drop table tokens
create table Tokens 
(
   Id INT not null,
   Text NVARCHAR(255) null,
   primary key (Id)
)

insert into Tokens (id, text) values (1,'1')
insert into Tokens (id, text) values (2,'2')

drop table DocumentClassTokens
create table DocumentClassTokens (
        Id INT not null,
       DocumentFk INT null,
       ClassFk INT null,
       TokenFk INT null,
       primary key (Id)
    )

insert into DocumentClassTokens (Id,documentfk,ClassFk,TokenFk) values (1,1,1,1) 
insert into DocumentClassTokens (Id,documentfk,ClassFk,TokenFk) values (2,1,1,2) 
insert into DocumentClassTokens (Id,documentfk,ClassFk,TokenFk) values (3,2,1,1) 
insert into DocumentClassTokens (Id,documentfk,ClassFk,TokenFk) values (4,2,2,1) 
insert into DocumentClassTokens (Id,documentfk,ClassFk,TokenFk) values (5,3,2,1) 
insert into DocumentClassTokens (Id,documentfk,ClassFk,TokenFk) values (6,3,2,3)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-19T22:38:19+00:00

Your question seems now much clearer, and if I haven’t overlooked anything, then here’s a query you might try to run against your data.

DECLARE @class int;
SET @class = 1;

SELECT
  TokenFk,
  TokenClassDocs                        AS A,
  TokenNonClassDocs                     AS B,
  TotalClassDocs    - TokenClassDocs    AS C,
  TotalNonClassDocs - TokenNonClassDocs AS D
FROM (
  SELECT
    TokenFk,
    COUNT(DISTINCT CASE ClassFk WHEN @class THEN DocumentFk ELSE NULL END) AS TokenClassDocs,
    COUNT(DISTINCT CASE ClassFk WHEN @class THEN NULL ELSE DocumentFk END) AS TokenNonClassDocs
  FROM DocumentClassTokens dct
  GROUP BY dct.TokenFk
) AS bytoken
  CROSS JOIN (
    SELECT
      COUNT(DISTINCT CASE ClassFk WHEN @class THEN DocumentFk ELSE NULL END) AS TotalClassDocs,
      COUNT(DISTINCT CASE ClassFk WHEN @class THEN NULL ELSE DocumentFk END) AS TotalNonClassDocs
    FROM DocumentClassTokens
  ) AS totals

Please let us know if it’s all right.

EDIT

The above solution is wrong. Here’s the fixed one, and it certainly seems correct only I do not like it as much as the wrong version (what an irony…).

DECLARE @class int;
SET @class = 1;

SELECT
  TokenFk,
  TokenClassDocs                        AS A,
  TokenNonClassDocs                     AS B,
  TotalClassDocs    - TokenClassDocs    AS C,
  TotalNonClassDocs - TokenNonClassDocs AS D
FROM (
  SELECT
    TokenFk,
    COUNT(DISTINCT cls.DocumentFk) AS TokenClassDocs,
    COUNT(DISTINCT CASE WHEN cls.DocumentFk IS NULL THEN dct.DocumentFk END) AS TokenNonClassDocs
  FROM DocumentClassTokens dct
    LEFT JOIN (
      SELECT DISTINCT DocumentFk
      FROM DocumentClassTokens
      WHERE ClassFk = @class
    ) cls ON dct.DocumentFk = cls.DocumentFk
  GROUP BY dct.TokenFk
) AS bytoken
  CROSS JOIN (
    SELECT
      COUNT(DISTINCT cls.DocumentFk) AS TotalClassDocs,
      COUNT(DISTINCT CASE WHEN cls.DocumentFk IS NULL THEN dct.DocumentFk END) AS TotalNonClassDocs
    FROM DocumentClassTokens dct
      LEFT JOIN (
        SELECT DISTINCT DocumentFk
        FROM DocumentClassTokens
        WHERE ClassFk = @class
      ) cls ON dct.DocumentFk = cls.DocumentFk
  ) AS totals

Note: I think I can see now how you can check if the figures are wrong: the sum of A, B, C, D in every row (i.e. for every token) must be equal to the total document count, which should not be surprising, because every document can satisfy 1 and only 1 of the 4 cases being explored. If the row sum is different from the total document count then some figures in the row are certainly wrong.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have documents which can belong to several classes and can contain several tokens

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply