I have this joining table:
CREATE TABLE [TagMap](
[intItemId] [bigint] NOT NULL,
[intTagId] [bigint] NOT NULL,
CONSTRAINT [PK_TagMap_intItemId] PRIMARY KEY CLUSTERED
(
[intItemId] ASC,
[intTagId] ASC
))
My sproc discovers the tags that are associated with a seedItemId and then selects other intItemIds that are associated with those tags, like this:
declare @baseTags table (intTagId bigint primary key clustered);
INSERT INTO @baseTags (intTagId)
SELECT TOP 20 t1.intTagId
FROM TagMap t1
WHERE t1.intItemId = 776
SELECT TOP 500 t1.intItemId
FROM TagMap t1
JOIN @baseTags t2
ON t1.intTagId = t2.intTagId
GROUP BY t1.intItemId
ORDER BY Count(*) DESC
What I need to to is specify a minimum number of tag associations – say 10 – between two intItemIds below which a value is not returned. In other words in the TagMap table can find ten or more intTagIds that two intItemIds have in common, it’s good and we select it – otherwise ignore it.
So, for example given this data:
CREATE TABLE #TagMap(
[intItemId] [bigint] NOT NULL,
[intTagId] [bigint] NOT NULL,
CONSTRAINT [PK_TagMap_intItemId] PRIMARY KEY CLUSTERED
(
[intItemId] ASC,
[intTagId] ASC
))
insert into #TagMap
(intItemId, intTagId)
values
(1, 100),(1, 200),(1, 300),
(2, 100),(2, 200), (2, 500),(2, 600),
(3, 100), (3, 500),(3, 600)
Assume the match threshold is two.
If the seed is intItemId 1, then only intItemId 2 should be returned (it has two matching tag Ids: 100 and 200, while intItemId 3 has one: 100, which is below the threshold).
If the seed is intItemId 2, then both intItemId 1 and 3 should be returned (intItemId 1 matches the tag Ids 100 and 200, while intItemId 3 matches the tag Ids 500 and 600).
If the seed is intItemId 3, then only intItemId 2 should be returned (it has two matching tag Ids: 500 and 600, while intItemId has one: 100 which is below the threshold).
Any ideas how to do this, please?
Cheers,
Matt
Note, however, that this scales pretty badly; because the
JOINsearches for “has any of these tags”, and then only in theHAVINGclause can you specify “has all of these tags”.From my experience you can make minor optimisations, but the most rewarding approach I found was to cache the results into a more standard mapping table. Updating them as and when necessary. (The contents of the tags data changing very infrequently.)