I would like to do something similar to gmail’s “consider including” suggestions on my blog, but with tags.
I was thinking of storing tags sets like this :

and I thought of the following algorithm :
//a blog post is published
//it has the tags "A", "B" & "C" :
if the tag set "A,B,C" doesn't exist
create it
else
add 1 to "number of times used"
and, to suggest tags :
//a blog post is being written.
//the author includes the tags "A" and "C"
//which tags should I suggest ?
find all the tags sets that contain "A" and "C"
among them, find the one with the highest "number of times used"
suggest the tags of the set not already picked (A & C)
Is there a better/smarter way of accomplishing this task ? What about the database model ? Can I optimize it so that searches like “sets that contain A & C” won’t be too slow ?
Search model issues:
Your model seems a bit too simplified to me, since very frequent tags are most likely to always be the suggested ones, even if there are tags more related to the pair A,C.
You probably should concider the tf-idf model, which gives a boost to rare terms, if they are also connected to the “query” [in here the query is
A and B], since if a rare term is commonly used withA and B– it is probably very much related to them.The idea is simple: If a tag is frequently used with
A and B– give it a boost. [tf]Also, if a term is rare [number of total uses of this tag] – give it a boost [idf]
The “score” for each tag will be the combined tf-idf score
Performance issues:
You might also concider for this task creating an inverted index – to speed up searches.
If you are using java, apache lucene is a mature library that can help you with it.