I would like information on algorithms that can help identify commonality and differences between sets of overlapping data.
Using stackoverflow’s tag system as an example:
Let’s say this question has been given 5 tags. Let’s say there are 1000 other questions that have at least one of these tags. Of these 1000 questions, how many of these questions have tags in common that my original post does not have?
Another more simple way of describing this is an auto-suggest tagging system :
‘You tagged your question with [5 tags I selected]. Other similiar questions were tagged with [list of tags that might be of interest]. where [list of tags that might be of interest] are frequently occuring tags that aren’t in my orginal list.
Code examples in c# if possible 🙂
I don’t know of any specific algorithms or data structures, but I could suggest a basic way of handling this:
Assumption: each entry has five unique tags.
In (sloppy) pseudo code, use two loops (if possible):
This should produce a sparse array of concatenated tag names (ok, I didn’t include a separator, but hey it’s pseudo code :-). Keep the highest number, then iterate backward for the best suggestions.
(Cache to optimize, but watch for updates)
Paul.