I have a database of documents which are tagged with keywords. I am trying to find (and then count) the unique tags which are used alongside each other. So for any given tag, I want to know what tags have been used alongside that tag.
For example, if I had one document which had the tags [fruit, apple, plant] then when I query [apple] I should get [fruit, plant]. If another document has tags [apple, banana] then my query for [apple] would give me [fruit, plant, banana] instead.
This is my map function which emits all the tags and their neighbours:
function(doc) {
if(doc.tags) {
doc.tags.forEach(function(tag1) {
doc.tags.forEach(function(tag2) {
emit(tag1, tag2);
});
});
}
}
So in my example above, it would emit
apple -- fruit
apple -- plant
apple -- banana
fruit -- apple
fruit -- plant
...
My question is: what should my reduce function be? The reduce function should essentially filter out the duplicates and group them all together.
I have tried a number of different attempts, but my database server (CouchDB) keeps giving me a Error: reduce_overflow_error. Reduce output must shrink more rapidly.
EDIT: I’ve found something that seems to work, but I’m not sure why. I see there is an optional “rereduce” parameter to the reduce function call. If I ignore these special cases, then it stops throwing reduce_overflow_errors. Can anyone explain why? And also, should I just be ignoring these, or will this bite me in the ass later?
function(keys, values, rereduce) {
if(rereduce) return null; // Throws error without this.
var a = [];
values.forEach(function(tag) {
if(a.indexOf(tag) < 0) a.push(tag);
});
return a;
}
Your answer is nice, and as I said in the comments, if it works for you, that’s all you should care about. Here is an alternative implementation in case you ever bump into performance problems.
CouchDB likes tall lists, not fat lists. Instead of view rows keeping an array with every previous tag ever seen, this solution keeps the “sibling” tags in the key of the view rows, and then group them together to guarantee one unique sibling tag per row. Every row is just two tags, but there could be thousands or millions of rows: a tall list, which CouchDB prefers.
The main idea is to emit a 2-array of tag pairs. Suppose we have one doc, tagged
fruit, apple, plant.Next I tag something
apple, banana.Why is the value always
1? Because I can make a very simple built-in reduce function:_sumto tell me the count of all tag pairs. Next, query with?group_level=2and CouchDB will give you unique pairs, with a count of their total.A map function to produce this kind of view might look like this: