I’m experimenting with the new AF to migrate away from Map/reduce. I have millions of objects like this:
{
_id: ObjectID,
owner: 1,
tags: [
{text: "dog", score: 5},
{text: "cat", score: 3},
{text: "hamster", score:1}]
}
{
_id: ObjectID,
owner: 2,
tags: [
{text: "cat", score: 8},
{text: "fish", score: 4}]
}
and I want to do a report with count of all matches of “cat” and “fish” where the owner is X.
So far I have my pipeline assuming input tags [“cat”, “fish”] looking like:
{
$match: { owner: X, $in: {"tags.text": ["cat", "fish"]}}
}, {
$project: {text: "$tags.text"},
}, {
$unwind: "$text",
}, {
$match: {"text": {$in: {"tags": ["cat", "fish"]}}
}, {
$group: {"_id": "$text", "total: {"$sum": 1}}
}
The first $match is to just narrow down to a subset of all these million objects – since I have an index on owner and “tags.txt”.
This pipeline functions fine for small numbers of tags, but I need to be able to pass in 100-1000 “tags” and get a quick result. It seems to be that it must be inefficient to project out and unwind all the tags, only to filter way 90% in the next match step.
Is there a more efficient way? Maybe reorder the pipeline steps?
This looks good to me except for some typos and the usage of the
$inoperator in each$matchpipeline operation probably should read:In essence, you want to use
$matchas early in the pipeline as possible to limit the number of documents being processed later in the pipeline. The match onownerand specific tags accomplishes this. You also need to make sure your$match, the equivalent of a.find(), uses the appropriate indexes.