Ideally, I have a Mongo document that looks like below. I want the ability to query for any two of the attributes, and then order by a third.
Document:
{
"tags" => ["ads", "shopping", "web20", "newspaper", "others..."],
"reachable_via" => ["email", "twitter", "facebook", "contact_form", "phone"],
"keywords" => ["keyword1", "keyword2", "keyword3"],
"score" => 4 #scalar of 0 - 10,
"read_in_project_ids => [124, 433,556]
}
Example query, using Mongoid syntax:
Document.any_in(:keywords => ["keyword1", "keyword2"]).where(:tags.in => ["ads", "shopping"], :reachable_via.in => ["email"]).order_by([:presence_score, :desc]).limit(10)
This query works, but they don’t use indexes. In addition, I’ve tried to restructure this thing to make it work three different ways, without any luck.
Right now, I have 3.8 million documents, and this query can take 45-60 seconds to return.
So, how should I restructure to maintain the flexibility of a set of array fields, while gaining indexation benefits?.
FYI, keywords could be hundreds long (and are added by users), but tags and reachable_via elements are fixed (7 options which will grow) and tags is about 20 options which will grow, and are controlled by the application’s code.
Thanks!
The problem is the $in combined with the sort.
If you can remove one or the other, it would speed up your query significantly.
Since you can’t have multiple indexes that have array value keys (multikeys, as they call them), you want to pick the most granular array from your query to index. In your example query, that would likely be keywords.
So, to make your query a bit faster, you would put an index on {keywords:1, score:-1}. This will scan the keywords index, filtering out other query requirements on tags and reachable_via, then sort with score descending. I tested this with collection of 5 million of similar documents to yours, and it used the index on the values that actually did a good job filtering.
Here’s an example query from the mongo shell (sorry, I’m not a mongoid expert):
If you can change your query to query only on one keyword, it makes it use the index much more efficiently, getting the top 10 score for a particular keyword in 0ms.
Here’s another example. I moved the score out of the sort, and into the query (querying on an exact score, without a limit). This does a good job of speeding up the query, if you’re only looking for the top score, or something like that.
Rinse, repeat for other query combinations. Pick the highest granularity array field in the query, index it along with the sorting field. If you can limit the query to not use $in on the indexed array, that’s ideal.
My test script is located here:
https://gist.github.com/2091880
The test script has a few weaknesses, such as the fact that almost every document has a keyword1, so it turns out that querying on keyword1, while it has an index, it’s faster to do a collection scan. Anyway, I was just a little lazy about randomizing the selection of keywords, but in real life that wouldn’t be a problem.