I am having trouble understanding this issue – I have a sharded cluster in which one of the shards (Shard 2) seems to use the wrong index. Im querying by the shard key, which is site id and first request time { site.id: 1, frt: 1 }. I also have an index on site id and last request time.
In this query, I am also trying to limit returned documents by a couple booleans I have set in the document.
Reading the docs on how Mongo’s Query Optimizer works, this seems especially weird to me looking at the returned Explains. Docs here: Query Optimizer
I also included an explain from Shard 1 where the query returns as expected. Lastly, if I use a site id which does not have chunks stored on Shard 2, it uses the correct index, though it has nothing to scan nor return. Added explain for this to the end for completeness.
Any ideas why this would happen and/or if this is a bug?
Basic query (bad index):
shard2:PRIMARY> db.visit.find({ "site.id": 128, "frt": { $gte: new Date(2012, 8, 24 ) }, "ue": false, "bot": false }).explain()
{
"cursor" : "BtreeCursor site.id_1_lrt_-1",
"isMultiKey" : false,
"n" : 198,
"nscannedObjects" : 61204,
"nscanned" : 61204,
"nscannedObjectsAllPlans" : 61537,
"nscannedAllPlans" : 61537,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 122,
"nChunkSkips" : 0,
"millis" : 727,
"indexBounds" : {
"site.id" : [
[
128,
128
]
],
"lrt" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
]
},
"server" : "ip-10-4-211-107:2200"
}
Supplying a Hint:
shard2:PRIMARY> db.visit.find({ "site.id": 128, "frt": { $gte: new Date(2012, 8, 24 ) }, "ue": false, "bot": false }).hint("site.id_1_frt_1").explain()
{
"cursor" : "BtreeCursor site.id_1_frt_1",
"isMultiKey" : false,
"n" : 198,
"nscannedObjects" : 486,
"nscanned" : 486,
"nscannedObjectsAllPlans" : 486,
"nscannedAllPlans" : 486,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 5,
"indexBounds" : {
"site.id" : [
[
128,
128
]
],
"frt" : [
[
ISODate("2012-09-24T07:00:00Z"),
ISODate("292278995-01--2147483647T07:12:56.808Z")
]
]
},
"server" : "ip-10-4-211-107:2200"
}
Same query WITHOUT additional boolean constraints (uses correct Index):
shard2:PRIMARY> db.visit.find({ "site.id": 128, "frt": { $gte: new Date(2012, 8, 24 ) } }).explain()
{
"cursor" : "BtreeCursor site.id_1_frt_1",
"isMultiKey" : false,
"n" : 486,
"nscannedObjects" : 486,
"nscanned" : 486,
"nscannedObjectsAllPlans" : 486,
"nscannedAllPlans" : 486,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 1,
"indexBounds" : {
"site.id" : [
[
128,
128
]
],
"frt" : [
[
ISODate("2012-09-24T07:00:00Z"),
ISODate("292278995-01--2147483647T07:12:56.808Z")
]
]
},
"server" : "ip-10-4-211-107:2200"
}
On Shard 1, Original Query uses expected index:
shard1:PRIMARY> db.visit.find({ "site.id": 253, "frt": { $gte: new Date(2012, 8, 24 ) }, "ue": false, "bot": false }).explain()
{
"cursor" : "BtreeCursor site.id_1_frt_1",
"isMultiKey" : false,
"n" : 15615,
"nscannedObjects" : 15950,
"nscanned" : 15950,
"nscannedObjectsAllPlans" : 16152,
"nscannedAllPlans" : 16152,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 125,
"nChunkSkips" : 0,
"millis" : 237,
"indexBounds" : {
"site.id" : [
[
253,
253
]
],
"frt" : [
[
ISODate("2012-09-24T07:00:00Z"),
ISODate("292278995-01--2147483647T07:12:56.808Z")
]
]
},
"server" : "ip-10-6-50-253:2100"
}
Query on Shard 2 for Site with no chunks here ( Uses correct index ):
shard2:PRIMARY> db.visit.find({ "site.id": 253, "frt": { $gte: new Date(2012, 8, 24 ), "ue": false, "bot": false } }).explain()
{
"cursor" : "BtreeCursor site.id_1_frt_1",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 0,
"nscanned" : 0,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 0,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"site.id" : [
[
253,
253
]
],
"frt" : [
[
ISODate("2012-09-24T07:00:00Z"),
ISODate("292278995-01--2147483647T07:12:56.808Z")
]
]
},
"server" : "ip-10-4-211-107:2200"
}
A couple of things from the docs you link that might explain this behavior, first:
So, if you don’t have enough volume of queries for it to be evaluated, it will stick with its first choice.
Second:
If the other index is already in memory, say because it is being used by another query, or something else is going on that slows down the query execution on the preferred index (or it is very close and occasionally they swap in terms of speed), then you will get the “bad” index being returned again.
The optimizer has been tweaked and improved in 2.2, so that may be worth a look if you continue to have problems (and are on 2.0 or below). Or, as you have already done in your testing, if you know the best index to use, just remove all doubt and use hint to specify it.