we have faced a problem with our production system yesterday which I
am unable to explain when following the official documentation.
The Setup:
- MongoDB 2.0.1
- Replica Set spanning over 5 Servers with one preferred Primary
- PHP Application using V 1.2.6 of the PHP Driver
- One collection with roughly 3 million entries
- SlaveOkay set to true for every connection
- The connection string for MongoDB includes all five servers from the replica set
The Problem:
Yesterday one of the secondaries suddenly died (hardware crash) and
became totally unavailable. From that point onwards many read
operations carried out from the PHP driver took over 30 seconds to
complete (before it rarely took over 0.1 second).
- rs.status() was clearly reporting that the failed secondary was
NOK and not accessible. - all same queries sent directly via the console to the primary or
any secondary were processed in under 0.1 seconds (as expected)
In the beginning I have taken the reference to the failed secondary
from the connection string in PHP, yet this yielded no change to the
overall performance. Only when I did an explicit
rs.remove(hostname_port) from the replica set did the performance
scale back to normal.
I suppose this is not expected behavior? Can we shield ourselves from
anything like this in the future?
There’s a least one significant bug related to replica sets with the V1.2.6 php driver. Are you able to upgrade the driver to a newer version?