I recently had a replica set member fall a few days out of sync. Using the “Resyncing a Very Stale Replica Set Member” instructions, I stopped mongod on the secondary machine, wiped out the data directories, restarted the process, and let the machine re-sync to the primary.
Everything worked perfectly, or so it seemed. Logging suggested the sync went fine. Eventually, it showed as complete, resulting in the following rs.status() output on the secondary machine:
# The secondary machine's status for itself and its primary:
{
"_id" : 0,
"name" : "myprimary:myport",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 497,
"optime" : {
"t" : 1347562257000,
"i" : 1
},
"optimeDate" : ISODate("2012-09-13T18:50:57Z"),
"lastHeartbeat" : ISODate("2012-09-13T19:00:34Z"),
"pingMs" : 3
},
{
"_id" : 2,
"name" : "mysecondary:myport",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"optime" : {
"t" : 1347562257000,
"i" : 1
},
"optimeDate" : ISODate("2012-09-13T18:50:57Z"),
"self" : true
}
As expected, the machines are in sync, and share an optime value. But the primary machine is a different story. It still shows the out-of-sync secondary, even though the optime for the primary advanced since the re-syncing completed.
# The primary machine's status for itself and its secondary:
{
"_id" : 0,
"name" : "myprimary:myport",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 497,
"optime" : {
"t" : 1347562257000,
"i" : 1
},
"optimeDate" : ISODate("2012-09-13T18:50:57Z"),
"self" : true
},
{
"_id" : 2,
"name" : "mysecondary:myport",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"optime" : {
"t" : 1347103757000,
"i" : 1
},
"optimeDate" : ISODate("2012-09-08T11:29:17Z"),
"lastHeartbeat" : ISODate("2012-09-11T17:27:06Z"),
"pingMs" : 3
}
What am I missing? At first I thought “wait it out”, but it’s been nearly an hour and the database had inserts in that time. Can I force the primary to heartbeat-check the secondary, or do I need to re-sync them again?
The only real oddity I can find on the primary is this:
PRIMARY> use local;
PRIMARY> db.slaves.find()
{ "_id" : ObjectId("4f675b909d8e143a90055864"), "host" : "<hostIP>", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1347395837000, "i" : 1 } }
{ "_id" : ObjectId("50522761212b77e9637ad541"), "host" : "<hostIP>", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1347562257000, "i" : 1 } }
These are the same hosts (the secondary machine in question). My understanding is this should show one entry, but I’m hesitant to touch it without a better understanding of what it tracks and how it updates.
It might be a good idea to try bringing down the secondary, deleting both entries on the primary’s db.slaves collection, and then restarting the secondary.
Do the data files corroborate that the machines are in sync?