High replication lag

Symptom

High replica lag on one replica set member after probable network issue.

Cause

Chained Replication.

Solutions

Use rs.syncFrom to change the replication source.
Disable Chained Replication

Investigation

This is a rather short investigation, but it does highlight an interesting property of MongoDB that I was unaware of. We had received two incidents related to replication lag on production and I was asked to look into them as we had not had any such alerts before, or at least that I can recall. I took a look at the Ops Manager graphs and they did indeed show significant spikes in replication lag over the weekend, although it was back to normal at the time I looked at it.

The replication lag at its worst reached almost 1.5 hours for one member of the replica set. This was very strange given that the primary was relatively close geographically. I took a look at the logs, but there was nothing for around the time of the big spike. However there were various messages of the following form sprinkled throughout the logs:

<date time> I REPL [replication-215] Scheduled new oplog
query Fetcher source: <far away replica set member>:<mongod port> database: local
query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp <timestamp>|63 }
}, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000,
term: 45 } query metadata: { $replData: 1, $oplogQueryData: 1, $ssm: {
$secondaryOk: true } } active: 1 timeout: 65000ms shutting down?: 0 first: 1
firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand
926199 -- target:<far away replica set member>:<mongod port> db:local cmd:{ find:
"oplog.rs", filter: { ts: { $gte: Timestamp <timestamp>|63 } }, tailable:
true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 45 }
active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1
retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms

This seemed rather strange as it suggested that the sync source was the most geographically distant replica set member (Fetcher source: <far away replica set member>:<mongod port>). These messages went back all the way to two days before and around that time there were a few other messages of interest:

<timestamp> I REPL [rsBackgroundSync] sync source
candidate: <close primary>:<mongod port>
<timestamp> I REPL [rsBackgroundSync] Replication
producer stopped after oplog fetcher finished returning a batch from our sync
source. Abandoning this batch of oplog entries and re-evaluating our sync
source.
<timestamp> I REPL [SyncSourceFeedback]
SyncSourceFeedback error sending update to
<close primary>:<mongod port>: InvalidSyncSource: Sync source was
cleared. Was <close primary>:<mongod port>
<timestamp> I REPL [rsBackgroundSync] sync source
candidate: <far away replica set member>:<mongod port>

This suggests that there was a problem with the geographically close primary replica set member and that the relevant secondary decided to start syncing from the geographically far secondary. Looking at the logs of the other replica set members confirmed that at that time there was an election due probably to network problems and geographically far secondary briefly became the primary. However why did the replica set member not go back to syncing from the geographically close primary replica set member when it became primary once more? To validate that the messages in the logs meant what I thought they meant, I took a look at the status of the replica set using rs.status():

{  
   "set":"<replica set>",
   "date":ISODate(<date>),
   "myState":1,
   "term":NumberLong(45),
   "heartbeatIntervalMillis":NumberLong(2000),
   "optimes":{  
      "lastCommittedOpTime":{  
         "ts":Timestamp(<timestamp>),
         "t":NumberLong(45)
      },
      "appliedOpTime":{  
         "ts":Timestamp(<timestamp>),
         "t":NumberLong(45)
      },
      "durableOpTime":{  
         "ts":Timestamp(<timestamp>),
         "t":NumberLong(45)
      }
   },
   "members":[  
      {  
         "_id":0,
         "name":"<replica set member with lag>",
         "health":1,
         "state":2,
         "stateStr":"SECONDARY",
         "uptime":90045,
         "optime":{  
            "ts":Timestamp(<timestamp>),
            "t":NumberLong(45)
         },
         "optimeDurable":{  
            "ts":Timestamp(<timestamp>),
            "t":NumberLong(45)
         },
         "optimeDate":ISODate(<date>),
         "optimeDurableDate":ISODate("<date>"),
         "lastHeartbeat":ISODate(<date>),
         "lastHeartbeatRecv":ISODate(<date>),
         "pingMs":NumberLong(8),
         "syncingTo":"<geographically far replica set member>", <------------------ 
         "configVersion":7
      },
      {  
         "_id":1,
         "name":"<geographically close replica set member>",
         "health":1,
         "state":1,
         "stateStr":"PRIMARY",
         "uptime":772743,
         "optime":{  
            "ts":Timestamp(<timestamp>),
            "t":NumberLong(45)
         },
         "optimeDate":ISODate(<date>),
         "electionTime":Timestamp(<timestamp>),
         "electionDate":ISODate(<date>),
         "configVersion":7,
         "self":true
      },
      {  
         "_id":2,
         "name":"<geographically far replica set member>"
         "health":1,
         "state":2,
         "stateStr":"SECONDARY",
         "uptime":90045,
         "optime":{  
            "ts":Timestamp(<timestamp>),
            "t":NumberLong(45)
         },
         "optimeDurable":{  
            "ts":Timestamp(<timestamp>),
            "t":NumberLong(45)
         },
         "optimeDate":ISODate(<date>),
         "optimeDurableDate":ISODate(<date>),
         "lastHeartbeat":ISODate(<date>),
         "lastHeartbeatRecv":ISODate(<date>),
         "pingMs":NumberLong(72),
         "syncingTo":"<geographically close replica set member>",
         "configVersion":7
      }
   ],
   "ok":1
}

The replica set member was indeed syncing from (or syncingTo as the document puts it) the geographically far replica set member which was syncing from the geographically close primary. My understanding was that this should never happen and that replica set members should always sync from the primary. I therefore logged a ticket with MongoDB support. They quickly came back and said that this was “Chained Replication” (https://docs.mongodb.com/manual/tutorial/manage-chained-replication/index.html) and that this was enabled by default. The setting can be seen in the “settings” embedded document of the “rs.conf()” command output.

{  
   "_id":"<replica set>",
   "version":7,
   "protocolVersion":NumberLong(1),
   "members":[
   ...
   ],
   "settings":{  
      "chainingAllowed":true,<--------------------- 
      "heartbeatIntervalMillis":2000,
      "heartbeatTimeoutSecs":10,
      "electionTimeoutMillis":10000,
      "catchUpTimeoutMillis":2000,
      "getLastErrorModes":{
      },
      "getLastErrorDefaults":{  
         "w":1,
         "wtimeout":0
      },
      "replicaSetId":ObjectId("598d8204858fcdc7d38ddeec")
   }
}

However, why after the election of the geographically close replica set member as the primary did the replica set member remain syncing from the geographically far replica set member?

That is more difficult to tell and the most likely reason, and the only one the documentation offers, is that the ping time to the geographically far replica set member was better than the ping time to the geographically close replica set member. There are a few other criteria that had to be met to be a sync source candidate, but if a brief network issues at the time of the election meant that ping times suggested that the geographically far replica set member was the best source, then it would be the sync source. MongoDB support did state the criteria for a re-evaluation of the sync source and they are:

The mongod instance restarts.
The connection between the mongod and the sync target closes.
If the sync target falls more than 30 seconds behind another member of the replica set.

If none of these criteria are met, as must have been the case, then the sync source would not be re-evaluated. To resolve the situation, the sync source can be set manually as shown below:

rs.syncFrom("<geographically close replica set member>:<mongod port>");

{
   "syncFromRequested" : "<geographically close replica set member>:<mongod port>",
   "prevSyncTarget" : "<geographically far replica set member>:<mongod port>",
   "ok" : 1
}

The value reported by rs.status() remained the same for a few minutes, but then changed to the new specified source. Since then, no replication lag alerts have occurred. That of course simply resolves the immediate problem. To stop chained replication happening again, the replica set configuration needs to be changed:

cfg = rs.config()
cfg.settings.chainingAllowed = false
rs.reconfig(cfg)

Do note that the documentation warns about a situation where the settings embedded document does not exist. In this situation you should create it (cfg.settings = {}) and then set the value. This of course will overwrite the existing document if it does exist, hence the warning.