As you can see, all the waits are on obj# 0, but there is never any object with and object_id of 0
This seemed rather strange as it suggested that the sync source was the most geographically distant replica set member
However, two days later whilst still eating my breakfast I was asked to join a conference call about a bigger problem
As there was a chap from the product looking over my shoulder at the time and I had confidently told him that the problem was…
The situation with the second replica set member was a little more interesting …
However, the number of child cursors built back up rather quickly …
The errors in the alert log were far more frequent than the database hangs, so clearly there was another factor involved …
I therefore felt a bit sheepish that I had been investigating what was apparently not the real problem …
What was happening was that when parallelism is used the optimiser can decide to use a high level of dynamic sampling.
There were all kinds of errors in the log, but the one of most concern was …