Cluster unresponsive

Symptom

Cluster unresponsive to most queries via MongoS.
Queries on a single replica set primary hang.

Cause

Unknown.

Solution

Kill -9 the hanging replica set primary and restart.

Investigation

On the previous Friday I had forgotten to send out an update on the status of a previous change to our QA cluster. I was therefore keen on the Monday morning to send out an update as early as possible. I took a look at the cluster and everything was fine with the result of the change however one of the hosts that I was going to perform a subsequent change on that morning was down and had been for 14 hours. I noted this in my update and as soon as a sys admin chap was in I got them to bring the host up. With this done, everything was sweetness and light once more and I was ready for the morning’s change. We had the go-no-go meeting and decided to give the sys admins a chance to investigate why the host had crashed before going ahead with the change.

Shortly after that I was called by the product and discovered, after an unfortunately timed offhand quip, that the entire cluster was unresponsive. This was the first I knew of this and there had been no alerts for anything other than the host being down. I was given various pieces of information about connections timing out, logins receiving authentication failures and commands hanging. I therefore set off to investigate the connectivity problems.

After doing some digging I was somewhat confused as to what was going on and so consulted with a colleague. What was confusing me, and I am no expert MongoDB admin, was that if you connect directly to a replica set member, the list of users is a subset of that which you get by querying via a MongoS:

MongoS:

MongoDB Enterprise mongos> use admin;
switched to db admin
MongoDB Enterprise mongos> db.system.users.find().count();
184

Replica set member:

MongoDB Enterprise <replica set>:PRIMARY> use admin;
switched to db admin
MongoDB Enterprise <replica set>:PRIMARY> db.system.users.find().count();
6

My confusion was not helped by the fact that firewalls caused connections to hang if you try to contact, for instance, Dev from a QA server. Once my confusion was cleared up I actually started to investigate the real problem.

As mentioned previously, commands were hanging and indeed when I tried “show dbs” when connected to a MongoS the command hung. I didn’t know exactly what this command did under the covers so I tailed (tail -f) the MongoS log whilst typing the command. The first time I did this it appeared that the MongoS opened connections to about 9 of the primaries. I waited for a while and then killed the command. The second time I tried it the MongoS only connected to one replica set primary. I therefore decided to have a look at the MongoD log on that server. Digging around in it I could not find any errors so I tried to connect to the MongoD process from the server. This connection hung. Could it have been me getting in a mess with connecting to the server? I tried connecting to a different replica set primary and this was fine. It therefore appeared that connections to one primary replica set member were hanging. The mongodb.log file for it showed connections coming in, but I simply could not connect to it, nor could a colleague. I could happily connect to the MongoS processes, but not directly to the MongoD.

I discussed the situation with a colleague and as there was no way we could log on to the replica set member we decided that we should bounce it. I attempted to do this through Ops Manager, but as we suspected, Ops Manager could not connect to it. It was at this point that I noticed that Ops Manager had stopped receiving metric information from the server. We discussed options again and decided to log a severity 2 call with MongoDB support before we issued a “kill -9”.

Shortly after logging the call I was phoned by someone from MongoDB Support who was clearly a techie. I detailed the situation and we discussed whether to issue a “kill -9”. She decided to consult with her colleagues and to get back to me via the ticket. Shortly after that the ticket was updated with the recommendation to issue a “kill” and if that failed, a “kill -9”. I did this and as expected, the “kill” failed, but the “kill -9” worked. At this point, Ops Manager detected the replica set member in the “Goal state” and so I started it back up with Ops Manager. With the replica set member bounced, it caught up, became primary again and then the cluster was back to full operation.

During the incident two of the MongoS processes for this cluster also crashed. This did not affect the situation much, but I uploaded the logs to MongoDB Support. Unfortunately this seems to have only served to confuse the matter as it was rather close to the previous incident on Dev with the MongoS process crashing and the two incidents became somewhat conflated.

At the time of writing, the cause of the hang is not known. Several avenues of investigation have so far been followed by MongoDB support, but to no avail. They theorised that it could have been a connection storm as for this cluster there were a very large number of connections, more than 12,000. However as can be seen from the graph below, the number of connections more or less constantly rose rather than having a storm:

The data for this graph was obtained by grep’ing for the string “connection accepted from” in the mongodb log and then using the number in the bracketed section as the end of the line, e.g. “(210 connections now open)”. The line is rising for about 12 hours, starting at about midnight.

They have also theorised that it could be a large number of metadata updates causing the problem. There were indeed a larger number of metadata updates than usual:

The graph shows the load average over the week and clearly shows that during the problem period the load on the server went from about nothing to between 7 and 9 and stayed there. From observations at the time, this was entirely down to the MongoD process and so clearly it was doing something, but what?

After uploading all the various logs for all the primary replica set members to MongoDB support, they analysed them and noticed that there were no diagnostic logs for the problem period. They suggested that I might have moved them, as they had suggested in a previous comment, and thus missed uploading them. However looking at the timestamps and sizes of the logs it would appear that the MongoD processes simply stopped generating diagnostic data.

The only errors I have found were in the automation agent logs and this shows the first problem at about 00:09 on the day of the problem, almost exactly the last timestamp of the relevant diagnostic data file and time at which the number of connections start to rise. It is also important to note that all the other diagnostic data files are approximately 1MB in size, but the relevant file is not. I put my “stopped logging” theory to MongoDB support and they seem to have accepted it. The latest request was for the output of “dmesg” which prints the kernel message buffer and “typically contains the messages produced by the device drivers”. On my original upload I simply gave them the output of “dmesg”, however this gave rise to some confusion as there were a couple of I/O errors listed:

dmesg
...
[9890186.778442] Buffer I/O error on dev <dev>, logical block 0, async page
read
...
[9976502.791764] Buffer I/O error on dev <dev>, logical block 0, async page
read
...

However by using the “–reltime” argument the confusion can be removed as they were both after the problem had been resolved.

dmesg --reltime | grep 'Buffer I/O
error on dev'
[<day after> 21:33] Buffer I/O error on dev <dev>, logical block 0, async page read
[<two days after> 21:32] Buffer I/O error on dev <dev>, logical block 0, async page read

Do note that the date is only given once in the output and all subsequent lines for that date are an offset from it. It is therefore somewhat fortuitous that the two relevant lines included the date. Note also that was a system snapshot volume and so even if the times coincided with the problem, they would not have been relevant, although MongoDB support could not know that.

As of the time of writing, this is the current state of the investigation. The last comment from MongoDB support was to monitor the situation for any recurrence.

Lessons

It is entirely possible for the database to become unusable, for no alerts to be raised and for none of the main logs to show a problem. In this case the only relevant errors seen were in the monitoring agent logs.
The MongoDB logs may not contain enough information to diagnose situations such as this.
To save time, always upload as much information as possible to MongoDB support. This should include graphs of the load where relevant on the affected servers which can be obtained from Ops Manager.