MongoS on a pogo stick

During the implementation of a change I noticed that there was an alert in Ops Manager for a Development MongoS host that was down. After this was brought back up I took a quick look at it and the MongoS process kept crashing and was using a rather large amount of memory. The memory graph below from Ops Manager shows the behaviour quite well:

The red lines indicate when the MongoS crashed and was restarted by the automation agent. This problem seemed to have started early on the previous day (23rd) as the next graph shows:

The gaps between the blocks of red are where the host had crashed. I took a look at the MongoS log, but that is rather sparse on information. After discussing with colleagues I attempted to install mtools which required installation of various other python packages as well. I created a case with MongoDB support, uploaded the mongod log files and they came back saying that there was no evidence to confirm that this is a result of a Linux Out Of Memory killer issue. I had only given them the mongod log files at that point and they seem to be very little of use in them, so there was indeed no evidence that it was a problem resulting in the Linux Out Of Memory killer killing the MongoS process. Given that MongoDB support had got nothing of use from those logs, I decided not to investigate them any further with the mtools utilities. I did upload further logs to MongoDB support, the output of “top” and the above graphs after which they concluded that it was a problem with MongoS running out of memory.

One of the more interesting pieces of evidence that I gathered was the output of mongostat. A snippet of which is shown below:

insert query update delete getmore command flushes mapped vsize   res faults qrw arw net_in net_out conn                time
   121    56    118     *0      62    74|0       0     0B 11.1G 10.7G      0 0|0 0|0   749k    575m  161 Jan 24 15:31:15.709
    32    34     84     *0      77    91|0       0     0B 11.1G 10.7G      0 0|0 0|0   382k    560m  161 Jan 24 15:31:16.708
    75    36    497     *0      62   479|0       0     0B 11.2G 10.7G      0 0|0 0|0  1.20m    538m  161 Jan 24 15:31:17.709
    41    35    329     *0      54   339|0       0     0B 11.4G 10.9G      0 0|0 0|0   784k    624m  161 Jan 24 15:31:18.712
    12    35    124     *0      53   147|0       0     0B 12.0G 11.5G      0 0|0 0|0   314k    667m  162 Jan 24 15:31:19.709
...
insert query update delete getmore command flushes mapped vsize   res faults qrw arw net_in net_out conn                time
    23    29     24     *0      41    61|0       0     0B 26.1G 14.9G      2 0|0 0|0  1.23m    307m  204 Jan 24 15:34:46.711
    16    17     16     *0      44    43|0       0     0B 26.1G 14.9G      3 0|0 0|0   188k    327m  205 Jan 24 15:34:47.709
    17    19     18     *0      53    51|0       0     0B 26.1G 14.9G      3 0|0 0|0   178k    421m  205 Jan 24 15:34:48.726
    18    32     18     *0      39    82|0       0     0B 26.1G 14.9G     31 0|0 0|0   218k    296m  205 Jan 24 15:34:49.717
    17    46     16     *0      44    39|0       0     0B 26.1G 14.9G      7 0|0 0|0   208k    325m  205 Jan 24 15:34:50.726
    19    54     20     *0      43    55|0       0     0B 26.2G 14.9G     10 0|0 0|0  1.15m    461m  205 Jan 24 15:34:51.718
    14    22     12     *0      34    62|0       0     0B 26.3G 14.9G     11 0|0 0|0   504k    420m  205 Jan 24 15:34:52.726
    24    40     26     *0      50    72|0       0     0B 26.4G 14.9G     11 0|0 0|0   362k    366m  205 Jan 24 15:34:53.723
     3    13      3     *0      39    59|0       0     0B 26.6G 14.9G     21 0|0 0|0  49.9k    257m  205 Jan 24 15:34:55.259
    12    61     12     *0      36    59|0       0     0B 26.6G 14.9G     28 0|0 0|0   656k    527m  205 Jan 24 15:34:55.896

Not only does this show the memory constantly increasing (vsize and res columns), but also that there is vastly more data going out than coming in (net_in and net_out columns). Given that the MongoS sits between the shards and the application, if it were streaming large amounts of data from the shards back to the application, you would expect the net_in value to be similar to the net_out. Likewise, if it were inserting large amounts of data you would also expect the values to be similar. However in this case it is getting relatively little data back from the shards and turning it into vast amounts of data to be sent back to the application. It is possible of course that it was receiving little data from the application and turning it into vast amounts of data to be inserted, but I can’t think how that would be the case. Could this be something like a Cartesian join? Given that this is MongoDB and the way the application is written, this is unlikely. Perhaps then it is some cached result set that is being accessed numerous times.

After uploading various pieces of information, MongoDB support asked for the output of the mdiag script, but this did not help them. At this point, given that MongoDB support had not found a smoking gun in the logs, I was relatively confident that the problem was not with MongoS itself, but rather with what was being asked of it. The product was also getting rather agitated as the Dev cluster was effectively unusable as there was effectively only one MongoS and as it controls all access to the database, if it crashes all connections are terminated. However, even if there were multiple MongoS processes, I would suspect that the problem would simply transfer to any failover MongoS resulting in a series of crashes and no actual benefit. I tried to determine where all the vast amount of output data was going to, but due to permissions this was proving difficult and resulted in a red herring or two. I then asked the product to try an determine where all of the data was going to. I gave them a list of all the IP addresses that were connecting to the MongoS, which can be seen in the log, and asked them to investigate. They decided to start shutting down applications on the hosts identified and after they shut the first one down, the MongoS went back to normal. They re-enabled it and the MongoS started crashing again.

Investigations by the product determined that there was a bug in their code and that this was causing the MongoS to run out of memory, although exactly why is not known. It was discussed with MongoDB support if there was any way to limit the amount of memory being used by the MongoS process. They suggested cgroups which effectively allow you to allocate a specific amount of memory to a specific process or group of processes. However this would be of little help as the MongoS would simply crash from being unable to allocate more memory. This is a slightly better situation in that the server does not run out of memory, but with regards to the application it is of little use. MongoDB support did suggest a way in which we could capture information on what the MongoS was doing. This is detailed below. However as the bug had been identified, the product had decided not to re-introduce the bug so I had no chance to test the commands.

mongo --host <mongos_host>:<mongos_port> --eval "while(true) {print(JSON.stringify(db.serverStatus({'tcmalloc':1}))); sleep(60000)}"
mongo --host <mongos_host>:<mongos_port> --eval 'while(true) {print("===", Date()); printjson(db.currentOp(true)); sleep(60000)}'

MongoDB support has advised that if this problem were to happen again, the two loops above should be run concurrently for “over at least a couple of hours”. They also advised changing the logging level through Ops Manager. This can be done by clicking on “Modify” next to the MongoS and under “Advanced Options” add “logLevel” with a value of 5.

Lessons

There is nothing that you can do to prevent a MongoS process attempting to allocate more memory than is available. The best you can do it to use CGroups to prevent it causing instability to the server.
The MongoS log is very sparse on useful information.
A bug in the code can effectively make you database unusable. Testing is your friend.
Although a MongoS crashing once may not be considered a problem given the architecture of MongoDB, it is a problem if it is continually crashing within a short space of time and this is a very critical situation.