During the implementation of a change I noticed that there was an alert in Ops Manager for a Development MongoS host that was down. After this was brought back up I took a quick look at it and the MongoS process kept crashing and was using a rather large amount of memory. The memory graph below from Ops Manager shows the behaviour quite well:
The red lines indicate when the MongoS crashed and was restarted by the automation agent. This problem seemed to have started early on the previous day (23rd) as the next graph shows:
The gaps between the blocks of red are where the host had crashed. I took a look at the MongoS log, but that is rather sparse on information. After discussing with colleagues I attempted to install mtools which required installation of various other python packages as well. I created a case with MongoDB support, uploaded the mongod log files and they came back saying that there was no evidence to confirm that this is a result of a Linux Out Of Memory killer issue. I had only given them the mongod log files at that point and they seem to be very little of use in them, so there was indeed no evidence that it was a problem resulting in the Linux Out Of Memory killer killing the MongoS process. Given that MongoDB support had got nothing of use from those logs, I decided not to investigate them any further with the mtools utilities. I did upload further logs to MongoDB support, the output of “top” and the above graphs after which they concluded that it was a problem with MongoS running out of memory.
One of the more interesting pieces of evidence that I gathered was the output of mongostat. A snippet of which is shown below:
insert query update delete getmore command flushes mapped vsize res faults qrw arw net_in net_out conn time 121 56 118 *0 62 74|0 0 0B 11.1G 10.7G 0 0|0 0|0 749k 575m 161 Jan 24 15:31:15.709 32 34 84 *0 77 91|0 0 0B 11.1G 10.7G 0 0|0 0|0 382k 560m 161 Jan 24 15:31:16.708 75 36 497 *0 62 479|0 0 0B 11.2G 10.7G 0 0|0 0|0 1.20m 538m 161 Jan 24 15:31:17.709 41 35 329 *0 54 339|0 0 0B 11.4G 10.9G 0 0|0 0|0 784k 624m 161 Jan 24 15:31:18.712 12 35 124 *0 53 147|0 0 0B 12.0G 11.5G 0 0|0 0|0 314k 667m 162 Jan 24 15:31:19.709 ... insert query update delete getmore command flushes mapped vsize res faults qrw arw net_in net_out conn time 23 29 24 *0 41 61|0 0 0B 26.1G 14.9G 2 0|0 0|0 1.23m 307m 204 Jan 24 15:34:46.711 16 17 16 *0 44 43|0 0 0B 26.1G 14.9G 3 0|0 0|0 188k 327m 205 Jan 24 15:34:47.709 17 19 18 *0 53 51|0 0 0B 26.1G 14.9G 3 0|0 0|0 178k 421m 205 Jan 24 15:34:48.726 18 32 18 *0 39 82|0 0 0B 26.1G 14.9G 31 0|0 0|0 218k 296m 205 Jan 24 15:34:49.717 17 46 16 *0 44 39|0 0 0B 26.1G 14.9G 7 0|0 0|0 208k 325m 205 Jan 24 15:34:50.726 19 54 20 *0 43 55|0 0 0B 26.2G 14.9G 10 0|0 0|0 1.15m 461m 205 Jan 24 15:34:51.718 14 22 12 *0 34 62|0 0 0B 26.3G 14.9G 11 0|0 0|0 504k 420m 205 Jan 24 15:34:52.726 24 40 26 *0 50 72|0 0 0B 26.4G 14.9G 11 0|0 0|0 362k 366m 205 Jan 24 15:34:53.723 3 13 3 *0 39 59|0 0 0B 26.6G 14.9G 21 0|0 0|0 49.9k 257m 205 Jan 24 15:34:55.259 12 61 12 *0 36 59|0 0 0B 26.6G 14.9G 28 0|0 0|0 656k 527m 205 Jan 24 15:34:55.896
Not only does this show the memory constantly increasing (vsize and res columns), but also that there is vastly more data going out than coming in (net_in and net_out columns). Given that the MongoS sits between the shards and the application, if it were streaming large amounts of data from the shards back to the application, you would expect the net_in value to be similar to the net_out. Likewise, if it were inserting large amounts of data you would also expect the values to be similar. However in this case it is getting relatively little data back from the shards and turning it into vast amounts of data to be sent back to the application. It is possible of course that it was receiving little data from the application and turning it into vast amounts of data to be inserted, but I can’t think how that would be the case. Could this be something like a Cartesian join? Given that this is MongoDB and the way the application is written, this is unlikely. Perhaps then it is some cached result set that is being accessed numerous times.
After uploading various pieces of information, MongoDB support asked for the output of the mdiag script, but this did not help them. At this point, given that MongoDB support had not found a smoking gun in the logs, I was relatively confident that the problem was not with MongoS itself, but rather with what was being asked of it. The product was also getting rather agitated as the Dev cluster was effectively unusable as there was effectively only one MongoS and as it controls all access to the database, if it crashes all connections are terminated. However, even if there were multiple MongoS processes, I would suspect that the problem would simply transfer to any failover MongoS resulting in a series of crashes and no actual benefit. I tried to determine where all the vast amount of output data was going to, but due to permissions this was proving difficult and resulted in a red herring or two. I then asked the product to try an determine where all of the data was going to. I gave them a list of all the IP addresses that were connecting to the MongoS, which can be seen in the log, and asked them to investigate. They decided to start shutting down applications on the hosts identified and after they shut the first one down, the MongoS went back to normal. They re-enabled it and the MongoS started crashing again.
Investigations by the product determined that there was a bug in their code and that this was causing the MongoS to run out of memory, although exactly why is not known. It was discussed with MongoDB support if there was any way to limit the amount of memory being used by the MongoS process. They suggested cgroups which effectively allow you to allocate a specific amount of memory to a specific process or group of processes. However this would be of little help as the MongoS would simply crash from being unable to allocate more memory. This is a slightly better situation in that the server does not run out of memory, but with regards to the application it is of little use. MongoDB support did suggest a way in which we could capture information on what the MongoS was doing. This is detailed below. However as the bug had been identified, the product had decided not to re-introduce the bug so I had no chance to test the commands.
mongo --host <mongos_host>:<mongos_port> --eval "while(true) {print(JSON.stringify(db.serverStatus({'tcmalloc':1}))); sleep(60000)}" mongo --host <mongos_host>:<mongos_port> --eval 'while(true) {print("===", Date()); printjson(db.currentOp(true)); sleep(60000)}'
MongoDB support has advised that if this problem were to happen again, the two loops above should be run concurrently for “over at least a couple of hours”. They also advised changing the logging level through Ops Manager. This can be done by clicking on “Modify” next to the MongoS and under “Advanced Options” add “logLevel” with a value of 5.
Lessons
- There is nothing that you can do to prevent a MongoS process attempting to allocate more memory than is available. The best you can do it to use CGroups to prevent it causing instability to the server.
- The MongoS log is very sparse on useful information.
- A bug in the code can effectively make you database unusable. Testing is your friend.
- Although a MongoS crashing once may not be considered a problem given the architecture of MongoDB, it is a problem if it is continually crashing within a short space of time and this is a very critical situation.