Menu Close

Initial sync running out of space

Symptom

  • Initial sync fails.
  • Initial sync fills file system.

Cause

  • Extremely high Oplog generation rate

Solution

  • Restart initial syncs when Oplog generation rate is reasonable.

Investigation

I was contacted by the team looking after a MongoDB cluster saying that they had two initial syncs which had failed, one of which had somehow run out of space. These were both syncs to new servers as they were moving their AWS replica set members around. Immediately I thought that the problem was going to be with Oplog generation rate although I did not know how that would lead to running out of space.

I first looked at the node for which the initial sync simply stopped. This showed all the hallmarks of Oplog being the problem with statements in the log about being too old and giving up. The graph below is the Oplog generation rate over the last week for the relevant shard. This is extremely high and would mean that the Oplog collection would be exhausted in about an hour at the peak generation rate. However at this point I was unsure what the exact scenario was that caused the replica set member to fall behind.

The situation with the second replica set member was a little more interesting. As previously mentioned, this had run out of space but why would an initial sync cause a replica set member to run out of space? There was over a terabyte of space free on all the file systems of the other replica set members of that shard, but none free on this one. A quick listing of the files show what was causing it:

ls -lSr
...
-rw-r-----. 1 <user group>   97167491072 <month> 12 12:27 collection-...wt
-rw-r-----. 1 <user group>   97217806336 <month> 13 06:45 collection-...wt
-rw-r-----. 1 <user group>  102059483136 <month> 13 06:47 collection-...wt
-rw-r-----. 1 <user group>  107327385600 <month> 12 12:21 collection-...wt
-rw-r-----. 1 <user group>  172389974016 <month> 13 06:43 collection-...wt
-rw-r-----. 1 <user group>  174145196032 <month> 13 06:45 collection-...wt
-rw-r-----. 1 <user group>  317303975936 <month> 13 06:45 collection-...wt
-rw-r-----. 1 <user group> 1601539600384 <month> 13 06:43 collection-...wt

It would appear that we had an ~1.5TB file , but what does it relate to? A hex dump of it clearly shows the name of a database and collection it it, but does it relate to this?

hexdump -c <largest file>

0000000   A 330 001  \0 001  \0  \0  \0 330  \b   # 267  \0  \0  \0  \0
0000020  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0010000  \0  \0  \0  \0  \0  \0  \0  \0 001  \0  \0  \0  \0  \0  \0  \0
0010020   &   j  \0  \0  \n  \0  \0  \0  \a 005  \0  \0  \0 020  \0  \0
0010040 264 301   : 323 001  \0  \0  \0 005 201 200 324 235 035 025  \0
0010060  \0 003   _   i   d  \0 030  \0  \0  \0 021   t   s  \0 346 001
0010100   6 016  \0  \0  \0  \0  \0  \0 346 323 001      \0  \0   l 240
0010120 345   [ 020   s  \0 005 001   @ 003   e   n   t   r   y  \0 364
0010140 024  \0  \0 021   t   s  \0 346 001  \t 037  \f 022   t  \0 032
0010160 005       X  \0  \0 022   h  \0 224 211   /   d   9 215 254 361
0010200 020   v  \0 002  \0  \0  \0 002   o   p 005  \b 020   u  \0 002
0010220   n   s 005   (  <database name>
0010240   <collection name>                                      \0 003
0010260   o   2  \0 024 001   4   T   _   i   d  \0 006  \0  \0  \0   5
...

The collection existed and I could query it from a MongoS, but was this the collection that was causing the problem? Looking at other replica set members, no file of that size existed, but files of approximately the same size as the others listed above did exist, although the file names may have been different. The long number at the end of the file names was certainly different but the number after “collection-” did appear to be the same or similar is a number of cases. Comparing file names across replica set members and no doubt shards would therefore seem to be of no use. I then looked for a way to determine which file belonged to which collection. It appears to be easy to determine the file belonging to a collection as it appears in the output of “stats()”

MongoDB Enterprise <replica set>:SECONDARY> db.system.roles.stats().wiredTiger.uri;
statistics:table:collection-...

From this I determined that the file relating to collection mentioned in the hexdump was not the 1.5TB file. What a search of the docs and Google did not tell me was how to go from a file name to a collection. I therefore used a brute force approach by writing the following little bit of code:

db.adminCommand({listDatabases: 1}).databases.forEach(function listFiles(DBDoc) {
	db.getSiblingDB(DBDoc.name).getCollectionNames().forEach(function listDBFiles(collName) {
		print(DBDoc.name + " " + collName + ": " + db.getSiblingDB(DBDoc.name).getCollection(collName).stats().wiredTiger.uri);
	})
});

This gave me an output listing all collections and their associated file. I could therefore search for the relevant file name and I could then see what collection was associated with it. The collection turned out to be local.temp_oplog_buffer but I had never seen it before. From the name I assumed that it was a temporary store for the oplog during the initial sync. It would therefore appear that both problems were caused by the oplog generation rate.

This left two questions: What was causing the extremely high Oplog generation rate and why did the two initial syncs behave differently? I have not found a definitive answer to the second question, but a member of the product team did come up with a sound theory for it. In version 3.4 the initial sync buffers oplog entries that are created during the initial sync (https://docs.mongodb.com/manual/core/replica-set-sync/#replica-set-initial-sync). This would appear to be in the local.temp_oplog_buffer collection. Once all the collections have been copied, it then starts applying these buffered oplog entries. It would seem that during this apply process any new oplog entries are probably recorded in the replica set’s oplog collection. Thus you have an unlimited buffer for oplog entries during the copying of the collections, but the normal limited oplog collection during the application of the buffered oplog entries. It would therefore seem probable that as the first replica set’s initial sync was started before the oplog generation rate became extremely high, this replica set completed the copying of the collections before the buffered oplog entries filled the file system. However, by the time it had completed applying the buffered entries, the quantity of oplog that had been generated by the primary had exceeded the capacity of the normal oplog collection and so it could not catch up any further. The DB storage graph shows this happening:

This shows the replica set member growing in size, applying the buffered oplog entries, dropping the buffer and then remaining at a static size. The second replica set member started its initial sync over a day later by which point the oplog generation rate was extremely high. The DB storage graph for that replica set member also shows what happened:

In this case, the storage grows until it reaches the point at which the file system is full. It would appear to never have started applying the buffered oplog as the file system filled before the copying of the collections was complete.

All that remained was to answer was why the oplog generation rate was so high? Again the product team member provided the answer. This was a single collection that had been range sharded on the “_id” field rather than hash sharded.