Huge log file sync waits

A database that had been migrated to AWS went live and was hitting serious performance problems three times every weekday. I was asked to investigate what was going on. The server had been configured to allow for the CPU usage and peak data throughput seen on premises, but something clearly had been missed or gone awry.

Creating a few AWR reports showed very quickly what the problem was:

Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                           Total Wait       Avg   % DB Wait
Event                                Waits Time (sec)      Wait   time Class
------------------------------ ----------- ---------- --------- ------ --------
log file sync                      208,804        20K   95.59ms   40.5 Commit
DB CPU                                         7926.6             16.1
db file sequential read          8,568,722     6823.3  796.31us   13.8 User I/O
log buffer space                    31,430     5459.1  173.69ms   11.1 Configur
log file switch (checkpoint in         852       1776 2084.46ms    3.6 Configur
db file scattered read             751,174     1628.5    2.17ms    3.3 User I/O
log file switch (archiving nee         373     1532.6 4108.76ms    3.1 Configur
log file switch completion           2,925     1429.2  488.63ms    2.9 Configur
buffer busy waits                5,162,816     1233.9  239.01us    2.5 Concurre
control file sequential read       210,367      521.9    2.48ms    1.1 System I/

Clearly, we had a redo log related problem. The single block and multiblock reads times were also higher than usual, but they had not been as affected as the redo logs. I looked at the iostat information captured by OSWatcher. Below is a stacked line graph showing the sum of all read and write activity on all the database related disks on the server.

This suggests that we are hitting an instance wide limit at about 800MB/s. The peak that reaches close to 1600MB/s seems to be an artifact as the instance was indeed limited to about 800MB/s. If it wasn’t an artifact as a result of something like iostat counting the data from two periods in one, then AWS have a QoS problem. I have seen such spikes before and they have never matched data from other sources, and indeed the AWS Cloudwatch metrics did not show such a high spike.

If we then look at the redo log disk alone, we can see that most of the I/O is coming from these:

This can also be seen in the AWR reports section “IOStat by Filetype summary”:

                Reads:  Reqs    Data   Writes:  Reqs    Data      Small    Large
Filetype Name     Data per sec per sec    Data per sec per sec     Read     Read
-------------- ------- ------- ------- ------- ------- ------- -------- --------
Log File        392.1G    37.9 37.323M  242.8G    60.5 23.111M   6.46ms  41.13ms
Archive Log     232.6G    22.1 22.135M  122.7G    11.7 11.683M   2.33ms  36.14ms
Data File       226.7G  1244.0  21.58M   66.7G   489.6  6.347M 356.68us   9.71ms
Other            46.3G    34.9  4.403M   91.5G    19.3  8.714M   2.65ms   2.67ms
Flashback Log       2M     0.0      0M  103.5G    21.8  9.854M 471.43us
Control File       45G    41.5  4.287M    592M     3.5   .055M   3.98ms  27.56ms
Temp File         170M     0.1   .016M    171M     0.1   .016M   1.32ms   2.27ms
TOTAL:          942.8G  1380.6 89.744M  628.1G   606.4 59.781M 480.46us  26.27ms
                          ------------------------------------------------------

The database has two members for each redo log group, and the archived logs written to a single location on disk, hence the “Log File” write rate is approximately double that of the “Archive Log” rate. The database had six standby databases, three on premises and three in AWS as we had kept the on-premises standbys in case of rollback to on premises. The high “Log File” read rate reflects the number of standbys. Data transmission straight from the log buffer can’t keep up and so the log files are being accessed. The high “Archive Log” read rate suggests that the sending of the redo to the standbys can’t keep up even by reading the redo logs and so the archived logs need to be accessed. Had all the six standbys been in the same AWS availability zone, perhaps everything would have been fine, but the standbys were spread across AWS regions and non-AWS data centres.

A further interesting observation that can be made from the IOStat information was that reading of the redo logs is only happening on the first member of each redo log group. The first members are all on one ASM disk group and the second on another. This can be seen from the below two non-stacked line graphs, with the first showing the read (blue) and write (orange) rates for the first disk group and the second showing the information for the second:

The server type was then increased to allow for more I/O, but still we had a problem, although not as bad.

Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                           Total Wait       Avg   % DB Wait
Event                                Waits Time (sec)      Wait   time Class
------------------------------ ----------- ---------- --------- ------ --------
DB CPU                                         2001.8             46.5
log file sync                       69,234       1255   18.13ms   29.2 Commit
db file sequential read            903,604        435  481.36us   10.1 User I/O
db file scattered read             224,648      229.6    1.02ms    5.3 User I/O
log file switch completion             318        165  518.84ms    3.8 Configur
direct path read                   254,451       90.3  354.80us    2.1 User I/O
resmgr:cpu quantum                   1,709       73.3   42.88ms    1.7 Schedule
log file switch (checkpoint in          36       57.5 1597.14ms    1.3 Configur
buffer busy waits                  908,222       28.9   31.77us     .7 Concurre
control file sequential read        61,251       27.7  452.73us     .6 System I

This time, it only seems to be a redo log related problem as the single block and multiblock read times are much more normal. This is reflected in the IOStat information:

It is difficult to see from this graph, but the blue and grey lines below all the rest in the big block between about 12:15 and 12:29 hint at the redo problem. If we look at the first redo log disk group alone, we can see it more clearly:

It appears that the disk group is hitting a limit at about 500MB/s, primarily due to reads. However, the underlying io2 disk was specified at 5K IOPS and we were never getting near that:

The peak may be at just over 25K, but that is over the minute of the IOStat sample, so nothing like 5K per second. Research into AWS IOPS and throughput limits uncovered the following document: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/provisioned-iops.html. This has a very interesting graph in it:

It would therefore appear that specifying IOPS between about 2K and 32K will only get you a maximum throughput of 500MiB/s. As the relevant disk was specified as 5K, this would seem to be the situation. I didn’t know the peak throughput we would need as although we had data from when the database was on premises, it only had 2 standbys then and there was significantly less network latency to these when all the databases were on premises. We decided to increase the throughput to its maximum of 1000MiB/s as we could also reduce this once we knew the peak we needed. However, that would mean increasing the IOPS to 64K and that makes the disk significantly more expensive than at 5K.

With the IOPS increased to 64K the problem was resolved, if only just, as can be seen from the IOStat graph below:

The problem was solved, but we were left with an extremely expensive disk with an IOPS specification vastly more than was needed just to get the throughput. To address this, we decided on a two-pronged approach, firstly I would investigate cascading standbys to reduce the load on the redo log volumes and secondly, we would replace the one 64K IOPS disk in the disk group with two at 5K IOPS. Theoretically, the two 5K IOPS disks should give us 2*500MiB/s throughput which would give us the same throughput as the one 64K IOPS disk, whilst being significantly cheaper. However, the throughput-IOPS limitations were unexpected and so we didn’t want to add the two disks only to discover that there was some other unexpected limitation.

Lots of testing later, we now have the two 5K IOPS disks providing sufficient throughput. Following my investigation into cascading standbys, we now also have two of the on premises and one of the AWS standbys as cascading standbys.

Testing failure?

Why did we not detect the possibility of this problem in testing? There are a few reasons for this. Firstly, we only have standbys in production and so we could not detect the problem in any environment other than production.

Secondly, looking at the log file sync wait event whilst the database was on premises, we could see the spikes, but they were an order of magnitude lower probably due to their only being two standbys, the network latency being much lower, and we had a large all-flash array on premises. We could see the spikes in the redo generation rate, but these were not causing application issues. Therefore, there were no indicators that the redo log disk groups would cause a problem.

Thirdly, the configuration in production was different from any configuration used in testing, not only in terms of the number of standbys, but also their location and therefore network latency. Such a difference is not unusual for a migration as we had the three intended standbys in AWS plus the on-premises standbys in case we needed to rollback to on premises. Once the application had been running successfully in AWS for a period, the on-premises standbys would be removed.

Finally, this was one of our first migrations to AWS and so we had little experience of working in an environment where each disk has specific characteristics. Therefore, in some respects this was a testing failure, but not a big one. You can’t test for everything and there was certainly no experience or expectation of such a problem. Knowledge of this problem means that testing is now designed to detect the possibility that it will occur again.