A database that had been migrated to AWS went live and was hitting serious performance problems three times every weekday. I was asked to investigate what was going on. The server had been configured to allow for the CPU usage and peak data throughput seen on premises, but something clearly had been missed or gone awry.
Creating a few AWR reports showed very quickly what the problem was:
Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Total Wait Avg % DB Wait
Event Waits Time (sec) Wait time Class
------------------------------ ----------- ---------- --------- ------ --------
log file sync 208,804 20K 95.59ms 40.5 Commit
DB CPU 7926.6 16.1
db file sequential read 8,568,722 6823.3 796.31us 13.8 User I/O
log buffer space 31,430 5459.1 173.69ms 11.1 Configur
log file switch (checkpoint in 852 1776 2084.46ms 3.6 Configur
db file scattered read 751,174 1628.5 2.17ms 3.3 User I/O
log file switch (archiving nee 373 1532.6 4108.76ms 3.1 Configur
log file switch completion 2,925 1429.2 488.63ms 2.9 Configur
buffer busy waits 5,162,816 1233.9 239.01us 2.5 Concurre
control file sequential read 210,367 521.9 2.48ms 1.1 System I/
Clearly, we had a redo log related problem. The single block and multiblock reads times were also higher than usual, but they had not been as affected as the redo logs. I looked at the iostat information captured by OSWatcher. Below is a stacked line graph showing the sum of all read and write activity on all the database related disks on the server.

This suggests that we are hitting an instance wide limit at about 800MB/s. The peak that reaches close to 1600MB/s seems to be an artifact as the instance was indeed limited to about 800MB/s. If it wasn’t an artifact as a result of something like iostat counting the data from two periods in one, then AWS have a QoS problem. I have seen such spikes before and they have never matched data from other sources, and indeed the AWS Cloudwatch metrics did not show such a high spike.
If we then look at the redo log disk alone, we can see that most of the I/O is coming from these:

This can also be seen in the AWR reports section “IOStat by Filetype summary”:
Reads: Reqs Data Writes: Reqs Data Small Large
Filetype Name Data per sec per sec Data per sec per sec Read Read
-------------- ------- ------- ------- ------- ------- ------- -------- --------
Log File 392.1G 37.9 37.323M 242.8G 60.5 23.111M 6.46ms 41.13ms
Archive Log 232.6G 22.1 22.135M 122.7G 11.7 11.683M 2.33ms 36.14ms
Data File 226.7G 1244.0 21.58M 66.7G 489.6 6.347M 356.68us 9.71ms
Other 46.3G 34.9 4.403M 91.5G 19.3 8.714M 2.65ms 2.67ms
Flashback Log 2M 0.0 0M 103.5G 21.8 9.854M 471.43us
Control File 45G 41.5 4.287M 592M 3.5 .055M 3.98ms 27.56ms
Temp File 170M 0.1 .016M 171M 0.1 .016M 1.32ms 2.27ms
TOTAL: 942.8G 1380.6 89.744M 628.1G 606.4 59.781M 480.46us 26.27ms
------------------------------------------------------
The database has two members for each redo log group, and the archived logs written to a single location on disk, hence the “Log File” write rate is approximately double that of the “Archive Log” rate. The database had six standby databases, three on premises and three in AWS as we had kept the on-premises standbys in case of rollback to on premises. The high “Log File” read rate reflects the number of standbys. Data transmission straight from the log buffer can’t keep up and so the log files are being accessed. The high “Archive Log” read rate suggests that the sending of the redo to the standbys can’t keep up even by reading the redo logs and so the archived logs need to be accessed. Had all the six standbys been in the same AWS availability zone, perhaps everything would have been fine, but the standbys were spread across AWS regions and non-AWS data centres.
A further interesting observation that can be made from the IOStat information was that reading of the redo logs is only happening on the first member of each redo log group. The first members are all on one ASM disk group and the second on another. This can be seen from the below two non-stacked line graphs, with the first showing the read (blue) and write (orange) rates for the first disk group and the second showing the information for the second:


The server type was then increased to allow for more I/O, but still we had a problem, although not as bad.
Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Total Wait Avg % DB Wait
Event Waits Time (sec) Wait time Class
------------------------------ ----------- ---------- --------- ------ --------
DB CPU 2001.8 46.5
log file sync 69,234 1255 18.13ms 29.2 Commit
db file sequential read 903,604 435 481.36us 10.1 User I/O
db file scattered read 224,648 229.6 1.02ms 5.3 User I/O
log file switch completion 318 165 518.84ms 3.8 Configur
direct path read 254,451 90.3 354.80us 2.1 User I/O
resmgr:cpu quantum 1,709 73.3 42.88ms 1.7 Schedule
log file switch (checkpoint in 36 57.5 1597.14ms 1.3 Configur
buffer busy waits 908,222 28.9 31.77us .7 Concurre
control file sequential read 61,251 27.7 452.73us .6 System I
This time, it only seems to be a redo log related problem as the single block and multiblock read times are much more normal. This is reflected in the IOStat information:

It is difficult to see from this graph, but the blue and grey lines below all the rest in the big block between about 12:15 and 12:29 hint at the redo problem. If we look at the first redo log disk group alone, we can see it more clearly:

It appears that the disk group is hitting a limit at about 500MB/s, primarily due to reads. However, the underlying io2 disk was specified at 5K IOPS and we were never getting near that:

The peak may be at just over 25K, but that is over the minute of the IOStat sample, so nothing like 5K per second. Research into AWS IOPS and throughput limits uncovered the following document: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/provisioned-iops.html. This has a very interesting graph in it:

It would therefore appear that specifying IOPS between about 2K and 32K will only get you a maximum throughput of 500MiB/s. As the relevant disk was specified as 5K, this would seem to be the situation. I didn’t know the peak throughput we would need as although we had data from when the database was on premises, it only had 2 standbys then and there was significantly less network latency to these when all the databases were on premises. We decided to increase the throughput to its maximum of 1000MiB/s as we could also reduce this once we knew the peak we needed. However, that would mean increasing the IOPS to 64K and that makes the disk significantly more expensive than at 5K.
With the IOPS increased to 64K the problem was resolved, if only just, as can be seen from the IOStat graph below:

The problem was solved, but we were left with an extremely expensive disk with an IOPS specification vastly more than was needed just to get the throughput. To address this, we decided on a two-pronged approach, firstly I would investigate cascading standbys to reduce the load on the redo log volumes and secondly, we would replace the one 64K IOPS disk in the disk group with two at 5K IOPS. Theoretically, the two 5K IOPS disks should give us 2*500MiB/s throughput which would give us the same throughput as the one 64K IOPS disk, whilst being significantly cheaper. However, the throughput-IOPS limitations were unexpected and so we didn’t want to add the two disks only to discover that there was some other unexpected limitation.
Lots of testing later, we now have the two 5K IOPS disks providing sufficient throughput. Following my investigation into cascading standbys, we now also have two of the on premises and one of the AWS standbys as cascading standbys.
Testing failure?
Why did we not detect the possibility of this problem in testing? There are a few reasons for this. Firstly, we only have standbys in production and so we could not detect the problem in any environment other than production.
Secondly, looking at the log file sync wait event whilst the database was on premises, we could see the spikes, but they were an order of magnitude lower probably due to their only being two standbys, the network latency being much lower, and we had a large all-flash array on premises. We could see the spikes in the redo generation rate, but these were not causing application issues. Therefore, there were no indicators that the redo log disk groups would cause a problem.
Thirdly, the configuration in production was different from any configuration used in testing, not only in terms of the number of standbys, but also their location and therefore network latency. Such a difference is not unusual for a migration as we had the three intended standbys in AWS plus the on-premises standbys in case we needed to rollback to on premises. Once the application had been running successfully in AWS for a period, the on-premises standbys would be removed.
Finally, this was one of our first migrations to AWS and so we had little experience of working in an environment where each disk has specific characteristics. Therefore, in some respects this was a testing failure, but not a big one. You can’t test for everything and there was certainly no experience or expectation of such a problem. Knowledge of this problem means that testing is now designed to detect the possibility that it will occur again.