Lack of vendor communication

I was first asked to look at this problem as a database restore was getting an error when attempting to restore the archived logs:

archived log thread=1 sequence=577949
channel d14: reading from backup piece 
<piece name>
channel d1: ORA-19870: error while restoring backup piece 
<piece name>
ORA-19504: failed to create file "<FRA disk group>"
ORA-17502: ksfdcre:4 Failed to create file <FRA disk group>
ORA-15347: logical block size 512 of ASM file '<FRA disk group>' is too small for disk group sector size 4096

I had never seen this error before. The chap who asked me to investigate, also an Oracle DBA, had done some research and found https://flashdba.com/2014/02/27/oracle-asmlib-physical-and-logical-blocksize/ which suggested setting ORACLEASM_USE_LOGICAL_BLOCK_SIZE. He had also found and AWS blog, https://aws.amazon.com/blogs/database/using-amazon-ebs-elastic-volumes-with-oracle-databases-part-3-databases-using-oracle-asm/, which stated that the EBS advertises either 512 bytes or 4K bytes depending on a number of factors.

This database was migrating from VMware Cloud (VMC) to AWS and the Operating system was being changed from RH8 to RH9. Both of these changed were therefore focused on initially.

Looking at the VMC server and comparing it to the AWS server using “lsblk -o NAME,PHY-SEC, LOG-SEC” the physical sector size was indeed different, changing from 512 in VMC to 4096 in AWS. I did some digging in MoS and on the web. “ORA-15347: Logical Block Size 2048 Of ASM File ‘+DATA’ Is Too Small For Diskgroup Sector Size 4096 (Doc ID 2231261.1)” and “Using 4k Redo Logs on Flash, 4k-Disk and SSD-based Storage (Doc ID 1681266.1)” suggested the following:

create diskgroup DG_512 external redundancy disk '/dev/raw/raw812' attribute 'sector_size'='512';

ALTER SYSTEM SET "_DISK_SECTOR_SIZE_OVERRIDE"=TRUE;

It was decided to modify the pipeline used to build the RH9 server to include the ASM parameter ORACLEASM_USE_LOGICAL_BLOCK_SIZE. The server was then rebuilt, and I recommended that we try to restore some redo logs to the FRA as this had been the point the previous restore had failed. That worked and then the database was restored and recovered.

Problem solved; job done. However, some time after the restore, a load test was performed and during that test, not only was everything rather slow, including an inability to logon to the database server, but when a colleague managed to logon, he found that the load average was in excess of 8000. I know that the load average is something of a strange beast, but I had never seen it that high and there were less than 900 running processes, so I am unsure what such a high value means. For security reasons, I didn’t have access to the AWS console for the server, so I could not look at the Cloudwatch metrics. However, I was able to get hold of an AWR report. This showed very high values for I/O waits:

Top 10 Foreground Events by Total Wait Time

Event	Waits	Total Wait Time (sec)	Avg Wait	% DB time	Wait Class
direct path read	2,214,272	3.9M	1780.74ms	90.6	User I/O
direct path read temp	1,203,973	237.7K	197.47ms	5.5	User I/O
DB CPU		166.5K		3.8
direct path write temp	21,629	28.1K	1298.44ms	.6	User I/O
db file sequential read	5,456,882	18.4K	3.37ms	.4	User I/O
read by other session	12,030,658	16.7K	1.39ms	.4	User I/O
latch: cache buffers chains	1,291,288	7662.9	5.93ms	.2	Concurrency
ASM IO for non-blocking poll	9,840,084	1372.5	139.48us	.0	User I/O
latch free	3,836	1103.3	287.62ms	.0	Other
acknowledge over PGA limit	4,382	566.5	129.28ms	.0	Scheduler

Without access to Cloudwatch, I asked if I could have a screen share with someone who did have access as I suspected that they had exceeded their throughput limit. The database was using gp3 storage and when people are not being careful, the default 125MB/s throughput is chosen. This had happened quite a few times in the past as we previously had used io2 storage where you don’t directly specify the throughput, rather it is dependant on the IOPS. It doesn’t take many IOPS for io2 to have a much higher throughput than 125MB/s.

During the screen share, I talked the chap through generating a graph of the throughput maximums for one of the ASM disks, but to my disappointment, during the load test the disk got nowhere near its 125MB/s maximum. It had briefly got there during the restore, but not during the load test. I talked the chap through doing the same for IOPS, but again, it was nowhere near its limit, 5000 in this case, during the load test.

With my brilliant hypothesis destroyed, I had to go away and have a think about what could be going on. To make my life easier, I decided to request access, and this was quickly granted. I took a look at a number of AWR reports, but I couldn’t find any smoking gun. I looked at the various Cloudwatch metrics available for the chosen ASM volume, including the “exceeded” checks, but nothing seemed amiss. Volume response was perfectly normal and didn’t change significantly during the load test. The CPU was also fine.

So, everything was fine. The disk performance was fine, the server performance was fine, but the database performance was terrible. About the only thing I had successfully proved was that there was no problem with the log writer as it was not complaining about performance problems during the load test. What could be causing the problem?

As I had looked at about everything else, I decided to look at the instance level IOPS and it is at this point that previous work I had done in understanding the Cloudwatch metrics meant that I chose a fortuitous metric to plot. I plotted the Maximum of the EBSReadOps. This is insightful, because it is not actually the Maximum of the EBSReadOps within the plot period (5 minutes in this case). You wouldn’t guess this from the AWS documentation, but it is actually the Maximum of all the VolumeReadOps samples within the plot period. This showed a peak at 5000 IOPS for almost the entire duration of the load test. As it is a maximum of the VolumeReadOps, I knew that this meant that one disk was reaching 5000 IOPS, and given that this was the specified maximum for all the disks in the disk group with the data files, which almost always will be the highest of any disk, it almost certainly meant that one of the disks in that disk group was reaching 5000 IOPS. I therefore plotted the Maximum VolumeReadOps for all the volumes in the disk group and this showed that not only did one disk reach 5,000 IOPS whilst the other remained at a few hundred, but that the disk reaching 5,000 IOPS was different at different times.

As an aside, ASM stripes data across all the disks in a disk group. This means that, apart from small variations, the load on one disk in a disk group will be the same as the load on all the others, and if the number of disks in a disk group is N, then the total load on the disk group will be N times the load on one disk in the disk group. This is why I have always plotted the load on a single disk in the disk group as they should all be the same. I also don’t normally use the instance level disk performance metrics as these will be muddied by the inclusion of all the ASM disk groups and non-ASM disks. This investigation proves my load assumption to be wrong, at least in certain situations.

Now that I knew what was causing the poor performance, I gathered all the information I could and logged an SR with Oracle. We were asked to run a script which detected imbalances in the distribution of files amongst the disks. This showed no imbalance. We were also asked to run a script which queried v$asm_disk_stat to determine what ASM thought of disk performance. I had been trying, but failing, to reproduce the problem myself. Therefore, I asked the application team to run a similar load test to the one that had shown the problem whilst the script was running. During the load test, the script kept crashing, probably due to the extremely high load, and so I kept having to restart it. It gave output similar to the following:

DiskPath – DiskName      Gr Dsk    Reads   Writes AvRdTm AvWrTm     KBRd     KBWr  AvRdSz  AvWrSz RdEr WrEr
<disk 1>		 3   0    45874     1862    1.7    1.5   195312     9765    4359    5370    0    0
<disk 2>		 3   1    81227     3486 1011.8 2373.8   292968    19531    3693    5737    0    0
<disk 3>		 3   2    83885     2202    1.8    1.4   292968     9765    3576    4541    0    0
<disk 4>		 3   3    88511     1542    1.8    1.3   390625     9765    4519    6485    0    0
<disk 5>		 3   4    67430     2226    1.7    1.5   292968     9765    4449    4492    0    0
<disk 6>		 3   5    50086     2334    1.8    1.4   195312     9765    3993    4284    0    0
<disk 7>		 3   6    62815     1666    1.8    1.3   292968     9765    4775    6002    0    0

Whilst this did show that <disk 2> had orders of magnitude worse read and write times, it didn’t show the level of imbalance seen in the IOPS. “Reads” may have varied across the disks, but the highest was only twice the lowest and this range seemed to be broadly consistent across all the samples, with the highest and lowest disks constantly changing.

Fortunately, throughout the test, I was also running a script capturing the iostat information. This did show the IOPS problem:

Device	        r/s	rkB/s	rrqm/s	%rrqm	r_await	rareq-sz
nvme11n         129	31546.4	7757.6	98.36	1.43	244.55
nvme12n1	169.8	39241.6	9691	98.28	1.39	231.1
nvme1n1	        139.4	33780.8	8305.8	98.35	1.43	242.33
nvme3n1	        154.2	38496.8	9470	98.4	1.46	249.65
nvme4n1	        155.4	38460	9459.6	98.38	1.44	247.49
nvme8n1	        4969.9	35118.8	3809.8	43.39	24.44	7.07
nvme9n1	        141.8	35424	8714.2	98.4	1.45	249.82

The device nvme8n1 is <disk 2>. This shows the reads per second (r/s), 4969.9, are about the IOPS limit of 5000 for <disk 2>, but only about 150 for the other disks in the disk group. It also shows that the read size (rareq-sz), 7.07, is much smaller for <disk 2> than for the other disks. This smaller size appears to be because the percentage of read requests that are merged (%rrqm), 43.39, is much smaller for <disk 2> than the other disks. The percentage value may not be so different, but at 98.4% of read requests being merged, the resulting read requests are quite large, but at 43% they are quite small. You can calculate the size as 4*(100/100-Merge percentage).

The following graphs from the iostat data give a better picture of the situation:

This graph shows the IOPS during a test run. Initially, all the disks are well below their 5000 IOPS limit at about 700 IOPS. Then part way through one disk suddenly starts performing 5000 IOPS. The disk doing so changes during the run, but from about a third of the way through the test until the end, one disk is almost always hitting its IOPS limit.

This graph shows the read throughput and shows a pattern that is more or less the inverse of the IOPS. When the disks are performing 700 IOPS, the throughput is high, but when one disk is hitting it IOPS limit, the throughput is low.

This graph is of the average read request size, rareq-sz in the iostat data. Six of the seven disks are all issuing 256KB read requests throughout the test. But, when a disk hits the IOPS limit, its read request size drops dramatically to about 7KB. This is the result of merging only about 40% of read requests.

All of this data and my analysis was then updated to the support ticket.

I had been doing some reading of the documentation and so I knew that on RH9, we had to use ASMLIBv3. This used the new io_uring (https://en.wikipedia.org/wiki/Io_uring) to perform asynchronous I/O. Further digging showed that it could be disabled by setting kernel.io_uring_disabled to 2 in /etc/sysctl.conf. This had no effect on existing processes, as the documentation suggested, so I bounced the test database and very quickly we had a problem:

SQL*Plus> show parameter parallel
ORA-00604: error occurred at recursive SQL level 1
ORA-01115: IO error reading block from file  (block # )
ORA-01110: data file 1: '<file name>’
ORA-15081: failed to submit an I/O operation to a disk

We can disable io_uring, but then we disable database I/O. Not a great thing to do. The reason the database can’t perform I/O is that, as the Oracle documentation states, ASMLIB will always use io_uring if the kernel has it compiled into it. See here and here. Note that not all the foot notes for the table at the top of the first link existed when I was investigating the problem.

We also had a support ticket open with AWS as our O/S support was not directly with Red Hat, but through AWS. This meant that we could not talk directly to Red Hat but had to go through AWS. I asked them if there was some way I could investigate what was happening with io_uring and the asynchronous I/O and merging of read requests. They suggested that I use blktrace. I did some research and gave it a go. It did take quite a bit of research to understand the output.

Using the “blktrace” and “blkparse” trace commands, I got output of the form shown below when all the disks are performing well. The fourth column is time since the start of the blktrace.

259,1   15        5     0.010433430  7788  A   R 13000472584 + 8 <- (259,17) 13000470536
259,1   15        6     0.010433528  7788  Q   R 13000472584 + 8 [ora_p00c_brd01q]
259,1   15        7     0.010433931  7788  M   R 13000472584 + 8 [ora_p00c_brd01q]
259,1   15        8     0.010434691  7788  A   R 13000472592 + 8 <- (259,17) 13000470544
259,1   15        9     0.010434833  7788  Q   R 13000472592 + 8 [ora_p00c_brd01q]
259,1   15       10     0.010435017  7788  M   R 13000472592 + 8 [ora_p00c_brd01q]
259,1   15       11     0.010435662  7788  A   R 13000472600 + 8 <- (259,17) 13000470552
259,1   15       12     0.010435744  7788  Q   R 13000472600 + 8 [ora_p00c_brd01q]
259,1   15       13     0.010435875  7788  M   R 13000472600 + 8 [ora_p00c_brd01q]

A = remapped for stack devices. This can effectively be ignored.

Q = request is queued

M = request is back merged, i.e. merged with the previous request.

I believe that the “+ 8” means plus 8 disk sectors which are each 512B, meaning each request is for 4KB, the block size of the disk. The value in the eighth column (e.g. 13000472592 ) is being incremented by 8 every time. This indicates that the requests are sequential. Requests are then submitted to the disks as seen below.

259,1   23       13     0.011020457     0  C   R 13000470528 + 512 [0]
259,1   23       14     0.011314195     0  C   R 13000471040 + 512 [0]
259,1   23       15     0.012456310     0  C   R 13000471552 + 512 [0]
259,1   23       16     0.013103586     0  C   R 13000472064 + 512 [0]
259,1   31        3     0.013565178     0  C   R 13254008848 + 512 [0]

C = completed. The I/O request has been completed.

When the process placing the request on the queues is an Oracle process (e.g. [ora_p00c_brd01q]), then the requests are all being merged into approximately 512*512B = 256KB requests. 256KB is the largest size of request the AWS SSDs can accept.

259,1   30    15851     0.951299927  9128  A   R 6101038592 + 8 <- (259,17) 6101036544
259,1   30    15852     0.951299998  9128  Q   R 6101038592 + 8 [ora_p00y_brd01q]
259,1   30    15853     0.951300640  9128  C   R 6101038592 + 8 [65525]
259,1   30    15854     0.951306630  9128  A   R 6101038600 + 8 <- (259,17) 6101036552
259,1   30    15855     0.951306706  9128  Q   R 6101038600 + 8 [ora_p00y_brd01q]
259,1   30    15856     0.951307037  9128  C   R 6101038600 + 8 [65525]
259,1   30    15857     0.951308194  9128  A   R 6101038608 + 8 <- (259,17) 6101036560
259,1   30    15858     0.951308267  9128  Q   R 6101038608 + 8 [ora_p00y_brd01q]
259,1   30    15859     0.951308489  9128  C   R 6101038608 + 8 [65525]

Some time later, we start to see completed requests with [65525] noted for them. This could be some kind of return code and indicate an error as it was [0] before. After this, we start to see requests being made by processes which are named in the format iou-wrk-#. These are io_uring worker processes. Importantly, requests from these never appear to be merged as can be seen from the below examples:

259,1    5    13089     9.997843095 179045  A   R 13321051832 + 8 <- (259,17) 13321049784
259,1    5    13090     9.997843175 179045  Q   R 13321051832 + 8 [iou-wrk-7818]
259,1   27    12039     9.997844854 179633  G   R 13064067512 + 8 [iou-wrk-7793]
259,1   27    12040     9.997845255 179633  D   R 13064067512 + 8 [iou-wrk-7793]
259,1   31    60712     9.997845727 177710  G   R 7698751592 + 8 [iou-wrk-7821]
259,1   31    60713     9.997846382 177710  D   R 7698751592 + 8 [iou-wrk-7821]
259,1   23    54510     9.997846745 176371  A   R 8240029560 + 8 <- (259,17) 8240027512
259,1   23    54511     9.997847072 176371  Q   R 8240029560 + 8 [iou-wrk-7830]
259,1   27    12041     9.997848707 179633  A   R 13064067656 + 8 <- (259,17) 13064065608
259,1   27    12042     9.997848927 179633  Q   R 13064067656 + 8 [iou-wrk-7793]
259,1   15    26618     9.997848940 178796  G   R 7698752008 + 8 [iou-wrk-7821]
259,1   31    60714     9.997849178 177710  A   R 7698752808 + 8 <- (259,17) 7698750760
259,1   31    60715     9.997849436 177710  Q   R 7698752808 + 8 [iou-wrk-7821]
259,1   27    12043     9.997849589 179633  G   R 13064067656 + 8 [iou-wrk-7793]
259,1   15    26619     9.997849612 178796  D   R 7698752008 + 8 [iou-wrk-7821]
259,1   27    12044     9.997849769 179633  D   R 13064067656 + 8 [iou-wrk-7793]
259,1   31    60716     9.997849918 177710  G   R 7698752808 + 8 [iou-wrk-7821]
259,1   31    60717     9.997850081 177710  D   R 7698752808 + 8 [iou-wrk-7821]

G = “get request To send any type of request to a block device, a struct request container must be allocated first.”

D = “issued A request that previously resided on the block layer queue or in the i/o scheduler has been sent to the driver.”

When the requests are submitted to the disks, they are no longer in order and, as they are not merged, are all 8*512B = 4KB in size.

259,1   31    60728     9.998885279     0  C   R 13576461984 + 8 [0]
259,1   31    60729     9.999061195     0  C   R 13576461976 + 8 [0]
259,1   31    60730     9.999309815     0  C   R 13885272288 + 8 [0]
259,1   31    60731     9.999361933     0  C   R 13885272152 + 8 [0]
259,1   31    60732     9.999510416     0  C   R 13885272464 + 8 [0]

I uploaded these finding to AWS as things were clearly looking like an O/S problem and not an Oracle one. I also dug into all the documentation from Oracle and Red Hat. This seemed to tell an interesting story.

19c version 19.19 and upwards is supported on Red Hat 9 version 9.5 and upwards
For Red Hat 9, you needed to use version 3 of ASMLIB if you want to use ASMLIB.
ASMLIB version 3 will use only io_uring if the kernel supports it, which is the case from Red Hat 9.3 onwards.
io_uring was introduced as a technology preview to Red Hat (https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/9.3_release_notes/technology-previews)
Technology previews (https://access.redhat.com/support/offerings/techpreview) “are not intended for production use”.

Logically then, if you want to use ASMLIB, you are not fully supported. How could we get to a situation where the Oracle documentation said you were supported, but in fact you were not, at least not by Red Hat.

Some time later, AWS decided to involve Red Hat and we very quickly had a meeting. They invited Red Hat and I invited Oracle. The meeting was very useful and enlightening. Very quickly, Red Hat agreed that we should not be using io_uring in production as it was a technology preview and therefore very likely to have bugs, one of which we were probably hitting. ASMLIB should therefore not be used in production with 19c on Red Hat.

Very soon after the meeting, Red Hat updated their documentation to reflect this. Oracle were somewhat slower, but with the help of Mike Dietrich (https://mikedietrichde.com/) the documentation was updated, although not as explicitly, as it just states that you should check the support status with Red Hat.

Given that I had more or less worked out the situation, the most interesting part of the meeting for me was that it appeared that the venders don’t talk to each other very much. Red Hat were surprised that Oracle could run on Red Hat 9 using Oracle’s ASMLIB given that they had not produced an ASMLib kernel module for Red Hat 9. Prior to Red Hat 9 such a module was required, but ASMLIB version 3 no longer required it. Oracle seemed surprised that io_uring was in technology preview. Of course, when I say “Oracle” or “Red Hat” I am really only referring to the support engineers on the call.

Immediately after the meeting, I alerted every appropriate person in my company that I could get hold of about this situation, telling them not to use Red Hat 9 with ASMLIB until we had some resolution.

One important thing to note is that all the documentation does have disclaimers, especially when it comes to one product running on a third party’s product, such as Oracle running on Red Hat. Oracle certainly does not test everything on everything. However, every appropriate person I spoke to in my company thought that 19c was supported on Red Hat 9, so the documentation at least confused people. Hopefully, now, with the additional notes, no more people will be confused.

Conclusions

When there are major changes, it is worth reading all the documentation very carefully. Just because vendor X says that their product is supported on vendor Y’s product, does not necessarily mean that every aspect of vendor X’s product is supported on every aspect of vendor Y’s product. For instance, 19c using ASM with udev rules (https://en.wikipedia.org/wiki/Udev) works perfectly well on Red Hat 9 and that is what we are using. Understand what has changed and how it interacts with other components. Check for yourself that what vendor X’s product is doing is actually supported by vendor Y’s product. Read the documentation with a sceptical eye. Unless is states explicitly that A works with B, don’t assume it does, even if it is implied.

Having said all of that, the vast majority of the time, your scepticism will not be justified as I am sure that neither Oracle nor Red Hat were in any way attempting to mislead. However, occasionally, you will find something like this, meaning that you could have gone into production with an unsupported system as we almost did.