How instance caging almost saved the day

This is more by way of amusement than an actual investigation. I had been asked to monitor a test of a new piece of application functionality that utilised AWS. It spun up a large number of AWS servers each of which then performed 36 instances of a particular process in parallel. The concern from the application developers was that this would result in a connection storm and the database starting to drop connections. My concern when I heard about this test was that it would bring the database and its hosts to their knees from the workload.

Let me just take a moment to explain why the application developers were afraid of a connection storm. In the past we had had occurrences of an application getting “ORA-12514: TNS:listener does not currently know of service requested in connect descriptor”. These would occur sporadically and only for a few minutes. I had investigated these and there were no problems with the services and no errors in the alert logs. However the listener logs did show the aforementioned error. From a quick bit of grep, sort and uniq -c, I was able to determine that just before there errors started to appear, another application was issuing a large number of connections to the same database, of the order of several hundred a second. From various notes in MOS I concluded that what was happening was that the listener was deciding that the database was close to running out of processes and so started protecting it by dropping new connection attempts. The listener is frequently updated with information about the database, such as the number of processes in use. However between these updates the listener only knows about new connections, not about connections that have been closed. It therefore makes assumptions about the likely number of processes in use and so can start to drop connections even if the database has not run out of processes. This is what the application developers were afraid of.

Anyway, back to the test. It started relatively slowly and then got very silly very fast. This can be seen from the below OEM active session graph:

As you can see from the server load, this was not too high because the database had been capped at 16 cores using instance caging. From the active sessions it is clear that there are hundreds more wanting CPU than can get it. This clearly shows how instance caging can protect a server from CPU hogs. As I was watching this test I was initially not too concerned, at least for the server, as the server was clearly not running out of CPU. Seeing a large number of sessions waiting on I/O is not normally a problem. However I took a look at the type of I/O it was performing and unfortunately it was almost exclusively performing direct path reads. Hundreds of sessions performing direct path reads at once is a recipe for swamping your I/O capacity, at least at the server level, and this was indeed what was happening:

The server was running out of puff at about 1.6GB/s. The instance caging had stopped the test from exhausting the CPU on the server, but had not stopped the test from exhausting the I/O capacity and so I asked for the test to be terminated. As the database, rather than the server, had run out of CPU, the normal background processes were not running as usual and so the AWR snapshots were not captured during the height of the test, however below are the top ten wait events from the period when the test was tailing off.

Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                           Total Wait       Wait   % DB Wait
Event                                Waits Time (sec)    Avg(ms)   time Class
------------------------------ ----------- ---------- ---------- ------ --------
resmgr:cpu quantum                 333,833     502.7K    1505.96   58.0 Schedule
direct path read                 1,180,511     332.4K     281.60   38.4 User I/O
buffer busy waits                  221,015      78.7K     356.04    9.1 Concurre
enq: TX - index contention         118,997      29.1K     244.55    3.4 Concurre
enq: TX - contention                   588      26.5K   45083.11    3.1 Other
db file sequential read            274,243      21.7K      78.97    2.5 User I/O
read by other session               39,963      15.2K     379.85    1.8 User I/O
DB CPU                                         4409.7                .5
gc buffer busy acquire              14,375     3456.5     240.45     .4 Cluster
log file sync                       17,781     1498.3      84.26     .2 Commit

This may not show the worst of it, but it does show a taste.

Analysis after the test showed that the direct path reads were all being caused by one SQL statement which was performing a full table scan instead of index access. There was a relevant index but it appears that the object-relational mapping tool in use was handily performing a trim several table columns, thus stopping the intended index from being used. Once this was fixed, a rerun of the test troubled neither the database nor the server.