I was asked to look at this problem as it has been going on for a number of hours and several other DBAs had been involved. I took at look at both instances of the database
Both graphs have times in GMT+1. The first instance appears to be active, but there is nothing that stands out as a problem. The second instance appears to be almost idle, which could be a problem, but due to the service configuration, was not.
About the only mention in the problem ticket of DBA activity was to do with shared pool errors and that the database had been bounced after shared_pool_size had been increased to no avail. The restart happened at around 2am GMT+1, however this had also not resolved the problem. Restarting the database introduces all kinds of variables and will result in poor performance whilst the buffer cache is being repopulated. The shared pool had reached more than 6GB in size prior to the restart and so the increases of shared_pool_size from 576M to 640M and then to 1G would have had no effect as this simply sets the minimum. The first shared pool errors occur at 22:25 GMT+1 and as the problem ticket was created at 20:45 GMT+1 the shared pool errors are unlikely to be a cause and are possibly a symptom or merely a coincidence.
With no obvious smoking gun, I took a look at the alert logs of both instances but saw nothing that I could identify as causing a performance problem. There were various “Immediate Kill” statements and some nasty looking Oracle errors, but nothing that stood out as a performance problem. There were of course various shared pool related errors from before the restart of the database. An AWR report also did not show any significant I/O problems.
There was mention of a SQL ID in the problem ticket that was something relating to the batch, however this did not appear to run for any significant time and had not run in a number of hours, or at least had not been captured by the AWR. With no evidence that the database was having problems, I joined the problem conference call. There was already a DBA on the call and he was handling the situation. The tone of the call was that the issue had been resolved and so I remained silent and ate my breakfast as I had not had a chance to do so by then.
It was mentioned on the call that a backup copy had been killed and this had produced an improvement in performance. However after a while the tone of the call changed and it became clear that there were still problems with performance and so I ventured that it did not look like there was a database problem, or at least not an obvious one. I asked for an example of processes that were running slowly and was pointed at sessions running under a specific user. At the time there were three such sessions running:
There sessions are hardly doing anything in the database and thus if there is any slowness, it will be due to something outside the database. There had been mention of network related problems and that the copying of a backup had been killed so perhaps this should have pointed in the direction of network, but the networking team had said there were no problems, at least inter-data centre.
Whilst performing these investigations I also noted that a particular SQL statement was using a large amount of CPU:
Clearly a statement that requires addressing, but given that the server had 16 cores with hyperthreading enabled, as there was no other significant CPU load on the server, this would not be a problem unless it was part of the batch processes. As it was not, being part of the a UI, I filed it for later review, although later when about 25 processes were running the SQL statement and the CPU usage on the server exceeded 90%, this was a problem. From discussions after the incident was resolved, it appears that a lot of time during the incident had been spent investigating this SQL statement.
Focus then turned to the network and it was immediately clear that there was an issue. Note that times in the below two graphs are in GMT+0:
It appears that the backup copy was not the only processes using network bandwidth. However the killing of it, which occurred at about 06:00 GMT+0 did coincide with a performance improvement and a change in the network profile indicating that some bandwidth was made available. An RMAN active duplicate using 5 channels was then killed at at 08:45 GMT+0 and this led to the drop in load seen. The server hosting the database had only 1GBit network cards and so 125MB/s is the maximum that can be expected. Both the backup copy and the RMAN duplicate were to the same server and its inbound traffic can be seen below, times are again in GMT+0:
By the time the second process was killed, most of the critical batches had completed and shortly afterwards the problem incident was paced into a monitoring state.
Lessons
- Heavy utilisation of a resource does not mean it is causing a performance problem. If no one cares about the time the process using the resource takes and the limits of the resource have not been reached, then it is not a problem and you should move on.
- Just because everyone else is saying the problem is with the database does not mean that the problem actually is with the database. Only the DBA has the metrics to determine if there is a problem with the database. If there are no obvious problems with the database, then ask for specifics of what is slow, ideally including SQL IDs and what good performance looks like. If these processes are not performing badly in the database, then it is time to start looking outside the database.
- If the database does not seem to be the cause, then ask specific questions of other teams. Apparently the networks team had concluded that there were no problems between the data centre hosting the application servers and the data centre hosting the database. No one however had looked at the load on the specific servers.
- Breakfast is not the most important meal of the day: https://www.theguardian.com/lifeandstyle/2016/nov/28/breakfast-health-america-kellog-food-lifestyle