“Shocking” database performance due to poor I/O

Symptoms

Very bad I/O performance

Cause

FC switches issues

Solution

Reboot SAN fabric which resulted in block corruption
Block corruption fixed by a level 0 backup of the standby being transferred to the primary, catalogued and then block recovery being performed.

Investigation

This was initially billed as a performance problem with rather bad performance on two dependant databases. However it morphed into something a little more unusual. I was asked to take a look and after an initial poke around I found a few things of minor interest, but certainly no shocking performance. I joined the chat and asked for some details. I was told that a particular batch job that normally took 35-40 mins was now taking 2 hours and that it starts at 01:00 UTC. I was also given a SQL ID. I took a quick look and saw that the SQL statement was spending almost all of its time on I/O. By now a storage chap was on the chat and prior to my involvement another DBA that was looking at this had sent out details of the I/O performance. I therefore decided to generate a graph of my favourite database metric, “db file sequential read” over the last week. This gave the below graph on one of the two databases:

Clearly then there was an I/O problem. Just after generating this graph and attempting to put it into the chat, the chap who asked me to look at the problem came over and said there was some issue with the FC switches and that they needed to reboot the SAN fabric, or something like that. Not being a storage expert I nodded sagely and tried to give the impressing I knew what he was talking about. He also asked me to monitor the databases on the host whilst this all happened.

I brought up a few putty windows and started tailing the alert logs of the affected instances. In the chat, the chap who was going to perform the work asked what order to do it in and I suggested the least used node first. That way if things did go wrong, it would have the least impact. Everything went fine on the first server and so the work was started on the second. As soon as the work started, messages of the following form started to appear in the alert log of one of the databases:

Error 3150 trapped in 2PC on transaction 494.30.292857. Cleaning up.
Error stack returned to user:
ORA-03150: end-of-file on communication channel for database link
ORA-02063: preceding line from <database link name>

Error 3150 trapped in 2PC on transaction 467.0.255498. Cleaning up.
Error stack returned to user:
ORA-03150: end-of-file on communication channel for database link
ORA-02063: preceding line from <database link name>

Queries using the database link were being terminated. This was entirely unexpected as the target database was on a separate cluster and should have been completely unaffected by the work. The exact reason for this is unknown at the time of writing. I quickly took a look at the relevant database and one instance had crashed when the work was performed. I took a look at the alert log for this instance and the instance had reached the mount state, but was not getting any further. This was a bit concerning, but turned out to be an incorrect configuration in the clusterware with “Start options” being “mount” rather than “open”. That mystery solved, I looked at the alert log of the other instance and it did not look healthy. There were all kinds of errors in the log, but the one of most concern was the following:

Errors in file
<trace file name> (incident=408047):
ORA-01578: ORACLE data block corrupted (file # 39, block # 2239719)
ORA-01110: data file 39:
'<data file name>'

A quick look at v$database_block_corruption confirmed that we had at least one block that had an incorrect checksum. As every good DBA knows, if you have block corruption you use the RMAN block recovery command. I therefore attempted to run this on the least active instance. Unfortunately it had run out of processes. I therefore switched to the other node and started the command using “RECOVER CORRUPTION LIST;” With this running, I turned to the least active instance which was having problems due to the processes issue. After discussing things with the product and my boss, we decided to double the processes parameter and bounce the instance. That done, all the errors apart from those relating to the block corruption disappeared.

Feeling that everything was now either fixed or about to be fixed, I sat back to wait for the block recover command to finish reading the level 0 backup piece containing the block. It took about 3.5 hours to create the backup piece, so it seemed like it may be a long wait. However after an hour it completed reading it and started restoring and applying the archived redo logs.

Everything seemed to be going well, but after a while I realised the archived redo logs were being applied rather slowly. At this point we were faced with something of a dilemma. The product wanted the second instance opened, but I was concerned that opening an instance whilst block recovery was underway could cause problems. What would happen if the block recovery was cancelled or crashed? Would it leave a block in a past state resulting in what would effectively be a fuzzy datafile? It had been years since I had done block recovery and so I was unsure. They also stated that the object containing the corrupt block could be truncated. However I thought that as recovery had started, that the object could not be truncated as it contained a block that was from an older point in time. I decided to speed things up by setting off a number of RMAN runs to restore the archived logs ready for the recovery process to apply them. Even though the block recovery session had automatically allocated 8 channels, only one was being used. The archived log restore would use 8 channels and I set three of them off, all using a “time between” or “from time” so that they would restore archived logs from both threads and would not clash with each other. I then sat back to monitor this in the knowledge that I had 24+1 channels all restoring or applying archived redo logs.

After a while I realised that the extra sessions were not actually restoring archived logs very fast. After checking on OEM I realised that they were all waiting on “control file sequential read” and were not restoring archived logs. This was not good. After doing some calculations it was clear that we were looking at days to recover and there was the possibility that archived logs were being generated faster than they were being applied. At this point, we had a meeting with the product and a colleague was brought in whom I would be handing over to when I left for the day. The product was happy to wait for the restore and after the meeting I discussed the situation with the colleague. He pointed out that there were lots of messages in the alert log of the A node about SMON incidents:

Dumping diagnostic data in directory=[cdmp_20180207125309], requested by
(instance=2, osid=18627 (SMON)), summary=[abnormal process termination].
Dumping diagnostic data in directory=[cdmp_20180207125539], requested by
(instance=2, osid=18627 (SMON)), summary=[abnormal process termination].
Dumping diagnostic data in directory=[cdmp_20180207125814], requested by
(instance=2, osid=18627 (SMON)), summary=[abnormal process termination].
Dumping diagnostic data in directory=[cdmp_20180207130049], requested by
(instance=2, osid=18627 (SMON)), summary=[abnormal process termination].
Dumping diagnostic data in directory=[cdmp_20180207130324], requested by
(instance=2, osid=18627 (SMON)), summary=[abnormal process termination].

This looked bad and so I decided to start killing the restore sessions, obviously not including the one performing the block recovery. Part way through this process the instance performing block recovery crashed, possibly because of killing the sessions, possibly for other reasons. Reviewing these errors messages for this write-up it would appear that these are actually the result of the block corruption errors being detected on the B node (instance=2) and SMON on that node requesting the action. These were therefore not an issue and were not being caused by the extra restore sessions.

Fortunately the crash of the instance did not result in the block recovery leaving a block in a past state. The corruption errors continued to appear in the alert log of the running instance, so it would appear that the block recovery is an in-memory operation until such time as it is completed. Clearly block recovery by this method was going to take too long and so we had to change to another approach.

In such situations in the past, the datafile has been restored from an image copy backup of the same datafile on the standby database. However this would require the entire datafile to be offline which would cause disruption, if not for very long. However during the time I had been monitoring the restore, I had hit on the idea of copying across a backup of the datafile and then performing block recover from that. As it was getting rather late, I discussed this idea with my colleague and we decided to try it. An “incremental level 0” backup of the appropriate datafile was taken from the standby. This was then transferred to the production host and catalogued. Block recovery was then tried again. This time it took only a few minutes and the block was recovered:

RMAN> catalog backuppiece '<backup piece>';
using target database control file instead of recovery catalog
cataloged backup piece
backup piece handle=<backup piece>
RECID=119204 STAMP=967484691
RMAN> recover corruption list ;
Starting recover at ...
allocated channel: ...
channel ORA_DISK_1: restoring block(s)
channel ORA_DISK_1: specifying block(s) to restore from backup set
restoring blocks of datafile 00039
channel ORA_DISK_1: reading from backup piece
<backup piece>
channel ORA_DISK_1: piece
handle=<backup piece>
tag=<tag>
channel ORA_DISK_1: restored block(s) from backup piece 1
channel ORA_DISK_1: block restore complete, elapsed time: 00:02:55
starting media recovery
archived log for thread 2 with sequence 635744 is already on disk as file
...
media recovery complete, elapsed time: 00:02:02
Finished recover ...

Lessons

Block recovery is crash safe. The corrupt block appears not to be touched until a fully restored version is created at the end of the recovery.
When faced with a corrupt block, don’t immediately issue the block recover command. Look at how far back the level 0 backup is and how many archived redo logs are necessary to recovery it. It is quite possible that taking a backup of the relevant data files on the standby and recovering from them will be far quicker.
Pay close attention to the messages in the alert logs. I was looking at the corrupt block messages appearing in the running instances alert log, but it did not click that this meant that the block had not been touched and therefore we could happily kill the recovery operation. I also starting killing the archived log restore sessions on the mistaken belief that these were causing SMON core dumps. Again, this was not the case as a closer examination of the error message showed.