Database running slowly after data centre switch

Symptom

Lost block wait events.
Huge number of cluster waits.
Database effectively inoperable.

Cause

Network hardware problem.
Lost master of primary key index.

Solutions

Stopping the faulty hardware component may well have stopped the second problem.
Bouncing an instance – any may well work, but possibly one specific instance.

Investigation

I was asked to take a look at a database as it had been getting slower since its data centre switch at the weekend. The OEM graphs didn’t show anything immediately odd, so I took a look at a global AWR report and my old friend, DB file sequential read. The AWR report showed something I had not seen before:


                               Wait                                   Event                      Wait Time                    Summary Avg Wait Time (ms)
       ---------------------------------------------------   ----------------------   ------------------------------   ----------------------------------------
  I#   Class      Event                                             Waits %Timeouts        Total(s) Avg(ms) %DB time        Avg      Min      Max  Std Dev  Cnt
----   ---------- ----------------------------------------   ------------ ---------   ------------- ------- --------   -------- -------- -------- -------- ----
   *              DB CPU                                              N/A       N/A       88,262.82     N/A    58.58                                          5
       User I/O   db file sequential read                      12,842,160       0.0       12,948.18     1.0     8.59       1.38     0.95     2.80     0.80    5
       Cluster    gc cr block lost                                 54,761       0.0        9,377.80   171.2     6.22     157.82   120.96   173.27    22.35    5
       Cluster    gc current block lost                            45,393       0.0        8,179.60   180.2     5.43     165.85   122.38   181.44    25.37    5
       Cluster    gc buffer busy acquire                        1,092,088       0.0        8,057.69     7.4     5.35      10.34     4.87    21.02     6.46    5
       User I/O   db file parallel read                         3,587,714       0.0        5,468.98     1.5     3.63       2.24     1.45     3.77     0.94    5
       Applicatio enq: TX - row lock contention                   295,614      93.5        4,309.05    14.6     2.86      14.68    11.95    16.96     2.44    5
       Cluster    gc cr multi block request                     3,880,981       0.0        3,973.71     1.0     2.64       3.10     0.67     5.40     2.28    5
       Cluster    gc cr grant 2-way                             8,007,630       0.0        2,958.13     0.4     1.96       0.38     0.36     0.43     0.03    5
       Cluster    gc current block busy                           162,712       0.0        1,219.86     7.5     0.81       7.45     6.74     9.35     1.07    5

Lost blocks. What did that mean? The MoS note “Troubleshooting gc block lost and Poor Network Performance in a RAC Environment (Doc ID 563566.1)” provides answers. Given that there were no CPU issues, this was most likely a problem with the private interconnect network. Running “netstat -s” also showed the problem:

$ while [[ 1 ]]
> do
> netstat -s | grep -E 'fragments dropped after timeout|packet reassembles
failed'
> sleep 60
> done
550621 fragments dropped after timeout
566501 packet reassembles failed
550648 fragments dropped after timeout
566528 packet reassembles failed
550655 fragments dropped after timeout
566535 packet reassembles failed
550661 fragments dropped after timeout
566541 packet reassembles failed
550670 fragments dropped after timeout
566550 packet reassembles failed

I explained to the networks team what this all meant and they started looking for the problem. They asked me to run “ping” for the interconnect, but that showed no problems. They then continued investigating. Whilst the network team were investigating I took a look at DB file sequential read after the last days:

The I/O was not as good as before the data centre switch on the 1st, but it was in fact getting better. It subsequently turned out that performance hadn’t been getting worse, but had just generally been worse since the data centre switch.

We therefore had two problems, both outside the database, and we then moved on to looking at individual SQL statements to identify the effects of them whilst waiting for the network and storage teams to investigate their end of the problem. The database makes extensive use of PL/SQL packages and so one statement can result in many SQL statements. The product tracks performance at the procedure call level, so things are not quite a simple as when looking at individual SQL statements by SQL ID. You can however use the ASH data to help in this situation. There are four columns of interest in v$active_session_history and dba_hist_active_sess_history:

desc v$active_session_history
...
PLSQL_ENTRY_OBJECT_ID 		NUMBER
PLSQL_ENTRY_SUBPROGRAM_ID 	NUMBER
PLSQL_OBJECT_ID 			NUMBER
PLSQL_SUBPROGRAM_ID 		NUMBER
...

These relate to the OBJECT_ID and SUBPROGRAM_ID columns in dba_procedures. The two “ENTRY” columns are the top level calls that are made on the command line. The other two are the lowest level PL/SQL entities being executed at the time. You can therefore either use the ASH data or you can work out what the SQL ID is of the top level call and then use the AWR data. From this data it was clear that some PL/SQL calls were being effected due to the I/O and some due to cluster waits which were no doubt the lost blocks.

By this time the network team had determined that the data arriving at the switch from one of the private interconnect interfaces had invalid CRC values which was no doubt the cause of the lost blocks. It was therefore decided to move load from the problem server to a relatively idle node which was historically used to load and process data over the weekend. As it was generally idle, this did not represent a loss of resources. This did reduce the “lost block” waits but not to zero and not all sessions could be gracefully moved.

At this point the investigation effectively left database and moved into that of other teams. However, two days later whilst still eating my breakfast I was asked to join a conference call about a bigger problem with the same database. I took a quick look at OEM and it was clear that something was very wrong, it wasn’t just a performance problem.

The images above are for instances two through five as instance one didn’t show a problem due to its lack of load. The cluster waits were almost exclusively on one table as were the Concurrently waits which were “enq: TX – index contention”. The Application waits were “enq: TX – row lock contention” and were on a second table. The fact that the problem seemed to be localised to one or two tables lead to a round of discussions about new partitions and performance problems that can result from them. I didn’t think this was the case, and so used the time to continue investigating. By this time I had already asked an experienced colleague to take a look as I was struggling to come up with a reason for the problem.

The AWR reports did not give me much help, but did point to the “gc buffer busy” waits primarily being against the current partition of the the primary key index of the table with the cluster waits. I enabled tracing on my session and tried a simple count(*) from the relevant partition. This seemed to hang until I killed it:

PARSING IN CURSOR #139899315332360 len=76 dep=0 uid=2660 oct=3 lid=2660 tim=...
select count(*) from <table> partition ( <partition> )
END OF STMT
WAIT #139899315332360: nam='library cache lock' ela= 403 handle address=526890006832 lock address=531830361592 100*mode+namespace=3276802 obj#=<obj# 3> tim=...
WAIT #139899315332360: nam='library cache lock' ela= 281 handle address=537153227432 lock address=531830361592 100*mode+namespace=3211267 obj#=<obj# 4> tim=...
WAIT #139899315332360: nam='library cache pin' ela= 342 handle address=537153227432 pin address=531830565080 100*mode+namespace=3211267 obj#=<obj# 4> tim=...
WAIT #139899315332360: nam='enq: IV -  contention' ela= 1205 type|mode=1230372869 id1=1280262987 id2=38 obj#=<obj# 4> tim=...
WAIT #139899315332360: nam='enq: IV -  contention' ela= 616 type|mode=1230372869 id1=1398361667 id2=38 obj#=<obj# 4> tim=...
WAIT #139899315332360: nam='enq: TB - SQL Tuning Base Cache Update' ela= 322 name|mode=1413611523 1=1 2=2 obj#=<obj# 1> tim=...
WAIT #139899315332360: nam='DFS lock handle' ela= 637 type|mode=1128857605 id1=50 id2=5 obj#=<obj# 1> tim=...
WAIT #139899315332360: nam='DFS lock handle' ela= 532 type|mode=1128857605 id1=50 id2=1 obj#=<obj# 1> tim=...
WAIT #139899315332360: nam='DFS lock handle' ela= 811 type|mode=1128857605 id1=50 id2=3 obj#=<obj# 1> tim=...
WAIT #139899315332360: nam='DFS lock handle' ela= 2378 type|mode=1128857605 id1=50 id2=2 obj#=<obj# 1> tim=...
WAIT #139899315332360: nam='library cache lock' ela= 324 handle address=526890006832 lock address=531858567432 100*mode+namespace=18446744069417861123 obj#=<obj# 1> tim=...
WAIT #139899315332360: nam='enq: IV -  contention' ela= 247 type|mode=1230372869 id1=1398361667 id2=64 obj#=<obj# 1> tim=...
WAIT #139899315332360: nam='enq: IV -  contention' ela= 812 type|mode=1230372869 id1=1280262987 id2=64 obj#=<obj# 1> tim=...
WAIT #139899315332360: nam='enq: IV -  contention' ela= 365 type|mode=1230372869 id1=1398361667 id2=64 obj#=<obj# 1> tim=...
WAIT #139899315332360: nam='enq: IV -  contention' ela= 907 type|mode=1230372869 id1=1280262987 id2=64 obj#=<obj# 2> tim=...
WAIT #139899315332360: nam='enq: IV -  contention' ela= 321 type|mode=1230372869 id1=1398361667 id2=64 obj#=<obj# 2> tim=...
WAIT #139899315332360: nam='enq: IV -  contention' ela= 768 type|mode=1230372869 id1=1280262987 id2=38 obj#=<obj# 2> tim=...
WAIT #139899315332360: nam='enq: IV -  contention' ela= 589 type|mode=1230372869 id1=1398361667 id2=38 obj#=<obj# 2> tim=...
EXEC #139899315332360:c=58992,e=81958,p=10,cr=43,cu=109,mis=0,r=0,dep=0,og=1,plh=3483965,tim=...
WAIT #139899315332360: nam='SQL*Net message to client' ela= 5 driver id=1650815232 #bytes=1 p3=0 obj#=<obj# 2> tim=...
WAIT #139899315332360: nam='gc buffer busy acquire' ela= 177837541 file#=2173 block#=160331 class#=1 obj#=<obj# 5> tim=...

I ran it twice more and twice more it hung on the same wait event:

WAIT #139899315332360: nam='gc buffer busy acquire' ela= 57788368 file#=2173 block#=160331 class#=1 obj#=<obj# 5> tim=...

WAIT #139899315332360: nam='gc buffer busy acquire' ela= 27026307 file#=2173 block#=160331 class#=1 obj#=<obj# 5> tim=...

Each time it is the same wait event and the same block, but what is this block?

select SEGMENT_NAME,PARTITION_NAME,SEGMENT_TYPE,EXTENT_ID,BLOCK_ID,
BLOCKS 
from dba_extents 
where FILE_ID = 2173 
and 160331 between BLOCK_ID and (BLOCK_ID+BLOCKS);

SEGMENT_NAME                   PARTITION_NAME       SEGMENT_TYPE        EXTENT_ID   BLOCK_ID     BLOCKS
------------------------------ -------------------- ------------------ ---------- ---------- ----------
<parimary key index>           <partition>          INDEX PARTITION             0     160328          8

It appears then that the problem was with what seems to be the root block of the index, given that the preceding blocks are likely to be ASSM bitmap blocks.

It was shortly after this that discussions turned to bouncing instances, the database and switching over to DR. At some point prior to this the NIC relating to the problematic network path had been brought down at the O/S level. This showed that the Oracle level HA used in 12c for the private interconnect worked perfectly, but unfortunately did not do anything for the situation, although it did stop the CRC errors. It was decided to shutdown instance one first as this had the original problem.

Once instance one was shutdown there was a huge spike in cluster waits and then everything went back to normal. Problem solved.

What could have been done better?

Should the NIC on the path with problems have been shutdown as soon as the errors were identified?

There was some concern that similar actions had caused outages in the past and so this option was considered too risky. However it most likely would have stopped the subsequent much more significant problem from occurring. Given that HA was tested when the cluster was created, perhaps a less risk averse approach would have been better.

Could we have either determined with better reasoning the GCS master of the problem block or even released it by moving the master?

It is certainly possible to determine which instance is the master for a block using the view V$GCSPFMASTER_INFO. This should be viewed using the DATA_OBJECT_ID of the table or the parent object in the case of indexes. You can also use ORADEBUG to move a GCS master as is show from the output of tkprof below:

-- All commands run on the second instance of the cluster.
-- V$GCSPFMASTER_INFO starts counting instances from 0, so CURRENT_MASTER of
0 means the first instance is the master.
-- Run as any user

create table master_test ( id number, padding char(1000) ) tablespace users;
insert into master_test select rownum, object_name from dba_objects;
insert into master_test select * from master_test;
<repeat a few times>
update master_test set id = rownum;
alter table master_test add constraint pk_master_test primary key ( id )
using index tablespace <some tables>;

-- as sysdba due to use of ORADEBUG

alter system flush buffer_cache;
select * 
from V$GCSPFMASTER_INFO 
where data_object_id in ( 
	select DATA_OBJECT_ID 
	from dba_objects 
	where object_name in ( 'MASTER_TEST','PK_MASTER_TEST' )
);

FILE_ID    DATA_OBJECT_ID GC_MASTERIN CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT     CON_ID
---------- -------------- ----------- -------------- --------------- ------------ ----------
         0          82907 Affinity                 0               1            3          0

-- Master is the remote instance.

exec dbms_monitor.session_trace_enable();
alter system flush buffer_cache;
select /*+ full(MASTER_TEST) */ /* remote */ count(*) from MASTER_TEST;

COUNT(*)
----------
   1262240

alter system flush buffer_cache;

-- The next command makes the current instance the master.
-- 5 is the tablespace number for the object, the tablespace being USERS in
this case.
-- Specifying the wrong tablespace will not generate an error but will
generate a new row in V$GCSPFMASTER_INFO, however
-- it will not have any actual effect.


oradebug lkdebug -m pkey 82907 5

select * 
from V$GCSPFMASTER_INFO 
where data_object_id in ( 
	select DATA_OBJECT_ID 
	from dba_objects 
	where object_name in ( 'MASTER_TEST','PK_MASTER_TEST' )
);

FILE_ID    DATA_OBJECT_ID GC_MASTERIN CURRENT_MASTER PREVIOUS_MASTER REMASTER_CNT     CON_ID
---------- -------------- ----------- -------------- --------------- ------------ ----------
         0          82907 Affinity                 1               0            4          0

-- This instance is now the master

select /*+ full(MASTER_TEST) */ /* local */ count(*) from MASTER_TEST;

COUNT(*)
----------
   1262240

If we now look at the TKPROF output, we see the following:

select /*+ full(MASTER_TEST) */ /* remote */ count(*)
from
 MASTER_TEST
...
Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  row cache lock                                  1        0.00          0.00
  Disk file operations I/O                        3        0.00          0.00
  gc cr grant 2-way                               6        0.00          0.00
  KSV master wait                                 2        0.00          0.00
  ASM file metadata operation                     1        0.00          0.00
  db file sequential read                         6        0.00          0.00
  SQL*Net message to client                       2        0.00          0.00
  gc cr multi block request                    2877        0.00          0.74
  db file scattered read                       1465        0.00          3.33
  SQL*Net message from client                     2       11.50         11.50
********************************************************************************


select /*+ full(MASTER_TEST) */ /* local */ count(*)
from
 MASTER_TEST
...
Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  Disk file operations I/O                        1        0.00          0.00
  SQL*Net message to client                       2        0.00          0.00
  db file scattered read                       1465        0.01          3.34
  SQL*Net message from client                     2        6.00          6.00
********************************************************************************

This proves that the GCS master can be identified and can be moved. However whether moving the master would have helped is unknown.

Should we have bounced the instance earlier?

Bouncing the instance will always cause disruption. Not only do all the connections get terminated, some perhaps in the middle of long transactions, but you also have the reconfiguration of the cluster and when the instance is restarted the buffer cache will be cold. It therefore has a significant impact and unless there is a statement on MOS that in your exact situation bouncing the instance will resolve it, you can never be sure it will help.

Bouncing an instance should never really work unless you are hitting a bug, as I suspected we were here. In a situation such as this, all the normal avenues of investigation can prove to be rather fruitless. You are therefore left with weighing up the impact of bouncing and the unknown chance that it will solve the problem verses the risk that the investigation will simply never find a solution. It is therefore rather impossible to come up with a correct point at which to bounce instance. However there are a few things that should be considered:

Is the problem causing an effective outage to the database anyway?
Have senior DBA’s taken a look and are similarly stumped?
Having investigated the problem, do you think there is a reasonable prospect that you can find the solution in the next hour or so?
Have you gathered all the data you can think of that would be useful for analysis after the database has been bounced?
Are you currently scouring Google for some inspiration?

In this situation, there was an effective outage and both I and an experienced colleague had looked and were stumped. Neither of us knew what was going on or had any real idea of what to investigate. We had done some trawling of MOS and Google for inspiration. Therefore bouncing the instance would seem to have been a reasonable choice. Perhaps we could have done this a little earlier, but it was only shortly after the generation of the traces that the idea of bouncing instances and the database was raised.

Lessons

Sometimes, bouncing the instance is the solution, even if as an Oracle DBA on Linux you hate the idea.
Sometimes running a small obvious risk can mean not running a larger less obvious risk.