Slow performance after “snap”

I was asked to look at the a UAT database as it was performing slowly. I asked for some details of the slowness and was told that it had been slow since the “snap” (duplicating of the database at the storage level) about three weeks ago. I asked for a specific example and they pointed me at a incident ticket which contains mention of statement is the alert log about “excessive swapping observed”. This is usually a symptom rather than a cause, but warrants investigation if nothing else seems odd. I pushed further for an example of a specific SQL statement and they pointed me at one that has been running for 6.9 hours.

Whilst I was asking the above questions I was also looking at the main OEM performance graph. Although I don’t have a snapshot from the exact time, the following is the 24 hour historical graph which covers the period:

I started my investigation at about 9:45am and so when I looked at it there were about 20 sessions performing I/O and not much CPU activity. Note that subsequent investigation showed that the database was capped at 2 CPUs (cpu_count=2), hence the large light green block.

I don’t know the usual profile of this database, but my first thought was that it was doing a lot of I/O compared to CPU. This of course may be normal, but warranted investigation. Drilling down into the I/O I saw the following:

Immediately I could see that there was far too much direct path reading happening and that, at least when I started my investigation, this was mainly down to one specific SQL statement, SQL ID c2hymggn2ab6j.

Looking at this SQL statement it appeared to be an innocuous “select * from <view> where <id> = :1”. However the view was 412 lines of SQL ( some of which was down to one column per line in the select clause ). Looking at a SQL Monitor report for the SQL, it was spending almost all of its time performing two full table scans of a specific table which I will call XXX_V2. I took a look at the two execution plans captured in the AWR and both execution plans performed full table scans.

Execution Plan
------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                 | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                          |                            |       |       | 52422 (100)|          |
...
|  21 |      TABLE ACCESS FULL                    | XXX_V2                     |  4611K|    65M| 26158   (2)| 00:00:02 |
...
|  27 |       TABLE ACCESS FULL                   | XXX_V2                     |    11M|   144M| 26158   (2)| 00:00:02 |

I then took a look at production to see what the execution plans were there and discovered that the SQL was not accessing XXX_V2. I then looked at DBA_OBJECTS on UAT to discover that not only had the view been recreated shortly after the snap, but the XXX_V2 table was also created at about that time.

I took a quick look at the view to see if I could see any simple way of tuning it, but to my surprise the table was not in it, nor was it in any view. A quick look at DBA_OBJECTS confirmed that what was in the view, XXX, was actually a synonym in the context of the view and that this synonym was for the XXX_V2 table.

After a brief look at the view I could not see anything immediately wrong and so I reported all of my finding to the developers. As an example of how the change to this view had affected the database, over a 10 hour period the database averaged 940MB/s, rather than a more usual 50-100MB/s. This was going to cause problems.

The product determined that the relevant join conditions in the view were wrong and quickly changed it and recreated it. This resulted in the relevant query performing an index range scan returning a few rows and the direct path reads disappeared.

Following the change to the view there were a significant number of “cursor: pin S wait on X” waits observed as can be seen from the below screenshot:

Again, this screen shot was taken sometime after the problem was resolved and the “Concurrency” issues (red bits, Concurrency) were much more prominent immediately after the recreation of the view. This wait event is well described in the blog post Cursor pin S wait on X and in this case is a lot of SQL referring to the recreated view being re-parsed.

Lessons

It is important to look for the root cause and not just address the symptoms. Keep asking “Why”. Memory usage has gone up. Why would that be? Perhaps because a lot of sessions were performing direct path reads and therefore using more PGA. Why are there more direct path reads? Because of SQL_ID c2hymggn2ab6j. Why is SQL ID c2hymggn2ab6j performing so many direct path reads? etc, etc.
Perceptions of time is not always accurate. The database was snapped three weeks prior to the problem, but the new table was not created until several days later. Therefore this problem can not have started until then. Being asked to think back several weeks to when the problem started is likely to be inaccurate.
The database really can be slow. However in my experience this is almost always down to one or two SQL statements performing badly and therefore using too many resources and starving other processes of those resources.