Investigation
I was not directly asked to look at this, but I was included in an email trail so I took a look. This initially only affected the QA database. The active sessions graph from OEM did not look good:

Note that the regular green spikes are an artefact of OEM. This can be seen when looking at the next graph. The problem was a specific SQL which had 1000 bind pairs and thus 2000 bind variables:

The problem was already being investigated by other DBAs and they had already done the normal things of killing the blocking sessions and had already asked for permission to bounce the database. Having recently looked at a cursor pin S wait on X situation I did a bit of digging. However, nothing seemed to be wrong with the database and the session that was blocking was just sitting on the CPU. It would remain on the CPU for several hours and then complete. The next session would then take over and have exactly the same behaviour. Killing the session on the CPU had no apparent effect as another session would just take over. The database has three instances, but the relevant service was only running on two. Both instances with the service running were exhibiting exactly the same behaviour. As a side effect of all the waiting sessions, the database was getting errors relating to running out of PGA and sessions being killed.
To investigate what the “active” session was doing I enabled tracing but no trace file was produced, except if the session was either killed or completed. This was very strange behaviour and suggested to me that there was something wrong with the way the statement was being handled by Oracle rather than a general performance issue. There wasn’t much that I could suggest at this point and I was rather confident that the restarting of the database would resolve the problem, even though I was not sure what the problem was. I therefore left the situation to the DBAs that were already working on it.
Unfortunately, the restarting of the database did not solve the problem and consequently an SR was raised that night. When I discovered this in the morning, I started to think that this was looking like a bug. From looking at the historic performance graphs on OEM it appeared that the problem started about two weeks ago and from ASH I pinpointed the first occurrence of “cursor: pin S wait on X”. I then looked for changes in the database at around that time and discovered that four objects had a LAST_DDL_TIMESTAMP of about that time, all advance queue exception tables.
This was quite suspicious, but I could not find or think of a way that a select statement would be affected by altering exception queues. This could therefore be a red herring, but at the very least warranted further investigation. I emailed the development team, and they were already purging advanced queues as the problem SQL in some way relates to the dequeuing procedure. However, no one knew why or how the objects had changed. I did some further digging around the advanced queues and the exception queues including starting and stopping them, but this had no effect. Note that you can’t enable an exception queue for enqueuing and so you need to specify “enqueue=>false” in the call to dbms_aqadm.start_queue.
As this avenue of investigation seemed to be going nowhere, I turned to MOS and the various articles and bugs relating to “cursor: pin S wait on X”. None of the various notes was any help with fixing the problem, but Troubleshooting ‘cursor: pin S wait on X’ waits. (Doc ID 1349387.1) had details of what information Oracle Support needed to investigate the problem, specifically:
sqlplus '/ as sysdba'
oradebug setmypid
oradebug unlimit
oradebug setinst all
oradebug -g all hanganalyze 4
oradebug -g all dump systemstate 267
quit
I therefore did this and, after filling the diag file system on one of the nodes, produced a file which was 2.7GB when compressed. This was obviously a bit large to upload to Oracle support and so I didn’t attempt to upload it immediately.
By this time Oracle Support had got back to us and it was clear that they had got the wrong end of the stick as no one had any idea why they had made the suggestions they did. I called Oracle support to clarify the situation and then uploaded the information about the changed objects and some ASH reports showing the start of the problem. I then handed to a colleague for the next shift.
Unfortunately, Oracle still had the wrong end of the stick and so when I got in the next working day, I called Oracle support to try to clear up the confusion. Oracle then asked me to upload a Hang analyze, state dump and STRACE of the active session that I mentioned I had taken. I uploaded the STRACE and tried uploading the 2.7GB dump file, but this was too big. I commented on the ticket as such.
I had been trying to create a version of the SQL I could execute myself but had not got very far due to time and I also needed to find some values for all 2000 bind variables and the SQL statement itself was over 14,000 characters long. Fortunately, the product had just sent out a version of the SQL with all the bind variables replaced with literals and so I could use this instead.
I executed this and it took about 5 minutes to run but then produced results. By executing the same SQL concurrently from another session I could reproduce the problem. I tried reducing the number of bind variables hoping that at some level the problem would go away. However, all that happened was that the SQL statement got proportionately faster as the number of bind variables was reduced, rather than at some point suddenly getting much faster. I then generated a 10046 trace of this SQL and discovered that there were significant gaps in the wait events, indicating CPU activity, whilst the following SQL was being executed:
select condition from cdef$ where rowid=:1
I did some digging around this statement, but I could not find anything helpful beyond that this was a query for check constraints during the parsing phase of the SQL. I then generated a further 10046 trace, this time with bind variables to see what the rowid was. Unfortunately, they were restricted rowids but it was possible to convert them to standard rowids as we know the relevant object:
select dbms_rowid.rowid_to_extended('<row id>','SYS','CDEF$',1) from dual;
This told me the full row ids which I could then use to find the constraints using:
select def.condition, o.owner, o.object_name
from sys.CDEF$ def
join dba_objects o
on o.object_id = def.obj#
where def.rowid = '<full row id>'';
The constraints were all on the table involved in the problem SQL. I then generated a 10053 trace which was over 1GB in size. All of this was pointing to a problem with the parsing of the SQL statement, but I could not find any solutions on MOS. I therefore uploaded all of this information to the SR in the hope that Oracle support would have some idea of what was going on.
I had a further go at uploading the hang analyze and state dump file, but to no avail. By this time the problem had started occurring on production and so the SR was raised from severity 2 to severity 1. I also handed over to my colleague once more.
Within a few hours of the raising of the severity the new Oracle support engineer had determined the problem was related to OR expansion of the IN list. He suggested hinting the SQL or using a logon trigger to set the “_NO_OR_EXPANSION” at the session level. At this point I picked up the SR again, created a SQL patch to insert the “NO_EXPAND” hint into the relevant SQL statement and this solved the problem.
Exactly why the problem started when it did and why for only one SQL statement, given that there are many other similar statements, is not clear.
Lessons
- No support organisation is perfect. Everyone gets things wrong sometimes and everyone gets the wrong end of the stick a times. Help them.
- It is important to have a good idea of what the problem is so that you can guide support and stop them pursuing fruitless avenues of investigation.