Cursor: mutex X and cursor: mutex S

Symptom

High Cursor: mutex X and cursor mutex S waits for a single SQL statement.
High number of child cursors for that SQL statement.

Cause

Hard parsing due to bind variable type mismatches.

Solutions

Keep flushing the cursor from the cursor cache. This is obviously only a temporary solution.
Set the parameter “_cursor_obsolete_threshold=200”. This limits the number of child cursors to 200 and can therefore have side effects.
Ensure all variables bound by JDBC are correctly type cast. This is the ideal solution.

Investigation

I was brought in to this issue as the DBA dealing with it asked for some assistance, which I am always happy to give when it comes to performance problems. The “messages” backlog for a production database was increasing and there were a lot of concurrency waits on the database affecting a single SQL statement. This was a PL/SQL anonymous block that did little more than an insert into a particular table. Importantly there were over one hundred bind variables for the values to be inserted.

The image below shows a taste of what the OEM graph looked like on the second occurrence, but this was during a reasonably quiet period during the problem and during the most active periods the entire graph was filled with cursor mutex X and cursor mutex S.

The effect on the database can be more clearly seen in the historic graph below, with the shaded period corresponding to the period above.

This SQL statement had a very high number of child cursors:

select count(*) from v$sql where sql_id = '<sql ID>';

COUNT(*)
----------
1400

The reason that each child cursor could not be used was due to bind variable mismatch e.g.

select reason from v$sql_shared_cursor where sql_id = '<sql ID>'
and rownum = 1
/
REASON
--------------------------------------------------------------------------------
<ChildNode><ChildNumber>0</ChildNumber><ID>40</ID><reason>Bind mismatch(8)</reas
on><size>4x4</size><bind_position>70</bind_position><original_oacflg>3</original
_oacflg><original_oacdty>2</original_oacdty><new_oacdty>1</new_oacdty></ChildNod
e>

The reason for the mismatch was always that either the original data type code (original_oacdty) was 1 and the new data type code (new_oacdty) was 2 or vice versa. Data type code 1 is a VARCHAR2 and data type code 2 is a NUMBER. Given that there are ten thousand possible combinations of two values for one hundred positions, the number of possible child cursors is very large.

A search of MOS brought up “Troubleshooting cursor: mutex X Due to High Version Count on the Insert Statements with Binds using JDBC (Doc ID 1469726.1)”. This was a good match and it was referring to NULLs, which are apparently converted to VARCHAR2(32) by JDBC, becoming CHAR(1) which would cause a similar data type mismatch. The note states that there is no limit to the number of child cursors and therefore suggests setting the following as a workaround:

_cursor_features_enabled=1026
event="106001 trace name context forever, level 200"

This limits the number of child cursors to 200. Whilst this mitigates the problem, it does not fix the underlying issue which is that the code should not be flipping between VARCHAR2 and NUMBER data types. However, subsequent testing on QA and a further search of MOS brought up “Troubleshooting: High Version Count Issues (Doc ID 296377.1)” which suggested that the above was valid up to and including 11.2.0.2.2, whilst the following was valid for 11.2.0.3 and above:

_cursor_obsolete_threshold=200

Testing on QA has proved that this is the case for 12c. I did not get a chance to implement the workaround on the first occurrence as the problem resolved itself, probably because the level of concurrency dropped. By the time of the second occurrence the proposed fix was in QA but had not made it into Production and the problem was not going away on its own. The product were bouncing their messaging servers willy-nilly and although this was having some effect, the backlog was still growing. I therefore decided to flush the cursor to get rid of all its children using dbms_shared_pool.purge. The result was an immediate drop in the cluster waits and more than a ten fold reduction in run time of the SQL. However, the number of child cursors built back up rather quickly and so I flushed it again. After a few times I decided that I should script it and so I can up with the following script:

#!/bin/ksh
export ORACLE_SID=<instance name>
ORAENV_ASK=NO
. oraenv
ORAENV_ASK=YES
while [[ 1 ]]
do
sqlplus -S -L "/ as sysdba" <<- EOF
select sum(executions) from v\$sql where sql_id = '<sql ID>';
select count(*) from v\$sql where sql_id = '<sql ID>';
exec dbms_shared_pool.purge('<address>, <hash value>','C');
quit
EOF
sleep 30
done

A combination of my initial manual flushing and this script allowed the backlog to clear and the problem was resolved. The code was later changed to correctly type cast the variables.