QA on faster new hardware is slower than Production

Symptom

The same query on production runs much faster than on QA.
Bind variables are used.

Cause

Bind variable peeking.
Sub-optimal use of bind variables.

Possible solutions

A Baseline if one plan is always or nearly always better and when it is not the impact is acceptable.
Use literals if the impact on the shard pool and hard parsing is acceptable. Note that you don’t need to replace all the bind variables in a SQL statement, only those that influence the optimal execution plan.
Disable bind variable peeking. Do not do this unless you understand all the implications of it as it will affect all SQL statements, not just the one with the problem. The appropriate parameter is _OPTIM_PEEK_USER_BINDS which is of course an underscore parameter and Oracle Support state that you should only use them in consultation with them.
Use the Adaptive features of 12c. However, these in my experience, at least in 12.1, cause more problems than they are worth and, again at least in 12.1, I would not use.

Investigation

A product had just started using their new QA database on its shiny new hardware, but a particular query was performing rather badly compared to production. I was asked to investigate the reasons for this. The query looked something like the following:

SELECT <some fields>
FROM <some table>
WHERE <some date field> <= :1 
AND ( <some other date field> IS NULL OR <some other date field> >:2 )
AND <some varchar2 field> in ( <some values> )

Due to the high rate at which batch related queries are aged out of the shared pool, the SQL statement was not in production’s shared pool and so a historic execution had to be found. This was from some months ago, and was essentially iterating through the in list and using an index on <some varchar2 field> and <some other date field>.

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name                       | Rows  | Bytes | Cost (%CPU)| Time
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |                            |       |       |  2413 (100)|
|   1 |  CONCATENATION                               |                            |       |       |            |
|   2 |   INLIST ITERATOR                            |                            |       |       |            |
|   3 |    TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED| <some table>               |   784 |   107K|    55   (0)| 00:00:0
|   4 |     INDEX RANGE SCAN                         | <some index>               |  1865 |       |    11   (0)| 00:00:0
|   5 |   INLIST ITERATOR                            |                            |       |       |            |
|   6 |    TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED| <some table>               | 36442 |  4982K|  2358   (1)| 00:00:0
|   7 |     INDEX RANGE SCAN                         | <some index>               | 86669 |       |   346   (1)| 00:00:0
------------------------------------------------------------------------------------------------------------------------

I have seen situations where indexes had become unusable and so this could have been the cause, however I had run an explain plan of the query on QA and it produced the same explain plan as production’s execution plan, so the index could not be unusable. As the statement has bind variables and an explain plan generated a different plan from the current cursor, this suggests a bind variable peeking issue. To prove whether this is the case, I needed to find the bind variables used when the statement was first parsed.

Fortunately this is available from v$sql in the “BIND_DATA” column. However this is a rather unintelligible long alphanumeric string. To get the actual value, you need to use dbms_sqltune.extract_bind and an example script for this is in the “Oracle Scripts” section. Both bind variables had the same value, a date from some years ago. I therefore plugged the values in as literals and generated a new explain plan which this time was the same as the execution plan on QA, a full table scan. This proves that, with the bind variables used when the cursor was parsed, the optimal execution plan would have been a full table scan, at least according to the statistics. Consequently the problem was related to bind variable peeking and using binds that influence the optimal plan. The immediate solution is to flush the cursor from the shared pool which can be done using dbms_shared_pool.purge, an example of which is in the “Oracle Scripts” section. Obviously this is a relatively temporary solution as it does not prevent the same situation from occurring again. To stop it occurring again we would ideally need the binds to be replaced by literals or a baseline created, but of course a baseline would enforce one plan, regardless of which plan would be best. There are other more complicated solutions as well.

This does however leave the question of why such a date range should result in a full table scan when the query returns only 1 row apparently. Of course if the optimiser generates a plan that apparently only returns one row and there is no apparent statement in the SQL to produce this, such as “where rownum = 1”, then the optimiser has usually calculated that there will be no rows returned. To determine the reason for the full table scan I looked at DBA_TAB_COL_STATISTICS for the statistics on the columns specified in the WHERE clause. As with the bind data, high and low values in this view are also difficult to interpret, but again Oracle provides something to help, in this case dbms_stats.convert_raw_value:

SQL> set serveroutput on
SQL> declare
2 v_date date;
3 begin
4 dbms_stats.convert_raw_value('<low value>',v_date);
5 dbms_output.put_line(v_date);
6 dbms_stats.convert_raw_value('<high value>',v_date);
7 dbms_output.put_line(v_date);
8 end;
9 /

Obviously it would be possible to create a more elegant way of doing this, but I simply wanted the values quickly. The statistics suggest that both of the date fields in the query has a low value from a few years back, a date some years after the relevant values being used for the bind variable when the problem query was last parsed on QA. You can use the same technique to get the high and low values for the in list predicate, changing the type of the output variable to match the type of the column. This showed that there were two distinct values, the two values specified in the in list.

According to these statistics then, the predicates would look something like:

<some date field> <= <date before any data>
(<some other date field> IS NULL OR <some other date field> > <date before any data>)
<some varchar2 field> in ( <both values for some field> )

The first part would match no data and the second and third would match all the data. Note that there were quite a few nulls for . As the first part would match no data and all the parts are AND’ed together, the query would return no rows.

Clearly then, accessing the table using an index on and/or would not result in good performance as both sections match all the rows. However, accessing the table using an index on would, at least in this situation, as it matches no rows. Looking at dba_ind_columns, although there were single column indexes on and and of course the multicolumn index on both, there was no index containing . Therefore the most effective execution plan for the given predicates was to access the table via a full table scan. Another example of bind variable peeking causing a problem, or is it sub-optimal use of bind variables?