Sudden appearance of direct path reads

I was asked to investigate a problem with an application which was suddenly exhausting its connection pool. I looked at OEM:

That looked interesting.

Almost all the reads being performed were direct path reads. Not a good sign.

I looked at some of the top SQL statements performing the direct path reads and they all seemed to share a full table scan of one specific table. I checked an AWR report and this confirmed the situation. 278MB/s of reads, the top wait event was direct path read, and 91% of unoptimized reads and 99.89% of direct physical reads were for the one table.

Direct path reads were clearly the problem, but why were they being performed? I didn’t know much about the algorithm that decides whether direct path reads or cached reads are chosen, but I did know that it depended on the “_small_table_threshold” parameter and the amount of the table that was cached. I also knew that the decision was made at run time and therefore the SQL plan could remain the same whilst executions varied between cached and direct path reads. From v$bh I could tell that there were almost 20,000 blocks in the cache for the table, and from dba_segments, I knew that there were about 21,500 blocks in the object:

The difference wasn’t big, but if I could get more of the table cached, then I was sure I could stop the direct path reads. Using an “index” hint I forced the entire table to be read via an index full scan. I ran this a few times and although the number of blocks cached increased slightly, it never reached 20,000 and the direct path reads didn’t stop. Why could I not force the entire table into cache? I started searching MoS and Google for details on the direct path read decision. This brought up several things:

“_serial_direct_reads” determines whether Oracle makes the decision to use direct path reads (AUTO), whether it will always use direct path reads (TRUE/ALWAYS) or never use direct path reads (FALSE/NEVER).
“_direct_read_decision_statistics_driven” determines whether the size of the object is determined from the statistics (TRUE) or from the segment size (FALSE)

I therefore looked at the values in the database which were “auto” and “TRUE” respectively. I consequently concluded that the database was making the decision and that the size of the object was determined from the statistics. There was only one problem with that. The small table threshold was above the number of blocks according to the statistics as it was 19679 and DBA_TABLES showed 18257 blocks.

According to all that I had read, this table should not even be considered for direct path reads, yet direct path reads were occurring against it. How could that be? This left me in a rather uncomfortable position as I was involved in several conference calls that day where I had to explain that the problem was direct path reads against one table, but that I didn’t know why they were occurring as it was too small.

I did a little bit of testing to try and find out how big I needed to set “_small_table_threshold” to stop direct path reads. I used a binary search to minimise the number of values I needed to try. The results showing the crossover point are as shown below. The “stt” comment is the value set for the threshold.

select /* stt 21316 */ sum(total_legs)
from
 <table>
 
 
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        6      0.00       0.00          0          0          0           0
Execute      6      0.00       0.00          0          0          0           0
Fetch       12      1.44      12.21     115638     115746          0           6
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total       24      1.44      12.21     115638     115746          0           6
 
 
select /* stt 21317  */ sum(total_legs)
from
 <table>
 
 
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        7      0.00       0.00          0          0          0           0
Execute      7      0.00       0.00          0          0          0           0
Fetch       14      0.78       0.78          0     135086          0           7
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total       28      0.78       0.78          0     135086          0           7

So, it would seem that 21,317 blocks as a threshold stops the direct path reads, was this the size Oracle considered the object to be? It was shortly after determining this value that I discovered two blog articles on direct path reads that proved rather useful: When bloggers get it wrong part 1 and When bloggers get it wrong part 2. Although their names don’t seem very relevant, the contents are. Not only did these help with understanding some aspects of the direct path read decision, but part 2 provides two trace events that can help with understanding a specific decision:

alter session set events ‘10358 trace name context forever, level 2’;
alter session set events ‘trace [NSMTIO] disk highest’;

Note that as author Roger Macnicol, who works for Oracle, states: “but we don’t give any internal details of the cost model used”. Thus, any analysis can only go so far. With the events enabled I ran the same query shown immediately above and got the following interesting output:

kcbism: islarge 1 next 0 nblks 21317 type 2, bpid 3, kcbisdbfc 0 kcbnhl 65536 kcbstt 19679 keep_nb 0 kcbnbh 1025440 kcbnwp 2
kcbimd: nblks 21317 kcbstt 19679 kcbpnb 98397 bpid 3 kcbisdbfc 0 is_medium 1
kcbcmt1: hit age_diff adjts last_ts nbuf nblk has_val kcbisdbfc cache_it 680 94355161 94354481 1025440 21317 1 0 1
kcbivlo: nblks 21317 vlot 500 pnb 983970 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 12, objd: 72149, objn: 72149
ckpt: 1, nblks: 21317, ntcache: 19351, ntdist:1966
Direct Path for pdb 0 tsn 12  objd 72149 objn 72149
Direct Path 1 ckpt 1, nblks 21317 ntcache 19351 ntdist 1966
Direct Path mndb 0 tdiob 44 txiob 0 tciob 269
Direct path diomrc 128 dios 2 kcbisdbfc 0
NSMTIO: Additional Info: VLOT=4919850
Object# = 72149, Object_Size = 21317 blocks
SqlId = 0jpgbdcxagzz9, plan_hash_value = 207974762, Partition# = 0
KCBO: prepared ksr msg for cid=65597, tsn=12, obj=72149, pdb=0, 0

Not the easiest to read and I don’t know what all of it means, but doing some reading around and a spot of guessing leads me to interpret it as follows:

nblks 21317: Oracle has determined that there are 21,317 blocks in the object.
kcbstt 19679: The small table threshold is 19679 blocks – 2% of the buffer cache.
kcbpnb 98397: The medium table threshold is 98397 blocks – 10% of the buffer cache.
pnb 983970: This appears to be the size of the buffer cache from which the thresholds are derived. As the starting value was 8757706752KB according to v$memory_resize_ops, this seems to only be about 92% of the buffer cache. Note that this value does not change as the buffer cache dynamically expands and shrinks.
kcbdpc:DirectRead: Direct path read has been chosen.
ntcache: 19351: Oracle considered that 19,351 blocks were in cache. This is rather lower than the number I was getting as I believe the source for this value, as suggested by Tanel Poder, is X$KCBOQH.NUM_BUF. The nature of the difference I don’t know.
ntdist:1966: This appears to be the difference between nblks and ntcache, however this does not always appear to be the case as it can, for instance, be 0 even though there is a difference.
vlot 500: The percentage of the buffer cache that is the very large object threshold.
VLOT=4919850: The very large object threshold.
[MTT < OBJECT_SIZE < VLOT]: This appears to suggest that the object is between the medium table threshold and the very large object threshold. However, this seems to be a misleading statement as if you increase the small table threshold such that the object is now considered small, you get the line of output below. Therefore, this suggests that the object is actually considered of medium size. This also appears to be the observation of Frits Hoogland.

NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 21317 (blocks), Threshold: MTT(98397 blocks)

One thing to note about the decision is that it is cached in some way, probably in the cursor, and so executing the query a second or subsequent times provides no useful information unless the cursor is flushed.

What can we tell from this information? The value of 21317 blocks agrees with findings for the size of the object that Oracle is using, but where did this size come from? It neither agrees with the statistics nor dba_segments. Beyond that, it gives us a way to examine the decision more closely and a better idea of the variables involved, but no real help with understanding why direct path reads suddenly started.

On several conference calls throughout the day, along with saying I didn’t know why the direct path reads were happening, I did offer a few possible solutions:

Disable direct path reads all together using “alter session set events ‘10949 trace name context forever, level 1’;”.
Increase “_small_table_threshold”.
Compact the relevant table using “alter table … shrink space” as this was likely to reduce the number of blocks in the table to below the small table threshold.
Increase the SGA size, not because this would mean that more of the table would be cached, but because it would increase the buffer cache size which would implicitly increase the “_small_table_threshold” which defaults to 2% of the buffer cache.

Given that, as far as I am aware, we had never tested the first two possible solutions on any databases and the last two were only a temporary measures as the table would continue to grow and would therefore exceed the small table threshold at some point, the product decided to increase the connection pool size instead. This decision may of course have been coloured by my inability to give definitive answers on what was going on. Note that although you can change “_small_table_threshold” at the session level, changing it at the database level unfortunately requires a restart.

It was time for me to go off shift and so I handed over to a colleague and went home for the weekend.

On the Monday, the impact on the database was significantly lower and a combination of that and the increase in the connection pool size meant that the pool was no longer being exhausted. I also noted that the statistics on the relevant table had been updated over the weekend, with the number of blocks now being 21317. This was exactly the number that the direct path read decision was using before the weekend but before the statistics were calculated. What did that mean? Did Oracle calculate the statistics for the object for the purposes of direct path reads but nothing else? What would have caused Oracle to decide to do this?

I kept investigating the situation as we didn’t know what caused the problem and we didn’t know if it would occur again. My colleague had opened an SR with Oracle, but they had not offered much help. At this point, they had just stated how to turn of direct path reads. I updated the SR with my finding. I also uploaded a test case showing all rows of the table being repeatedly read via an index due to a hint. It then showed the same query without the hint performing a full table scan using direct path reads.

I then set about testing how much of the object needed to be in memory before direct path reads would stop. Whilst doing this I noted that specifying a rownum would mean that the direct path read decision would consider the object to be smaller. This meant two things. Firstly, by using rownum to reduce the considered size of the object to be just below the small table threshold, I could easily get a consistent amount of the object in cache. Secondly, I could increase the considered size of the object gradually above the small table threshold until the direct path reads stopped. I also needed to perform an “alter table … move;” between each run to get back to the same starting conditions. The script ran through a number of different values for the number of rows to be selected, each time performing the following, amongst other things:

-- Remove all blocks from the cache
alter table <copied table> move tablespace <tablespace>;
-- 300000 rows make the considered size of the table just below the small table threshold.  Thus, the reads 
-- are cached.
select count(*) from <copied table> where rownum < 300000;
-- Increase value for &1 until direct path reads occur
select count(*) from <copied table> where rownum < &1;

The <copied table> table was a copy of the relevant table but without any indexes. As a CTAS produced an object below the small table threshold, I needed to copy some rows twice.

Highlighting the relevant pieces of the trace file, we see:

select count(*) from <copied table> where rownum < 303500
NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 5, objd: 288385, objn: 287801
ckpt: 0, nblks: 19219, ntcache: 19080, ntdist:139
WAIT #140090625580816: nam='db file scattered read' ela= 1462 file#=5 block#=39812 blocks=128 obj#=287801 tim=8202685186830
 
select count(*) from <copied table> where rownum < 304000
NSMTIO: kcbdpc:DirectRead: tsn: 5, objd: 288386, objn: 287801
ckpt: 0, nblks: 19250, ntcache: 19080, ntdist:170
WAIT #140090625580816: nam='db file scattered read' ela= 1135 file#=5 block#=19332 blocks=128 obj#=287801 tim=8202686643082
 
select count(*) from <copied table> where rownum < 306000
NSMTIO: kcbdpc:DirectRead: tsn: 5, objd: 288410, objn: 287801
ckpt: 0, nblks: 19377, ntcache: 19080, ntdist:297
WAIT #140645831585568: nam='db file scattered read' ela= 1361 file#=5 block#=19332 blocks=128 obj#=287801 tim=8203755622239
 
select count(*) from <copied table> where rownum < 306500
NSMTIO: kcbdpc:DirectRead: tsn: 5, objd: 288391, objn: 287801
ckpt: 1, nblks: 19599, ntcache: 19080, ntdist:519
WAIT #140090625580816: nam='direct path read' ela= 610 file number=5 first dba=371 block cnt=13 obj#=287801

This appears to show that when (100*19080/19219) = 99.3% of the object is in cache, cached reads are chosen. Then something strange happens, at (100*19080/19250) = 99.1% of the object in cache direct path reads are chosen, but db file scattered reads, which are cached, are performed. Either the NSMTIO line does not indicate what it would appear to indicate, or there is some secondary decision being made. It is not until only (100*19080/19599) = 97.4% of the object is in cache that direct path reads are chosen and performed. This does seem a very high value and Frits Hoogland found it to be higher still at above 99%.

I put all of this together and uploaded a second test case to Oracle. I then continued my investigating by looking for the cause of the sudden switch to direct path reads. The statistics history (sys.WRI$_OPTSTAT_TAB_HISTORY) showed 18257 blocks for at least the last month. The current statistics which were gathered over the weekend showed 21317 blocks. This suggested that a recent unusually high rate of growth had taken it above the small table threshold, which I already knew. I looked at the allocated extents:

SQL> select extent_id, blocks from dba_extents where segment_name = ‘<table>’ order by extent_id;
 
 EXTENT_ID     BLOCKS
---------- ----------
         0          8
...
        15          8
        16        128
...
        78        128
        79       1024
...
        91       1024

For some time the ASSM extent allocation size had been 1024 blocks, so this was not just one extent being added. I had been playing around with DBMS_SPACE to see if that offered any insight into what was going on. Using a slightly modified version of the example from the documentation which didn’t print out the size in bytes, I got the following output:

       UNF
----------
      1136
 
 
       FS4
----------
      1753
 
 
       FS3
----------
        27
 
 
       FS2
----------
        28
 
 
       FS1
----------
        16
 
 
      FULL
----------
     18357

These values summed to 21317 blocks, the number that the statistics suggested were in the table. As I knew that the extents that had recently been allocated were all 1024 blocks, this raised two questions. What were unformatted blocks and how did we manage to get more than one complete extent worth of them? The first question was answered by a spot of Googling. They are the blocks above the high water mark and will therefore never be read in a query even if it performs a full table scan (or a fast full index scan if the segment were an index). Importantly though, unformatted blocks are included in the number of blocks calculated by DBMS_STATS and therefore are included in the direct path read decision. This is why the direct path read decision was working with 21317 blocks even though I couldn’t get more than 20,000 blocks in the cache, resulting in a direct path read always being chosen. The difference between the value for blocks in DBA_TABLES and DBA_SEGMENTS is because although unformatted blocks are counted by DBMS_STATS, the various header blocks are not. the value in DBA_SEGMENTS is the actual total number of blocks assigned to that object.

The second question was more difficult to answer. Did Oracle allocate a new extent when an object gets close to being full, rather than actually being full? More Googling and searching of MOS followed. I created a test table and a script that looped round numerous times inserting a few large rows into the table, running DBMS_STATS and DBMS_SPACE.SPACE_USAGE, logging the results to a second table.

This testing showed that extents were allocated only when the object was full, not before. It also showed some interesting behaviour with respect to unformatted blocks becoming formatted in large chunks rather than as needed. It is difficult to draw many more conclusions as my setup meant that two rows will always fill an 8KB block which is a rather unusual situation.

How then could we end up with more than one extent’s worth of unformatted blocks? More Googling and documentation reading on MOS followed. I then noted a remark in one of the documentation pages on ASSM which stated that extents had instance affinity. The database was RAC, so could this mean that each instance gets its own extent? My test database was RAC, but all the inserts had been performed in a single session on a single instance. I set up a quick test case using the previous extent allocation testing scripts. I used the scripts to insert enough rows that the existing extents were completely full, and the insertion of a new row would force the allocation of a new extent. I then manually inserted two rows in quick succession, one row from one instance and one row from another. After that I ran a modified version of the scripts to populate the log table for the state after the insertion of the two rows. This showed the following in the log table:

  NUM_ROWS NUM_BLOCKS DBMS_STATS_BLOCKS UNFORMATTED_BLOCKS FULL_BLOCKS_0_25 FULL_BLOCKS_25_50 FULL_BLOCKS_50_75 FULL_BLOCKS_75_100 FULL_BLOCKS
---------- ---------- ----------------- ------------------ ---------------- ----------------- ----------------- ------------------ -----------
        10          8                 5                  0                0                 1                 0                  0           4
...
     18154       9216              9077                  0             1807                 0                 0                  0        7270
     18156      11264             11117               1960             1807                 0                 2                 78        7270

Clearly then, in a RAC environment, if inserts are occurring across instances, you can get multiple new extents allocated in quick succession even though you are only inserting a few blocks worth of rows. This showed how you can get into a situation where you have so many unformatted blocks that no matter what you do about caching, you can’t stop direct path reads from occurring. However, whilst I was investigating the ASSM block allocation, at about 8am on the Friday the direct path reads stopped:

I checked the amount of the object in the cache, and it was not the 97.4% figure that my testing had predicted as there were still far too many unformatted blocks. I tried my simple “count(*)” query and that was still performing direct path reads, so this must be something to do with the queries the application was actually using. There were many different queries, but they were fundamentally of two forms which varied only in the number of bind variables in an in list. In both cases, there were two tables involved, both being accessed via a full table scan. The second table was much smaller and therefore well below the small table threshold.

I turned on the trace events and had a look at the output. This was a bit more complicated:

kcbism: islarge 0 next 0 nblks 3520 type 2, bpid 3, kcbisdbfc 0 kcbnhl 262144 kcbstt 19679 keep_nb 0 kcbnbh 3205350 kcbnwp 1
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 3520 (blocks), Threshold: MTT(320535 blocks),
_object_statistics: enabled, Sage: enabled,
Direct Read for serial qry: enabled(::::::), Ascending SCN table scan: FALSE
flashback_table_scan: FALSE, Row Versions Query: FALSE
SqlId: 71yxr2bkn6ab9, plan_hash_value: 2642064940, Object#: 72151, Parition#: 0 DW_scan: disabled
kcbism: islarge 1 next 0 nblks 21317 type 2, bpid 3, kcbisdbfc 0 kcbnhl 262144 kcbstt 19679 keep_nb 0 kcbnbh 3205350 kcbnwp 1
kcbimd: nblks 21317 kcbstt 19679 kcbpnb 320535 bpid 3 kcbisdbfc 0 is_medium 1
kcbcmt1: scann age_diff adjts last_ts nbuf nblk has_val kcbisdbfc 0 97902 0 3205350 21317 0 0
kcbivlo: nblks 21317 vlot 500 pnb 3205350 kcbisdbfc 0 is_large 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 12, objd: 72149, objn: 72149
ckpt: 0, nblks: 21317, ntcache: 20243, ntdist:0
Direct Path for pdb 0 tsn 12  objd 72149 objn 72149
Direct Path 0 ckpt 0, nblks 21317 ntcache 20243 ntdist 0
Direct Path mndb 18264 tdiob 3 txiob 0 tciob 738
Direct path diomrc 128 dios 2 kcbisdbfc 0
NSMTIO: Additional Info: VLOT=16026750
Object# = 72149, Object_Size = 21317 blocks
SqlId = 71yxr2bkn6ab9, plan_hash_value = 2642064940, Partition# = 0

This was showing that even at (100*20243/21317) = 95% cached, cached reads were chosen. Clearly then, the nature of the query, not just the relevant object, is considered by the costing algorithm for the direct path read decision. Given that I did not know what all the variables involved were, I didn’t think it was worth determining at what level of caching direct path reads where chosen.

In addition to the other aspects of this investigation, I had been looking at the effects of “alter table … shrink space”. This showed two interesting results:

Shrinking the object packs all the rows into as few blocks as possible. This would mean that as soon as a few more rows were added, a new extent would be added and potentially a second one in a two node RAC environment. Therefore, this approach may well be a very short-term fix.
As soon as the object was shrunk, a smaller size was used in the direct path read decision, even though statistics had not been recalculated. I therefore concluded that although the value for blocks in the object statistics is normally used, some operations will result in a different value being used, perhaps that value is the value that would be calculated if statistics had been regathered. It is possible that the allocation of an extent is one such operation which would answer the question of why 21,317 blocks was being used as the object size at the start of this investigation.

The last solution to the problem that I proposed was using an index or indexes to stop the full table scan being performed on the table. Due to the nature of the queries, two indexes were required. These have gone through testing and showed that not only did they stop the direct path reads, but they also improved performance. They are now in production and so the problem has been resolved. One of the indexes was a covering index which I would not normally use, but it was the only way to force an index to be used. This does have the downside that as a fast full index scan is used to access it, direct path reads could occur against it. However, as it is a tenth of the size of the table, we have a long time before it reaches the small table threshold.

Conclusions

This investigation raised several questions which we now have answers to or at least plausible answers to. The question at the heart of this investigation was why did direct path reads suddenly start occurring for all full table scans on a specific table? This has several causes:

Recently crossing the small table threshold.
Allocation of extents, even though existing extents were not full, due to insert activity across two instances.
Unformatted blocks, which can never be cached, counting towards the size of the object for the direct path read decision.
Direct path read decision requiring a very high proportion of an object’s blocks to be in cache before a cached read is chosen over a direct path read.
Operations, probably including extent allocation, resulting in the direct path read decision using a value for an object’s blocks other than that in the statistics.

This set of circumstances appears to highlight a flaw in the direct path read decision algorithm. But it is an important flaw or is it just an edge case that we were just very unlucky to hit? Given that the details of the direct path read decision algorithm are unknown, it is impossible to say definitively, however given that I have never investigated a problem like this before, I think the likelihood of the situation occurring again and being as significant is low. Due to the unknown algorithm, it is also very difficult to give any advice on reducing the likelihood of encountering the problem.

If you do encounter the problem, what can you do to stop it or reduce its impact? Beyond disabling direct path reads completely, there is no permanent and complete solution to the problem, only temporarily ones. However, as the size of an object increases there will be a reduction in the impact of unformatted blocks on the percentage of the object that can be cached. Therefore, temporary solutions may be enough if the object continues to grow. If one or more indexes can be added such that all important access to the object will be via range scans of those indexes, then this would probably be enough to address the impact of the issue. If, however, such indexes are accessed via fast full index scans, then the problem will simply be delayed until the index exceeds the small table threshold, although that may be a long time into the future.