Menu Close

CPU starvation after moving to server with fewer but faster CPUs

Symptoms

  • Batch with high level of concurrency running slower than on previous hardware

Causes

  • Sub-optimal hint
  • High degree of parallelism and dynamic sampling
  • Sub-optimal schema design
  • Parallel Queries in 12c Spending Time on Recursive Query With /* DS_SKEW */ (Doc ID 2104350.1)
  • Fundamentally the new server had less CPU power than the old one.

Solutions

  • Tune the various pieces of SQL as detailed in the investigation.
  • Migrate to a server with more CPU.

Investigation

I was asked to investigate a database which was running slowly and had been migrated to a new server. This server had faster but fewer CPUs. Given the migration my first theory was that this was a CPU starvation issue. This was not a critical issue as no SLAs had been breached and it was close to the end of the day, so I got the names of some contacts and suggested that we might want to increase the memory as this was the only database on the server and it was not using even half of the memory. Memory may not make CPU bound tasks go any faster, but it will reduce the run time of I/O bound tasks if they are performing buffered reads. Also a logical I/O takes less CPU than a physical I/O. Therefore a memory change may not reduce the run time of CPU bound tasks, but hopefully it would reduce the run time of most other tasks, thereby hopefully reducing the overall run time of batch processes.

The following day I took a look an the EM graph and this did indeed show some problems:

The big spike between 2am and 5am was the period during which the slow performance was being reported. Although it is not clear, the dark patch on the right hand side of the spike consists of both Concurrency in dark red and Configuration in dark brown. The former is waits on “cursor: pin S wait on X” and the latter relates to Streams. From the light green Scheduler wait it is clear that there is CPU starvation of the database and, as the “cpu_count” parameter is set to the number of threads on the server, CPU starvation on the server. There is also a large about of I/O, averaging about 470MB per second of reads between 3am and 4:30am.

Unusually from the ASH and AWR data there did not seem to be many big SQL statements. The biggest SQL statement by CPU only took up 3.4% of the total. However there were a large number of similar looking queries all taking 2-4 minutes of CPU time running under the OBIEE user. Although these statements were only taking a few minutes of CPU time, they were taking about 30-40 minutes to complete due to the I/O involved. However, CPU was the resource being starved and the resource that was changed, so I decided to concentrate on the few top CPU SQL statements first and look at the rest later.

The top SQL statement was a delete that was deleting rows based on the rows returned by a select. The select also contained an INDEX hint for one of the tables. This was performing a index full scan, as opposed to an index fast full scan, due to that hint. An index full scan will read from the root block down one side of the index to the leaf block. It will then proceed to read all the other leaf blocks in order. All of these reads are single block reads. An index fast full scan however simply reads the entire index as it exists on disk in the same way a full table scan would for a table. The reading is not in sorted order but does use multi-block reads. Therefore it will almost always be quicker to perform an index fast full scan rather than an index full scan, but the results are not sorted.

When I removed the hint, an index fast full scan was performed and this ran about 5 times faster. It did use more CPU in total, but hopefully by running faster it could complete before the time when the CPU starvation occurred. This of course would need to be tested. The second SQL statement I looked at was an insert append and was the SQL statement getting the “cursor: pin S wait on X” waits. It was running parallel 12 due to table properties and there were 2 executions of it running simultaneously, i.e. 24 sessions. Normally waits such as this can be difficult to track down when they are not happening at the current time. An ASH report showed that they were all waiting on a single blocking sessions, all except one, which was the blocking session of the others. This session was happily working away. A quick “explain” of the query showed the problem. I ran it and waited, after 10 minutes it still had not returned and so I cancelled it.

What was happening was that when parallelism is used the optimiser can decide to use a high level of dynamic sampling. It is the dynamic sampling that was causing the one active session to take a long time to parse the query whilst the 23 other sessions wait for it. When the SQL statement finishes parsing, all 24 sessions then execute the query. I had seen such a situation before and the use of the hint “dynamic_sampling(0)”, in that case applied by a hint, was the solution then. Of course this could lead to a different plan being chosen and so the performance of whatever plan is used should be tested and a baseline used if it performs worse than the plan generated with dynamic sampling.

The third SQL statement I looked at was a strange one:

SELECT /* DS_SKEW */ /*+ opt_param('parallel_execution_enabled', 'false')
dynamic_sampling(0) no_sql_tune no_monitoring */ *
FROM
(SELECT SYS_OP_COMBINED_HASH("<column>"), COUNT(*) CNT
FROM "<table>" SAMPLE(11.562448) GROUP BY "<column>" ORDER BY CNT
DESC)
WHERE ROWNUM <= 4

Looking at the AWR history, there were a number of version of this only differing by the value specified in the sample clause which was always between 11 and 12. There were also versions for two other tables in two different schemas, but all fundamentally the same. Given the hints, the use of SAMPLE and SYS_OP_COMBINED_HASH this looked very much like system generated SQL. I therefore did a quick Google search which brought up a now removed blog post from Toad World titled “12c Adaptive Query Optimization – Parallel Distribution Methods” and so I did a quick MOS search which resulted in “Parallel Queries in 12c Spending Time on Recursive Query With /* DS_SKEW */ (Doc ID 2104350.1)”. In short, if you have a degree of parallelism over 6 you can get such SQL being executed and the article suggested some ways to resolve it.

With those three SQL statements addressed, I was left with the the queries generated by OBIEE. I looked at the top few and they were all accessing the a specific view. This was on a partitioned table with over a thousand partitions. Importantly, from what I could find in the AWR, the partition key column was not specified in any of the queries or the view. As the relevant indexes were local, this meant that any index scan would have to scan all 1308 partitions. The only solutions to such a situation are to re-partition the table by a column that is specified in the queries, or to make the indexes global. I therefore fed all of this information back to the product and they set about the acting on the recommendations.