Batches running slowly due to “enq: TM contention”

Symptom

Large number of sessions waiting on “enq: TM contention”
Other sessions waiting on “cursor: pin S wait on X”

Cause

Idle session blocking delete – exact locking change unknown.

Solutions

Kill idle session

Investigation

I was asked to look at a production database as there was a lot of “enq: TM contention” waits and queries on a specific table were running very slowly or not completing at all. I took a look at instance 1:

There was clearly a problem, although not a drastic looking one as only a few sessions were waiting on “Concurrency”. I drilled down into these sessions and they were all waiting on “cursor: pin S wait on X” and coming from one select statement:

This was a select with a large number of bind variables and I had seen it cause problems before, but not of this type. Looking at the statistics for the statement it seemed to be almost exclusively waiting but never completing:

This however did not tally with the Activity of the SQL statement as it seemed to spend a lot of time on the CPU as well:

Why would it apparently spend so much time on the CPU according to the “Activity” section, but spend so little according to the “Details” section. Given that there was at least one point in the Activity graph where CPU dropped to zero, this would suggest that an execution had finished, but there were no complete executions according to the statistics. This was all rather confusing, especially considering the wait event usually relates to activity in the shared pool, such as large numbers of child cursors, large amounts of hard parsing and changes to an object.

However there clearly were not many sessions executing the query and it was apparently completing, at least according to my interpretation of the Activity graph. I looked at the blocking sessions but this was just as confusing. Both the blocker and the blocked sessions were executing the SQL statement I was investigating and the one that was blocking was apparently the one that was on the CPU according to the Activity graph. The reported wait event for the blocker was “SQL*Net more data to client”, but I suspected that this was the last wait event it encountered, not the current wait event. A query of V$SQL ( or looking at the “General” tab of the session details ) showed that it was not actually waiting. Looking at the session details for the blocking session, It agreed with the Activity graph for the SQL statement:

Tracing of the sessions produced no trace file which would of course be the case if the session was performing an activity which was purely using CPU and did not encounter any wait events.

It was at about this point that I was called by the chap who asked me to investigate in the first place. He mentioned that “enq: TM contention” event again which of course I had not seen as the SQL I was looking at was not waiting on it. I was looking at the aforementioned SQL statement at the time on the second instance and it was doing the same thing. So I took a look at the general performance graph for instance 2. To my somewhat surprise there were the few sessions I was already looking at, executing the SQL I was investigating, but more importantly, there were about 30 sessions all waiting on “enq: TM contention”:

I therefore felt a bit sheepish that I had been investigating what was apparently not the real problem, although it clearly was a problem, but fortunately this was only about 25 minutes into the investigation and so I didn’t feel too stupid. I therefore shifted my investigation to look at this problem and specifically the insert statement at the top of the list of sessions getting the “enq: TM contention” wait event. This was the statement that was actually causing all of the product impact.

A web search for “enq: TM contention” brought up the usual “unindexed foreign keys” answer. I did not think this was the case as nothing was new about the code and I had already been told that no changes had been made to the system, such as the dropping of an index. I did however take a quick look.

On the assumption than an unindexed foreign key would result in a large number of full table scans on the parent table, especially given the number of sessions running the insert, I generated an AWR report and looked for segments with large numbers of physical reads against them in the “Segments by Physical Reads” section. This however didn’t help as the top table was the table being inserted into. However only 41.7% of the reads had been captured, so I ran one of the many scripts to list unindexed foreign keys. This brought back 462 rows, but none relating to the relevant table. It therefore didn’t look like this was an unindexed foreign key. I then ran an ASH report to see if it gave me any information on what objects were involved.

Top DB Objects              DB/Inst: ...
-> With respect to Application, Cluster, User I/O, buffer busy waits and
   In-Memory DB events only.
-> Tablespace name is not available for reports generated from the root PDB
   of a consolidated database.

      Object ID % Activity Event                             % Event
--------------- ---------- ------------------------------ ----------
Object Name (Type)                                    Tablespace
----------------------------------------------------- --------------------------
          17239     100.00 enq: TM - contention               100.00
<table from original SQL> (TABLE)                     <tablespace>
          -------------------------------------------------------------

Bingo! I thought. The waits were all against one specific table not mentioned in the SQL and that was the one having all the problems that I was originally investigating. I ran an ASH report of the original select statement, but this did not give me any more useful information, although it did concur with the Activity graphs. No objects were listed under Top DB Objects and the blocking SID was listed as one of the other sessions executing the same select.

Top User Events             DB/Inst: 

                                                               Avg Active
Event                               Event Class        % Event   Sessions
----------------------------------- --------------- ---------- ----------
cursor: pin S wait on X             Concurrency          87.50       6.98
CPU + Wait for CPU                  CPU                  12.50       1.00
          -------------------------------------------------------------

I enabled tracing on the select statement and the sessions waiting on “cursor pin S wait on X” gave content such as the following:

WAIT #140701459264208: nam='cursor: pin S wait on X' ela= 69843963
idn=2372086739 value=10411000725504 where=21474836480 obj#=<obj#>
tim=<tim>
*** <date> 08:51:41.940
WAIT #140701459264208: nam='cursor: pin S wait on X' ela= 69805319
idn=2372086739 value=10411000725504 where=21474836480 obj#=<obj#>
tim=<tim>
*** <date> 08:52:51.761
WAIT #140701459264208: nam='cursor: pin S wait on X' ela= 69820542
idn=2372086739 value=10411000725504 where=21474836480 obj#=<obj#>
tim=<tim>

Confusingly, from a query of DBA_OBJECTS, the “obj#” seemed to point to a completely unrelated object. I knew that even though the table in the query was a table, in various schemas there was a view of the same name and so a select from it may well be rather more complicated. This indeed proved to be the case, but it still did not access the “obj#”. An explain plan also showed no access to it. I reported these findings back and the product confirmed that there should not be access to that object. I tried a “select *” from both the table and the view. The table returned rows immediately, but the view did not. However I suspected that this may be due to the joins in the view and with a specific set of bind variables, the view returned quickly. It therefore seemed that there was no specific problem with the view or the underlying tables.

At this point, I was struggling to understand what was going on due to the conflicting information. Then the reporter from the product asked if I could see any problem with one application server of a specific type as it was not getting errors. Each of this type of application servers uses a different user and I could not see the user relating to the one mentioned in any of the queries getting the “cursor: pin S wait on X” wait. Perhaps the application servers that were getting the error were doing something strange with their environment. Perhaps by stopping these problem queries we could resolve the problem.

Killing the selects would apparently just result in them being reissued and so I asked if the application servers generating the queries getting the problem could be temporarily stopped. This was done. I then killed the select as they refused to die, I even had to kill the O/S processes of the sessions which were apparently on the CPU. Although this removed the “cursor: pin S wait on X” waits, disappointingly it made no real difference to the sessions getting the “enq: TM – contention” waits. There was a slight drop, but this was down to the application servers being shutdown.

I took a look at the blocking sessions on instance 2 again. Originally it had had all the select statements getting “cursor: pin S wait on X” waits, but now there were just two sessions. The blocker was a session which had been idle for about 16 hours and had been connected for about two days. The blocked session was executing a delete from an apparently unrelated table and was waiting on “enq: TM – contention”.

I knew that there were lots of foreign keys in the schema, as there should be, and so just because the table appeared to be unrelated did not mean that it could not be involved. As the blocking session had been idle for so long, it was decided to kill it.

As soon as the session was killed, the delete completed and then all the inserts completed. Why then would a delete from an apparently unrelated table block the insert? This I do not have an answer for. I looked at the relationships between all the various tables identified in the investigation, but I found no smoking gun. Perhaps the idle session or the blocked session also held some form of lock that was causing the problem, but there is no easy way to tell at this point. The “enq: TM contention” locks were all in mode 3 which is apparently the mode in which they wait for unindexed foreign keys, but I don’t think that at this stage I am going to be able to find out the exact locking chain.

The whole situation was therefore caused by one session blocking what seemed like an unconnected session. It took me about 2.5 hours to kill the offending session and resolve the situation. Should I have killed the session earlier? Had I investigated it at the time, perhaps I could have worked out what the connection was, but it was just one session being blocked. It initially looked unconnected and none of the times matched, so perhaps in another scenario I would have wasted time looking at an unrelated blocking session. Perhaps I should have just killed the blocking session as it was idle for so long and was causing a problem for at least one session, but my general approach is to only do things that I have reason to believe are related to the problem I am trying to resolve. However in the end I did kill it as it was the only block session left.

Lessons

Even if the services are running across all instances, as was the case here, the problem may only be visible on one. Therefore check all instances before diving in to any deeper analysis.

Superficially unconnected blocked sessions can be causing the problem. Therefore, in a blocking scenario such as this, before going too deeply into the investigation, take a quick look at all the blocking sessions and if any are causing a problem, even apparently unrelated, and can be killed without impact, it is probably worth killing them.