OEM inaccessible due to inability to connect to an instance

Symptom

All instances of a database are inaccessible on the SCAN or VIP listeners.
SSH hangs to the same IP addresses.
OEM inaccessible.

Cause

Not known exactly, but certainly a O/S problem.

Solution

Bounce the server.

Investigation

On one Friday one of my team mentioned that OEM was down just as I was walking out the door. He was dealing with it, so I left it to him. On the Monday, OEM was still inaccessible and as it ran our backups and alerting, this was potentially a big problem.

I was pinged by one of my team as he was trying to run backups manually as they had not run over the weekend. This was getting serious. After some investigations, this proved to be an unrelated problem. I asked my team about other backups and alerts and it seemed that these were all fine, so perhaps it was not as bad as I had thought.

However I still decided that I needed to investigate this as it was going nowhere and sooner or latter it would be a problem. Over the weekend we had logged a severity 1 SR with Oracle but due to inactivity on our side this had gone nowhere. From discussions with the people who had been investigating the situation before me, it seemed that even though instance 1 was available, because connections to instance 2 hung, you could not connect to OEM. Exactly why this is I am not sure, but is probably due to needing to spawn a number of connections. This would also explain why occasionally there were reports that OEM was accessible again.

My first port of call was the OEM logs, but as it had been quite some time since I had investigated OEM I needed to find where they were. Via a helpful blog post and Doc ID 1448308.1 I determined that they were in “<EM_INSTANCE_BASE>/em/<OMS_NAME>/sysman/log” and so I started looking at them.

The main log file appears to be emoms.log and its trace counterpart emoms.trc. However there seemed to be no errors in them. In fact I could find no relevant errors in any of the log files. This seemed rather confusing and certainly didn’t help me with the investigation. It had already been determined by other members of my team that the problem was with instance 2 and that nothing was able to connect to the other database instances on the same host via the scan listener. They had also determined that connecting via the local listener and therefore using the host name in the connect string did work and had advised at least one of the teams with a database on the cluster to use that as a work around.

My first test was to ping the host from the OMS server. This worked. I then tried using telnet to access port 22 (SSH) on the IP address for the scan listener on the problem host. This worked and I got back the SSH version string. I then configured my TNS environment to access the scan listener on the problem host and tried using SQL*Plus to connect, which did not work. Was there a problem with the traffic to the listener port? I then tried to SSH to the IP address for the scan listener on the second instance. This hung and after a while disconnected.

At this point I have identified two problems which may or may not have the same cause. If I were to ask another team to investigate a problem we are having with TNS, they are likely to think it was a database problem. If however I talk to them about a problem with SSH, then I am likely to get a much better response. This is of course partly because SSH is far better understood by the other teams. But, if I ask the other teams to investigate a problem with SSH and it turns out to be a different problem from the one with TNS, then we have not fixed the most important problem. It is therefore a judgement call as to whether the problems have the same cause or not. I decided in this case that they were likely to and so asked the other teams to investigate the SSH problem. I asked the network engineering team, rather than network operations team, as the head of that team was sitting beside me at the time.

Whilst they were investigating I continued my investigation. I enabled TNS client tracing by adding the following to the sqlnet.ora:

DIAG_ADR_ENABLED=OFF
TRACE_LEVEL_CLIENT=ADMIN
TRACE_DIRECTORY_CLIENT=/tmp

However the trace file did not show anything of use, nor did the listener logs on the server. I tried connecting to the VIP and this too did not work. I tried connecting from the A node to the B node and this did work on all the interfaces. I then tried SSH tracing and got the following:

$ ssh -vvv <IP address> # The SCAN listener IP address on instance 2.
OpenSSH_5.3p1, OpenSSL 1.0.1e-fips 11 Feb 2013
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug2: ssh_connect: needpriv 0
debug1: Connecting to <IP address> [<IP address>] port 22.
debug1: Connection established.
debug3: Not a RSA1 key file <user home directory>/.ssh/id_rsa.
...
debug1: identity file <user home directory>/.ssh/id_rsa type 1
debug1: identity file <user home directory>/.ssh/id_rsa-cert type -1
debug1: identity file <user home directory>/.ssh/id_dsa type -1
debug1: identity file <user home directory>/.ssh/id_dsa-cert type -1
debug1: identity file <user home directory>/.ssh/id_ecdsa type -1
debug1: identity file <user home directory>/.ssh/id_ecdsa-cert type -1
debug1: Remote protocol version <version>, remote software version <version>
debug1: match: <ssh version string>
debug1: Enabling compatibility mode for protocol ...
debug1: Local version string <ssh version string>
debug2: fd 3 setting O_NONBLOCK
debug1: SSH2_MSG_KEXINIT sent
debug3: Wrote 864 bytes for a total of 885
Connection closed by <IP address>

Googling for “SSH2_MSG_KEXINIT” brought up many hits about problems with the host keys or the MTU size. However the host keys had not changed in years and how could they be the problem if connecting using the host name worked? When I changed the MTU size for the SQL*Plus connection, it still hung. I tried playing about with my user’s known_hosts file but to no effect. By this time, the network engineering team had concluded that although their was a firewall between the OMS host and the database servers, this was configured to allow all traffic. It therefore seemed that this was not a networking problem. The sys admins had also concluded that there did not appear to be anything wrong with server. TCPDUMPs had not shown anything beyond the initial connection.

At this point I updated the team and left for the day. The following day I contacted both parties again to see if there was any progress, but none had been made. I therefore decided to get the chaps on a conference call so that we could discuss what the problem might be and what we should do now. Prior to that the network engineering team chap asked for the routing tables from the A and B nodes. I thought I could not generate this as I would not have permission, but it turns out that you can and that there are several ways in which you can determine the routing table:

/sbin/ip route
/bin/netstat -rn
/sbin/route -n

They all gave slightly different output and I passed this on along with the output of “/sbin/ifconfig -a”.

On the conference call the sys admin chap explained that he had checked all kinds of things but found nothing wrong. The network engineering team chap had compared the routing tables from the A and B nodes and they were identical other than the different IP addresses of the servers of course. The network engineering team chap then took us through the output of some tracing he had done on the firewall and this showed that SSH communication to the B node were decidedly more chatty than the A node and that the firewall was dropping a lot of this extra chat. It therefore looked like the B node was generating a lot of extraneous traffic and that this was probably causing the problem. We therefore concluded that bouncing the server was the way to go. It took a while to raise the change ticket as we wanted to bounce a production server during the day. Before we managed to complete the raising of the change, it was suggested that we switch the OMS to its standby site, but I had already determined that that too would have the connectivity problem. We eventually got the change raised and after the server was bounced, connectivity was restored.

Lessons

Don’t take significant action such as moving to DR unless you actually think it will resolve the problem. Just moving the OMS would not have, however moving the database as well would have at least restored OEM.
If you reasonably believe that the problem is outside Oracle, then try to show that something other than Oracle can see the problem. This does several things for you:
- If nothing but Oracle can see the problem, then the problem is more likely to be with Oracle.
- If you are talking to a SYS Admin, then saying your DB File Sequential Read wait times are high is likely to get you a blank look. However if you say iostat’s await is high, then they are much more likely to understand. In other words, talk their language.
- If you can show that an O/S tool can see the problem, such as SSH in this case, then you preempt any question about whether it is Oracle doing something silly. Before I started using OS Watcher iostat data to prove an I/O problem was outside the database, I spent quite a lot of time explaining to people how Oracle did I/O and that this was all through standard system calls and so slow I/O would not be because Oracle was doing something silly. Of course, just because iostat shows that I/O is slow, does not mean that Oracle is not causing the I/O to be slow. It could be placing too much load on the server, but at least you can prove that it is not a problem with how Oracle is performing its I/O. You can also use the OS Watcher iostat data to see what load Oracle is placing on the server.