Menu Close

Unable to connect to a new database in a new data centre

I was asked to look at why a product could not connect to their newly created standby database which was opened as a snapshot standby. Several people had already taken a look but to no avail I started by getting some details on the situation. This was a new standby in a data centre the product had never run out of before and they were connecting from existing infrastructure in another data centre. The network team had concluded that it was unlikely to be a network issue as they could ping the server and communicate on the relevant port that Oracle was listening on. I made a few connection attempts from various hosts and everything was fine. I talked to my trusty Sys Admin contact and he confirmed that everything looked fine on the server. I therefore started to trawl the Oracle logs.

The listener log showed successful connections from the relevant application servers And there were no errors in the alert log which stood out. I tried enabling tracing database wide. This is not something I would normally do, but it was a snapshot standby and was just being used for testing. This however showed no evidence of the connections. I talked to my trusty Sys Admin contact again and he mentions that the MTU (Message Transfer Unit) size for the relevant interface was set to 9000 on both servers. I discuss this with network chaps, but they said that this should be negotiated down to 1500 as that is what would be used for inter-data centre communication.

I therefore gained access to the application server and tried enabling SQL*Net tracing to see what was going on. After entering the username and password the connection seemed to hang. Looking at the generated log it was clear that after the password had been sent SQLPlus was waiting for a response which never came. Could this be MTU related? If so, why would the connection work at all?

I decided to test whether the MTU size was the issue and set “DEFAULT_SDU_SIZE” in the sqlnet.ora file to 1500. Now the connection worked as expected. I therefore added the “SDU” parameter to the relevant listener.ora file. Of course, this requires a reload of the listener which will lead to a brief failure of new connections as the databases re-register. With that in place, the application could connect successfully. It would appear that the packets early on in the communication never exceeded 1500, but the password packet did.

Discussing this issue with various people determined that 9000 was the standard setting within our organisation for the MTU size on all interfaces and that something called “Path MTU Discovery” (PMTUD) would mean that the servers negotiate to maximum packet size that can go between them. What did this mean? Did PMTUD actually work?

We decided to set the MTU size on the relevant interface of the database server to 1500 as this would resolve the immediate situation and then investigate any wider implications later. The settings were managed by our configuration management tool and so things were somewhat more complicated. It meant waiting for the change in the configuration management tool to be made and then waiting for the change to be pushed out. It was past my going home time by this point, but I decided that I should stay to see it through rather than hand it over to someone else. A good thing I did as after the change was pushed out the interfaces were restarted and this caused all kinds of weird problems with the listeners. I had thought they would restart when the interfaces did, but instead they got into a bad state I therefore had to bounce all the listeners, after which the problem was resolved.

The following day there were various discussions about what the impact of this was. It appeared that as long as one side of the communication had an MTU size of 1500, everything was fine, but when both had 9000 we had a problem. This was why I could connect successfully when the product could not. As it turned out it was known that there was a misconfiguration that would mean that Path MTU Discovery would not work in this situation, a misconfiguration that has now been resolved.