Menu Close

Database slow “due” to rebalance

Symptom

  • Slow performance reported during a rebalance.

Cause

  • Not the rebalance.
  • Actual cause is not known at this point. It is likely to be a problem with the storage layer.

Solution

  • Unknown at this point.

Investigation

I was asked to look at a production database as it had been rather slow early in the morning apparently due to a rebalance. I was rather sceptical about this as a rebalance had never caused a problem before, even though this one had been allowed to run into the online day.

I firstly confirmed the sequence of events which appeared to be as follows:

13:45 Rebalance started at power 11 after adding of storage
02:30 Poor I/O performance reported on the database.
11:15 Rebalance power dropped to 1
12:00 Rebalance paused

As usual, my first port of call in such an investigation was the DB file sequential read graph:

Instantly I felt more confident in my original scepticism that the rebalance was the issue. Something appears to have happened around the 6am on the 1st, the middle of the graph, which took the I/O performance from 4ms to 5ms, but it remained at that level throughout the period of the rebalance, apart from the spike which peaked at 2:30am on the 3rd. There was no correlation between the rebalance and the poor I/O. As always in such a situation, my next step was to generate some iostat graphs from the OS Watcher data.

This is a stacked line graph of the number of sectors read per second for all of the database data LUNs averaged over each hour. A sector is 512 bytes, so the highest peak at just over 800,000 sectors per second equates to about 400MB/s. Clearly then the rebalance has had an effect, but not one that should cause 30ms single block reads. You will note that the load is at about 500,000 sectors per second until about midnight on the 2nd/3rd and then increases. This is because the application does more or less nothing over the weekend, but is active 24hrs a day during the week. This increase therefore represents the application load. There is a correlation between the I/O load and the time during which the rebalance was active as well as between the I/O load and the activity of the application, however there again seems to be no correlation between the load and the performance problem under investigation.

I then took a look at the await value, again averaged over each hour and in ms:

My first thought was that here was a correlation between the rebalance and poor performance as the await clearly increases when the rebalance starts and decreases when the rebalance stops. As there was a chap from the product looking over my shoulder at the time and I had confidently told him that the problem was not the rebalance, a little bit of panic set in.

Fortunately, I remembered an investigation I had done a week or two earlier where the await in both the good and bad times had been consistently above the single block read time. This database had been doing large amounts of multi-block reads. The await metric is the average time for a read request to be answered, so the time between a read call being issued by a process and the call completing. This of course takes no account of how big the read is. The rebalance was no doubt reading large amounts of data in each call and so await is bound to increase.

With this slight hiccup explained, I could then show again that there was no correlation between the rebalance and the performance problem. Confident that this problem was not with the database or the rebalance operation, I reported my findings and left the database side of the investigation there.

Lesson

Know what your metrics actually measure. That way you will not find yourself confused and panicking when doubt is cast on your conclusion by a misinterpretation.