Database hangs on “log file switch (archiving needed)”

Symptom

One instance of a database briefly hangs once or twice a day waiting on log file switch (archiving needed).
Hangs seemed to resolve themselves.

Cause

High log switch frequency.
Probable bug causing archiver process to continuously fail to archive anything.

Solutions

Increase size of redo logs.
Reduce REOPEN time.
Kill offending archiver process or reduce and increase log_archive_max_processes.

Investigation

Several DBAs had been working on this and the problem had occurred on two previous occasions. The database, or rather one instance of the database, would appear to hang for a few minutes and then return to normal. This was happening once or twice a day. In the alert logs error messages of the following form were appearing:

Errors in file
<trace file name>:
ORA-19504: failed to create file
"<archive log file name>"
ORA-17502: ksfdcre:4 Failed to create file
<archive log file name>^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
ARC7: Error 19504 Creating archive log file to
'<archive log file name>'

Throughout this investigation the long string of “^@” characters caused me concern, but I suspect that this is a problem with the error message rather than a problem with the file name. I did however manage to create a file with the same name.

The errors in the alert log were far more frequent than the database hangs, so clearly there was another factor involved. Looking at the associated trace file there were further error messages:

Unable to create archive log file
'<archive log file name>'
ORA-19504: failed to create file
"<archive log file name>"
ORA-17502: ksfdcre:4 Failed to create file
<archive log file name>^@
ORA-17500: ODM err:No such file or directory
*** <timestamp> 4929 krsh.c
ARC7: Error 19504 Creating archive log file to
'<archive log file name>'
*** <timestamp> 6944 krsi.c
krsi_dst_fail_caller: dest:1 err:19504 force:0 blast:1
DDE rules only execution for: ORA 312
----- START Event Driven Actions Dump ----
---- END Event Driven Actions Dump ----
----- START DDE Actions Dump -----
Executing SYNC actions
----- START DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) -----
Successfully dispatched
----- END DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (SUCCESS, 1 csec) -----
Executing ASYNC actions
----- END DDE Actions Dump (total 1 csec) -----
DDE: Problem Key 'ORA 312' was flood controlled (0x5) (no incident)
ORA-00312: online log 8 thread 2: '<redo log file name>'
ORA-00312: online log 8 thread 2: '<redo log file name>'

The ORA-17500 error message is the most significant here, stating the standard O/S error message “No such file or directory”. Note that this database uses NFS, hence the “ODM” prefix. This message suggested to me that there was a problem with the file system and so I tried to look at the /var/log/messages file. This file however was rather devoid of useful information. It consisted almost exclusively of messages of the following form:

<timestamp> <hostname> xinetd[4244]: EXIT: nrpe status=0
pid=58652 duration=0(sec)
<timestamp> <hostname> xinetd[4244]: START: nrpe pid=59472
from=<IP address>
<timestamp> <hostname> xinetd[4244]: EXIT: nrpe status=0
pid=59472 duration=0(sec)
<timestamp> <hostname> xinetd[4244]: START: nrpe pid=60325
from=<IP address>

There were none of the usual messages I would expect and certainly no error messages relating to the problem. I therefore did a spot of Googling and discovered the file that configured the syslog was, in this case, /etc/rsyslog.conf. This contained the following relevant lines:

...
*.info;mail.none;authpriv.none;cron.none;local0.*;local5.none
/var/log/messages
...
#--- Send messages to <old monitoring solution>
*.crit;kern.*;authpriv.info;syslog.info;cron.warn;mail.none;local1.*;local5.*
@<IP address>;RSYSLOG_ForwardFormat

I am no expert when it comes to this configuration, but it seemed to me that all the relevant messages would be going to at the IP address in the configuration file. I therefore tried to get a sys admin to help me with getting the information that was going to the . Whilst waiting for them I did some further investigation into why the error in the alert log occurred far more frequently than the performance problem in the database. It turned out to be the rather too frequent switching of redo logs. Archiver process 7 received the following error:

*** <date> 23:02:15.406225 4929 krsh.c
Unable to create archive log file
'<archived log file>'
ORA-19504: failed to create file
"<archived log file>"
ORA-17502: ksfdcre:4 Failed to create file
<archived log file>^@
ORA-17500: ODM err:No such file or directory

A few minutes later archiver process 8 retries the destination:

[arc8 process]
*** <date> 23:08:10.403
krsd_reopen_chk: Request to check destination, call krsi_dest_check
krsd_reopen_chk: Reactivate destination LOG_ARCHIVE_DEST_1, REOPEN has
elapsed
krsd_reopen_chk: failcnt = 1
krsd_reopen_chk: Request to check destination, call krsi_dest_check

A few minutes later there is a message in the alert log that the archiving has now succeeded:

<date> 23:10:15
Archived Log entry 1570821 added for thread 2 sequence 183451 ID 0x10d3af75
dest 1:
krse_arc_driver_core: Successful archiving of previously failed ORL

During this time however, the database had hung:

Although this problem only lasted for a few minutes, it has caused various connections to be dropped and problems for the application. There were no messages in the alert log about this hang. If we look at the switching of the redo logs around this period we see the following:

SQL> select first_time, next_time, sequence#
2 from v$archived_log
3 where FIRST_TIME between to_date('<date>','YYYYMMDDHH24') and
to_date('<date>+1','YYYYMMDDHH24MI')
4 and dest_id = 1
5 and thread# = 2
6 order by sequence#;

FIRST_TIME          NEXT_TIME           SEQUENCE#
------------------- ------------------- ----------
    <date> 22:00:15     <date> 22:00:18     183427
    <date> 22:00:18     <date> 22:07:58     183428
    <date> 22:07:58     <date> 22:16:23     183429
    <date> 22:16:23     <date> 22:24:29     183430
    <date> 22:24:29     <date> 22:31:57     183431
    <date> 22:31:57     <date> 22:38:51     183432
    <date> 22:38:51     <date> 22:39:45     183433 <-- switching every minute for the next approximately 10 redo logs
    <date> 22:39:45     <date> 22:40:36     183434
    <date> 22:40:36     <date> 22:41:21     183435
    <date> 22:41:21     <date> 22:42:13     183436
    <date> 22:42:13     <date> 22:43:16     183437
    <date> 22:43:16     <date> 22:44:22     183438
    <date> 22:44:22     <date> 22:45:28     183439
    <date> 22:45:28     <date> 22:46:22     183440
    <date> 22:46:22     <date> 22:47:16     183441
    <date> 22:47:16     <date> 22:48:13     183442
    <date> 22:48:13     <date> 22:52:05     183443 <-- end
    <date> 22:52:05     <date> 22:52:56     183444
    <date> 22:52:56     <date> 22:53:50     183445
    <date> 22:53:50     <date> 22:54:38     183446
    <date> 22:54:38     <date> 22:55:26     183447
    <date> 22:55:26     <date> 22:56:30     183448
    <date> 22:56:30     <date> 22:57:30     183449
    <date> 22:57:30     <date> 22:58:30     183450
    <date> 22:58:30     <date> 23:02:09     183451 <-- The redo log that archiver process 7 failed to archiver
    <date> 23:02:09     <date> 23:03:14     183452
    <date> 23:03:14     <date> 23:04:14     183453
    <date> 23:04:14     <date> 23:05:26     183454
    <date> 23:05:26     <date> 23:06:32     183455
    <date> 23:06:32     <date> 23:10:15     183456 <-- The "NEXT_TIME" correlates with the point the database becomes operational again.
    <date> 23:10:15     <date> 23:11:51     183457
    <date> 23:11:51     <date> 23:13:30     183458
    <date> 23:13:30     <date> 23:15:21     183459
    <date> 23:15:21     <date> 23:17:16     183460
    <date> 23:17:16     <date> 23:19:16     183461
    <date> 23:19:16     <date> 23:21:25     183462
    <date> 23:21:25     <date> 23:23:47     183463
    <date> 23:23:47     <date> 23:26:17     183464
    <date> 23:26:17     <date> 23:29:05     183465
    <date> 23:29:05     <date> 23:32:02     183466

The timeline would therefore appear to be as below. Note that there are six redo log groups.

23:02:15 Redo log switch, but a failure to archive the finished redo log (sequence 183451 )
23:03:14 Switch of redo log, but no attempt to archive. Two redo logs are now unarchived.
23:04:14 Switch of redo log, but no attempt to archive. Three redo logs are now unarchived.
23:05:26 Switch of redo log, but no attempt to archive. Four redo logs are now unarchived.
23:06:32 Switch of redo log, but no attempt to archive. Five redo logs are now unarchived.
23:08:10 Archive destination is checked
23:08:35 (approximately) Session start hanging waiting on “log file switch (archiving needed)”.
23:10:15 Switch of redo log, archiving of all redo logs completes and sessions cease hanging

This suggests that at 23:08:35 the last remaining archived redo log was completed causing the database to hang until all the redo logs were archived, which completed at 23:10:15. Thus when a high log switch rate coincides with the error the database can run out of archived redo logs before the archive destination is retried and logs are archived. This then leads to a brief hang of the database. I therefore suggested to the product that increasing the size of the redo logs by an order of magnitude would resolve the problem. I later also suggested that changing the “REOPEN” time from the current default value of 300 seconds to 60 seconds would at least reduce the duration of the problem and possibly stop it from occurring.

By this point an experienced SYSOPS hand had been tinkering with the relevant configuration file. It appeared that there were in fact no error messages of the kind I was expecting actually being generated. Although I was somewhat sceptical of this I asked the chap to set strace running on archiver process 7 which was the one hitting the problem. I was expecting to see data being generated at every log switch as I had assumed that archiver process 7 was responsible for archiving logs to the relevant destination. This did not appear to be happening. I therefore started running strace for each archiver process. This showed that there was some kind of round-robin assignment going on to the archiver processes. This of course is a good thing as if one process gets into trouble, the database does not hang, or at least not for long, as another archiver processes will be along shortly to try again.

From the strace data it was clear that only archiver process 7 was having a problem and that it had had a problem ever since it started. In fact it had never successfully archived a redo log. All the other 9 processes were fine. This led me to think that we were dealing with an Oracle bug and I raised a new SR. As of the time of writing there has been no result from this.

The strace files did show that there was a difference in the system calls when each archive processes was archiving a log:

Archiver process 7:

...
49249 11:58:20 lstat("<mount point>/<parent directory>/<archive directory>/<archive file name>", 0x7fff2c5d0320) = -1 ENOENT (No such file or directory)
49249 11:58:20 lstat("<mount point>", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
49249 11:58:20 lstat("<mount point>/<parent directory>", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
49249 11:58:20 lstat("<mount point>/<parent directory>/<archive directory>", {st_mode=S_IFDIR|0770, st_size=507904, ...}) = 0
49249 11:58:20 stat("<mount point>/<parent directory>/<archive directory>", {st_mode=S_IFDIR|0770, st_size=507904, ...}) = 0
49249 11:58:20 open("<mount point>/<parent directory>/<archive directory>", O_RDONLY) = 16
49249 11:58:20 fstatfs(16, {f_type="NFS_SUPER_MAGIC", f_bsize=32768, f_blocks=173277184, f_bfree=155014506, f_bavail=155014506, f_files=31876689, f_ffree=31875690, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0
49249 11:58:20 stat("<mount point>/<parent directory>", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
49249 11:58:20 close(16)                = 0
49249 11:58:20 open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 16
49249 11:58:20 fstat(16, {st_mode=S_IFREG|0644, st_size=3496, ...}) = 0
49249 11:58:20 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2b925fe000
49249 11:58:20 read(16, "/dev/mapper/system-root / ext4 r"..., 4096) = 3496
49249 11:58:20 close(16)                = 0
49249 11:58:20 munmap(0x7f2b925fe000, 4096) = 0
49249 11:58:20 open("<oracle home>/dbs/oranfstab", O_RDONLY) = -1 ENOENT (No such file or directory)
49249 11:58:20 open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 16
49249 11:58:20 fstat(16, {st_mode=S_IFREG|0644, st_size=3496, ...}) = 0
49249 11:58:20 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2b925fe000
49249 11:58:20 read(16, "/dev/mapper/system-root / ext4 r"..., 4096) = 3496
49249 11:58:20 close(16)                = 0
49249 11:58:20 munmap(0x7f2b925fe000, 4096) = 0
49249 11:58:20 open("/etc/oranfstab", O_RDONLY) = -1 ENOENT (No such file or directory)
49249 11:58:20 open("/etc/mtab", O_RDWR|O_CLOEXEC) = -1 EACCES (Permission denied) <-- Note that this is read-write
...

Archiver process 0:

...
49221 12:05:12 lstat("<mount point>/<parent directory>/<archive directory>/<archive file name>", 0x7ffe185ccca0) = -1 ENOENT (No such file or directory)
49221 12:05:12 lstat("<mount point>", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
49221 12:05:12 lstat("<mount point>/<parent directory>", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
49221 12:05:12 lstat("<mount point>/<parent directory>/<archive directory>", {st_mode=S_IFDIR|0770, st_size=507904, ...}) = 0
49221 12:05:12 stat("<mount point>/<parent directory>/<archive directory>", {st_mode=S_IFDIR|0770, st_size=507904, ...}) = 0
49221 12:05:12 open("<mount point>/<parent directory>/<archive directory>", O_RDONLY) = 14
49221 12:05:12 fstatfs(14, {f_type="NFS_SUPER_MAGIC", f_bsize=32768, f_blocks=173277184, f_bfree=154993977, f_bavail=154993977, f_files=31876689, f_ffree=31875689, f_fsid={0, 0}, f_namelen=255, f_frsize=32768}) = 0
49221 12:05:12 stat("<mount point>/<parent directory>", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
49221 12:05:12 close(14)                = 0
49221 12:05:12 open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 14
49221 12:05:12 fstat(14, {st_mode=S_IFREG|0644, st_size=3496, ...}) = 0
49221 12:05:12 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fb0827000
49221 12:05:12 read(14, "/dev/mapper/system-root / ext4 r"..., 4096) = 3496
49221 12:05:12 close(14)                = 0
49221 12:05:12 munmap(0x7f7fb0827000, 4096) = 0
49221 12:05:12 open("<oracle home>/dbs/oranfstab", O_RDONLY) = -1 ENOENT (No such file or directory)
49221 12:05:12 open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 14
49221 12:05:12 fstat(14, {st_mode=S_IFREG|0644, st_size=3496, ...}) = 0
49221 12:05:12 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7fb0827000
49221 12:05:12 read(14, "/dev/mapper/system-root / ext4 r"..., 4096) = 3496
49221 12:05:12 close(14)                = 0
49221 12:05:12 munmap(0x7f7fb0827000, 4096) = 0
49221 12:05:12 open("/etc/oranfstab", O_RDONLY) = -1 ENOENT (No such file or directory)
49221 12:05:12 open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 14  <-- Note that this is read-only
...

As can be seen from the above, in one instance /etc/mtab is being opened for reading and writing, whilst in the other it is only being opened for reading. As /etc/mtab is a root owned file and should not be written to by the Oracle process, it would seem only reasonable that the attempt to open it read-write fails. Whether this is symptom or a cause is at this point unknown.

Whilst waiting for some input from Oracle and for the redo log size or REOPEN changes to be implemented I looked back at the two previous occurrences of this problem. The trace files did not exist, but the alert log entries did. These showed the same pattern with one archiver process, in those cases processes 9 and 1. The errors started immediately after a restart and finished at the next restart. I therefore suggested that we should stop the relevant archiver process either by setting “log_archive_max_processes” to a value such that the relevant process stops gracefully or by simply issuing a kill command on the process. Before suggesting these options I did test them and both were quick and safe.

The redo log change was delayed as it turned out to be rather more complicate than I had thought due to streams. The REOPEN change however has been implemented and although it has improved the situation, it is still possible for the performance problem to reoccur. However a scheduled restart of the database has resolved the problem for the moment. I am still pursuing the SR with Oracle but I suspect that it will not go anywhere for now as the problem has been resolved. Hopefully if it happens again we will be in a much better place.

Lessons

Initially I thought that the problem was going to be with the O/S or storage and so I waited more than a day for an experienced SYSOPS resource to get the appropriate messages going to /var/log/messages. Perhaps in a similar situation I would now continue investigating.
I am not overly familiar with the database environment and so I was rather more gung-ho than perhaps I should have been with my suggestion of increasing the redo log size. I knew that the standbys would complicate things and I raised this, but as I did not know that Streams was in use I did not raise this. Perhaps in future I should be more cautious when making suggestions about environments I am unfamiliar with.