Solaris Tictimed Catastrophic File Error

Solaris administrators may have seen the message “Catastrophic file error – zero length” in their system logs. Although it sounds serious, there is nothing “catastrophic” about it. This post explains how to stop the message from flooding your log files.

Here is an example of the error (scroll right):

May 21 13:58:01 pluto tictimed[27759]: [ID 584656 user.error] [tictimed] Catastrophic file error - zero length
May 21 13:58:01 pluto tictimed[27762]: [ID 584656 user.error] [tictimed] Catastrophic file error - zero length
May 21 13:58:01 pluto tictimed[27762]: [ID 584656 user.error] [tictimed] Catastrophic file error - zero length
May 21 13:58:01 pluto tictimed[27765]: [ID 584656 user.error] [tictimed] Catastrophic file error - zero length

The messages repeat every 10 ninutes or so. And they might be accompanied by this message on the system console:

INIT: Command is respawning too rapidly. Check for possible errors.
id: LT "/usr/sbin/tictimed >/dev/msglog 2/dev/msglog

Global and non-global zones are affected.

Oracle’s Answer – Remove a File. But which one ?

Googling for an answer leads only to an Oracle website with the following less-than-helpful information:

LWACT is removing the zero byte file and starting afresh. Occurs when the availability datagram file turns to 0 bytes in size for an unknown reason.

Action: For pre-LWACT 3.2 installation, remove the zero byte file, tictimed will recreate it. For LWACT 3.2 or higher versions, no action is required. LWACT will automatically remove the zero byte file.

So we must remove the zero byte file. But it doesn’t give us the file name or location.

Identify that File

The file it is talking about is fact $LOGDIR/hostid.lwact.xml, as referred to at the bottom of the tictimed man page:

$ man tictimed
...
FILES
     /etc/default/lwact -  Configuration  file  of  light  weight
     availability  collection  tool.   $LOGDIR/hostid.lwact.xml -
     Availability data file generated by light weight  availabil-
     ity  collection  tool.   $UPDATE/lwact.update  - Update file
     containing the cause codes  to  be  assigned  for  the  last
     outage.    LOGDIR  and  UPDATE  are  configurable  variables
     defined in the /etc/default/lwact. They hold  the  directory
     path to their corresponding files.

where $LOGDIR is defined in the file /etc/default/lwact:

bash-3.00# grep LOGDIR /etc/default/lwact
# LOGDIR: path to directory where log file will be written.
LOGDIR=/var/log

And hostid is the system identity, the output of the hostid command:

bash-3.00# hostid
03f1c06a

So in this case the file to delete is /var/log/03f1c06a.lwact.xml

Delete the File

Now check that the file exists, and is indeed zero bytes long:

bash-3.00# ls -l /var/log/03f1c06a.lwact.xml
-rw-r--r--   1 root     root           0 May 18  2011 /var/log/03f1c06a.lwact.xml

Remove the file:

bash-3.00# rm /var/log/03f1c06a.lwact.xml

Once the file is deleted, init will restart the tictimed daemon within 10 minutes, after which the program will keep running, rather than continually failing and respawning.

Doing a ps (a few minutes after removing the file) should show tictimed running properly:

bash-3.00# ps -ef | grep tictimed
    root 19328  5461   0 15:01:21 pts/7       0:00 grep tictimed
    root 17891  1286   0 14:57:27 ?           0:00 /usr/sbin/tictimed

A Note about Zones

It is possible to check for this error on all non-global zones at once. If you are logged into the global zone, type something like ls /zones/*/root/var/log/*.lwact.xml. One file should be listed for each non-global zone. Any file having zero size corresponds to a zone having the “catastrpohic file error” problem, and can be deleted from the global zone, fixing the problem on the non-global zone.

On production systems however, it is probably safer to login to each zone and check for the problem seperately.

Footnote

I originally used a couple of clues to find the name of the file. First, just by using find:

bash-3.00# find /var -name '*lwact*'
/var/sadm/pkg/SUNWlwact
/var/sadm/pkg/SUNWlwact/save/pspool/SUNWlwact
/var/log/03f1c06a.lwact.xml

and secondly by running tictimed in the foreground – it fails immediately but with no error message. The file name is shown by truss:

bash-3.00# truss tictimed 2>&1 | grep xstat | grep lwact
xstat(2, "/var/log/03f1c06a.lwact.xml", 0x08047DF8) = 0

Conclusion

Tictimed gives a misleading and slightly alarming error message if its “availability data file” is empty. As the man page says, the problem behaviour was fixed with the release of version 3.2, and tictimed will remove the file automatically.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.