Solaris administrators may have seen the message “Catastrophic file error – zero length” in their system logs. Although it sounds serious, there is nothing “catastrophic” about it. This post explains how to stop the message from flooding your log files.
Here is an example of the error (scroll right):
May 21 13:58:01 pluto tictimed[27759]: [ID 584656 user.error] [tictimed] Catastrophic file error - zero length May 21 13:58:01 pluto tictimed[27762]: [ID 584656 user.error] [tictimed] Catastrophic file error - zero length May 21 13:58:01 pluto tictimed[27762]: [ID 584656 user.error] [tictimed] Catastrophic file error - zero length May 21 13:58:01 pluto tictimed[27765]: [ID 584656 user.error] [tictimed] Catastrophic file error - zero length
The messages repeat every 10 ninutes or so. And they might be accompanied by this message on the system console:
INIT: Command is respawning too rapidly. Check for possible errors. id: LT "/usr/sbin/tictimed >/dev/msglog 2/dev/msglog
Global and non-global zones are affected.
Oracle’s Answer – Remove a File. But which one ?
Googling for an answer leads only to an Oracle website with the following less-than-helpful information:
LWACT is removing the zero byte file and starting afresh. Occurs when the availability datagram file turns to 0 bytes in size for an unknown reason.
Action: For pre-LWACT 3.2 installation, remove the zero byte file, tictimed will recreate it. For LWACT 3.2 or higher versions, no action is required. LWACT will automatically remove the zero byte file.
So we must remove the zero byte file. But it doesn’t give us the file name or location.
Identify that File
The file it is talking about is fact $LOGDIR/hostid.lwact.xml, as referred to at the bottom of the tictimed man page:
$ man tictimed ... FILES /etc/default/lwact - Configuration file of light weight availability collection tool. $LOGDIR/hostid.lwact.xml - Availability data file generated by light weight availabil- ity collection tool. $UPDATE/lwact.update - Update file containing the cause codes to be assigned for the last outage. LOGDIR and UPDATE are configurable variables defined in the /etc/default/lwact. They hold the directory path to their corresponding files.
where $LOGDIR is defined in the file /etc/default/lwact:
bash-3.00# grep LOGDIR /etc/default/lwact # LOGDIR: path to directory where log file will be written. LOGDIR=/var/log
And hostid is the system identity, the output of the hostid command:
bash-3.00# hostid 03f1c06a
So in this case the file to delete is /var/log/03f1c06a.lwact.xml
Delete the File
Now check that the file exists, and is indeed zero bytes long:
bash-3.00# ls -l /var/log/03f1c06a.lwact.xml -rw-r--r-- 1 root root 0 May 18 2011 /var/log/03f1c06a.lwact.xml
Remove the file:
bash-3.00# rm /var/log/03f1c06a.lwact.xml
Once the file is deleted, init will restart the tictimed daemon within 10 minutes, after which the program will keep running, rather than continually failing and respawning.
Doing a ps (a few minutes after removing the file) should show tictimed running properly:
bash-3.00# ps -ef | grep tictimed root 19328 5461 0 15:01:21 pts/7 0:00 grep tictimed root 17891 1286 0 14:57:27 ? 0:00 /usr/sbin/tictimed
A Note about Zones
It is possible to check for this error on all non-global zones at once. If you are logged into the global zone, type something like ls /zones/*/root/var/log/*.lwact.xml. One file should be listed for each non-global zone. Any file having zero size corresponds to a zone having the “catastrpohic file error” problem, and can be deleted from the global zone, fixing the problem on the non-global zone.
On production systems however, it is probably safer to login to each zone and check for the problem seperately.
Footnote
I originally used a couple of clues to find the name of the file. First, just by using find:
bash-3.00# find /var -name '*lwact*' /var/sadm/pkg/SUNWlwact /var/sadm/pkg/SUNWlwact/save/pspool/SUNWlwact /var/log/03f1c06a.lwact.xml
and secondly by running tictimed in the foreground – it fails immediately but with no error message. The file name is shown by truss:
bash-3.00# truss tictimed 2>&1 | grep xstat | grep lwact xstat(2, "/var/log/03f1c06a.lwact.xml", 0x08047DF8) = 0
Conclusion
Tictimed gives a misleading and slightly alarming error message if its “availability data file” is empty. As the man page says, the problem behaviour was fixed with the release of version 3.2, and tictimed will remove the file automatically.