ZFS Corruption Persists in Unlinked Files

This article explains how zfs errors can sometimes persist even after the files containing the offending blocks have been deleted, and presents an easy way of removing corruption.

Most solaris sysadmins will be familiar with the following situation. A zpool gives errors about failed checksums and enters a degraded state:


# zpool status -v
pool: pool1
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: scrub completed after 2h7m with 10 errors on Tue Jan 10 19:38:00 2012
config:

NAME STATE READ WRITE CKSUM
pool1 DEGRADED 0 0 130
c0d2 ONLINE 0 0 0
c0d4 DEGRADED 0 0 260 too many errors

errors: Permanent errors have been detected in the following files:

pool1/u02@Backup_u02:/oracle/some/directory/logfile.log

ZFS is saying that the file logfile.log, located within the snapshot Backup_u02, contains blocks that are failing checksums. The errors might be caused by a physical problem with disk c0d4 – note the large error count (260). In this case, though, there was no physical disk problem. In fact the system was a Solaris LDOM hosted by a sparc blade, c0d4 was a virtual disk backed by a file on the LDOM parent. I checked the parent and found no corresponding errors.

Scrubbing the zpool

The usual course of action is to delete the corrupted file and do a zpool scrub to show up any further errors. Files in error are again deleted and the zpool is scrubbed again. By repeating this process a few times, it is sometimes possible to return a zpool to full working order, so long as there is no underlying physical problem. If there is a physical problem the zpool will likely become corrupt again.

Rather than just deleting logfile.log, let’s remove the whole snapshot:

# zfs destroy pool1/u02@Backup_u02

Now check the status of the zpool:

# zpool status -v
pool: pool1
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: scrub completed after 2h7m with 10 errors on Tue Jan 10 19:38:00 2012
config:

NAME STATE READ WRITE CKSUM
pool1 DEGRADED 0 0 130
c0d2 ONLINE 0 0 0
c0d4 DEGRADED 0 0 260 too many errors

errors: Permanent errors have been detected in the following files:

<0x398>:<0x40229c>

Those hex codes are often seen in the output of “zpool status -v” when a troublesome file has been deleted. The more files that are deleted, the more hex numbers will appear.

Errors in Blocks Belonging to Deleted Files

The code on the right, 0x40229c, is the inode number of the deleted file. Though the file was removed, it still exists. The file is being held open by some running process. As with older file systems, zfs will not fully delete a file until it is no longer being locked by any process and is not linked from the file system.

To remove the corruption, simply restart whichever program is holding open inode 4203163. You can search for the appropriate process using lsof or pfiles. My solaris system does not have lsof. Here is a quick pipeline using pfiles to print the PIDs of all processes having inode 4203164 open:


# ps -ef | grep -v PID | awk '{print "pfiles "$2}' | sh | awk '/^[0-9]/ {pid=$1} /4203164/ {print pid}'

5606:
pfiles: cannot examine 995: no such process

So the deleted file is being locked by process 5606. Check it:


# ps -ef | grep 5606
dwdev2 5606 5393 0 Jan 05 ? 26:57 /u01/oracle/DWDEV1/apps/tech_st/10.1.3/appsutil/jdk/bin/java -DCLIENT_PROCESSID=4

Unsurprisingly, it’s an Oracle application process that is holding open the deleted logfile and thus unwittingly preserving the corrupted data blocks on the system.

Pfiles Cautionary Note

“pfiles” briefly pauses a process in order to examine its internals. Some administrators believe that this can be dangerous to the system. It is therefore advisable to exercise judgement before running pfiles on a live system. In particular, the above pipeline will run pfiles on every running process, which may take a few minutes on a busy system. Exercise caution.

Stopping the Process and Final Scrub

In this case the application process was restarted (by arrangement with the database administrator), releasing the file blocks that so displeased ZFS. zpool still shows the corruption, however:


zpool status -v
...
errors: Permanent errors have been detected in the following files:

<0x398>:<0x40229c>

zpool needs to take another look and update its records. zpool scrub will do it:

# zpool scrub pool1

The scrub will take 2 hours to complete. It is not strictly necessary to wait all that time. zpool seems to run a check just before it terminates. If the scrub is stopped immediately it will still do the check and mark the pool as clean.

# zpool scrub -s pool1

Now the pool is clean.

# zpool status -v
pool: pool1
state: ONLINE
scrub: scrub in progress for 1h58m, 79.04% done, 0h31m to go
config:

NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
c0d2 ONLINE 0 0 0
c0d4 ONLINE 0 0 0

errors: No known data errors

9 thoughts on “ZFS Corruption Persists in Unlinked Files

  1. Great bit of information – thanks

    My ESXi based OpenIndiana server seems to have this problem constantly when running a SCRUB, yet it’s fine when booted directly into a Live OI CD. I wonder if vmware tools is interfering with normal operation at the zfs level.

  2. Had a similar problem on a FreeBSD 9.1 system. System is several years old, first time this has occurred. I do see occasional sector repaired errors when I do a scrub.

    The file in question was not corrupt – it was a bittorrent share so it was easy enough to have bittorrent re-hash the file. Nice to clear the error.

  3. In case anyone needs the “lsof” equivalent of the “pfiles”-dependent command line above, it is this:

    ps -ef | grep -v PID | awk ‘{print “lsof -a -p “$1}’ | sh | awk ‘/^[0-9]/ {pid=$1} /176905/ {print pid}’

    where 176905 is the decimal of the inode hex.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.