Zfs: errors: Permanent errors have been detected in the following files:

Created on 9 Dec 2019 · 16Comments · Source: openzfs/zfs

my pool has errors with redundant disks, but scrub didn't repair, why?

pool: nas
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: scrub repaired 0B in 0h0m with 0 errors on Mon Dec 9 10:19:57 2019
config:

NAME        STATE     READ WRITE CKSUM
nas         ONLINE       0     0     0
  mirror-0  ONLINE       0     0     0
    sdb     ONLINE       0     0     0
    sdc     ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

    <metadata>:<0x0>
    <metadata>:<0x39>

nas is ubuntu with zfs version 0.7.5-1ubuntu16.6

Question

Source

opili892

👎1

Most helpful comment

You may need to run zpool scrub twice to clear the error message buffer. It contains records for the current state and previous scrub.

This saved my pool. I got similar errors as @opili892 when I played around with vdev device names while doing zfs send/receive. Stupid me. The pool ended up with this status:

pool: zstore
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0 days 07:41:34 with 0 errors on Fri Apr 24 07:10:00 2020
config:

NAME                           STATE     READ WRITE CKSUM
zstore                         ONLINE       0     0     0
  mirror-0                     ONLINE       0     0     0
    WD40EFRX-red-WCC7K6XALFKX  ONLINE       0     0     0
    WD40EFRX-red-WCC7K2YAJA69  ONLINE       0     0     0
  mirror-1                     ONLINE       0     0     0
    WD40EFRX-red-WCC7K6YX6SAA  ONLINE       0     0     0
    WD40EFRX-red-WCC7K4JKN36U  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
        <metadata>:<0x39>

The second(!) scrub fixed it!

I would never have tried a second scrub until I read @richardelling comment. I always thought that what one scrub cannot fix, another scrub cannot fix either. Is this behaviour documented somewhere?

mabod on 26 May 2020

👍3

All 16 comments

Did you read the Contribution guidelines?
https://github.com/zfsonlinux/zfs/blob/master/.github/CONTRIBUTING.md

This looks like a support request or a question, which should primarily be asked on the mailinglist instead.

Ornias1993 on 9 Dec 2019

👎4 👍2 😕1

support request? this is a bug, zfs didn't protect my data with mirror disks!?

opili892 on 9 Dec 2019

👎1

@opili892 If it is a bug, we need logs (debug info) or a way to repeat the issue at hand.

Considering you didn't confirm this actually is a bug, your post sounds more like a "help me what happened" than a bug report.

To help you some questions:

Was this a one time issue?
What is the general content of the array?
Do you have any debug logs?
Do you have any way to repeat this?
How often do you scrub?
How old are the drives?
What are the smart values of the drives?

To be fair: I'm not sure this IS a bug.

Ornias1993 on 9 Dec 2019

Was this a one time issue?

the pool is corrupted with errors and can't be repair, how many times do this need to occur to be called a bug?

What is the general content of the array?

small database files

Do you have any debug logs?

Do you have any way to repeat this?

i didn't corrupt my own data, are you asking if i can do it more than one time?

How often do you scrub?

every month, but this pool is new. today a scrub was run ====> scan: scrub repaired 0B in 0h0m with 0 errors on Mon Dec 9 10:19:57 2019

How old are the drives?

disks are new, amazon black friday new

What are the smart values of the drives?

i don't use smart

opili892 on 9 Dec 2019

👎1

and to be fair, i didn't ask you're opinion

opili892 on 9 Dec 2019

👎2

Can you please check the dmesg logs on the system for any device related failures. This may help shed some light on what you're seeing. The odd thing is that zpool status should be reporting at least one checksum or read error if you're seeing persistent errors.

After checking dmesg, what I'd suggest is to run zpool clear to reset the error counts. Then run zpool scrub and when it's done zpool status -v again to determine if the errors are really persistent.

behlendorf on 9 Dec 2019

You may need to run zpool scrub twice to clear the error message buffer. It contains records for the current state and previous scrub.

richardelling on 9 Dec 2019

👍2

@behlendorf Is this type of error always related to metadata or is the metadata part a special case?
What is there to do, if the errors are really persistent?

Does this comment express, that these errors are more likely to happen on USB connected drives?

If there was a snapshot made of the pool, before the error incident happened, will restoring the snapshot automatically "clear" the errors?

Update

Scrub finished with zpool status -v saying afterwards:

scan: scrub repaired 0B in 0 days 08:54:31 with 0 errors on Wed Feb 12 20:50:37 2020

Now the metadata errors are gone. Am still not entirely sure, what caused them.

theAkito on 12 Feb 2020

I have seen this issue as well. In my case the files in question are discovered as corrupt during a zfs send and previous scrubs did not see them. I have seen at least one of these corrupt file issues disappear after/during a scrub. I'm not seeing anything related to the backing drives in dmesg at all.

Additionally, I've seen corrupt files become corrupt metadata as the OP has. It happened when I renamed the data set with those files, restored it from a replica and then destroyed the renamed version with the corruption. The corrupted metadata showed up until, I presume, the background destroy finished.

linux amd64, debian buster, debian backports, kernel 5.4.0-0.bpo.2-amd64, zfs-dkms 0.8.2-3~bpo10+1

schallee on 12 Feb 2020

You may need to run zpool scrub twice to clear the error message buffer. It contains records for the current state and previous scrub.

This saved my pool. I got similar errors as @opili892 when I played around with vdev device names while doing zfs send/receive. Stupid me. The pool ended up with this status:

pool: zstore
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0 days 07:41:34 with 0 errors on Fri Apr 24 07:10:00 2020
config:

NAME                           STATE     READ WRITE CKSUM
zstore                         ONLINE       0     0     0
  mirror-0                     ONLINE       0     0     0
    WD40EFRX-red-WCC7K6XALFKX  ONLINE       0     0     0
    WD40EFRX-red-WCC7K2YAJA69  ONLINE       0     0     0
  mirror-1                     ONLINE       0     0     0
    WD40EFRX-red-WCC7K6YX6SAA  ONLINE       0     0     0
    WD40EFRX-red-WCC7K4JKN36U  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
        <metadata>:<0x39>

The second(!) scrub fixed it!

mabod on 26 May 2020

👍3

@mabod: from what I know, you don't need two scrubs to fix the issue; rather, the two scrubs are required to clear the issue from zpool status (as it shows both the current and previous error reports)

shodanshok on 26 May 2020

👍2

ok. Does that mean that I always need to run scrub twice to get rid of any(!) error messages? What is the technical reason for not clearing the status right away when the issue is fixed.

mabod on 26 May 2020

It is simply a log of the previous pool state. I am not sure if all errors behave in this manner, though.

shodanshok on 26 May 2020

ok. Does that mean that I always need to run scrub twice to get rid of any(!) error messages? What is the technical reason for not clearing the status right away when the issue is fixed.

Seems dead simple to me, imagine this scenario:

you have pool with hidden error, nobody knows about it yet (or maybe does, because it got randomly triggered during normal operation)
you run scrub and it finds the error, fixes it and puts it to zpool status
you run scrub and it does not find any errors and puts it to zpool status

If the first run of the scrub cleared the errors, there would be no errors shown at all when you come back to check on it (except for brief moment during scrub itself).

Harvie on 19 Jun 2020

I am not sure if I understand your explanation and which of the three points is actually explaining the zpool status I saw:

To recap what happened
1.) zpool status was showing two errors

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
        <metadata>:<0x39>

2.) after(!) the first scrub the status was showing the same two errors
3.) after(!) the second scrub the status is not showing any errors anymore

How does this behavior fit to your explanations?

mabod on 19 Jun 2020

👍2

On May 26, 2020, at 4:28 AM, Matthias Bodenbinder notifications@github.com wrote:

ok. Does that mean that I always need to run scrub twice to get rid of any(!) error messages? What is the technical reason for not clearing the status right a way when the issue is fixed.

scrubs can take a long time (weeks), so the status shows the results of current scrub and previous scrub
-- richard

richardelling on 19 Jun 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings