my pool has errors with redundant disks, but scrub didn't repair, why?
pool: nas
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: scrub repaired 0B in 0h0m with 0 errors on Mon Dec 9 10:19:57 2019
config:
NAME STATE READ WRITE CKSUM
nas ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<metadata>:<0x0>
<metadata>:<0x39>
nas is ubuntu with zfs version 0.7.5-1ubuntu16.6
Did you read the Contribution guidelines?
https://github.com/zfsonlinux/zfs/blob/master/.github/CONTRIBUTING.md
This looks like a support request or a question, which should primarily be asked on the mailinglist instead.
support request? this is a bug, zfs didn't protect my data with mirror disks!?
@opili892 If it is a bug, we need logs (debug info) or a way to repeat the issue at hand.
Considering you didn't confirm this actually is a bug, your post sounds more like a "help me what happened" than a bug report.
To help you some questions:
To be fair: I'm not sure this IS a bug.
the pool is corrupted with errors and can't be repair, how many times do this need to occur to be called a bug?
small database files
no
i didn't corrupt my own data, are you asking if i can do it more than one time?
every month, but this pool is new. today a scrub was run ====> scan: scrub repaired 0B in 0h0m with 0 errors on Mon Dec 9 10:19:57 2019
disks are new, amazon black friday new
i don't use smart
and to be fair, i didn't ask you're opinion
Can you please check the dmesg logs on the system for any device related failures. This may help shed some light on what you're seeing. The odd thing is that zpool status should be reporting at least one checksum or read error if you're seeing persistent errors.
After checking dmesg, what I'd suggest is to run zpool clear to reset the error counts. Then run zpool scrub and when it's done zpool status -v again to determine if the errors are really persistent.
You may need to run zpool scrub twice to clear the error message buffer. It contains records for the current state and previous scrub.
@behlendorf Is this type of error always related to metadata or is the metadata part a special case?
What is there to do, if the errors are really persistent?
Does this comment express, that these errors are more likely to happen on USB connected drives?
If there was a snapshot made of the pool, before the error incident happened, will restoring the snapshot automatically "clear" the errors?
Scrub finished with zpool status -v saying afterwards:
scan: scrub repaired 0B in 0 days 08:54:31 with 0 errors on Wed Feb 12 20:50:37 2020
Now the metadata errors are gone. Am still not entirely sure, what caused them.
I have seen this issue as well. In my case the files in question are discovered as corrupt during a zfs send and previous scrubs did not see them. I have seen at least one of these corrupt file issues disappear after/during a scrub. I'm not seeing anything related to the backing drives in dmesg at all.
Additionally, I've seen corrupt files become corrupt metadata as the OP has. It happened when I renamed the data set with those files, restored it from a replica and then destroyed the renamed version with the corruption. The corrupted metadata showed up until, I presume, the background destroy finished.
linux amd64, debian buster, debian backports, kernel 5.4.0-0.bpo.2-amd64, zfs-dkms 0.8.2-3~bpo10+1
You may need to run
zpool scrubtwice to clear the error message buffer. It contains records for the current state and previous scrub.
This saved my pool. I got similar errors as @opili892 when I played around with vdev device names while doing zfs send/receive. Stupid me. The pool ended up with this status:
pool: zstore
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: scrub repaired 0B in 0 days 07:41:34 with 0 errors on Fri Apr 24 07:10:00 2020
config:
NAME STATE READ WRITE CKSUM
zstore ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
WD40EFRX-red-WCC7K6XALFKX ONLINE 0 0 0
WD40EFRX-red-WCC7K2YAJA69 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
WD40EFRX-red-WCC7K6YX6SAA ONLINE 0 0 0
WD40EFRX-red-WCC7K4JKN36U ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<metadata>:<0x0>
<metadata>:<0x39>
The second(!) scrub fixed it!
I would never have tried a second scrub until I read @richardelling comment. I always thought that what one scrub cannot fix, another scrub cannot fix either. Is this behaviour documented somewhere?
@mabod: from what I know, you don't need two scrubs to fix the issue; rather, the two scrubs are required to clear the issue from zpool status (as it shows both the current and previous error reports)
ok. Does that mean that I always need to run scrub twice to get rid of any(!) error messages? What is the technical reason for not clearing the status right away when the issue is fixed.
It is simply a log of the previous pool state. I am not sure if all errors behave in this manner, though.
ok. Does that mean that I always need to run scrub twice to get rid of any(!) error messages? What is the technical reason for not clearing the status right away when the issue is fixed.
Seems dead simple to me, imagine this scenario:
If the first run of the scrub cleared the errors, there would be no errors shown at all when you come back to check on it (except for brief moment during scrub itself).
I am not sure if I understand your explanation and which of the three points is actually explaining the zpool status I saw:
To recap what happened
1.) zpool status was showing two errors
errors: Permanent errors have been detected in the following files:
<metadata>:<0x0>
<metadata>:<0x39>
2.) after(!) the first scrub the status was showing the same two errors
3.) after(!) the second scrub the status is not showing any errors anymore
How does this behavior fit to your explanations?
On May 26, 2020, at 4:28 AM, Matthias Bodenbinder notifications@github.com wrote:
ok. Does that mean that I always need to run scrub twice to get rid of any(!) error messages? What is the technical reason for not clearing the status right a way when the issue is fixed.
scrubs can take a long time (weeks), so the status shows the results of current scrub and previous scrub
-- richard
Most helpful comment
This saved my pool. I got similar errors as @opili892 when I played around with vdev device names while doing zfs send/receive. Stupid me. The pool ended up with this status:
The second(!) scrub fixed it!
I would never have tried a second scrub until I read @richardelling comment. I always thought that what one scrub cannot fix, another scrub cannot fix either. Is this behaviour documented somewhere?