Zfs: RAIDZ1: unable to replace a drive with itself

Created on 23 Jan 2014 · 20Comments · Source: openzfs/zfs

Trying to simulate failure scenarios with a 3+1 RAIDZ1 array in order to prepare for eventualities.

# create spfstank raidz1 -o ashift=12 sda sdb sdc sdd 
# zfs create spfstank/part
# dd if=/dev/random of=/spfstank/part/output.txt bs=1024 count=10000

manually pull out /dev/sdc without shutting anything down. As expected, zpool status shows the drive in a bad state:

# zpool status 
-- SNIP -- 
ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637  UNAVAIL     16   122     0  corrupted data
-- SNIP --

This status doesn't change when I re-insert the drive. So I want to simulate re-introducing a drive that's extremely incoherent relative to the state of the ZFS pool. So, making sure that the drive is "offline", I introduce a raft of changes:

# zpool offline spfstank ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637
# dd if=/dev/zero of=/dev/disk/by-id/ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 bs=1024 count=100000

102 MB of changes, to be exact. Now, I want to re-introduce the drive to the pool and get ZFS to work it out. At this point, the status of the drive is:

# zpool status 
-- SNIP -- 
ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637  OFFLINE     16   122     0
-- SNIP --

I try to replace the drive with itself:

# zpool replace spfstank ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 -f 
cannot replace ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 with ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637: ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 is busy

# zpool replace spfstank /dev/sdc /dev/sdc -f 
invalid vdev specification
the following errors must be manually repaired:
/dev/sdc1 is part of active pool 'spfstank'

# zpool replace spfstank /dev/sdc  -f 
invalid vdev specification
the following errors must be manually repaired:
/dev/sdc1 is part of active pool 'spfstank'

I was able to "fix" this with:

# zpool online spfstank ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637
# zpool clear spfstank
# /sbin/zpool scrub spfstank

During the scrub, the status of the drive changes:

zpool status 
-- SNIP -- 
ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637  ONLINE       0     0     9  (repairing)
-- SNIP --

There doesn't seem to be a way to "replace" a known incoherent drive with itself.

Source

mcrbids

Most helpful comment

Workaround: strace -e pread64 zdb -l $DEV >/dev/null

Gives a bunch of offsets:

pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 0) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 262144) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 12000127614976) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 12000127877120) = 262144

Clout these offsets with dd and charlie's your uncle.

Here's your free firearm. You'll find what remains of your foot somewhere near the end of your leg.

shevek on 13 Aug 2018

😄5 👍1

All 20 comments

You didn't corrupt the disk enough. The dd left the 3rd and 4th copies of the labels intact so it's still being recognized as a part of the pool. All you need to do in this case is to zpool online it. The only part of a vdev that's in a specific location are the labels; 2 at the beginning and 2 near the end. So long as any one of them is intact, you'd likely need extremely severe damage to prevent a simple "online" from working (due to the multiple copies of all metadata).

dweeezil on 23 Jan 2014

Or as was mentioned on the mailing list, zpool labelclear -f /dev/sdc should let zpool replace work to simulate a drive swap.

nedbass on 23 Jan 2014

@dweeezil is it safe to assume that after the disk is online'd and the scrub finishes, the disk in the FAULTED state will return to the ONLINE state?
UPDATE: After the scrub completed the disk is still in the faulted state.

This may be better mailing list fodder but I'm noticing similar behavior as @mcrbids and I believe this is on topic. I hope you don't mind.

Here is the zpool configuration:

config:

        NAME        STATE     READ WRITE CKSUM
        data        DEGRADED     0     0     0
          raidz2-0  ONLINE       0     0     0
            A0      ONLINE       0     0     0
            B0      ONLINE       0     0     0
            C0      ONLINE       0     0     0
            D0      ONLINE       0     0     0
            E0      ONLINE       0     0     0
            F0      ONLINE       0     0     0
          raidz2-1  DEGRADED     0     0     0
            A1      OFFLINE      0     0     0
            B1      ONLINE       0     0     0
            C1      ONLINE       0     0     0
            D1      ONLINE       0     0     0
            E1      ONLINE       0     0     0
            F1      ONLINE       0     0     0
          raidz2-2  DEGRADED     0     0     0
            A2      ONLINE       0     0     0
            B2      ONLINE       0     0     0
            C2      OFFLINE      0     0     0
            D2      ONLINE       0     0     0
            E2      ONLINE       0     0     0
            F2      OFFLINE      0     0     0
          raidz2-3  ONLINE       0     0     0
            A3      ONLINE       0     0     0
            B3      ONLINE       0     0     0
            C3      ONLINE       0     0     0
            D3      ONLINE       0     0     0
            E3      ONLINE       0     0     0
            F3      ONLINE       0     0     0

I have attempted to "borrow" a disk from one of the N+2 vdevs (raidz2-1) to the vdev at N (raidz2-2) by offline'ing A1 and zero'ing the the first few hundred megs.

# zpool offline data A1
# dd if=/dev/zero of=/dev/disk/by-vdev/A1 bs=64M count=10

I then edited my /etc/zfs/vdev_id.conf so that udev will give A1 the label of C2 and commented the existing line that defines C2.

I then removed A1 and C2 and placed A1 in C2's drive tray. I reconnected the new C2. udev triggers and /dev/disk/by-vdev/C2 now exists.

# ls -l /dev/disk/by-vdev/C2*
lrwxrwxrwx 1 root root 9 Jul 29 16:15 /dev/disk/by-vdev/C2 -> ../../sdu

When I attempt to replace the offline'd C2 with the new C2 however, I get a message that C2 is busy and the disk is automatically partitioned. By zfs I assume.

# zpool replace data C2 /dev/disk/by-vdev/C2
invalid vdev specification
use '-f' to override the following errors:
/dev/disk/by-vdev/C2 contains a corrupt primary EFI label.
# zpool replace -f data C2 /dev/disk/by-vdev/C2
cannot replace C2 with /dev/disk/by-vdev/C2: /dev/disk/by-vdev/C2 is busy
# ls -l /dev/disk/by-vdev/C2*
lrwxrwxrwx 1 root root  9 Jul 29 16:16 /dev/disk/by-vdev/C2 -> ../../sdu
lrwxrwxrwx 1 root root 10 Jul 29 16:16 /dev/disk/by-vdev/C2-part1 -> ../../sdu1
lrwxrwxrwx 1 root root 10 Jul 29 16:16 /dev/disk/by-vdev/C2-part9 -> ../../sdu9

_Note, the "corrupt primary EFI label" message is always present even with brand new disks that have never touched the system. Not sure what that is about. I always have to use -f when replacing._

If I had to take a guess, this has something to do with the fact I created the pool with the /dev/disk/by-vdev/ labels and not /dev/disk/by-id/. ZFS sees the path, "/dev/disk/by-id/C2" and assumes it is just badly damaged (and as I've learned from this thread, a label still exists at a location higher than the first several hundred megs I overwrote). Am I close here?
UPDATE: Doesn't appear to be related to which symlink was used when referencing the disk.

~~Would the correct course of action in replacing a disk this way, be to just zpool online the "borrowed" disk if I need to borrow disks from other vdevs in the future.~~
UPDATE: No. zpool online will not resilver the faulted disk. zpool replace will not allow disk reuse within the pool which I believe to be a bug.

joshenders on 30 Jul 2014

I think there might actually be a bug here as of 0.6.3. Even if I zpool labelclear the disk I still cannot use it as a replacement in this pool.

# zpool replace -f data C2 /dev/disk/by-id/scsi-SATA_ST3000DM001-1CH_XXXXXXX
invalid vdev specification
the following errors must be manually repaired:
/dev/disk/by-id/scsi-SATA_ST3000DM001-1CH_XXXXXXX is part of active pool '

As seen in the post above, the system automatically partitions the drive without my intervention. There must be some signaling beyond the zfs label on the drive that informs zfs that this disk is/was a member of this pool.

After I zero'd the drive fully with dd, I was able to use it as a replacement disk.

...
          raidz2-2  DEGRADED     0     0     0
            A2      ONLINE       0     0     0
            B2      ONLINE       0     0     0
        replacing-2                      OFFLINE      0     0     0
          old                            OFFLINE      0     0     0
          C2                             ONLINE       0     0     0  (resilvering)
            D2      ONLINE       0     0     0
            E2      ONLINE       0     0     0
            F2      OFFLINE      0     0     0
...

joshenders on 30 Jul 2014

You have to zpool labelclear the partition on the disk, not just the whole disk. Even if you give ZFS a whole disk it makes partitions on it and you have to clear those.

DeHackEd on 1 Aug 2014

👍1

Noted. That's a lot less time consuming than wiping the disk. Thanks!

joshenders on 1 Aug 2014

zpool labelclear scsi-SATA_ST3000DM001-1CH_XXXXXXX-part1 complains about the disk being part of an active pool too. Tried that after a zpool offline /dev/disk/by/id/scsi-SATA_ST3000DM001-1CH_XXXXXXX

To workaround this I moved the disk to another system and did the zpool labelclear there.

After that 'zpool replace -f tank scsi-SATA_ST3000DM001-1CH_XXXXXXX /dev/disk/by-id/scsi-SATA_ST3000DM001-1CH_XXXXXXX got me to resilvering.

gitbisector on 12 Oct 2014

It would be really handy to be able to do this without physically removing the disk. A prime example of the use case is when changing partitions around, e.g. dropping a partition to make more space for a zfs one.

gordan-bobic on 31 Mar 2016

👍1

i'm running in to this as well. i don't understand, how is this not considered a bug any more?

labelclear is clearly broken: it's impossible to clear a partition that was created as part of a whole-disk pool.

also, 'labelclear -f'ing the drive doesn't do enough to prevent the error 'does not contain an EFI label but it may contain information in the MBR'

why is it even necessary for the user to reason about partitions that they didn't create?

Spongman on 27 Sep 2017

👍5

I believe I'm running into this problem.

sudo zpool status

``` pool: tank
state: ONLINE
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: scrub repaired 0 in 29h17m with 0 errors on Mon Jun 11 05:41:41 2018
config:

    NAME         STATE     READ WRITE CKSUM
    tank         ONLINE       0     0     0
      raidz1-0   ONLINE       0     0     0
        sda-enc  ONLINE       0     0     0
        sdb-enc  ONLINE       0     0     0
        sdc-enc  ONLINE       0     0     0
    logs
      log        ONLINE       0     0     0
    cache
      cache      FAULTED      0     0     0  corrupted data
      cache      ONLINE       0     0     0

errors: No known data error
```

Cache is a logical volume on a LUKs drive. I must have done something wrong with the setup and it is not properly recognized on reboot.

sudo zpool replace -f tank cache /dev/disk/by-id/dm-name-ws1--vg-cache

cannot open '/dev/disk/by-id/dm-name-ws1--vg-cache': Device or resource busy
cannot replace cache with /dev/disk/by-id/dm-name-ws1--vg-cache: no such device in pool

sudo zpool labelclear /dev/disk/by-id/dm-name-ws1--vg-cache

labelclear operation failed.
        Vdev /dev/disk/by-id/dm-name-ws1--vg-cache is a member (L2CACHE), of pool "tank".
        To remove label information from this device, export or destroy
        the pool, or remove /dev/disk/by-id/dm-name-ws1--vg-cache from the configuration of this pool
        and retry the labelclear operation.

Any insights greatly appreciated.

EDIT: I should clarify that the cache seems to be in use, which explains why the device is busy. So it's maybe just a minor annoyance that the old cache is unable to be removed?

EDIT: sorry I must have just been being dumb about the paths.. I was able to remove the degraded device with sudo zpool remove tank /dev/ws1-vg/cache...

rueberger on 19 Jun 2018

I have this issue too. I can't labelclear an offline disk to reinsert it in the pool.

shevek on 13 Aug 2018

Workaround: strace -e pread64 zdb -l $DEV >/dev/null

Gives a bunch of offsets:

pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 0) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 262144) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 12000127614976) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 12000127877120) = 262144

Clout these offsets with dd and charlie's your uncle.

Here's your free firearm. You'll find what remains of your foot somewhere near the end of your leg.

shevek on 13 Aug 2018

😄5 👍1

Clout these offsets with dd and charlie's your uncle.

Here's your free firearm. You'll find what remains of your foot somewhere near the end of your leg.

For the un-initiated, do you have a sample command and do we need to divide the numbers by the disk block size? e.g: offset 12000127614976 from your example divided by 512 block size = 23437749248

TheLinuxGuy on 26 Apr 2019

You don't need optimality, just firepower. Use dd with byte units and no division is required. Anyway, I can't math.

shevek on 28 May 2019

did anybody try wipefs? that also seems to be able to remove zfs information from the disks without overwriting as a whole...

devZer0 on 22 Jan 2020

I've tried wipefs -a and it doesn't work.

scintilla13 on 11 Feb 2020

did anybody try wipefs? that also seems to be able to remove zfs information from the disks without overwriting as a whole...

According to the man page:

   When option -a is used, all magic strings that are visible for
   libblkid are erased. In this case the wipefs scans the device again
   after each modification (erase) until no magic string is found.

   Note that by default wipefs does not erase nested partition tables on
   non-whole disk devices.  For this the option --force is required.

So I tried:

wipefs -all --force

But that didn't work for me...

dev-sngy on 17 Feb 2020

Workaround: strace -e pread64 zdb -l $DEV >/dev/null

Gives a bunch of offsets:
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 0) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 262144) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 12000127614976) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 12000127877120) = 262144
Clout these offsets with dd and charlie's your uncle.

Here's your free firearm. You'll find what remains of your foot somewhere near the end of your leg.

These sayings like "firearm" and "charlie's your uncle" are extremely not intuitive for a foreigner like me :(

Can you provide a dd commands example for the more unlearned of us out here? (i.e. to clarify which parameter is used to do what from this strace output.)

Thanks in advance.

dev-sngy on 17 Feb 2020

Translating the pread() results from MY drives roughly into dd commands gives:

dd if=/dev/zero of=$DEV bs=1 seek=0 count=262144
dd if=/dev/zero of=$DEV bs=1 seek=262144 count=262144
dd if=/dev/zero of=$DEV bs=1 seek=12000127614976 count=262144
dd if=/dev/zero of=$DEV bs=1 seek=12000127877120 count=262144

However, the pread() values will differ for YOUR drive(s), so I strongly recommend you learn to load and aim your own firearm. The trick with dd is to use bs=1 when you don't want performance and can't do mathematics (like me).

shevek on 18 Feb 2020

🎉1 👍1

@shevek - floor sufficiently swiss cheezed from weapons fire, and still no joy; (Edit, see end of comment)

Background

dozer1 had 2 disks in mirror, sds1 and sdr1. At somepoint sdl (previously usb drive) was removed, and either through reboot or some other means, udev moved sds to sdl. Disk is 14.6T; full DD would take 3.16 days.

Initial status

[root@fs01 etc]# zpool status
  pool: dozer1
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
  scan: none requested
config:

    NAME                      STATE     READ WRITE CKSUM
    dozer1                    DEGRADED     0     0     0
      mirror-0                DEGRADED     0     0     0
        sdr                   ONLINE       0     0     0
        17256646544208471230  OFFLINE      0     0     0  was /dev/sds1

Trying to clear

[root@fs01 etc]# strace -e pread64 zdb -l /dev/sdl >/dev/null
pread64(5, "\0\1\0\0\0\0\0\0\1\0\0\0000\0\0\0\7\0\0\0\1\0\0\0\23\0\0\0doze"..., 13920, 0) = 13920
pread64(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 0) = 262144
pread64(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 262144) = 262144
pread64(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 16000900136960) = 262144
pread64(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 16000900399104) = 262144
+++ exited with 2 +++
[root@fs01 etc]# for f in 0 262144 16000900136960 16000900399104; do dd if=/dev/zero of=/dev/sdl bs=1 seek=$f count=262144; done
262144+0 records in
262144+0 records out
262144 bytes (262 kB, 256 KiB) copied, 0.507745 s, 516 kB/s
262144+0 records in
262144+0 records out
262144 bytes (262 kB, 256 KiB) copied, 0.508549 s, 515 kB/s
262144+0 records in
262144+0 records out
262144 bytes (262 kB, 256 KiB) copied, 0.499234 s, 525 kB/s
262144+0 records in
262144+0 records out
262144 bytes (262 kB, 256 KiB) copied, 0.496669 s, 528 kB/s

[root@fs01 etc]# partprobe /dev/sdl
### LSBLK shows sdl has no partitions, so far so good

[root@fs01 etc]# zpool replace -f dozer1 17256646544208471230 /dev/sdl
cannot replace 17256646544208471230 with /dev/sdl: /dev/sdl is busy, or device removal is in progress
### LSBLK shows:
...
sdl               8:176  0  14.6T  0 disk 
├─sdl1            8:177  0  14.6T  0 part 
└─sdl9            8:185  0     8M  0 part 
...
[root@fs01 etc]# zpool replace -f dozer1 17256646544208471230 /dev/sdl
invalid vdev specification
the following errors must be manually repaired:
/dev/sdl1 is part of active pool 'dozer1'
[root@fs01 etc]# zpool labelclear -f /dev/sdl1
/dev/sdl1 is a member (ACTIVE) of pool "dozer1"

When I try to offline/delete /dev/sdl1, ZFS says its not in the pool (I'm assuming because it's checking cache?). When I try to add it, it checks the metadata and says its already part of the pool.

Success!

So doing a zpool detach dozer1 17256646544208471230 and then zpool attach dozer1 /dev/sdr /dev/sdl worked like a charm! Crumbs for those who need it.

That being said, the fact that labelclear doesn't work as intended is still an issue.