Trying to simulate failure scenarios with a 3+1 RAIDZ1 array in order to prepare for eventualities.
# create spfstank raidz1 -o ashift=12 sda sdb sdc sdd
# zfs create spfstank/part
# dd if=/dev/random of=/spfstank/part/output.txt bs=1024 count=10000
manually pull out /dev/sdc without shutting anything down. As expected, zpool status shows the drive in a bad state:
# zpool status
-- SNIP --
ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 UNAVAIL 16 122 0 corrupted data
-- SNIP --
This status doesn't change when I re-insert the drive. So I want to simulate re-introducing a drive that's extremely incoherent relative to the state of the ZFS pool. So, making sure that the drive is "offline", I introduce a raft of changes:
# zpool offline spfstank ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637
# dd if=/dev/zero of=/dev/disk/by-id/ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 bs=1024 count=100000
102 MB of changes, to be exact. Now, I want to re-introduce the drive to the pool and get ZFS to work it out. At this point, the status of the drive is:
# zpool status
-- SNIP --
ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 OFFLINE 16 122 0
-- SNIP --
I try to replace the drive with itself:
# zpool replace spfstank ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 -f
cannot replace ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 with ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637: ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 is busy
# zpool replace spfstank /dev/sdc /dev/sdc -f
invalid vdev specification
the following errors must be manually repaired:
/dev/sdc1 is part of active pool 'spfstank'
# zpool replace spfstank /dev/sdc -f
invalid vdev specification
the following errors must be manually repaired:
/dev/sdc1 is part of active pool 'spfstank'
I was able to "fix" this with:
# zpool online spfstank ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637
# zpool clear spfstank
# /sbin/zpool scrub spfstank
During the scrub, the status of the drive changes:
zpool status
-- SNIP --
ata-WDC_WD40EZRX-00SPEB0_WD-WCC4E0546637 ONLINE 0 0 9 (repairing)
-- SNIP --
There doesn't seem to be a way to "replace" a known incoherent drive with itself.
You didn't corrupt the disk enough. The dd left the 3rd and 4th copies of the labels intact so it's still being recognized as a part of the pool. All you need to do in this case is to zpool online it. The only part of a vdev that's in a specific location are the labels; 2 at the beginning and 2 near the end. So long as any one of them is intact, you'd likely need extremely severe damage to prevent a simple "online" from working (due to the multiple copies of all metadata).
Or as was mentioned on the mailing list, zpool labelclear -f /dev/sdc should let zpool replace work to simulate a drive swap.
@dweeezil is it safe to assume that after the disk is online'd and the scrub finishes, the disk in the FAULTED state will return to the ONLINE state?
UPDATE: After the scrub completed the disk is still in the faulted state.
This may be better mailing list fodder but I'm noticing similar behavior as @mcrbids and I believe this is on topic. I hope you don't mind.
Here is the zpool configuration:
config:
NAME STATE READ WRITE CKSUM
data DEGRADED 0 0 0
raidz2-0 ONLINE 0 0 0
A0 ONLINE 0 0 0
B0 ONLINE 0 0 0
C0 ONLINE 0 0 0
D0 ONLINE 0 0 0
E0 ONLINE 0 0 0
F0 ONLINE 0 0 0
raidz2-1 DEGRADED 0 0 0
A1 OFFLINE 0 0 0
B1 ONLINE 0 0 0
C1 ONLINE 0 0 0
D1 ONLINE 0 0 0
E1 ONLINE 0 0 0
F1 ONLINE 0 0 0
raidz2-2 DEGRADED 0 0 0
A2 ONLINE 0 0 0
B2 ONLINE 0 0 0
C2 OFFLINE 0 0 0
D2 ONLINE 0 0 0
E2 ONLINE 0 0 0
F2 OFFLINE 0 0 0
raidz2-3 ONLINE 0 0 0
A3 ONLINE 0 0 0
B3 ONLINE 0 0 0
C3 ONLINE 0 0 0
D3 ONLINE 0 0 0
E3 ONLINE 0 0 0
F3 ONLINE 0 0 0
I have attempted to "borrow" a disk from one of the N+2 vdevs (raidz2-1) to the vdev at N (raidz2-2) by offline'ing A1 and zero'ing the the first few hundred megs.
# zpool offline data A1
# dd if=/dev/zero of=/dev/disk/by-vdev/A1 bs=64M count=10
I then edited my /etc/zfs/vdev_id.conf so that udev will give A1 the label of C2 and commented the existing line that defines C2.
I then removed A1 and C2 and placed A1 in C2's drive tray. I reconnected the new C2. udev triggers and /dev/disk/by-vdev/C2 now exists.
# ls -l /dev/disk/by-vdev/C2*
lrwxrwxrwx 1 root root 9 Jul 29 16:15 /dev/disk/by-vdev/C2 -> ../../sdu
When I attempt to replace the offline'd C2 with the new C2 however, I get a message that C2 is busy and the disk is automatically partitioned. By zfs I assume.
# zpool replace data C2 /dev/disk/by-vdev/C2
invalid vdev specification
use '-f' to override the following errors:
/dev/disk/by-vdev/C2 contains a corrupt primary EFI label.
# zpool replace -f data C2 /dev/disk/by-vdev/C2
cannot replace C2 with /dev/disk/by-vdev/C2: /dev/disk/by-vdev/C2 is busy
# ls -l /dev/disk/by-vdev/C2*
lrwxrwxrwx 1 root root 9 Jul 29 16:16 /dev/disk/by-vdev/C2 -> ../../sdu
lrwxrwxrwx 1 root root 10 Jul 29 16:16 /dev/disk/by-vdev/C2-part1 -> ../../sdu1
lrwxrwxrwx 1 root root 10 Jul 29 16:16 /dev/disk/by-vdev/C2-part9 -> ../../sdu9
_Note, the "corrupt primary EFI label" message is always present even with brand new disks that have never touched the system. Not sure what that is about. I always have to use -f when replacing._
If I had to take a guess, this has something to do with the fact I created the pool with the /dev/disk/by-vdev/ labels and not /dev/disk/by-id/. ZFS sees the path, "/dev/disk/by-id/C2" and assumes it is just badly damaged (and as I've learned from this thread, a label still exists at a location higher than the first several hundred megs I overwrote). Am I close here?
UPDATE: Doesn't appear to be related to which symlink was used when referencing the disk.
Would the correct course of action in replacing a disk this way, be to just zpool online the "borrowed" disk if I need to borrow disks from other vdevs in the future.
UPDATE: No. zpool online will not resilver the faulted disk. zpool replace will not allow disk reuse within the pool which I believe to be a bug.
I think there might actually be a bug here as of 0.6.3. Even if I zpool labelclear the disk I still cannot use it as a replacement in this pool.
# zpool replace -f data C2 /dev/disk/by-id/scsi-SATA_ST3000DM001-1CH_XXXXXXX
invalid vdev specification
the following errors must be manually repaired:
/dev/disk/by-id/scsi-SATA_ST3000DM001-1CH_XXXXXXX is part of active pool '
As seen in the post above, the system automatically partitions the drive without my intervention. There must be some signaling beyond the zfs label on the drive that informs zfs that this disk is/was a member of this pool.
After I zero'd the drive fully with dd, I was able to use it as a replacement disk.
...
raidz2-2 DEGRADED 0 0 0
A2 ONLINE 0 0 0
B2 ONLINE 0 0 0
replacing-2 OFFLINE 0 0 0
old OFFLINE 0 0 0
C2 ONLINE 0 0 0 (resilvering)
D2 ONLINE 0 0 0
E2 ONLINE 0 0 0
F2 OFFLINE 0 0 0
...
You have to zpool labelclear the partition on the disk, not just the whole disk. Even if you give ZFS a whole disk it makes partitions on it and you have to clear those.
Noted. That's a lot less time consuming than wiping the disk. Thanks!
zpool labelclear scsi-SATA_ST3000DM001-1CH_XXXXXXX-part1 complains about the disk being part of an active pool too. Tried that after a zpool offline /dev/disk/by/id/scsi-SATA_ST3000DM001-1CH_XXXXXXX
To workaround this I moved the disk to another system and did the zpool labelclear there.
After that 'zpool replace -f tank scsi-SATA_ST3000DM001-1CH_XXXXXXX /dev/disk/by-id/scsi-SATA_ST3000DM001-1CH_XXXXXXX got me to resilvering.
It would be really handy to be able to do this without physically removing the disk. A prime example of the use case is when changing partitions around, e.g. dropping a partition to make more space for a zfs one.
i'm running in to this as well. i don't understand, how is this not considered a bug any more?
labelclear is clearly broken: it's impossible to clear a partition that was created as part of a whole-disk pool.
also, 'labelclear -f'ing the drive doesn't do enough to prevent the error 'does not contain an EFI label but it may contain information in the MBR'
why is it even necessary for the user to reason about partitions that they didn't create?
I believe I'm running into this problem.
sudo zpool status
``` pool: tank
state: ONLINE
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: scrub repaired 0 in 29h17m with 0 errors on Mon Jun 11 05:41:41 2018
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
sda-enc ONLINE 0 0 0
sdb-enc ONLINE 0 0 0
sdc-enc ONLINE 0 0 0
logs
log ONLINE 0 0 0
cache
cache FAULTED 0 0 0 corrupted data
cache ONLINE 0 0 0
errors: No known data error
```
Cache is a logical volume on a LUKs drive. I must have done something wrong with the setup and it is not properly recognized on reboot.
sudo zpool replace -f tank cache /dev/disk/by-id/dm-name-ws1--vg-cache
cannot open '/dev/disk/by-id/dm-name-ws1--vg-cache': Device or resource busy
cannot replace cache with /dev/disk/by-id/dm-name-ws1--vg-cache: no such device in pool
sudo zpool labelclear /dev/disk/by-id/dm-name-ws1--vg-cache
labelclear operation failed.
Vdev /dev/disk/by-id/dm-name-ws1--vg-cache is a member (L2CACHE), of pool "tank".
To remove label information from this device, export or destroy
the pool, or remove /dev/disk/by-id/dm-name-ws1--vg-cache from the configuration of this pool
and retry the labelclear operation.
Any insights greatly appreciated.
EDIT: I should clarify that the cache seems to be in use, which explains why the device is busy. So it's maybe just a minor annoyance that the old cache is unable to be removed?
EDIT: sorry I must have just been being dumb about the paths.. I was able to remove the degraded device with sudo zpool remove tank /dev/ws1-vg/cache...
I have this issue too. I can't labelclear an offline disk to reinsert it in the pool.
Workaround: strace -e pread64 zdb -l $DEV >/dev/null
Gives a bunch of offsets:
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 0) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 262144) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 12000127614976) = 262144
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 12000127877120) = 262144
Clout these offsets with dd and charlie's your uncle.
Here's your free firearm. You'll find what remains of your foot somewhere near the end of your leg.
Clout these offsets with
ddand charlie's your uncle.Here's your free firearm. You'll find what remains of your foot somewhere near the end of your leg.
For the un-initiated, do you have a sample command and do we need to divide the numbers by the disk block size? e.g: offset 12000127614976 from your example divided by 512 block size = 23437749248
You don't need optimality, just firepower. Use dd with byte units and no division is required. Anyway, I can't math.
did anybody try wipefs? that also seems to be able to remove zfs information from the disks without overwriting as a whole...
I've tried wipefs -a and it doesn't work.
did anybody try wipefs? that also seems to be able to remove zfs information from the disks without overwriting as a whole...
According to the man page:
When option -a is used, all magic strings that are visible for
libblkid are erased. In this case the wipefs scans the device again
after each modification (erase) until no magic string is found.
Note that by default wipefs does not erase nested partition tables on
non-whole disk devices. For this the option --force is required.
So I tried:
wipefs -all --force
But that didn't work for me...
Workaround:
strace -e pread64 zdb -l $DEV >/dev/nullGives a bunch of offsets:
pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 0) = 262144 pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 262144) = 262144 pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 12000127614976) = 262144 pread64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 12000127877120) = 262144Clout these offsets with
ddand charlie's your uncle.Here's your free firearm. You'll find what remains of your foot somewhere near the end of your leg.
These sayings like "firearm" and "charlie's your uncle" are extremely not intuitive for a foreigner like me :(
Can you provide a dd commands example for the more unlearned of us out here? (i.e. to clarify which parameter is used to do what from this strace output.)
Thanks in advance.
Translating the pread() results from MY drives roughly into dd commands gives:
dd if=/dev/zero of=$DEV bs=1 seek=0 count=262144
dd if=/dev/zero of=$DEV bs=1 seek=262144 count=262144
dd if=/dev/zero of=$DEV bs=1 seek=12000127614976 count=262144
dd if=/dev/zero of=$DEV bs=1 seek=12000127877120 count=262144
However, the pread() values will differ for YOUR drive(s), so I strongly recommend you learn to load and aim your own firearm. The trick with dd is to use bs=1 when you don't want performance and can't do mathematics (like me).
@shevek - floor sufficiently swiss cheezed from weapons fire, and still no joy; (Edit, see end of comment)
dozer1 had 2 disks in mirror, sds1 and sdr1. At somepoint sdl (previously usb drive) was removed, and either through reboot or some other means, udev moved sds to sdl. Disk is 14.6T; full DD would take 3.16 days.
[root@fs01 etc]# zpool status
pool: dozer1
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: none requested
config:
NAME STATE READ WRITE CKSUM
dozer1 DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
sdr ONLINE 0 0 0
17256646544208471230 OFFLINE 0 0 0 was /dev/sds1
[root@fs01 etc]# strace -e pread64 zdb -l /dev/sdl >/dev/null
pread64(5, "\0\1\0\0\0\0\0\0\1\0\0\0000\0\0\0\7\0\0\0\1\0\0\0\23\0\0\0doze"..., 13920, 0) = 13920
pread64(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 0) = 262144
pread64(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 262144) = 262144
pread64(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 16000900136960) = 262144
pread64(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144, 16000900399104) = 262144
+++ exited with 2 +++
[root@fs01 etc]# for f in 0 262144 16000900136960 16000900399104; do dd if=/dev/zero of=/dev/sdl bs=1 seek=$f count=262144; done
262144+0 records in
262144+0 records out
262144 bytes (262 kB, 256 KiB) copied, 0.507745 s, 516 kB/s
262144+0 records in
262144+0 records out
262144 bytes (262 kB, 256 KiB) copied, 0.508549 s, 515 kB/s
262144+0 records in
262144+0 records out
262144 bytes (262 kB, 256 KiB) copied, 0.499234 s, 525 kB/s
262144+0 records in
262144+0 records out
262144 bytes (262 kB, 256 KiB) copied, 0.496669 s, 528 kB/s
[root@fs01 etc]# partprobe /dev/sdl
### LSBLK shows sdl has no partitions, so far so good
[root@fs01 etc]# zpool replace -f dozer1 17256646544208471230 /dev/sdl
cannot replace 17256646544208471230 with /dev/sdl: /dev/sdl is busy, or device removal is in progress
### LSBLK shows:
...
sdl 8:176 0 14.6T 0 disk
鈹溾攢sdl1 8:177 0 14.6T 0 part
鈹斺攢sdl9 8:185 0 8M 0 part
...
[root@fs01 etc]# zpool replace -f dozer1 17256646544208471230 /dev/sdl
invalid vdev specification
the following errors must be manually repaired:
/dev/sdl1 is part of active pool 'dozer1'
[root@fs01 etc]# zpool labelclear -f /dev/sdl1
/dev/sdl1 is a member (ACTIVE) of pool "dozer1"
When I try to offline/delete /dev/sdl1, ZFS says its not in the pool (I'm assuming because it's checking cache?). When I try to add it, it checks the metadata and says its already part of the pool.
So doing a zpool detach dozer1 17256646544208471230 and then zpool attach dozer1 /dev/sdr /dev/sdl worked like a charm! Crumbs for those who need it.
That being said, the fact that labelclear doesn't work as intended is still an issue.
Most helpful comment
Workaround:
strace -e pread64 zdb -l $DEV >/dev/nullGives a bunch of offsets:
Clout these offsets with
ddand charlie's your uncle.Here's your free firearm. You'll find what remains of your foot somewhere near the end of your leg.