Zfs: Top-level disk removal: operation not supported on this type of pool. How to replace disk?

Created on 25 Dec 2019 · 40Comments · Source: openzfs/zfs

System information

Type | Version/Name
--- | ---
Distribution Name | Debian
Distribution Version | zfs-0.8.1-pve2
Linux Kernel | 5.0.18-1-pve
Architecture | x86
ZFS Version | 0.8.1-pve1
SPL Version | 0.8.1-pve1

Describe the problem you're observing

I accidentally added a disk to a pool in order to replace it, since the replace commands failed for some reason (disk name unresolvable or not found, can't remember). Now I cannot replace a failed device with the disk I already physically replaced in the same slot.

Now, even though I offline'd the new disk, I cannot remove the disk, as it throws errors (see below)

What is the recommended way to replace the disk and cleanup the apparently-unfixable-mess? There seems to be no force flags...

Describe how to reproduce the problem

Take out old disk from a raidz2
Insert new disk in same slot
zpool add tank <new-disk>
zpool offline tank <new-disk>

Include any warning/errors/backtraces from the system logs

Before:

        NAME                                          STATE     READ WRITE CKSUM
        tank                                          DEGRADED     0     0     0
          raidz2-0                                    DEGRADED     0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2NN9JNJ  FAULTED     33     0     0  too many errors
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0555658  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0535953  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0527669  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5JEC5V2  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-WCC7K4AR0032  ONLINE       0     0     0

After:

        NAME                                          STATE     READ WRITE CKSUM
        tank                                          DEGRADED     0     0     0
          raidz2-0                                    DEGRADED     0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2NN9JNJ  FAULTED     33     0     0  too many errors
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0555658  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0535953  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0527669  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5JEC5V2  ONLINE       0     0     0
            ata-WDC_WD40EFRX-68N32N0_WD-WCC7K4AR0032  ONLINE       0     0     0
          ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD    DEGRADED     0     0     0  external device fault

Removing or replacing the disk fails:

$ zpool remove tank ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD
cannot remove ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD: operation not supported on this type of pool
$ zpool replace tank  ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2NN9JNJ ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD
/dev/disk/by-id/ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD is in use and contains a unknown filesystem.

Source

ccremer

Most helpful comment

@Ornias1993 I learned that the hard way now. However, I'm not sure which are the Docs for Zfsonlinux. In the wiki page https://github.com/zfsonlinux/zfs/wiki/Admin-Documentation I'm again presented with 3 different admin guides, including the Oracle one. And when using search engines, only the Oracle ones show up... So in a way it's hard to find the right online documentation. Coming from Kubernetes/Docker worlds I did not think of manpages, I always look for online documentation first.

Regardless of the -f flag and the docs, how could I fix the mess and replace the disk in the raidz2 vdev with the wrongly-added new disk? What's the recommended approach?

ccremer on 25 Dec 2019

👍2

All 40 comments

You created a single drive vdev that is stripped with your raidz2 vdev. Normally you should not be able to do that (add a new top level vdev that has less redundancy than the existing vdevs) without the -f option to force.

drescherjm on 25 Dec 2019

Hm, apparently it's "should not". Went through my bash history and this is what I entered:
zpool add tank /dev/disk/by-id/ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD, no -f flag.

ccremer on 25 Dec 2019

@ccremer we're always looking to improve documentation. Can you share what led you to use the zpool add command rather than the zpool replace command?

richardelling on 25 Dec 2019

I tried to replace the faulty disk with this command:

zpool replace tank ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2NN9JNJ ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD

For some reason (I don't have the output anymore) that failed, I think it was a not-found error or similar. When googling for "zfs replace disk" I land ofc on the Oracle docs: https://docs.oracle.com/cd/E19253-01/819-5461/gazgd/index.html, there the device name was entered directly. Since that didn't work, I thought "ah, maybe the pool needs to know/import the disk before replacing" and there you have it...

ccremer on 25 Dec 2019

ok, I think this is where we can do better. The replace command is what should have succeeded, but it is possible that zpool was looking the wrong device directory /dev, rather than /dev/disk/by-id. So the "not found" error sent you off on a wild goose chase and that should not happen.

It is still unclear to me why '-f' wasn't required because that code has been there for 10+ years. Later this week I'll try to reproduce.
-- richard

richardelling on 25 Dec 2019

That sums it up pretty much.

Looking over the manpages and the output of zpool status, I think there is now a mirror between the raidz0 vdev and the vdev that is now inconveniently called ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD, but the disk is not yet actually used yet as part of the same-named vdev. Am I understanding this correctly? If so, I could still replace the old faulted disk?

ccremer on 25 Dec 2019

You created a single drive vdev that is mirrored with your raidz2 vdev.

Isn't the single drive in stripe with the raidz vdev ?

HiFiPhile on 25 Dec 2019

@richardelling I think it might be at least worth while to document a big "Do not use add instead of replace" in the replace section of the documentation... just to prevent mistakes...

@drescherjm Interesting note: Actually I just created some single disk pools and I can at least confirm single disk pools(and thus vdev) require -f to be created on MASTER. So that part of the code is right.

So if the -f requirement isn't enforced on add then it would be limited to add.

@ccremer It might be very important to note NOT to use the oracle docs, this is not Oracle ZFS (anymore), while most things might still work no guarantees.

@HiFiPhile According to these readouts its indeed in stripe with the raidz2 vdev

Edit

Looking into why force isn't working:
zpool add passes the force bool to make_root_vdev. But make_root_vdev wants an int, is B_True or B_False compatible with that?

Even if so, make_root_vdev doesn't seem to do much with the force argument....
It only passes it to is_device_in_use and thats about it.

I don't see any checks that require -f to prevent single disk vdev adds in any case.

Ornias1993 on 25 Dec 2019

Regardless of the -f flag and the docs, how could I fix the mess and replace the disk in the raidz2 vdev with the wrongly-added new disk? What's the recommended approach?

ccremer on 25 Dec 2019

👍2

Google-foo showed me a guide:
http://blog.moellenkamp.org/archives/50-How-to-remove-a-top-level-vdev-from-a-ZFS-pool.html

Ornias1993 on 25 Dec 2019

Google-foo showed me a guide:
http://blog.moellenkamp.org/archives/50-How-to-remove-a-top-level-vdev-from-a-ZFS-pool.html

It's for Oracle ZFS, OpenZFS doesn't support remove for pools with RAIDZ in them now https://github.com/zfsonlinux/zfs/blob/master/man/man8/zpool-remove.8#L56

gmelikov on 25 Dec 2019

👍1

Yes, I also found this one, except the first command doesn't work:

$ zpool remove tank ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD
cannot remove ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD: operation not supported on this type of pool

ccremer on 25 Dec 2019

@gmelikov The second vdev isn't raidz

Ornias1993 on 25 Dec 2019

@gmelikov The second vdev isn't raidz

When the primary pool storage includes a top-level raidz vdev only hot spare,
cache, and log devices can be removed.

gmelikov on 25 Dec 2019

@gmelikov SHOOT, my bad!

Ornias1993 on 25 Dec 2019

You created a single drive vdev that is mirrored with your raidz2 vdev.

Isn't the single drive in stripe with the raidz vdev ?

Yes you are correct, I guess I was a little sleepy replying last night..

drescherjm on 25 Dec 2019

And when using search engines, only the Oracle ones show up...

I find that annoying myself when searching for the documentation.

drescherjm on 25 Dec 2019

Coming from Kubernetes/Docker worlds I did not think of manpages, I always look for online documentation first.

This one made me laugh out loud :D
Keep it that way! :+1:

Vlad1mir-D on 1 Jan 2020

You dont need your bash history to find out what happened, zpool history <pool> should show you what happened to a pool, at any time, eg:

[root@taski ~]# zpool history zpool1 | head -n 10
History for 'zpool1':
2019-12-26.00:48:35 zpool create -o ashift=12 zpool2 /dev/sdb /dev/sdh
2019-12-26.00:51:53 zfs create zpool2/data
2019-12-26.00:52:51 zfs set relatime=on zpool2
2019-12-26.00:53:04 zfs set compression=on zpool2
2019-12-26.00:53:12 zfs set xattr=sa dnodesize=auto zpool2
2019-12-26.00:54:08 zfs destroy zpool2/data
2019-12-26.00:54:17 zfs create zpool2/data
2019-12-26.01:00:09 zfs snapshot -o com.sun:auto-snapshot-desc=- -r zpool2@zfs-auto-snap_frequent-2019-12-26-0000
2019-12-26.01:01:09 zfs snapshot -o com.sun:auto-snapshot-desc=- -r zpool2@zfs-auto-snap_hourly-2019-12-26-0001

brunopereira81 on 7 Jan 2020

nice! unfortunately I couldn't find the documentation in man zpool, or online. Apparently it supports some more flags, I wondered if they can filter for dates or events. Right now, the output takes a minute to generate, as I'm creating/deleting lots snapshots automatically, so the full output isn't helpful atm.

ccremer on 8 Jan 2020

I also still haven't got an answer on how to fix my mess here. Is destroying and recreating the zpool the best option I have? How would I keep the snapshots then? ZFS send/receive?

ccremer on 8 Jan 2020

You could send the output to a file and /or use grep to filter.

I also still haven't got an answer on how to fix my mess here.

Depends on how much extra storage you have. This is probably also a mailing list topic instead of a bug report.

drescherjm on 8 Jan 2020

@ccremer this seems a 2 part issue: how it happened, how can you fix it.

The _how it happened_ can be answered by having a look at the pool history, just grep for add and have a look, but that's not so much relevant in the end.

The _how can you fix it_, most likely, and unfortunately will be answered with: you cannot. Just make a new pool with a proper config. Pools with raidz vdevs do not support top level device removal unless its cache / logs. I for example use stripes of mirrors, and that allows me to do operations like the one you are trying. Once you have a new pool move the data from this one to the new one, and destroy this one.

Snapshots can be transferred with the dataset to the new pool, zfs send -R will do that for you.

How much data are we talking about here? Is it something you can recreate, eg: movies, tvshows, etc?

brunopereira81 on 8 Jan 2020

found it with the grep approach: 2019-12-24.13:52:40 zpool add tank /dev/disk/by-id/ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD

I think I have some older 2TB drives laying around. looking with zfs list, it seems I have

4.25 TB of hard-to-recreate data, but I could give up the snapshots there, so rsync should do too
~500 GB easy-to-recreate data
700 GB definitely-important-data, I'll go with zfs send here.

Thank you guys for your comments. So unless you want to treat the "add without force" that @Ornias1993 mentioned as a bug, I think we can come to an end of this discussion here.

ccremer on 8 Jan 2020

@ccremer
The -f tag not being required on add is definately a bug, but I think it's cleaner to create a new specific issue for it (you can use my quote if you like) and close this one :)

Ornias1993 on 9 Jan 2020

I don't see any checks that require -f to prevent single disk vdev adds in any case.

No, zpool_do_add passes !force to make_root_vdev in check_rep argument.

Are we sure history preserves the exact command?

edit: !false was typo, it is !force

scineram on 9 Jan 2020

If it does not then that would be the actual bug, I would say...

What good is a untrustworthy history for ?

On Thu, Jan 9, 2020 at 22:18, scineram notifications@github.com wrote:

I don't see any checks that require -f to prevent single disk vdev adds in any case.

No, zpool_do_add passes !false to make_root_vdev in check_rep argument.

Are we sure history preserves the exact command?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.

brunopereira81 on 9 Jan 2020

I do not see an explicit ZTS test for this in zfs/tests/zfs-tests/functional/cli_root/zpool_add Am I blind?

richardelling on 9 Jan 2020

@scineram Yes, but as I wrote above: make_root_vdev doesn't do much with the !false :

zpool add passes the force bool to make_root_vdev. But make_root_vdev wants an int, is B_True or B_False compatible with that?

Even if so, make_root_vdev doesn't seem to do much with the force argument....
It only passes it to is_device_in_use and thats about it.

I don't see any checks that require -f to prevent single disk vdev adds in any case.

Ornias1993 on 10 Jan 2020

This is the second report suggesting zpool add does not work as intended with mismatched replication levels (#9038): i am going to mark this as duplicate so we don't forget to close both issues once this is fixed.

loli10K on 18 Jan 2020

Duplicate of #9038

loli10K on 18 Jan 2020

I have now inserted 3 drives into the same machine and made a new pool out of them. They are solely used as a transfer pool where I put the data temporarily. After that, I'll get rid of the disks again. Now I'm trying to send the data with zfs send, but with Encryption enabled at the same time on the transfer pool (the source is unencrypted yet, I would prefer to keep the data encrypted once I recreated the Raidz2)

So, my commands looks like this, but I cannot seem to get it started. Before I'm messing things up again, any advises are appreciated :)

$ zpool history transfer
2020-01-20.19:07:26 zpool create transfer /dev/disk/by-id/ata-WDC... /dev/disk/by-id/ata-WDC... /dev/disk/by-id/ata-WDC...
2020-01-20.19:12:56 zfs set compression=lz4 transfer
2020-01-20.19:20:31 zfs set aclinherit=passthrough transfer
2020-01-20.19:47:27 zfs create -o encryption=on -o keylocation=file:///path/to/zfs/transfer.key -o keyformat=passphrase transfer/data

$ zfs snap -r tank/data@backup-2020-01-20
$ zfs send -v -R tank/data@backup-2020-01-20 | zfs receive -o encryption=on -o keyformat=passphrase -o keylocation=file:///path/to/zfs/transfer.key transfer/data
cannot receive new filesystem stream: destination 'transfer/data' exists
must specify -F to overwrite it
# with -F
$ zfs send -v -R tank/data@backup-2020-01-20 | zfs receive -o encryption=on -o keyformat=passphrase -o keylocation=file:///path/to/zfs/transfer.key -F transfer/data
cannot receive new filesystem stream: zfs receive -F cannot be used to destroy an encrypted filesystem or overwrite an unencrypted one with an encrypted one

Should I destroy the transfer/data first?

ccremer on 20 Jan 2020

What could possibly go wrong... from bad to worse

When I tried to copy the data to the new transfer pool, the host crashed with the following error:

Feb  1 21:19:08 vmm-1 kernel: [12098918.482173] audit: type=1400 audit(1580588348.450:33): apparmor="STATUS" operation="profile_replace" info="same as current profile, skippi
ng" profile="unconfined" name="/usr/bin/lxc-start" pid=15848 comm="apparmor_parser"
Feb  1 21:19:08 vmm-1 kernel: [12098918.701348] audit: type=1400 audit(1580588348.670:35): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="lxc-contai
ner-default-cgns" pid=15851 comm="apparmor_parser"
Feb  1 21:33:40 vmm-1 kernel: [12099790.507007] ata7.00: exception Emask 0x0 SAct 0x20000000 SErr 0x0 action 0x6 frozen
Feb  1 21:33:40 vmm-1 kernel: [12099790.507304] ata7.00: failed command: WRITE FPDMA QUEUED
Feb  1 21:33:40 vmm-1 kernel: [12099790.507601] ata7.00: cmd 61/08:e8:10:1e:8c/00:00:04:00:00/40 tag 29 ncq dma 4096 out
Feb  1 21:33:40 vmm-1 kernel: [12099790.507601]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb  1 21:33:40 vmm-1 kernel: [12099790.508172] ata7.00: status: { DRDY }
Feb  1 21:33:40 vmm-1 kernel: [12099790.508468] ata7: hard resetting link
Feb  1 21:33:40 vmm-1 kernel: [12099790.849101] ata7.00: supports DRM functions and may not be fully accessible
Feb  1 21:33:40 vmm-1 kernel: [12099790.849482] ata7.00: NCQ Send/Recv Log not supported
Feb  1 21:33:40 vmm-1 kernel: [12099790.850187] ata7.00: supports DRM functions and may not be fully accessible
Feb  1 21:33:40 vmm-1 kernel: [12099790.850567] ata7.00: NCQ Send/Recv Log not supported
Feb  1 21:33:40 vmm-1 kernel: [12099790.851002] ata7.00: configured for UDMA/133
Feb  1 21:33:40 vmm-1 kernel: [12099790.851009] ata7: EH complete

After rebooting, importing the pool is not possible anymore:

root@vmm-1:/var/log# zpool import -f
   pool: tank
     id: 10161650679385460837
  state: UNAVAIL
 status: One or more devices are faulted.
 action: The pool cannot be imported due to damaged devices or data.
 config:

        tank                                          UNAVAIL  insufficient replicas
          raidz2-0                                    DEGRADED
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2NN9JNJ  UNAVAIL
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0555658  ONLINE
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0535953  ONLINE
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0527669  ONLINE
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5JEC5V2  ONLINE
            ata-WDC_WD40EFRX-68N32N0_WD-WCC7K4AR0032  ONLINE
          ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD    FAULTED  corrupted data

I still had the old disk, my hope was that once the raid leg was working again, it should have enough replicas. It actually comes online (until any scrubs run, then it should become too-many-write-errors or so):

root@vmm-1:~# zpool import -f
   pool: tank
     id: 10161650679385460837
  state: UNAVAIL
 status: One or more devices are faulted.
 action: The pool cannot be imported due to damaged devices or data.
 config:

        tank                                          UNAVAIL  insufficient replicas
          raidz2-0                                    ONLINE
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2NN9JNJ  ONLINE
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0555658  ONLINE
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0535953  ONLINE
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0527669  ONLINE
            ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5JEC5V2  ONLINE
            ata-WDC_WD40EFRX-68N32N0_WD-WCC7K4AR0032  ONLINE
          ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD    FAULTED  corrupted data

The disk is inserted and should work, I assume because I set it offline on Christmas this is now the result. I feel like the ZFS does a good enough job in documenting what each command does, but poorly about possible implications and gotchas :/

Is there ANY way I can retrieve some data out of it? Even a read-only no-snapshot corrupted view of the files would be sufficient for me, because I know that before setting it offline, it only holds about ~150 MB of possibly-corrupted data...

commands like zpool import -fFX also failed

ccremer on 2 Feb 2020

Found this blog: https://serverfault.com/questions/562998/zfs-bringing-a-disk-online-in-an-unavailable-pool
Researching further, there is a zdb command, for which I'm trying to find a txg_id, so I can try importing with something like zpool import -o readonly=on -f -T [txg_id] tank.

root@vmm-1:~# zdb -e tank -v

Configuration for import:
        vdev_children: 2
        version: 5000
        pool_guid: 10161650679385460837
        name: 'tank'
        state: 0
        hostid: 2831157250
        hostname: 'vmm-1'
        vdev_tree:
            type: 'root'
            id: 0
            guid: 10161650679385460837
            children[0]:
                type: 'raidz'
                id: 0
                guid: 17074537606798264386
                nparity: 2
                metaslab_array: 35
                metaslab_shift: 37
                ashift: 12
                asize: 23991808425984
                is_log: 0
                create_txg: 4
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 17647865786871571347
                    whole_disk: 1
                    DTL: 222
                    create_txg: 4
                    faulted: 1
                    path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2NN9JNJ-part2'
                    devid: 'ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2NN9JNJ-part2'
                    phys_path: 'pci-0000:00:1f.2-ata-6'
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 17282544705469633954
                    whole_disk: 1
                    DTL: 228
                    create_txg: 4
                    path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0555658-part2'
                    devid: 'ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0555658-part2'
                    phys_path: 'pci-0000:00:1f.2-ata-1'
                children[2]:
                    type: 'disk'
                    id: 2
                    guid: 8666471612038363978
                    whole_disk: 1
                    DTL: 227
                    create_txg: 4
                    path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0535953-part2'
                    devid: 'ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0535953-part2'
                    phys_path: 'pci-0000:00:1f.2-ata-2'
                children[3]:
                    type: 'disk'
                    id: 3
                    guid: 3867782762648460105
                    whole_disk: 1
                    DTL: 226
                    create_txg: 4
                    path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0527669-part2'
                    devid: 'ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0527669-part2'
                    phys_path: 'pci-0000:00:1f.2-ata-3'
                children[4]:
                    type: 'disk'
                    id: 4
                    guid: 15287609802282741202
                    whole_disk: 1
                    DTL: 225
                    create_txg: 4
                    path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5JEC5V2-part2'
                    devid: 'ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5JEC5V2-part2'
                    phys_path: 'pci-0000:00:1f.2-ata-4'
                children[5]:
                    type: 'disk'
                    id: 5
                    guid: 4786417980543102686
                    whole_disk: 1
                    DTL: 267
                    create_txg: 4
                    path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68N32N0_WD-WCC7K4AR0032-part1'
                    devid: 'ata-WDC_WD40EFRX-68N32N0_WD-WCC7K4AR0032-part1'
                    phys_path: 'pci-0000:00:1f.2-ata-5'
            children[1]:
                type: 'disk'
                id: 1
                guid: 14041696495835651738
                whole_disk: 1
                metaslab_array: 73622
                metaslab_shift: 34
                ashift: 12
                asize: 4000771997696
                is_log: 0
                create_txg: 23976332
                degraded: 1
                aux_state: 'external'
                path: '/dev/disk/by-id/ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD-part1'
                devid: 'ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD-part1'
                phys_path: 'pci-0000:04:00.0-ata-2'
        load-policy:
            load-request-txg: 18446744073709551615
            load-rewind-policy: 2
zdb: can't open 'tank': No such device or address

ZFS_DBGMSG(zdb) START:
spa.c:5490:spa_import(): spa_import: importing tank
spa_misc.c:408:spa_load_note(): spa_load(tank, config trusted): LOADING
spa_misc.c:408:spa_load_note(): spa_load(tank, config untrusted): vdev tree has 1 missing top-level vdevs.
spa_misc.c:408:spa_load_note(): spa_load(tank, config untrusted): current settings allow for maximum 0 missing top-level vdevs at this stage.
spa_misc.c:393:spa_load_failed(): spa_load(tank, config untrusted): FAILED: unable to open vdev tree [error=6]
vdev.c:179:vdev_dbgmsg_print_tree():   vdev 0: root, guid: 10161650679385460837, path: N/A, can't open
vdev.c:179:vdev_dbgmsg_print_tree():     vdev 0: raidz, guid: 17074537606798264386, path: N/A, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 0: disk, guid: 17647865786871571347, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2NN9JNJ-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 1: disk, guid: 17282544705469633954, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0555658-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 2: disk, guid: 8666471612038363978, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0535953-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 3: disk, guid: 3867782762648460105, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0527669-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 4: disk, guid: 15287609802282741202, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5JEC5V2-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 5: disk, guid: 4786417980543102686, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68N32N0_WD-WCC7K4AR0032-part1, healthy
vdev.c:179:vdev_dbgmsg_print_tree():     vdev 1: disk, guid: 14041696495835651738, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD-part1, faulted
spa_misc.c:408:spa_load_note(): spa_load(tank, config untrusted): UNLOADING
ZFS_DBGMSG(zdb) END

Is the create_txg: 23976332 on the last disk something of interest? Would that import the pool before it was added as a disk?

ccremer on 2 Feb 2020

So you moved it to the transfer pool, that went fine and while transfering to your new pool everything it died?

In that case I dont get your attached screenshot, because it shows the old pool with the seperate (single) drive attatched.

Ornias1993 on 2 Feb 2020

Not quite. I started copying to the transfer pool, but it didn't complete. It ran for ca. 30min, then the host crashed.

ccremer on 2 Feb 2020

Well in that case this is a good threat to point people towards that added a single disk or a raid 0 :)

Ornias1993 on 2 Feb 2020

Found this issue #9313
Basically it mentions this blog: https://www.delphix.com/blog/openzfs-pool-import-recovery
So I tried the procedure with zdb

zdb -e tank -G -X tank
zdb: can't open 'tank': No such device or address

ZFS_DBGMSG(zdb) START:
spa.c:5493:spa_import(): spa_import: importing tank, max_txg=-1 (RECOVERY MODE)
spa_misc.c:408:spa_load_note(): spa_load(tank, config trusted): LOADING
spa_misc.c:408:spa_load_note(): spa_load(tank, config untrusted): vdev tree has 1 missing top-level vdevs.
spa_misc.c:408:spa_load_note(): spa_load(tank, config untrusted): current settings allow for maximum 0 missing top-level vdevs at this stage.
spa_misc.c:393:spa_load_failed(): spa_load(tank, config untrusted): FAILED: unable to open vdev tree [error=6]
vdev.c:179:vdev_dbgmsg_print_tree():   vdev 0: root, guid: 10161650679385460837, path: N/A, can't open
vdev.c:179:vdev_dbgmsg_print_tree():     vdev 0: raidz, guid: 17074537606798264386, path: N/A, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 0: disk, guid: 17647865786871571347, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2NN9JNJ-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 1: disk, guid: 17282544705469633954, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0555658-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 2: disk, guid: 8666471612038363978, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0535953-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 3: disk, guid: 3867782762648460105, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0527669-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 4: disk, guid: 15287609802282741202, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5JEC5V2-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 5: disk, guid: 4786417980543102686, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68N32N0_WD-WCC7K4AR0032-part1, healthy
vdev.c:179:vdev_dbgmsg_print_tree():     vdev 1: disk, guid: 14041696495835651738, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD-part1, faulted
spa_misc.c:408:spa_load_note(): spa_load(tank, config untrusted): UNLOADING
ZFS_DBGMSG(zdb) END

cd /lib
ln -s libzpool.so.2 libzpool.so

With the zfs_max_missing_tvds it looks like it could be possible to import:

root@vmm-1:~# zdb -e tank -G -o zfs_max_missing_tvds=1 -X tank
Dataset mos [META], ID 0, cr_txg 4, 2.19G, 9413 objects

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         0    4    16K    16K  65.0M     512   820M    0.56  DMU dnode


ZFS_DBGMSG(zdb) START:
spa.c:5493:spa_import(): spa_import: importing tank, max_txg=-1 (RECOVERY MODE)
spa_misc.c:408:spa_load_note(): spa_load(tank, config trusted): LOADING
spa_misc.c:408:spa_load_note(): spa_load(tank, config untrusted): vdev tree has 1 missing top-level vdevs.
spa_misc.c:408:spa_load_note(): spa_load(tank, config untrusted): current settings allow for maximum 1 missing top-level vdevs at this stage.
vdev.c:179:vdev_dbgmsg_print_tree():   vdev 0: root, guid: 10161650679385460837, path: N/A, degraded
vdev.c:179:vdev_dbgmsg_print_tree():     vdev 0: raidz, guid: 17074537606798264386, path: N/A, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 0: disk, guid: 17647865786871571347, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2NN9JNJ-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 1: disk, guid: 17282544705469633954, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0555658-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 2: disk, guid: 8666471612038363978, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0535953-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 3: disk, guid: 3867782762648460105, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0527669-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 4: disk, guid: 15287609802282741202, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5JEC5V2-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 5: disk, guid: 4786417980543102686, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68N32N0_WD-WCC7K4AR0032-part1, healthy
vdev.c:179:vdev_dbgmsg_print_tree():     vdev 1: disk, guid: 14041696495835651738, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD-part1, faulted
vdev.c:125:vdev_dbgmsg(): disk vdev '/dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0555658-part2': best uberblock found for spa tank. txg 24675614
spa_misc.c:408:spa_load_note(): spa_load(tank, config untrusted): using uberblock with txg=24675614
spa_misc.c:408:spa_load_note(): spa_load(tank, config trusted): vdev tree has 1 missing top-level vdevs.
spa_misc.c:408:spa_load_note(): spa_load(tank, config trusted): current settings allow for maximum 1 missing top-level vdevs at this stage.
vdev.c:179:vdev_dbgmsg_print_tree():   vdev 0: root, guid: 10161650679385460837, path: N/A, degraded
vdev.c:179:vdev_dbgmsg_print_tree():     vdev 0: raidz, guid: 17074537606798264386, path: N/A, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 0: disk, guid: 17647865786871571347, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2NN9JNJ-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 1: disk, guid: 17282544705469633954, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0555658-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 2: disk, guid: 8666471612038363978, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0535953-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 3: disk, guid: 3867782762648460105, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0527669-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 4: disk, guid: 15287609802282741202, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5JEC5V2-part2, healthy
vdev.c:179:vdev_dbgmsg_print_tree():       vdev 5: disk, guid: 4786417980543102686, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68N32N0_WD-WCC7K4AR0032-part1, healthy
vdev.c:179:vdev_dbgmsg_print_tree():     vdev 1: disk, guid: 14041696495835651738, path: /dev/disk/by-id/ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD-part1, faulted
spa_misc.c:408:spa_load_note(): spa_load(tank, config trusted): spa_load_verify found 0 metadata errors and 1 data errors
spa_misc.c:408:spa_load_note(): spa_load(tank, config trusted): LOADED
spa.c:7592:spa_async_request(): spa=tank async request task=32
ZFS_DBGMSG(zdb) END

but the zdb doesn't actually import. Is there a way to import with zfs_max_missing_tvds=1? According to the blog he uses mdb to modify that, but it's not available on my Debian, and https://github.com/max123/mdbzfs was built for Solaris, cannot execute on my machine.

How/Where can I get a version of mdb for Linux, or how is it possible in other way to import with that restriction off?

For me it seems illogical that you can put disks offline and ZFS continues to work even with Data corruption. But add a node crash/reboot and ZFS is suddenly like "Nah, I don't feel like working at all". This is where ZFS disappoints me now. There should be an easier way to import faulty pools even with possible corruption, I mean it was running like that before the reboot...

ccremer on 3 Feb 2020

To anyone who is googling for this issue, has a similar problem and tries to import an otherwise healthy pool with missing top level device:

This is a summary how I managed to rescue my data after accidentally adding a replacement disk to a pool instead of replacing a faulty one.

Summary

What happened

In a raidz2 array, a drive was marked as faulty by a scrub due to checksum errors, so it got degraded.
When replacing a disk with zpool replace I got thrown errors about disk not found. In an attempt to "fix" this I made a mistake and made everything worse: I executed zpool add <new-disk>. That resulted in Stripe between the degraded raidz2 and the new disk, without replacing the old one. Once done, this process is irreversible.
To make things worse, I then set the new drive manually offline with the zpool offline command. ZFS stopped striping, but apparently that left some data corrupted. The pool was otherwise still readable and writable.
The only possible way to fix this pool constellation is by destroying and recreating it properly with the new disk. So in an attempt to copy data to a temporary new pool with completely other disks with zfs send | zfs receive the node crashed during the copy for an unknown reason and the pool cannot be imported anymore.
Internet research leads to modifying a zfs mod parameter so that the import worked again.

What went wrong

According to discussion in this thread, a Stripe with a raidz# and a single disk should only be possible with zpool add -f. Right now, apparently this flag wouldn't do anything. This is considered a (duplicate) bug and handled in #9038
Online documentation is hard to find via default search engines. There are some differences between Oracle ZFS (top search results) and OpenZFS. But also, the included man pages do not sufficiently point out to possible implications on certain commands IMO.
Commands were entered without much thought (my mistakes obviously).

How to import a pool with missing top-level vdev

Obviously this only works when the damage done is not too much i.e. most of the data is on one leg. In my case I had years of data on the raidz2 pool before adding a vdev for stripe.

All the flags documented in zpool import are useless in my case. Every combination (-F, -X, -T, etc) failed with the error The pool cannot be imported due to damaged devices or data. or similar.

Reading upon this excellent blog (https://www.delphix.com/blog/openzfs-pool-import-recovery) I discovered the zdb tool, which confirmed the cause of the failing import: ZFS would simply not import a pool when a top-level vdev is missing and has insufficient replicas.

I know that I put the zpool offline command fairly immediately after adding the disk, from which I always believed that the data should be accessible and intact, since the Raidz2 leg was otherwise healthy.

Luckily, there are Module parameters that allowed me to import the pool. They are documented here: https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters. It just happens that the required parameter is not listed in the page 🤦‍♂ , but it's in man zfs-module-parameters.

The workaround is actually easy, but I first to had to research and get to know about it.

# Allow a pool import even with missing top level vdevs
$ echo 1 >> /sys/module/zfs/parameters/zfs_max_missing_tvds
# to make that persistent, add an entry in `/etc/modprobe.d/zfs.conf`

# Then, let's import the tank read-only, to minimize risk of further data corruption.
$ zpool import -o readonly=on tank
$ zpool status
  pool: tank
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: scrub repaired 1M in 0 days 06:39:23 with 0 errors on Sun Nov 10 07:03:25 2019
config:

    NAME                                          STATE     READ WRITE CKSUM
    tank                                          DEGRADED     0     0     0
      raidz2-0                                    ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2NN9JNJ  ONLINE       0     0   129 # This is the old disk that was actually marked faulty by a scrub, it's bound to happen again.
        ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0555658  ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0535953  ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E0527669  ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5JEC5V2  ONLINE       0     0     0
        ata-WDC_WD40EFRX-68N32N0_WD-WCC7K4AR0032  ONLINE       0     0     0
      ata-WDC_WD40EFRX-68N32N0_WD-WCC7K5VH43CD    FAULTED      0     0     0  external device fault

errors: No known data errors

Now I can rescue my data with rsync etc. At this point, I don't even care about snapshots or encryption anymore, I'm just happy that they're not lost completely. It was a hard time for me to research and figure out a way to access the data again. Even if the workaround contains about 3 lines of commands in the end 🙄

Lessons learned

Don't. Neglect. Offsite. Backups. And. The. Monitoring. Just don't. But also Backups are worth nothing if not regularly tested.
Find the correct documentation. Oracle ZFS != OpenZFS
Don't change a pool layout after its initial creation (no zpool add). recreate if needed. This would ensure that you have at least one recent data backup.
Only ever use zpool replace when managing disk failures.
Don't blindly add a force flag when you get thrown an error. Try to investigate first what it means and why it failed. Failure messages are very often short, misleading, thrown-up from a low-level lib and without providing human-readable suggestions or hints for fixes.
If supported, use a dry-run first. (I actually think a dry-run should be default on irreversible commands IMO, there should be an additional "confirm" flag or something)
Expect outdated, hard-to-find or insufficient documentation
Don't panic or give up easily :)

This might seem common sense. But we are all human and make mistakes here and there ;)

Thanks for the help to everyone involved here. I think I can go on now and maybe you also learned a thing or two.

ccremer on 4 Feb 2020

👍1

@ccremer please file an issue against the zfs_module_parameters wiki page for the missing module parameter and assign it to me.

richardelling on 5 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Can't override 'mountpoint' when mounting a filesystem

FransUrbo · 4Comments

Docker build performance extremely slow due to lack of renameat2/overlayfs on ZFS

dbavatar · 4Comments

Staggered-write file fragmentation on multi-core systems

Baughn · 4Comments

[RFC] ZFS 0.6.5.9 proposed patch stack

kernelOfTruth · 4Comments

Error module compile on kernel 4.20.3

Hubbitus · 4Comments