Distribution Name | Scientific Linux
Distribution Version | 7.7
Linux Kernel | 3.10.0-1062.1.2.el7.x86_64
Architecture | x86-64
ZFS Version | 0.8.2
I was using ZoL version 0.8.1 from zfs-testing repo. After upgrade to 0.8.2 from EL7.7 repo and reboot, I get this
# zpool status storage
pool: storage
state: ONLINE
status: Mismatch between pool hostid and system hostid on imported pool.
This pool was previously imported into a system with a different hostid,
and then was verbatim imported into this system.
action: Export this pool on all systems on which it is imported.
Then import it to correct the mismatch.
see: http://zfsonlinux.org/msg/ZFS-8000-EY
scan: scrub repaired 0B in 0 days 07:40:32 with 0 errors on Sat Jun 29 18:33:32 2019
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-ST8000AS0002-1NA17Z_Z840KE5M ONLINE 0 0 0
ata-ST8000AS0002-1NA17Z_Z840KEB0 ONLINE 0 0 0
ata-ST8000AS0002-1NA17Z_Z840KEQJ ONLINE 0 0 0
ata-ST8000AS0002-1NA17Z_Z840KHX1 ONLINE 0 0 0
ata-ST8000AS0002-1NA17Z_Z840KJNK ONLINE 0 0 0
ata-ST8000AS0002-1NA17Z_Z840KR6D ONLINE 0 0 0
ata-ST8000AS0002-1NA17Z_Z840KS8R ONLINE 0 0 0
ata-ST8000AS0002-1NA17Z_Z840KV6J ONLINE 0 0 0
ata-ST8000AS0002-1NA17Z_Z840L4A9 ONLINE 0 0 0
ata-ST8000AS0002-1NA17Z_Z840L4N4 ONLINE 0 0 0
logs
ata-M4-CT064M4SSD2_00000000111903078C20-part3 ONLINE 0 0 0
errors: No known data errors
The pool has survived multiple reboots and ZoL version upgrades in the past without such effect. The problem happened to all 3 pools on this system.
I cannot say if the problem is reproducible. I know how to fix it with export/import, I only wanted to point out that it happened by itself after version upgrade.
Host ID was never defined on this system: /etc/hostid does not exist, cat /sys/module/spl/parameters/spl_hostid returns 0 (just as before).
@vstax thank you for reporting this. This warning was inadvertently trigger by 25f06d677a81a65ca98fa3d725ab5031a4864104 which added logic to cache the hostid. As a workaround you can clear the warning without exporting/importing the pool by briefly enabling the multiimport pool property. This will cause cached hostid to be updated.
zgenhostid
zpool set multihost=on
zpool set multihost=off
rm /etc/hostid
What is most likely causing this is the system's hostid changing during the boot process after the pool is initially imported. We'll propose a fix.
Could this kind of bug be avoided if the test suite could create ZFS filesystems with older versions and then run the test suite using the latest code?
@gdevenyi that's a good idea, the CI does already automatically perform basic forwards and backwards compatibility testing for every commit. But unfortunately that testing wouldn't catch this issue since it depends on the pool being imported during boot before the networking is started (which changes the hostid). This particular scenario would be a bit more difficult to automate.
I had this occur on Gentoo - but I thought it was triggered by the intermittent error I have where the initramfs seems to have an outdated zpool.cache (I set root to cachefile=none during the upgrade).
Gentoo didn't have /etc/hostid so I created it and rebuilt initramfs, now it seems to be happier, though the hostids still don't match (I suspect 007f0100 is basically "localhost").
```tank ~ # zdb -C root
MOS Configuration:
version: 5000
name: 'root'
state: 0
txg: 3691586
pool_guid: 8720697486956549366
errata: 0
hostid: 8323328
hostname: '(none)'
com.delphix:has_per_vdev_zaps
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 8720697486956549366
children[0]:
type: 'mirror'
id: 0
guid: 5511327584098821731
metaslab_array: 132
metaslab_shift: 30
ashift: 9
asize: 120019484672
is_log: 0
create_txg: 4
com.delphix:vdev_zap_top: 129
children[0]:
type: 'disk'
id: 0
guid: 14569829130078630519
path: '/dev/sda1'
whole_disk: 1
DTL: 277
create_txg: 4
com.delphix:vdev_zap_leaf: 130
children[1]:
type: 'disk'
id: 1
guid: 12082120380929373497
path: '/dev/sdb1'
whole_disk: 1
DTL: 263
create_txg: 4
com.delphix:vdev_zap_leaf: 131
features_for_read:
com.delphix:embedded_data
com.delphix:hole_birth
tank ~ # hostid
007f0100
Is there a reason that the hostid is not generated based on something that is likely to be fairly static for a particular system? An option that I am strongly considering is creating the hostid based on the system's uuid, as implemented in my set-hostid.
I understand that there are a few things that are not perfect with this approach. In particular:
dmidecode and python3. I suspect those could be removed by writing it in C and using whatever API dmidecode is using.@mgerdts using the dmidecode approach makes a lot of sense to me as a nice way to address this. Arguably, it would be been a better way to handle this initially but at the time it didn't occur to us.
It does suffer from the limitations you've mentioned, but those are probably manageable. The big one would be adding a dependency on dmidecode and python3 which we should do our best to avoid.
If you're so inclined to investigate this further that would be wonderful.
some points:
dmidecode is not often installed by defaultdmidecode -t 1 | awk '$1 ~ /^UUID/ {print $NF}' The kernel decodes the SMBIOS information so you don't need dmidecode either.
# cat /sys/devices/virtual/dmi/id/product_uuid
A6B8816B-2068-D803-2FC6-00259001EC32
Of course I seem to recall having seen lots of systems that have bogus UUID and I think it's relatively easy to find BIOS tools that let you change the UUID to anything you might want to set it to. But as long as it's not in the small blacklist (all zeros, 01020304-0506-0708-090a-0b0c0d0e0f10 IIRC) it would be fine to use.
I though it would be easier to find a machine with a bogus UUID, but on three machines I checked amongst the ancient hardware in my lab, they all had valid UUID:
23C4E16A-AF21-11E0-9EE9-5572DFDA3A26
The first one above is based on the motherboard ethernet MAC, but these other two appear to be generated using the "new" UUID algorithm that does NOT reference a system MAC address.
If administrators have systems with shared storage where ZFS pools can be failed over from one host to another, it is on them to insure that the two hosts don't have identical UUID or identical hostid which is something they have to do already for that configuration.
As a fix for this, should we include the spl_hostid in the dracut generated initramfs as a module parameter? Or perhaps on the kernel command line? "spl.spl_hostid=
Grub with the zfs module can read from the pool so it should be able to read /etc/hostid directly and pass that on the command line if you wanted it to be more dynamic.
If you want to use CRC-32 to hash the UUID, you can do it all in sh. No dmidecode or python required.
> MOS Configuration:
> hostid: 8323328
[snip]
> tank ~ # hostid
> 007f0100
Actually, the hostid are the same. The one in the json is in decimal:
# printf "%08x\n" 8323328
007f0100
I am getting the mismatch issue on a fresh install for my root pool. Running zfs 0.8.3 within Proxmox. The posted workaround did not fix the issue either.
Most helpful comment
I am getting the mismatch issue on a fresh install for my root pool. Running zfs 0.8.3 within Proxmox. The posted workaround did not fix the issue either.