Zfs: kstat collisions from pool names

Created on 17 Oct 2016 · 16Comments · Source: openzfs/zfs

It dawns on me that this is potentially problematic:

/proc/spl/kstat/zfs/fm
/proc/spl/kstat/zfs/zil
/proc/spl/kstat/zfs/dbufs
/proc/spl/kstat/zfs/rpool
/proc/spl/kstat/zfs/rpool/io
/proc/spl/kstat/zfs/rpool/txgs
/proc/spl/kstat/zfs/rpool/reads
/proc/spl/kstat/zfs/rpool/dmu_tx_assign
/proc/spl/kstat/zfs/dbgmsg
/proc/spl/kstat/zfs/dmu_tx
/proc/spl/kstat/zfs/arcstats
/proc/spl/kstat/zfs/xuio_stats
/proc/spl/kstat/zfs/zfetchstats
/proc/spl/kstat/zfs/vdev_cache_stats

In specific, what happens if the pool named rpool was named zil or one of those other kstats? Things collide. I haven't tried this on a live system to see what happens yet, but I can imagine no way of handling a collision here that I like.

I think we should change (void) snprintf(name, KSTAT_STRLEN, "zfs/%s", spa_name(spa)); to (void) snprintf(name, KSTAT_STRLEN, "zfs/pool/%s", spa_name(spa)); in spa_tx_assign_init(). This avoids collisions. I am opening this issue to hear if anyone is opposed to this change.

Design Review Needed Inactive Stale Understood Defect good first issue

Source

ryao

All 16 comments

There's a decent chance that will break some users monitoring scripts. But I agree the pools should have been placed in a sub directory from day one. _If_ we wanted to be super friendly we could consider creating symlinks to the new names when there isn't a name conflict.

behlendorf on 17 Oct 2016

👍1

agree... lots of monitoring agents look for the pools (eg telegraf)

also, this problem exists in general, which is another reason not mounting a non-empty directory by default is a good idea (zpool create etc /dev/sd...)

richardelling on 17 Oct 2016

@kpande Thanks for testing that. :)

@behlendorf I think the monitoring scripts are broken either way, although it is definitely worse with my proposed fix than with the current situation. Lets introduce this change in phases. We can have a module parameter to control the behavior and offer 3 options that I'll call legacy (what we have now), compatible (what you suggested) and standard (what I suggested). We could introduce it as legacy at first and let script authors know about it. Then switch to compatible in the major release afterward. Then switch to standard in the major release after that. That should give script authors ample time to migrate and system administrators the ability to rollback the behavior if they for some reason have not updated their scripts. How does that sound?

@richardelling Thankfully, the collision doesn't cause any runtime issues aside from breaking scripts. We can do a very long transition to minimize the pain.

ryao on 17 Oct 2016

👍1

all good ideas, but I think we can go straight to "compatible" and not worry about "legacy"

richardelling on 17 Oct 2016

👍1

Alright. I'll plan to write up a patch later this week. This is a low priority, but it ought to be a small patch, so I'll fit it in. :)

ryao on 17 Oct 2016

This _could_ cause some runtime issues:

root@linux:~# modprobe zfs
[   39.018919] spl: loading out-of-tree module taints kernel.
[   39.021217] SPL: Loaded module v0.7.0-rc4_5_g7a35f2b (DEBUG mode)
[   39.655319] znvpair: module license 'CDDL' taints kernel.
[   39.656172] Disabling lock debugging due to kernel taint
[   60.040827] ZFS: Loaded module v0.7.0-rc4_103_ge19572e (DEBUG mode), ZFS pool version 5000, ZFS filesystem version 5
root@linux:~# POOLNAME='zil'
root@linux:~# TMPDIR='/var/tmp'
root@linux:~# mountpoint -q $TMPDIR || mount -t tmpfs tmpfs $TMPDIR
root@linux:~# zpool destroy $POOLNAME
cannot open 'zil': no such pool
root@linux:~# rm -f $TMPDIR/zpool.dat
root@linux:~# fallocate -l 64m $TMPDIR/zpool.dat
root@linux:~# zpool create $POOLNAME $TMPDIR/zpool.dat
[   64.118609] ------------[ cut here ]------------
[   64.119136] WARNING: CPU: 0 PID: 738 at fs/proc/generic.c:345 proc_register+0xe4/0x110
[   64.120037] proc_dir_entry 'zfs/zil' already registered
[   64.120620] Kernel panic - not syncing: panic_on_warn set ...
[   64.120620] 
[   64.122457] CPU: 0 PID: 738 Comm: zpool Tainted: P           OE   4.9.0 #5
[   64.124222] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   64.124868]  0000000000000000 ffffffff9a736b63 ffffa8954082fb00 ffffa8954082fb78
[   64.126761]  ffffffff9a5890ca ffffffff00000008 ffffa8954082fb88 ffffa8954082fb20
[   64.128666]  000000003ef23b9a 000000000000002b ffffffff9ae1bf0e 000000000000002b
[   64.130573] Call Trace:
[   64.130863]  [<ffffffff9a736b63>] ? dump_stack+0x5c/0x79
[   64.132507]  [<ffffffff9a5890ca>] ? panic+0xe8/0x236
[   64.136116]  [<ffffffff9a684c34>] ? proc_register+0xe4/0x110
[   64.136607]  [<ffffffff9a48150e>] ? __warn+0xde/0xe0
[   64.138139]  [<ffffffff9a48156f>] ? warn_slowpath_fmt+0x5f/0x80
[   64.138756]  [<ffffffff9aa0ea7d>] ? _raw_spin_unlock_irq+0x1d/0x30
[   64.139405]  [<ffffffff9a684a7d>] ? proc_alloc_inum+0x4d/0xe0
[   64.139985]  [<ffffffff9a684c34>] ? proc_register+0xe4/0x110
[   64.140598]  [<ffffffff9a684d60>] ? proc_mkdir_data+0x60/0x90
[   64.142239]  [<ffffffffc0807dca>] ? __kstat_install+0x24a/0x2f0 [spl]
[   64.143997]  [<ffffffffc099cd44>] ? spa_stats_init+0x144/0x4f0 [zfs]
[   64.144703]  [<ffffffffc07ff176>] ? spl_kmem_zalloc+0xc6/0x190 [spl]
[   64.146448]  [<ffffffffc099a0c5>] ? spa_add+0x395/0x720 [zfs]
[   64.148123]  [<ffffffffc09906a1>] ? spa_create+0x151/0xae0 [zfs]
[   64.148793]  [<ffffffffc09d6682>] ? zfs_fill_zplprops_impl+0x262/0x450 [zfs]
[   64.150725]  [<ffffffffc09dba05>] ? zfs_ioc_pool_create+0x205/0x270 [zfs]
[   64.152459]  [<ffffffffc09dc857>] ? zfsdev_ioctl+0x627/0x740 [zfs]
[   64.154118]  [<ffffffff9a624e7b>] ? do_vfs_ioctl+0x9b/0x5f0
[   64.154753]  [<ffffffff9a625446>] ? SyS_ioctl+0x76/0x90
[   64.156359]  [<ffffffff9aa0f13b>] ? entry_SYSCALL_64_fastpath+0x1e/0xad
[   64.158188] Kernel Offset: 0x19400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   64.160363] Rebooting in 1 seconds..

Then switch to compatible in the major release afterward. Then switch to standard in the major release after that.

Can we please push this change in next (0.7.0) stable release?

loli10K on 12 Jul 2017

👍1

@loli10K yes, if you can open a PR which addresses this in the next few days.

behlendorf on 12 Jul 2017

👍1

I've been thinking about this: we can't guarantee the uniqueness of kstat dirs because the pool name is truncated (KSTAT_STRLEN = 31).

We can have different pool names and still get a kstat collision (albeit without the WARN()):

root@linux:~# ls -l /proc/spl/kstat/zfs
total 0
-rw-r--r-- 1 root root 0 Jul 12 22:59 abdstats
-rw-r--r-- 1 root root 0 Jul 12 22:59 arcstats
-rw-r--r-- 1 root root 0 Jul 12 22:59 dbgmsg
-rw-r--r-- 1 root root 0 Jul 12 22:59 dbufs
-rw-r--r-- 1 root root 0 Jul 12 22:59 dmu_tx
-rw-r--r-- 1 root root 0 Jul 12 22:59 fletcher_4_bench
-rw-r--r-- 1 root root 0 Jul 12 22:59 fm
-rw-r--r-- 1 root root 0 Jul 12 22:59 vdev_cache_stats
-rw-r--r-- 1 root root 0 Jul 12 22:59 vdev_raidz_bench
dr-xr-xr-x 2 root root 0 Jul 12 22:59 verylongpoolname0123456789
-rw-r--r-- 1 root root 0 Jul 12 22:59 xuio_stats
-rw-r--r-- 1 root root 0 Jul 12 22:59 zfetchstats
-rw-r--r-- 1 root root 0 Jul 12 22:59 zil
root@linux:~# 
root@linux:~# zpool list
NAME                                SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
verylongpoolname0123456789           48M  97.5K  47.9M         -     1%     0%  1.00x  ONLINE  -
verylongpoolname0123456789_second    48M   104K  47.9M         -     1%     0%  1.00x  ONLINE  -
root@linux:~

This would happen even if we move the kstat dirs under zfs/pool/. Maybe it would make more sense just to avoid the WARN() from proc_register() and document the fact that if you need working kstats don't give your pools "reserved" names?

loli10K on 12 Jul 2017

for the kstats, you can use the pool's guid (as hex, please)
Then in a stat, keep the pool's name

richardelling on 12 Jul 2017

@richardelling that _could_ work, but:

It's SPL job to deal with Linux procfs and its requisite of uniqueness, not ZFS: blindly trusting ZFS to always ask SPL to kstat_install() unique guids sounds wrong to me, even if it's (probably) always going to work.
Illumos seems to have the same issue with kstat collisions, and they still use spa names:

[root@52-54-00-d3-7a-f3 ~]# uname -v
joyent_20170202T040152Z
[root@52-54-00-d3-7a-f3 ~]# zpool list | grep verylongpoolname
NAME                              SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
verylongpoolname0123456789AAAAA    48M   264K  47.7M         -     1%     0%  1.00x  ONLINE  -
verylongpoolname0123456789AAAAB    48M   264K  47.7M         -     1%     0%  1.00x  ONLINE  -
[root@52-54-00-d3-7a-f3 ~]# kstat -l | grep verylongpoolname
zfs:0:verylongpoolname0123456789AAAA:class
zfs:0:verylongpoolname0123456789AAAA:crtime
zfs:0:verylongpoolname0123456789AAAA:nread
zfs:0:verylongpoolname0123456789AAAA:nwritten
zfs:0:verylongpoolname0123456789AAAA:rcnt
zfs:0:verylongpoolname0123456789AAAA:reads
zfs:0:verylongpoolname0123456789AAAA:rlastupdate
zfs:0:verylongpoolname0123456789AAAA:rlentime
zfs:0:verylongpoolname0123456789AAAA:rtime
zfs:0:verylongpoolname0123456789AAAA:snaptime
zfs:0:verylongpoolname0123456789AAAA:wcnt
zfs:0:verylongpoolname0123456789AAAA:wlastupdate
zfs:0:verylongpoolname0123456789AAAA:wlentime
zfs:0:verylongpoolname0123456789AAAA:writes
zfs:0:verylongpoolname0123456789AAAA:wtime
[root@52-54-00-d3-7a-f3 ~]#

and from dmesg:

2017-09-15T12:13:42.930954+00:00 52-54-00-d3-7a-f3 unix: [ID 665567 kern.warning] WARNING: kstat_create('zfs', 0, 'verylongpoolname0123456789AAAAB'): namespace collision
2017-09-15T12:20:10.376250+00:00 52-54-00-d3-7a-f3 unix: [ID 665567 kern.warning] WARNING: kstat_create('zfs', 0, 'arcstats'): namespace collision

We could deal with this issue in a similar way: emit a warning from SPL and avoid triggering the WARN() in procfs. Given the limited time available this would be a short-term objective: in the long run we can revisit this issue. Does this seem reasonable?

loli10K on 13 Jul 2017

I think we're not communicating well. What I recommend is using something like snprintf(name, KSTAT_STRLEN, "0x%16x", spa->spa_config_guid) rather than spa_name(spa).

It might be overall simpler to make a function that is more descriptive of its use, like spa_kstat_name(). If we went take that approach then we could try a substring hash if the name is too long.

NB, I haven't looked closely at the various spa guids recently, though, and we'll need to figure out if the spa_config_guid or spa_load_guid is appropriate at the time kstats are initialized.

In any case, today there are few consumers of this data that aren't humans. So the best approach is oriented towards humans.

richardelling on 13 Jul 2017

We could deal with this issue in a similar way: emit a warning from SPL and avoid triggering the WARN() in procfs. Given the limited time available this would be a short-term objective: in the long run we can revisit this issue. Does this seem reasonable?

Yes. This sounds entirely reasonable, let's prevent the warnings for now and we can reviist a better long term solution.

As for using guids for the pool names in /proc/ that's an option. Although, that's not exactly the most human friendly of naming convention. If we ignore the truncated pool name issue for now we do already have unique pool names available to us. And that unlikely case could be handled in a variety of ways.

behlendorf on 13 Jul 2017

This is more of an implementation question than anything, should we be allowed to name pools and datasets after things like zfs-related commands, or pool/dataset property names?

bunder2015 on 22 Jul 2017

It's not recommended, but historically nothing has outright prevented it so it's entirely possible users have created pools with these sorts of names. Aside from the confusion this can case, these kstats names are the only place I can think of offhand where it's an actual technical problem.

behlendorf on 24 Jul 2017

@loli10K's kstat collision detection patch was merged. I'm leaving this issue open until a more complete long term solution can be implemented.

behlendorf on 24 Jul 2017

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.