Distribution Name | CentOS
Distribution Version | CentOS Linux release 7.5.1804 (Core)
Linux Kernel | 3.10.0-862.14.4.el7.x86_64
Architecture | x86_64
ZFS Version | 0.8.0-rc2
SPL Version | 0.8.0-rc2
see http://list.zfsonlinux.org/pipermail/zfs-discuss/2018-November/032655.html why i wanted to give allocation classes a try.
I would like to speed up metadata access by splitting off metadata from data to a dedicted vdev (as pushing metadata to l2arc is no option)
for testing i'm using virtual machine with 2 virtual block devices xvdb + xvde. later on i would like to add ssd as special vdev to existing pool on our backup server to speed up metadata access.
what i'm observing does not meet my expectation - i don't see any benefit at all when experimenting with metadata access.
on the pool, i create a lot of empty files (simply touch filename.$index) and after invalidating cache with echo 3 /proc/sys/vm/drop_caches and export/re-import of the pool - i read the files metadata with "rsync -av --dry-run /zfstestpool /tmp" (which is simply recursive lstat() every file)
from what i can see, most of the reads being served by the pool's regular device and but not by the special one.
What i would expect is that i see a lot more read i/o and a lot more read bandwidth is being served by the special device.
I cannot explain this behaviour, discussed this on IRC with user
create a pool with special device for metadata , create lots of empty files and metadata centric workload
zpool create zfstestpool /dev/xvdb special /dev/xvde
<create lots of empty files>
[root@rolandtest /]# zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zfstestpool 2,03T 192M 2,03T - - 0% 0% 1.00x ONLINE -
xvdb 1,94T 110M 1,94T - - 0% 0,00%
special - - - - - -
xvde 99,5G 82,0M 99,4G - - 0% 0,08%
[root@rolandtest /]# zpool get all
NAME PROPERTY VALUE SOURCE
zfstestpool size 2,03T -
zfstestpool capacity 0% -
zfstestpool altroot - default
zfstestpool health ONLINE -
zfstestpool guid 6837366004087058845 -
zfstestpool version - default
zfstestpool bootfs - default
zfstestpool delegation on default
zfstestpool autoreplace off default
zfstestpool cachefile - default
zfstestpool failmode wait default
zfstestpool listsnapshots off default
zfstestpool autoexpand off default
zfstestpool dedupditto 0 default
zfstestpool dedupratio 1.00x -
zfstestpool free 2,03T -
zfstestpool allocated 192M -
zfstestpool readonly off -
zfstestpool ashift 0 default
zfstestpool comment - default
zfstestpool expandsize - -
zfstestpool freeing 0 -
zfstestpool fragmentation 0% -
zfstestpool leaked 0 -
zfstestpool multihost off default
zfstestpool checkpoint - -
zfstestpool load_guid 10393073353253220537 -
zfstestpool feature@async_destroy enabled local
zfstestpool feature@empty_bpobj enabled local
zfstestpool feature@lz4_compress active local
zfstestpool feature@multi_vdev_crash_dump enabled local
zfstestpool feature@spacemap_histogram active local
zfstestpool feature@enabled_txg active local
zfstestpool feature@hole_birth active local
zfstestpool feature@extensible_dataset active local
zfstestpool feature@embedded_data active local
zfstestpool feature@bookmarks enabled local
zfstestpool feature@filesystem_limits enabled local
zfstestpool feature@large_blocks enabled local
zfstestpool feature@large_dnode enabled local
zfstestpool feature@sha512 enabled local
zfstestpool feature@skein enabled local
zfstestpool feature@edonr enabled local
zfstestpool feature@userobj_accounting active local
zfstestpool feature@encryption enabled local
zfstestpool feature@project_quota active local
zfstestpool feature@device_removal enabled local
zfstestpool feature@obsolete_counts enabled local
zfstestpool feature@zpool_checkpoint enabled local
zfstestpool feature@spacemap_v2 active local
zfstestpool feature@allocation_classes active local
zfstestpool feature@resilver_defer enabled local
<do rsync -av --dry-run /zfstestpool /tmp>
capacity operations bandwidth
pool alloc free read write read write
----------- ----- ----- ----- ----- ----- -----
zfstestpool 191M 2,03T 3,56K 0 7,33M 0
xvdb 110M 1,94T 3,34K 0 1,67M 0
special - - - - - -
xvde 81,6M 99,4G 230 0 5,66M 0
----------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
----------- ----- ----- ----- ----- ----- -----
zfstestpool 192M 2,03T 3,78K 63 3,81M 458K
xvdb 110M 1,94T 3,54K 3 1,77M 4,00K
special - - - - - -
xvde 81,9M 99,4G 240 59 2,04M 454K
----------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
----------- ----- ----- ----- ----- ----- -----
zfstestpool 192M 2,03T 3,62K 0 1,96M 0
xvdb 110M 1,94T 3,60K 0 1,80M 0
special - - - - - -
xvde 81,9M 99,4G 25 0 161K 0
----------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
----------- ----- ----- ----- ----- ----- -----
zfstestpool 192M 2,03T 3,53K 0 7,03M 0
xvdb 110M 1,94T 3,28K 0 1,64M 0
special - - - - - -
xvde 81,9M 99,4G 253 0 5,39M 0
----------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
----------- ----- ----- ----- ----- ----- -----
zfstestpool 192M 2,03T 3,78K 0 4,03M 0
xvdb 110M 1,94T 3,55K 0 1,78M 0
special - - - - - -
xvde 81,9M 99,4G 233 0 2,26M 0
----------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
----------- ----- ----- ----- ----- ----- -----
zfstestpool 192M 2,03T 3,62K 0 1,95M 0
xvdb 110M 1,94T 3,59K 0 1,80M 0
special - - - - - -
xvde 81,9M 99,4G 25 0 162K 0
----------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
----------- ----- ----- ----- ----- ----- -----
zfstestpool 192M 2,03T 3,58K 70 9,36M 367K
xvdb 110M 1,94T 3,20K 3 1,60M 4,00K
special - - - - - -
xvde 82,0M 99,4G 387 66 7,76M 363K
----------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
----------- ----- ----- ----- ----- ----- -----
zfstestpool 192M 2,03T 3,46K 0 1,94M 0
xvdb 110M 1,94T 3,40K 0 1,70M 0
special - - - - - -
xvde 82,0M 99,4G 62 0 244K 0
----------- ----- ----- ----- ----- ----- -----
as being discussed on IRC - "zfs set special_small_blocks=1k zfstestpool" was the solution to this problem.
apparently, the file's metadata is treated as "small block". after setting to 1k, all lstat() requests being served from special metadata device xvde (as it should)
oh, sorry about the initial confusion regarding embedded_data - i just did testing with active/inactive because being told so, but apparently that isn't completely unrelated. it should have been small_blocks....and that is a zfs dataset property (and not a zpool feature). i edited the bugreport and remove all the unrelated information regarding embedded_data property. that false information may have been the main reason why it was initially closed by kpande
i found that the weirdness i was seeing and the need to set special_small_blocks=1k to see "expected behaviour" must be related to selinux.
with zdb i found, there was selinux attribute with every file and that probably does not account for metadata.
not sure if this is a bug - but after disabling selinux (which is active by default on centos) behaviour is exactly as expected.
@devZer0 thanks for following up. Alternately, setting the dnodesize=auto and xattr=sa properties on the dataset will allow the selinux xattr to be stored in the dnode itself which is treated as metadata. The attached spill block is considered to be data.
yes.
but i think it's a little bit unfortunate, that just with "defaults", adding a special device for metadata does not give the expected performance benefit.
i think it's really hard to guess that selinux (which is default=on on centos/rhel) does play such an important role here and is throttling things because selinux xattrs is NOT metadata.
i guess most admins or even experienced persons will be surprised that selinux xattrs will NOT go to the special metadata device.
i think there should be at least a note in the docs for this, otherwise performance benefit may be thrown away and nobody will notice (until he looks precisely and researches why behaviour is weird..)
i was spending most of the evening yesterday to get a clear picture on what's happening
I'm largely inclined to agree. With a default of xattr=onI think there's an expectation that ls -l will be served entirely by metadata, but SELinux attr reading will ruin that.
Let's get @don-brady's thoughts, but there's a solid case to be made spill blocks should be considered to be metadata. When in use they're virtually always storing xattrs which are metadata, and when storing storing selinux xattrs in particular we want it to be fast. That small change would better align the behavior with users expectations.
So... can we re-open the issue then?
i have cleaned up the bugreport and removed misleading/wrong information
This issue has been resolved. PR #8361 resolved the issue with small blocks being misplaced, and I manually verified the existing code _does_ consider spill blocks to be metadata. Setting xattr=sa will result in all xattrs beings stored on the special devices.
Most helpful comment
Let's get @don-brady's thoughts, but there's a solid case to be made spill blocks should be considered to be metadata. When in use they're virtually always storing xattrs which are metadata, and when storing storing selinux xattrs in particular we want it to be fast. That small change would better align the behavior with users expectations.