Zfs: Metadata Allocation Classes not behaving as expected

Created on 15 Nov 2018  路  10Comments  路  Source: openzfs/zfs

Distribution Name | CentOS
Distribution Version | CentOS Linux release 7.5.1804 (Core)
Linux Kernel | 3.10.0-862.14.4.el7.x86_64
Architecture | x86_64
ZFS Version | 0.8.0-rc2
SPL Version | 0.8.0-rc2

Describe the problem you're observing

see http://list.zfsonlinux.org/pipermail/zfs-discuss/2018-November/032655.html why i wanted to give allocation classes a try.

I would like to speed up metadata access by splitting off metadata from data to a dedicted vdev (as pushing metadata to l2arc is no option)

for testing i'm using virtual machine with 2 virtual block devices xvdb + xvde. later on i would like to add ssd as special vdev to existing pool on our backup server to speed up metadata access.

what i'm observing does not meet my expectation - i don't see any benefit at all when experimenting with metadata access.

on the pool, i create a lot of empty files (simply touch filename.$index) and after invalidating cache with echo 3 /proc/sys/vm/drop_caches and export/re-import of the pool - i read the files metadata with "rsync -av --dry-run /zfstestpool /tmp" (which is simply recursive lstat() every file)

from what i can see, most of the reads being served by the pool's regular device and but not by the special one.

What i would expect is that i see a lot more read i/o and a lot more read bandwidth is being served by the special device.

I cannot explain this behaviour, discussed this on IRC with user and and was recommended to report a bug.

Describe how to reproduce the problem

create a pool with special device for metadata , create lots of empty files and metadata centric workload

zpool create  zfstestpool /dev/xvdb special /dev/xvde

<create lots of empty files>

[root@rolandtest /]# zpool list -v
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zfstestpool  2,03T   192M  2,03T        -         -     0%     0%  1.00x  ONLINE  -
  xvdb       1,94T   110M  1,94T        -         -     0%  0,00%
special          -      -      -         -      -      -
  xvde       99,5G  82,0M  99,4G        -         -     0%  0,08%

[root@rolandtest /]# zpool get all
NAME         PROPERTY                       VALUE                          SOURCE
zfstestpool  size                           2,03T                          -
zfstestpool  capacity                       0%                             -
zfstestpool  altroot                        -                              default
zfstestpool  health                         ONLINE                         -
zfstestpool  guid                           6837366004087058845            -
zfstestpool  version                        -                              default
zfstestpool  bootfs                         -                              default
zfstestpool  delegation                     on                             default
zfstestpool  autoreplace                    off                            default
zfstestpool  cachefile                      -                              default
zfstestpool  failmode                       wait                           default
zfstestpool  listsnapshots                  off                            default
zfstestpool  autoexpand                     off                            default
zfstestpool  dedupditto                     0                              default
zfstestpool  dedupratio                     1.00x                          -
zfstestpool  free                           2,03T                          -
zfstestpool  allocated                      192M                           -
zfstestpool  readonly                       off                            -
zfstestpool  ashift                         0                              default
zfstestpool  comment                        -                              default
zfstestpool  expandsize                     -                              -
zfstestpool  freeing                        0                              -
zfstestpool  fragmentation                  0%                             -
zfstestpool  leaked                         0                              -
zfstestpool  multihost                      off                            default
zfstestpool  checkpoint                     -                              -
zfstestpool  load_guid                      10393073353253220537           -
zfstestpool  feature@async_destroy          enabled                        local
zfstestpool  feature@empty_bpobj            enabled                        local
zfstestpool  feature@lz4_compress           active                         local
zfstestpool  feature@multi_vdev_crash_dump  enabled                        local
zfstestpool  feature@spacemap_histogram     active                         local
zfstestpool  feature@enabled_txg            active                         local
zfstestpool  feature@hole_birth             active                         local
zfstestpool  feature@extensible_dataset     active                         local
zfstestpool  feature@embedded_data          active                         local
zfstestpool  feature@bookmarks              enabled                        local
zfstestpool  feature@filesystem_limits      enabled                        local
zfstestpool  feature@large_blocks           enabled                        local
zfstestpool  feature@large_dnode            enabled                        local
zfstestpool  feature@sha512                 enabled                        local
zfstestpool  feature@skein                  enabled                        local
zfstestpool  feature@edonr                  enabled                        local
zfstestpool  feature@userobj_accounting     active                         local
zfstestpool  feature@encryption             enabled                        local
zfstestpool  feature@project_quota          active                         local
zfstestpool  feature@device_removal         enabled                        local
zfstestpool  feature@obsolete_counts        enabled                        local
zfstestpool  feature@zpool_checkpoint       enabled                        local
zfstestpool  feature@spacemap_v2            active                         local
zfstestpool  feature@allocation_classes     active                         local
zfstestpool  feature@resilver_defer         enabled                        local

<do rsync -av --dry-run /zfstestpool /tmp>

               capacity     operations     bandwidth
pool         alloc   free   read  write   read  write
-----------  -----  -----  -----  -----  -----  -----
zfstestpool   191M  2,03T  3,56K      0  7,33M      0
  xvdb        110M  1,94T  3,34K      0  1,67M      0
special          -      -      -      -      -      -
  xvde       81,6M  99,4G    230      0  5,66M      0
-----------  -----  -----  -----  -----  -----  -----
               capacity     operations     bandwidth
pool         alloc   free   read  write   read  write
-----------  -----  -----  -----  -----  -----  -----
zfstestpool   192M  2,03T  3,78K     63  3,81M   458K
  xvdb        110M  1,94T  3,54K      3  1,77M  4,00K
special          -      -      -      -      -      -
  xvde       81,9M  99,4G    240     59  2,04M   454K
-----------  -----  -----  -----  -----  -----  -----
               capacity     operations     bandwidth
pool         alloc   free   read  write   read  write
-----------  -----  -----  -----  -----  -----  -----
zfstestpool   192M  2,03T  3,62K      0  1,96M      0
  xvdb        110M  1,94T  3,60K      0  1,80M      0
special          -      -      -      -      -      -
  xvde       81,9M  99,4G     25      0   161K      0
-----------  -----  -----  -----  -----  -----  -----
               capacity     operations     bandwidth
pool         alloc   free   read  write   read  write
-----------  -----  -----  -----  -----  -----  -----
zfstestpool   192M  2,03T  3,53K      0  7,03M      0
  xvdb        110M  1,94T  3,28K      0  1,64M      0
special          -      -      -      -      -      -
  xvde       81,9M  99,4G    253      0  5,39M      0
-----------  -----  -----  -----  -----  -----  -----
               capacity     operations     bandwidth
pool         alloc   free   read  write   read  write
-----------  -----  -----  -----  -----  -----  -----
zfstestpool   192M  2,03T  3,78K      0  4,03M      0
  xvdb        110M  1,94T  3,55K      0  1,78M      0
special          -      -      -      -      -      -
  xvde       81,9M  99,4G    233      0  2,26M      0
-----------  -----  -----  -----  -----  -----  -----
               capacity     operations     bandwidth
pool         alloc   free   read  write   read  write
-----------  -----  -----  -----  -----  -----  -----
zfstestpool   192M  2,03T  3,62K      0  1,95M      0
  xvdb        110M  1,94T  3,59K      0  1,80M      0
special          -      -      -      -      -      -
  xvde       81,9M  99,4G     25      0   162K      0
-----------  -----  -----  -----  -----  -----  -----
               capacity     operations     bandwidth
pool         alloc   free   read  write   read  write
-----------  -----  -----  -----  -----  -----  -----
zfstestpool   192M  2,03T  3,58K     70  9,36M   367K
  xvdb        110M  1,94T  3,20K      3  1,60M  4,00K
special          -      -      -      -      -      -
  xvde       82,0M  99,4G    387     66  7,76M   363K
-----------  -----  -----  -----  -----  -----  -----
               capacity     operations     bandwidth
pool         alloc   free   read  write   read  write
-----------  -----  -----  -----  -----  -----  -----
zfstestpool   192M  2,03T  3,46K      0  1,94M      0
  xvdb        110M  1,94T  3,40K      0  1,70M      0
special          -      -      -      -      -      -
  xvde       82,0M  99,4G     62      0   244K      0
-----------  -----  -----  -----  -----  -----  -----

Performance

Most helpful comment

Let's get @don-brady's thoughts, but there's a solid case to be made spill blocks should be considered to be metadata. When in use they're virtually always storing xattrs which are metadata, and when storing storing selinux xattrs in particular we want it to be fast. That small change would better align the behavior with users expectations.

All 10 comments

as being discussed on IRC - "zfs set special_small_blocks=1k zfstestpool" was the solution to this problem.

apparently, the file's metadata is treated as "small block". after setting to 1k, all lstat() requests being served from special metadata device xvde (as it should)

oh, sorry about the initial confusion regarding embedded_data - i just did testing with active/inactive because being told so, but apparently that isn't completely unrelated. it should have been small_blocks....and that is a zfs dataset property (and not a zpool feature). i edited the bugreport and remove all the unrelated information regarding embedded_data property. that false information may have been the main reason why it was initially closed by kpande

i found that the weirdness i was seeing and the need to set special_small_blocks=1k to see "expected behaviour" must be related to selinux.

with zdb i found, there was selinux attribute with every file and that probably does not account for metadata.

not sure if this is a bug - but after disabling selinux (which is active by default on centos) behaviour is exactly as expected.

@devZer0 thanks for following up. Alternately, setting the dnodesize=auto and xattr=sa properties on the dataset will allow the selinux xattr to be stored in the dnode itself which is treated as metadata. The attached spill block is considered to be data.

yes.

but i think it's a little bit unfortunate, that just with "defaults", adding a special device for metadata does not give the expected performance benefit.

i think it's really hard to guess that selinux (which is default=on on centos/rhel) does play such an important role here and is throttling things because selinux xattrs is NOT metadata.

i guess most admins or even experienced persons will be surprised that selinux xattrs will NOT go to the special metadata device.

i think there should be at least a note in the docs for this, otherwise performance benefit may be thrown away and nobody will notice (until he looks precisely and researches why behaviour is weird..)

i was spending most of the evening yesterday to get a clear picture on what's happening

I'm largely inclined to agree. With a default of xattr=onI think there's an expectation that ls -l will be served entirely by metadata, but SELinux attr reading will ruin that.

Let's get @don-brady's thoughts, but there's a solid case to be made spill blocks should be considered to be metadata. When in use they're virtually always storing xattrs which are metadata, and when storing storing selinux xattrs in particular we want it to be fast. That small change would better align the behavior with users expectations.

So... can we re-open the issue then?

i have cleaned up the bugreport and removed misleading/wrong information

This issue has been resolved. PR #8361 resolved the issue with small blocks being misplaced, and I manually verified the existing code _does_ consider spill blocks to be metadata. Setting xattr=sa will result in all xattrs beings stored on the special devices.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mabod picture mabod  路  53Comments

cytrinox picture cytrinox  路  66Comments

ltz3317 picture ltz3317  路  82Comments

allanjude picture allanjude  路  72Comments

crollorc picture crollorc  路  143Comments