Zfs: Read performance in 0.7.0

Created on 3 Aug 2017  路  37Comments  路  Source: openzfs/zfs

System information

Type | Version/Name
--- | ---
Distribution Name | CentOS
Distribution Version | 7.3
Linux Kernel | 3.10.0-514.26.2.el7
Architecture | x86_64
ZFS Version | 0.7.0-1
SPL Version | 0.7.0-1

Describe the problem you're observing

Read performance seems much slower in 0.7.0 compared to 0.6.5.11

Describe how to reproduce the problem

I have a RAIDz2 array of 10 disks, created with the following commands:

zpool create -f tank raidz2 disk{0..9}
zfs set recordsize=1M tank
zfs set atime=off tank
zfs set xattr=sa tank

zfs,conf has some minor changes:

options zfs zfs_vdev_async_write_max_active=15
options zfs zfs_vdev_async_write_min_active=5
options zfs zfs_dirty_data_max=16863235072
options zfs zfs_dirty_data_max_max=33726470144

Testing is done using iozone:

iozone -t 10 -s 24G -r 1024k -i 0 -i 1 -e -+n -+m /root/clientlist.tank

Results in 0.6.5.11:

Children see throughput for 10 initial writers  = 1167944.82 kB/sec
Children see throughput for 10 readers          = 1195458.09 kB/sec

Results in 0.7.0:

Children see throughput for 10 initial writers  =  942196.48 kB/sec
Children see throughput for 10 readers          =  711960.76 kB/sec

Are there any parameters to adjust in order to get similar perfs as in the old version ?

Performance

Most helpful comment

@kernelOfTruth Thanks for sharing.
Let me try to clarify few of the arcstat fields, and what I think could be the good metric for judging the efficiency of prefetchers:

  • prefetch_data_hits - prefetch block found in ARC. This is partly ok, but we want to keep this as low as possible to avoid useless lookups.
  • prefetch_data_misses - prefetch block not in ARC. These blocks are fetched in the async queue.
  • demand_hit_predictive_prefetch - This is the most important measure of prefetch effectiveness IMO. This number tells how many times a demand (asynchronous) I/O found the requested block in the ARC, which happened thanks to the prefetcher. This happens only once, and it will not be counted multiple times.
  • sync_wait_for_async - Demand I/O is waiting for the async prefetch I/O to complete in order to access the block. Ideally, it should be low. Try slightly increasing the zfetch_stream_skip_blk, or reducing zfetch_max_distance.

Hope it helps.

All 37 comments

just tried the latest master branch and getting the same results.

@quentinleburel A wild guess: could you repeat the test with zfs module option zfs_abd_scatter_enabled=0? This is by no means the recommended option, but it might tell us where to look.

@ironMann zfs_abd_scatter_enabled=0 didn't help. We see read performance regressions with 0.7.x as well. The new prefetcher seems to behave poorly. Setting zfetch_max_distance larger helps a bit, but we're not getting close to what we had with the old prefetcher.

@efocht Can you characterize your access pattern (continuous, strided, random, small files, ...) ? Also, can you paste /proc/spl/kstat/zfs/zfetchstats before and after the test you do? Thanks

@ironMann The read test is sequential reads of huge files. Example: 2 zpools (raidz2), each takes 32 file reads, so 64 sequential streams in total.

zfetchstats are not really usefull, it seems. Result with (more or less) default settings:

echo 8 > /sys/module/zfs/parameters/zfetch_max_streams
echo 8388608 > /sys/module/zfs/parameters/zfetch_max_distance
echo 64 > /sys/module/zfs/parameters/zfs_vdev_async_read_max_active    # default is 3
echo 1 > /sys/module/zfs/parameters/zfs_vdev_async_read_min_active
echo 2 > /sys/module/zfs/parameters/zfetch_min_sec_reap

              capacity     operations     bandwidth    syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read
pool        alloc   free   read  write   read  write   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
ost0         525G  75,5T  11,0K     13   942M  55,9K      0      0      0      0     12    226      0      0      0      0
ost1         525G  75,5T  10,8K     13   920M  55,9K      0      0      0      0     21    139      0      0      0      0

...
Bandwidth: ~ 1800MB/s


# cat /tmp/zfetchstats_pre
10 1 0x01 3 144 607764375143899 694685892472550
name                            type data
hits                            4    22906538
misses                          4    80370649
max_streams                     4    80198946

# cat /tmp/zfetchstats_post
10 1 0x01 3 144 607764375143899 695035295194589
name                            type data
hits                            4    23365102
misses                          4    81976505
max_streams                     4    81772091

Delta hits        :  458564
Delta misses      : 1605856
Delta max_streams : 1573145

A single zfetch related change improves the performance significantly, but zfetchstats doesn't show why:

echo $((512*1024*1024)) > /sys/module/zfs/parameters/zfetch_max_distance

[zpool iostat ...]
              capacity     operations     bandwidth    syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read
pool        alloc   free   read  write   read  write   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
ost0         525G  75,5T  17,8K     13  1,49G  55,9K      0      0      0      0  13,3K    447      0      0      0      0
ost1         525G  75,5T  13,1K     14  1,09G  57,9K      0      0      0      0  7,60K    447      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

Bandwidth: ~ 2600MB/s

oposs1:queue# cat /tmp/zfetchstats_pre
10 1 0x01 3 144 607764375143899 695226841132331
name                            type data
hits                            4    23365104
misses                          4    81976617
max_streams                     4    81772201
oposs1:queue# cat /tmp/zfetchstats_post
10 1 0x01 3 144 607764375143899 695495991551268
name                            type data
hits                            4    23823667
misses                          4    83582360
max_streams                     4    83366269

Delta hits        :  458563
Delta misses      : 1605743
Delta max_streams : 1594068

As said, the performance of 0.6.5.X with the old dmu_zfetch was around 50% higher.

Interesting is the huge pile of async_read pending requests in the good case.

@efocht The prefetcher code underwent large rewrite for 0.7.0, but your use-case should be handled well by the new code. No changes in zfetchstats, and a large number of misses indicate that the prefetcher is not recognizing the access pattern...

I could write a patch to instrument the prefetcher to expose more metrics into zfetchstats to hopefully get the reason why there are so many misses. Would you be able to apply it on top of 0.7.1 (or master) and rerun your test?

Yes, I'm eager to test, please send me the patch!

Great, the patch is here. Alternatively, you can pull my branch based on 0.7.1 tag. There should be plenty of information now in the kstats, but the behavior should be not changed.

OK, so ... bad case:

echo 8 > /sys/module/zfs/parameters/zfetch_max_streams
echo 8388608 > /sys/module/zfs/parameters/zfetch_max_distance
echo 64 > /sys/module/zfs/parameters/zfs_vdev_async_read_max_active    # default is 3
echo 1 > /sys/module/zfs/parameters/zfs_vdev_async_read_min_active
echo 2 > /sys/module/zfs/parameters/zfetch_min_sec_reap

Read bandwidth: ~1650MB/s
2 zpools, 32 files of 7GB, each, sequential read.

# cat /tmp/zfetchstats_pre_0                        | # cat /tmp/zfetchstats_post_0
-------------------------------------------------------------------------------------------------------
10 1 0x01 72 3456 732457014599296 803496118397564   | 10 1 0x01 72 3456 732457014599296 803817153168806     
name                            type data           |   type data             
hits                            4    1021           |   4    459585           
hits_data                       4    885            |   4    459449           
hits_meta                       4    136            |   4    136              
misses                          4    38789          |   4    1644696          
misses_lock                     4    29             |   4    29               
reaps                           4    830            |   4    35559            
max_streams                     4    37902          |   4    1608568          
max_streams_small               4    0              |   4    267              
prefetch_bytes_1                4    0              |   4    0                
prefetch_bytes_2                4    0              |   4    0                
prefetch_bytes_4                4    0              |   4    0                
prefetch_bytes_8                4    0              |   4    0                
prefetch_bytes_16               4    0              |   4    0                
prefetch_bytes_32               4    0              |   4    0                
prefetch_bytes_64               4    0              |   4    0                
prefetch_bytes_128              4    0              |   4    0                
prefetch_bytes_256              4    0              |   4    0                
prefetch_bytes_512              4    6              |   4    6                
prefetch_bytes_1024             4    2              |   4    2                
prefetch_bytes_2048             4    0              |   4    0                
prefetch_bytes_4096             4    6              |   4    6                
prefetch_bytes_8192             4    12             |   4    12               
prefetch_bytes_16384            4    805            |   4    809              
prefetch_bytes_32768            4    13             |   4    13               
prefetch_bytes_65536            4    5              |   4    5                
prefetch_bytes_131072           4    4              |   4    4                
prefetch_bytes_262144           4    4              |   4    4                
prefetch_bytes_524288           4    4              |   4    4                
prefetch_bytes_1048576          4    4              |   4    432628           
prefetch_bytes_2097152          4    4              |   4    17388            
prefetch_bytes_4194304          4    4              |   4    8556             
prefetch_bytes_8388608          4    4              |   4    4                
prefetch_bytes_16777216         4    4              |   4    4                
prefetch_bytes_33554432         4    4              |   4    4                
prefetch_bytes_67108864         4    0              |   4    0                
prefetch_bytes_134217728        4    0              |   4    0                
prefetch_bytes_268435456        4    0              |   4    0                
prefetch_bytes_536870912        4    0              |   4    0                
prefetch_bytes_1073741824       4    0              |   4    0                
prefetch_bytes_2147483648       4    0              |   4    0                
prefetch_miss_bytes_1           4    0              |   4    0                
prefetch_miss_bytes_2           4    0              |   4    0                
prefetch_miss_bytes_4           4    0              |   4    0                
prefetch_miss_bytes_8           4    0              |   4    0                
prefetch_miss_bytes_16          4    0              |   4    0                
prefetch_miss_bytes_32          4    0              |   4    0                
prefetch_miss_bytes_64          4    0              |   4    0                
prefetch_miss_bytes_128         4    0              |   4    0                
prefetch_miss_bytes_256         4    0              |   4    0                
prefetch_miss_bytes_512         4    356            |   4    356              
prefetch_miss_bytes_1024        4    12             |   4    12               
prefetch_miss_bytes_2048        4    0              |   4    0                
prefetch_miss_bytes_4096        4    230            |   4    230              
prefetch_miss_bytes_8192        4    32             |   4    32               
prefetch_miss_bytes_16384       4    24307          |   4    24481            
prefetch_miss_bytes_32768       4    12252          |   4    12300            
prefetch_miss_bytes_65536       4    923            |   4    930              
prefetch_miss_bytes_131072      4    391            |   4    409              
prefetch_miss_bytes_262144      4    148            |   4    165              
prefetch_miss_bytes_524288      4    87             |   4    98               
prefetch_miss_bytes_1048576     4    17             |   4    1978630          
prefetch_miss_bytes_2097152     4    6              |   4    244813           
prefetch_miss_bytes_4194304     4    0              |   4    487136           
prefetch_miss_bytes_8388608     4    0              |   4    957141           
prefetch_miss_bytes_16777216    4    0              |   4    1846795          
prefetch_miss_bytes_33554432    4    0              |   4    3370144          
prefetch_miss_bytes_67108864    4    0              |   4    3031567          
prefetch_miss_bytes_134217728   4    0              |   4    515332           
prefetch_miss_bytes_268435456   4    0              |   4    231593           
prefetch_miss_bytes_536870912   4    0              |   4    82403            
prefetch_miss_bytes_1073741824  4    0              |   4    14629            
prefetch_miss_bytes_2147483648  4    0              |   4    0                             

Somewhat better case (but still 50% off the past dmu_zfetch from 0.6.5.X

echo $((512*1024*1024)) > /sys/module/zfs/parameters/zfetch_max_distance

Read bandwidth: ~2750MB/s
2 zpools, 32 files per zpool, 7GB each file, linear read.

      before                                                 after
--------------------------------------------------------------------------------------------------------------               
10 1 0x01 72 3456 732457014599296 804657251029842 | 10 1 0x01 72 3456 732457014599296 805109494983485      
name                            type data         |   type data              
hits                            4    459595       |   4    918159            
hits_data                       4    459459       |   4    918023            
hits_meta                       4    136          |   4    136               
misses                          4    1645066      |   4    3250846           
misses_lock                     4    29           |   4    29                
reaps                           4    35569        |   4    53892             
max_streams                     4    1608928      |   4    3196385           
max_streams_small               4    627          |   4    771               
prefetch_bytes_1                4    0            |   4    0                 
prefetch_bytes_2                4    0            |   4    0                 
prefetch_bytes_4                4    0            |   4    0                 
prefetch_bytes_8                4    0            |   4    0                 
prefetch_bytes_16               4    0            |   4    0                 
prefetch_bytes_32               4    0            |   4    0                 
prefetch_bytes_64               4    0            |   4    0                 
prefetch_bytes_128              4    0            |   4    0                 
prefetch_bytes_256              4    0            |   4    0                 
prefetch_bytes_512              4    6            |   4    6                 
prefetch_bytes_1024             4    2            |   4    2                 
prefetch_bytes_2048             4    0            |   4    0                 
prefetch_bytes_4096             4    6            |   4    6                 
prefetch_bytes_8192             4    12           |   4    12                
prefetch_bytes_16384            4    819          |   4    823               
prefetch_bytes_32768            4    13           |   4    13                
prefetch_bytes_65536            4    5            |   4    5                 
prefetch_bytes_131072           4    4            |   4    4                 
prefetch_bytes_262144           4    4            |   4    4                 
prefetch_bytes_524288           4    4            |   4    4                 
prefetch_bytes_1048576          4    432628       |   4    866082            
prefetch_bytes_2097152          4    17388        |   4    23242             
prefetch_bytes_4194304          4    8556         |   4    11420             
prefetch_bytes_8388608          4    4            |   4    2757              
prefetch_bytes_16777216         4    4            |   4    2757              
prefetch_bytes_33554432         4    4            |   4    2747              
prefetch_bytes_67108864         4    0            |   4    2743              
prefetch_bytes_134217728        4    0            |   4    2698              
prefetch_bytes_268435456        4    0            |   4    2698              
prefetch_bytes_536870912        4    0            |   4    0                 
prefetch_bytes_1073741824       4    0            |   4    0                 
prefetch_bytes_2147483648       4    0            |   4    0                 
prefetch_miss_bytes_1           4    0            |   4    0                 
prefetch_miss_bytes_2           4    0            |   4    0                 
prefetch_miss_bytes_4           4    0            |   4    0                 
prefetch_miss_bytes_8           4    0            |   4    0                 
prefetch_miss_bytes_16          4    0            |   4    0                 
prefetch_miss_bytes_32          4    0            |   4    0                 
prefetch_miss_bytes_64          4    0            |   4    0                 
prefetch_miss_bytes_128         4    0            |   4    0                 
prefetch_miss_bytes_256         4    0            |   4    0                 
prefetch_miss_bytes_512         4    356          |   4    356               
prefetch_miss_bytes_1024        4    12           |   4    12                
prefetch_miss_bytes_2048        4    0            |   4    0                 
prefetch_miss_bytes_4096        4    230          |   4    230               
prefetch_miss_bytes_8192        4    32           |   4    32                
prefetch_miss_bytes_16384       4    24741        |   4    24845             
prefetch_miss_bytes_32768       4    12410        |   4    12454             
prefetch_miss_bytes_65536       4    930          |   4    930               
prefetch_miss_bytes_131072      4    409          |   4    409               
prefetch_miss_bytes_262144      4    165          |   4    165               
prefetch_miss_bytes_524288      4    98           |   4    98                
prefetch_miss_bytes_1048576     4    1978630      |   4    3767680           
prefetch_miss_bytes_2097152     4    244813       |   4    369693            
prefetch_miss_bytes_4194304     4    487136       |   4    733511            
prefetch_miss_bytes_8388608     4    957141       |   4    1438839           
prefetch_miss_bytes_16777216    4    1846795      |   4    2773366           
prefetch_miss_bytes_33554432    4    3370144      |   4    5087267           
prefetch_miss_bytes_67108864    4    3031567      |   4    5872284           
prefetch_miss_bytes_134217728   4    515332       |   4    3628274           
prefetch_miss_bytes_268435456   4    231593       |   4    1596841           
prefetch_miss_bytes_536870912   4    82403        |   4    255748            
prefetch_miss_bytes_1073741824  4    14629        |   4    27453             
prefetch_miss_bytes_2147483648  4    0            |   4    0                

@efocht Thanks for that. Are you using larger recodsize?
I've updated my branch with a ~quick fix~ https://github.com/ironMann/zfs/commit/c35b14c6da31a0e0119fcf4ca5a17d90d14af5fc
In my ad-hoc testing, the number of hits is doubled. If this does not help, we have to investigate your access pattern in more detail.

Thanks! I'll test ASAP.

Yes, I use recordsize=1MB. Reads are driven by some parallel file system like Lustre or BeeGFS, thus issued by worker threads. So the purely sequential access can be slightly out of order due to the scheduling of these threads in the parallel file system above.

Unfortunately performance is worse with ironMann/zfs@c35b14c . Number of hits is on the same level, increasing zfetch_max_distance to 512MB even reduces the performance (~1250MB/s). With 64MB zfetch_max_distance bandwidth is a little better (~2200MB/s) than with the 8MB default (~1750MB/s), but far from what we had with 512MB before (2750MB/s).

@efocht I figured as much, this will not have an easy fix. Increasing the zfetch_max_distance has the consequence of re-prefetching the same blocks (hence the async-read pileup you noticed). The best bet with the current prefetcher is to figure out how many PFS worker threads you have and how is the file access split between them. Then try to set zfetch_max_streams just above the number of threads, and keep the zfetch_max_distance at a more conservative value (8 - 32MiB)...

I'll try to figure out what can be done about the redundant prefetches...

@efocht I've pushed a new initial prefetcher implementation to https://github.com/ironMann/zfs/tree/prefetch_dmu
The new prefetcher is based on range map and should not issue any overlapping i/os. It should pickup linear ascending and descending access on larger files (also if multiple threads are used, like in your case). It cannot (yet) detect different access streams inside a single file, but this is not your use-case. I would appreciate if you could give it a try and report kstats as usual. Thanks!

Thanks! Here are some results with the new prefetcher. The number of hits increased significantly, reaps are rather few. The PFS on top was doing accesses with 512kB, maybe that's the reason why we see 900k hits vs. 450k x 1MB prefetches.

/sys/module/zfs/parameters/zfetch_max_distance:8388608
/sys/module/zfs/parameters/zfetch_min_sec_reap:60
/sys/module/zfs/parameters/zfs_vdev_async_read_max_active:16
/sys/module/zfs/parameters/zfs_vdev_async_read_min_active:16

read bandwidth (32 files per zpool, file size = 7GB, sequential read):
~1650MB/s

10 1 0x01 39 1872 124593485225155 124741953996895 | 10 1 0x01 39 1872 124593485225155 125303772758506   
name                            type data         |  data           
hits                            4    6413         |  920942         
misses                          4    1206         |  4294           
hits_data                       4    5841         |  920370         
hits_meta                       4    572          |  572            
hits_forward                    4    14348        |  473356         
hits_backward                   4    20           |  20             
reaps                           4    0            |  2              
prefetch_bytes_1                4    0            |  0              
prefetch_bytes_2                4    0            |  0              
prefetch_bytes_4                4    0            |  0              
prefetch_bytes_8                4    0            |  0              
prefetch_bytes_16               4    0            |  0              
prefetch_bytes_32               4    0            |  0              
prefetch_bytes_64               4    0            |  0              
prefetch_bytes_128              4    0            |  0              
prefetch_bytes_256              4    0            |  0              
prefetch_bytes_512              4    4            |  4              
prefetch_bytes_1024             4    0            |  0              
prefetch_bytes_2048             4    0            |  0              
prefetch_bytes_4096             4    0            |  0              
prefetch_bytes_8192             4    0            |  0              
prefetch_bytes_16384            4    8            |  8              
prefetch_bytes_32768            4    6            |  6              
prefetch_bytes_65536            4    0            |  0              
prefetch_bytes_131072           4    2            |  2              
prefetch_bytes_262144           4    0            |  0              
prefetch_bytes_524288           4    0            |  0              
prefetch_bytes_1048576          4    0            |  457024         
prefetch_bytes_2097152          4    0            |  192            
prefetch_bytes_4194304          4    0            |  192            
prefetch_bytes_8388608          4    10           |  74             
prefetch_bytes_16777216         4    0            |  0              
prefetch_bytes_33554432         4    0            |  0              
prefetch_bytes_67108864         4    0            |  0              
prefetch_bytes_134217728        4    0            |  0              
prefetch_bytes_268435456        4    0            |  0              
prefetch_bytes_536870912        4    0            |  0              
prefetch_bytes_1073741824       4    0            |  0              
prefetch_bytes_2147483648       4    0            |  0              

Increasing zfs_max_distance to 64MB helps. I switched the PFS back to 1MB transactions.

echo 96 > /sys/module/zfs/parameters/zfs_vdev_async_read_max_active
echo $((64*1024*1024)) > /sys/module/zfs/parameters/zfetch_max_distance

read bandwidth (32 files per zpool, file size = 7GB, sequential read):
~2400MB/s

10 1 0x01 39 1872 124593485225155 127887739449554 | 10 1 0x01 39 1872 124593485225155 128140379826819     
name                            type data         |  data             
hits                            4    1379884      |  1827921          
misses                          4    5669         |  16485            
hits_data                       4    1379135      |  1827172          
hits_meta                       4    749          |  749              
hits_forward                    4    934536       |  1397128          
hits_backward                   4    25           |  25               
reaps                           4    91           |  155              
prefetch_bytes_1                4    0            |  0                
prefetch_bytes_2                4    0            |  0                
prefetch_bytes_4                4    0            |  0                
prefetch_bytes_8                4    0            |  0                
prefetch_bytes_16               4    0            |  0                
prefetch_bytes_32               4    0            |  0                
prefetch_bytes_64               4    0            |  0                
prefetch_bytes_128              4    0            |  0                
prefetch_bytes_256              4    0            |  0                
prefetch_bytes_512              4    12           |  12               
prefetch_bytes_1024             4    0            |  0                
prefetch_bytes_2048             4    0            |  0                
prefetch_bytes_4096             4    83           |  83               
prefetch_bytes_8192             4    1            |  1                
prefetch_bytes_16384            4    9            |  9                
prefetch_bytes_32768            4    7            |  7                
prefetch_bytes_65536            4    0            |  0                
prefetch_bytes_131072           4    2            |  2                
prefetch_bytes_262144           4    0            |  0                
prefetch_bytes_524288           4    0            |  0                
prefetch_bytes_1048576          4    914020       |  1360292          
prefetch_bytes_2097152          4    384          |  512              
prefetch_bytes_4194304          4    384          |  576              
prefetch_bytes_8388608          4    139          |  331              
prefetch_bytes_16777216         4    0            |  128              
prefetch_bytes_33554432         4    28           |  156              
prefetch_bytes_67108864         4    0            |  64               
prefetch_bytes_134217728        4    0            |  0                
...

Thanks, I should have access to a Lustre node soon so I'll be able to continue development and tuning there. These numbers are very reassuring, even though this is a very basic implementation. In this case, hits just mean that the prefetcher decided to do something on the file access it saw. Miss is a signal that the access pattern was not recognized (or deemed to be random) and that no blocks will be prefetched. Also, I think it would be best if you matched PFS transaction size to the record size.

I've sent off the new initial prefetch to a PR [https://github.com/zfsonlinux/zfs/issues/6453#issuecomment-323693969]

if it passes I can do an Android ROM (AOSP or LineageOS) build test-run to see how it differs in the time needed to build it before vs. after - if that is useful

Thank you

@kernelOfTruth Thanks for the effort you're investing in this. @efocht had some success with the code but the patch is in a very rough initial stage. I've just pushed some work from today to enable more sensible prefetching when a stream is first identified. If you go through with the test, I would be very interested in seeing zfetchstats and arcstats. I'm not sure how useful block prefetching of small files is though. Good luck.

Seems like there was quite the improvement:

previously known build time:

cache cold build [e.g. after boot] + l2arc:

~ 01:45:00 (hh:mm:ss))


build from yesterday with prior zfs_master (https://github.com/kernelOfTruth/zfs/commits/zfs_master_09.08.2017_0.7.0%2B)

no stats, repo sync (minor sync up - so the files were crawled already after reboot) + l2arc:

make completed successfully (01:30:24 (hh:mm:ss))


today's build - cache cold, + l2arc

make completed successfully (01:33:39 (hh:mm:ss))

cat /proc/spl/kstat/zfs/arcstats 
12 1 0x01 93 25296 57439232799 9010454558720
name                            type data
hits                            4    1557138
misses                          4    1082407
demand_data_hits                4    849517
demand_data_misses              4    251447
demand_metadata_hits            4    319475
demand_metadata_misses          4    787321
prefetch_data_hits              4    65412
prefetch_data_misses            4    10632
prefetch_metadata_hits          4    322734
prefetch_metadata_misses        4    33007
mru_hits                        4    710317
mru_ghost_hits                  4    11263
mfu_hits                        4    475849
mfu_ghost_hits                  4    8952
deleted                         4    305040
mutex_miss                      4    6
evict_skip                      4    80
evict_not_enough                4    0
evict_l2_cached                 4    28082894848
evict_l2_eligible               4    4732558848
evict_l2_ineligible             4    324257792
evict_l2_skip                   4    0
hash_elements                   4    823973
hash_elements_max               4    823977
hash_collisions                 4    214740
hash_chains                     4    70254
hash_chain_max                  4    4
p                               4    9104723856
c                               4    12803985656
c_min                           4    10737418240
c_max                           4    12884901888
size                            4    12769052840
compressed_size                 4    12311106048
uncompressed_size               4    24302977024
overhead_size                   4    182002688
hdr_size                        4    172241464
data_size                       4    12290973184
metadata_size                   4    202135552
dbuf_size                       4    13879120
dnode_size                      4    48231040
bonus_size                      4    12762240
anon_size                       4    700416
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    6806301184
mru_evictable_data              4    6648347648
mru_evictable_metadata          4    48835584
mru_ghost_size                  4    5257148928
mru_ghost_evictable_data        4    2922083328
mru_ghost_evictable_metadata    4    2335065600
mfu_size                        4    5686107136
mfu_evictable_data              4    5472883712
mfu_evictable_metadata          4    36806656
mfu_ghost_size                  4    6906084352
mfu_ghost_evictable_data        4    5879109120
mfu_ghost_evictable_metadata    4    1026975232
l2_hits                         4    14262
l2_misses                       4    1068107
l2_feeds                        4    9175
l2_rw_clash                     4    0
l2_read_bytes                   4    94854656
l2_write_bytes                  4    13519991296
l2_writes_sent                  4    5729
l2_writes_done                  4    5729
l2_writes_error                 4    0
l2_writes_lock_retry            4    5
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    2161
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    29310505472
l2_asize                        4    11720991232
l2_hdr_size                     4    28830240
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    30
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    2956
arc_meta_used                   4    478079656
arc_meta_limit                  4    3221225472
arc_dnode_limit                 4    322122547
arc_meta_max                    4    1787160352
arc_meta_min                    4    16777216
sync_wait_for_async             4    1389
demand_hit_predictive_prefetch  4    25071
arc_need_free                   4    0
arc_sys_free                    4    527127808
arc_raw_size                    4    0

what is especially salient is the value of prefetch_data_hits, that value was NEVER in the "positive" range - meaning prefetch_data_misses was ALWAYS higher than prefetch_data_hits [at least from my previous observations during daily usage]
Now prefetch_data_hits even surpassed prefetch_data_misses by level of 4x !

I'll keep an eye on it (prefetch_data_hits vs. prefetch_data_misses) and see how it behaves during everyday usage patterns of launching desktop searches, opening files, browsing, listening to music, etc.

@kernelOfTruth Thanks for sharing.
Let me try to clarify few of the arcstat fields, and what I think could be the good metric for judging the efficiency of prefetchers:

  • prefetch_data_hits - prefetch block found in ARC. This is partly ok, but we want to keep this as low as possible to avoid useless lookups.
  • prefetch_data_misses - prefetch block not in ARC. These blocks are fetched in the async queue.
  • demand_hit_predictive_prefetch - This is the most important measure of prefetch effectiveness IMO. This number tells how many times a demand (asynchronous) I/O found the requested block in the ARC, which happened thanks to the prefetcher. This happens only once, and it will not be counted multiple times.
  • sync_wait_for_async - Demand I/O is waiting for the async prefetch I/O to complete in order to access the block. Ideally, it should be low. Try slightly increasing the zfetch_stream_skip_blk, or reducing zfetch_max_distance.

Hope it helps.

I am also experiencing quite a slowdown with the 0.7 builds.
This is almost exclusively in read performance, which is approximately 60% of v0.6.5.

The zpool is build with 8x Micron 9100 SSDs in a stripe of 4 mirrors
The zpool had the following options ashift=13 xattr=sa acltype=posixacl

Here are the summary stats from iozone (with a file size of 16HB and 246k blocks) with the three versions I tried (I limited RAM on the server to 8GB available for the tests).

iozone -i 0 -i 1 -t 1 -s16g -r256k -t 20 

Debian Stretch
zfs v0.6.5.9-5 (standard stretch package)

4x mirrors striped ashift=13 (1/20 thread, 256k blocks)
    - write     2.19G/sec   2.86G/sec
    - rewrite   2.33G/sec   2.77G/sec
    - read      2.53G/sec   6.26G/sec
    - reread    2.50G/sec   6.28G/sec

Stretch backports version 0.7.3-1~bpo9+1

4x mirrors striped ashift=13 (1/20 thread, 256k blocks)
    - write     2.35G/sec   2.89G/sec
    - rewrite   2.39G/sec   2.85G/sec
    - read      1.60G/sec   3.69G/sec
    - reread    1.68G/sec   3.69G/sec

Compiled version of ZFS 0.7.5-1

4x mirrors striped ashift=13 (1/20 thread, 256k blocks)
    - write     2.41G/sec   2.97G/sec
    - rewrite   2.18G/sec   2.85G/sec
    - read      1.76G/sec   3.78G/sec
    - reread    7.55G/sec   3.82G/sec

I also noticed that the 0.7 builds seem to use more CPU but haven't quantified that yet

@zeigerpuppy it would be great if you could retest with the 0.7.5 tag + commit 0873bb6337452e3e028e40f5dad945b30deab185 which was merged to master. This small fix will be part of 0.7.6 and should help address the performance issues with the 0.7.x builds.

Thanks @behlendorf
I have tried building the 0.7.5 tag + commit 0873bb6 but have bit of difficulty getting it to build. I have opened a new issue here: https://github.com/zfsonlinux/zfs/issues/7039

I don't think I'm doing the build properly, for some reason when I compile SPL from the github source, it builds 0.7.0-22 instead of 0.7.5-1. I think I may be messing up the build command somehow, is this right?:

git clone https://github.com/zfsonlinux/spl.git
cd spl
git reset --hard c9821f1;  # the latest commit
./autogen.sh 
./configure
make deb

It produces the following debs:

  • kmod-spl-4.9.0-4-amd64_0.7.0-22_amd64.deb
  • kmod-spl-devel_0.7.0-22_amd64.deb
  • kmod-spl-devel-4.9.0-4-amd64_0.7.0-22_amd64.deb
  • spl_0.7.0-22_amd64.deb
  • spl-dkms_0.7.0-22_all.deb

Then zfs, doesn;t build with this SPL (which is not surprising)

I don't think I'm doing the build properly, for some reason when I compile SPL from the github source, it builds 0.7.0-22 instead of 0.7.5-1

the builds from source generate packages with 0.7.0 names in them but in fact they are really the latest source you downloaded. you can check with dmesg, you will see spl and zfs modules printing the commit number when loading

I have tried building again, looks like the SPL modules are fine, but I can't get the make deb to build properly for ZFS commit 0873bb6

I checked and it seems to be finding the correct SPL source (the one I built from the most recent commit)

checking spl source version... 0.7.0-22_gc9821f1

It builds the following files OK:

  • kmod-zfs-4.9.0-4-amd64_0.7.0-228_amd64.deb
  • kmod-zfs-devel_0.7.0-228_amd64.deb
  • kmod-zfs-devel-4.9.0-4-amd64_0.7.0-228_amd64.deb
  • zfs-0.7.0-228_g0873bb633.src.rpm
  • zfs-0.7.0.tar.gz
  • zfs-dkms_0.7.0-228_all.deb
  • zfs-dkms-0.7.0-228_g0873bb633.src.rpm
  • zfs-kmod-0.7.0-228_g0873bb633.src.rpm

But then it fails out when I think it's trying to build the zfs-0.7.0-228_g0873bb633.deb. I'd like to keep using debs to make it easier to manage, perhaps I could convert the zfs-0.7.0-228_g0873bb633.src.rpm to a deb?

Here's the tail of the build output:

make[8]: Leaving directory '/tmp/zfs-build-root-hKSPx4Dr/BUILD/zfs-0.7.0/tests/zfs-tests/tests/functional/largest_pool'
Making all in libzfs
make[8]: Entering directory '/tmp/zfs-build-root-hKSPx4Dr/BUILD/zfs-0.7.0/tests/zfs-tests/tests/functional/libzfs'
  CC       many_fds.o
  CCLD     many_fds
../../../../../lib/libzfs/.libs/libzfs.so: undefined reference to `lzc_reopen'
../../../../../lib/libzfs/.libs/libzfs.so: undefined reference to `lzc_load_key'
../../../../../lib/libzfs/.libs/libzfs.so: undefined reference to `lzc_unload_key'
../../../../../lib/libzfs/.libs/libzfs.so: undefined reference to `lzc_change_key'
collect2: error: ld returned 1 exit status
Makefile:637: recipe for target 'many_fds' failed
make[8]: *** [many_fds] Error 1
make[8]: Leaving directory '/tmp/zfs-build-root-hKSPx4Dr/BUILD/zfs-0.7.0/tests/zfs-tests/tests/functional/libzfs'
Makefile:622: recipe for target 'all-recursive' failed
make[7]: *** [all-recursive] Error 1
make[7]: Leaving directory '/tmp/zfs-build-root-hKSPx4Dr/BUILD/zfs-0.7.0/tests/zfs-tests/tests/functional'
Makefile:556: recipe for target 'all-recursive' failed
make[6]: *** [all-recursive] Error 1
make[6]: Leaving directory '/tmp/zfs-build-root-hKSPx4Dr/BUILD/zfs-0.7.0/tests/zfs-tests/tests'
Makefile:556: recipe for target 'all-recursive' failed
make[5]: *** [all-recursive] Error 1
make[5]: Leaving directory '/tmp/zfs-build-root-hKSPx4Dr/BUILD/zfs-0.7.0/tests/zfs-tests'
Makefile:556: recipe for target 'all-recursive' failed
make[4]: *** [all-recursive] Error 1
make[4]: Leaving directory '/tmp/zfs-build-root-hKSPx4Dr/BUILD/zfs-0.7.0/tests'
Makefile:728: recipe for target 'all-recursive' failed
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory '/tmp/zfs-build-root-hKSPx4Dr/BUILD/zfs-0.7.0'
Makefile:601: recipe for target 'all' failed
make[2]: *** [all] Error 2
make[2]: Leaving directory '/tmp/zfs-build-root-hKSPx4Dr/BUILD/zfs-0.7.0'
error: Bad exit status from /tmp/zfs-build-root-hKSPx4Dr/TMP/rpm-tmp.fqsSO5 (%build)


RPM build errors:
    Bad exit status from /tmp/zfs-build-root-hKSPx4Dr/TMP/rpm-tmp.fqsSO5 (%build)
Makefile:1213: recipe for target 'rpm-common' failed
make[1]: *** [rpm-common] Error 1
make[1]: Leaving directory '/tmp/build/zfs'
Makefile:1172: recipe for target 'rpm-utils' failed
make: *** [rpm-utils] Error 2

further info, looks like I'm hitting this bug in the deb build process: https://github.com/zfsonlinux/zfs/issues/6539 which is addressed in this PR https://github.com/zfsonlinux/zfs/pull/6562
From initial reading, it looks like it can be worked around by specifying the library location at build... will try

Update
Building from within a chroot environment works. The error is caused by the build deb command pulling old libraries. However, now I get stuck when installing the zfs-dkms_0.7.0-1_all.deb package as it also finds the old SPL objects. Damn, so close...

I tried just symlinking the /usr/src/spl-0.7.0 -> /tmp/build/spl to the new build dir, but it still throws the error on install:

dpkg -i zfs-dkms_0.7.0-1_all.deb
(Reading database ... 99846 files and directories currently installed.)
Preparing to unpack zfs-dkms_0.7.0-1_all.deb ...

Uninstall of zfs module (zfs-0.7.0-1) beginning:

------------------------------
Deleting module version: 0.7.0
completely from the DKMS tree.
------------------------------
Done.
Unpacking zfs-dkms (0.7.0-1) over (0.7.0-1) ...
Setting up zfs-dkms (0.7.0-1) ...
Loading new zfs-0.7.0 DKMS files...
Building for 4.9.0-4-amd64
Building initial module for 4.9.0-4-amd64
configure: error:
    *** Please make sure the kmod spl devel <kernel> package for your
    *** distribution is installed then try again.  If that fails you
    *** can specify the location of the spl objects with the
    *** '--with-spl-obj=PATH' option.  Failed to find spl_config.h in
    *** any of the following:
    /usr/src/spl-0.7.0/4.9.0-4-amd64
    /usr/src/spl-0.7.0
Error! Bad return status for module build on kernel: 4.9.0-4-amd64 (x86_64)
Consult /var/lib/dkms/zfs/0.7.0/build/make.log for more information.

Sorry I couldn't test the changes due to being unable to install the built Debian packages. I will happily test 0.7.6 when released.

Tested https://github.com/zfsonlinux/zfs/commit/0873bb6337452e3e028e40f5dad945b30deab185 on Debian with 0.7.4, arc hit rate is back to normal 85+% hit from 15% on desktop usage.

@zeigerpuppy You can hack around that with this, just keep in mind it's dirty. (this is zsh, not bash, fyi)

cd /usr/src/spl-0.7.0 && for kver in /var/lib/dkms/spl/0.7.0/*(/); ln -sfvr $kver/x86_64 ./$kver:t

(The issue is that the debs being generated are treating the spl dkms module source directory as the dkms module's build root)

Just tested with the new release 0.7.6 which builds fine on Debian Stretch.

I'm so impressed with the new performance gains, on a deliberately memory starved server (11G free), here are some stats:

8x NVME Micron 9100 (MTFDHAL1T2MCF) in RAIDZ2

iozone -i 0 -i 1 -t 1 -s16g -r16k  -t 20
RAIDZ2 ashift=13 (1/20 thread, 16k blocks)
    - write     1.22G/sec   5.25G/sec
    - rewrite   1.27G/sec   3.73G/sec
    - read      2.82G/sec   17.6G/sec
    - reread    2.97G/sec   17.8G/sec
iozone -i 0 -i 1 -t 1 -s16g -r256k  -t 20
RAIDZ2 ashift=13 (1/20 thread, 256k blocks)
    - write     2.48G/sec   8.90G/sec
    - rewrite   2.84G/sec   6.46G/sec
    - read      4.42G/sec   16.6G/sec
    - reread    4.59G/sec   17.2G/sec

Sounds like this was fixed in 0.7.6. Please let me know or reopen if I'm mistaken.

yes, thanks. speeds were definitely better with 0.7.6

Maybe I'm quite late, but I still would like to continue looking into that issue, as for our use case it is still not fixed.

Here a fio benchmark (1 ZPool, RaidZ2, 8+2 disks):

ZFS 0.6.5.11:

   READ: bw=1517MiB/s (1591MB/s), 190MiB/s-210MiB/s (199MB/s-220MB/s), io=160GiB (172GB), run=97434-107996msec

ZFS 0.7.12:

   READ: bw=1089MiB/s (1142MB/s), 136MiB/s-137MiB/s (143MB/s-144MB/s), io=160GiB (172GB), run=148977-150458msec

arcstats and zfetchstats are not very enlightening to me:

read_zfs0.6.log
read_zfs0.7.log
arcstats_zfs0.6.txt
arcstats_zfs0.7.txt
zfetchstats_zfs0.6.txt
zfetchstats_zfs0.7.txt
write.fio.txt
read.fio.txt

Was this page helpful?
0 / 5 - 0 ratings