zfs 0.6.5 and write performance problems

Created on 12 Apr 2016  ·  143Comments  ·  Source: openzfs/zfs

Hi,

First post here so apologies in advance for my mistakes!

We recently upgraded ZFS on our Ubuntu 12.04.5 SAN from 0.6.2-1~precise to 0.6.5.4-1~precise. This was done mistakenly as part of a crash recovery.

This system has been in-place and stable for several years.

Following the upgrade we have been experiencing higher disk utilisation resulting in intermittent IOWAIT CPU issues for the server and its NFS clients.

We noticed several performance changes post upgrade. _Note that the upgrade occurred @ Week 13 on the graphs below_.

1) All clients of this SAN use NFS and there has been no change in the number of inbound NFS writes (dark brown line)

image

2) zpool iostat write ops have doubled

image

3) iostat writes have doubled for all pool disks (graph is for a sample disk)

image

4) iostat write sizes (avgrq-sz) have halved for all pool disks

image

I would guess that these issues are related to this entry in the 0.6.5 changelog -

  • Improved zvol performance on all kernels (>50% higher throughput, >20% lower latency)

I've also seen write coalescing mentioned in github and looking at zpool iostat, I don't see much of this happening -

san1stdpool                 48.6T  16.6T     63  1.36K   404K  9.00M
  raidz2                    8.04T  2.83T     14    306  88.6K  2.12M
    scsi-2001b4d2031085703      -      -      3    268  29.6K  1.55M
    scsi-2001b4d2061836505      -      -      4    265  38.4K  1.53M
    scsi-2001b4d2061930609      -      -      3    260  23.2K  1.52M
    scsi-2001b4d2061838504      -      -      3    266  15.2K  1.55M
  raidz2                    8.10T  2.78T      7    205   101K  1.40M
    scsi-2001b4d2004191406      -      -      2    154  40.8K  1.04M
    scsi-2001b4d2061656700      -      -      1    156  34.4K  1.05M
    scsi-2001b4d2061540302      -      -      2    156  17.6K  1.07M
    scsi-2001b4d2061648702      -      -      1    159  19.2K  1.08M
  raidz2                    8.11T  2.77T     11    169   118K  1.11M
    scsi-2001b4d2061055300      -      -      3    138  44.8K   865K
    scsi-2001b4d2061651708      -      -      3    136  42.4K   861K
    scsi-2001b4d2061645707      -      -      2    133  19.2K   861K
    scsi-2001b4d2061558207      -      -      2    137  23.2K   875K
  raidz2                    8.11T  2.76T     11    188  34.9K  1.07M
    scsi-2001b4d2061040403      -      -      1    154  7.20K   891K
    scsi-2001b4d2061556002      -      -      2    156  9.60K   892K
    scsi-2001b4d2061652709      -      -      4    151  17.6K   866K
    scsi-2001b4d2061634008      -      -      2    152  11.2K   873K
  raidz2                    8.12T  2.76T      8    236  28.6K  1.39M
    scsi-2001b4d2061541201      -      -      1    170  7.20K  1.12M
    scsi-2001b4d2061642603      -      -      2    169  12.0K  1.10M
    scsi-2001b4d2061650808      -      -      2    168  9.60K  1.09M
    scsi-2001b4d2061543300      -      -      2    167  9.60K  1.10M
  raidz2                    8.13T  2.75T     10    281  32.3K  1.89M
    scsi-2001b4d2061653704      -      -      2    241  12.8K  1.40M
    scsi-2001b4d2061643704      -      -      3    241  16.0K  1.40M
    scsi-2001b4d2061651706      -      -      2    243  8.80K  1.40M
    scsi-2001b4d2061555002      -      -      2    242  11.2K  1.41M

What would you guys suggest? Downgrade to 0.6.4?

Thanks for your help

Inactive Performance

Most helpful comment

@behlendorf: maybe its time to revert the bio work. Its obviously causing major slowdown and there's no solution in sight. I've not seen anything from @ryao in a while on this (or actually anything, hope he's OK), and I'm not sure anyone else is working on/knows how to solve the problem. If LLNL had to back the workload with blockdev, would you allow engineering to implement a pathologically slow system which literally dropped 100x in terms of performance on a commit and never fixed it? This has blocked many production implementations for us, client engineering teams shoot us down and spent 30X on ... name a vendor ... Then they're stuck in proprietary hell locked in to the people who now own their data with other performance and consistency problems.
With crypto in place, this is probably now the most serious outstanding feature gap. Can this get prioritized please, or reverted if there is no developer bandwidth to fix it?

All 143 comments

@crollorc There have been a whole lot more fundamental changes since 0.6.2 than just the zvol rework you pointed out. Among other things that could impact a write-heavy load is the write throttle overhaul of e8b96c6. You should check out some of the new-ish kstats such as /proc/spl/kstat/zfs/dmu_tx and also the per-pool histograms in /proc/spl/kstat/zfs/<pool>/dmu_tx_assign. The new system may be absorbing writes much more quickly than previously. The zvol rework you mention lowers latency a lot and cause excessive write throttling. It would depend on the amount of memory in the system and the speed at which your drives can handle the writes.

@dweeezil Thank you for the speedy response!

The new system may be absorbing writes much more quickly than previously.

Note that sync=disabled was setup on all pools a long time ago to get reasonable NFS performance

    root@san1:~# zfs get sync san1stdpool
    NAME         PROPERTY  VALUE     SOURCE
    san1stdpool  sync      disabled  local

It would depend on the amount of memory in the system and the speed at which your drives can handle the writes.

We use an Areca ARC-1880 controller and a bunch of WD30EZRX disks. Here are some details on system memory, ZFS tunings and iostat for a sample disk

    root@san1:~# grep Mem /proc/meminfo
    MemTotal:       148500592 kB
    MemFree:        52012052 kB

    root@san1:~# cat /etc/modprobe.d/zfs.conf
    options zfs zfs_arc_max=80530636800
    options zfs zfs_arc_meta_limit=21474836480
    options zfs zfs_prefetch_disable=1

    root@san1:~# iostat -dx 30 sdx
    Linux 3.2.0-57-generic (san1)   12/04/16        _x86_64_        (8 CPU)
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sdx               0.00     0.00    3.79  172.44    82.42  1225.43    14.84     0.38    2.14   63.47    0.79   1.27  22.42

    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sdx               0.00     0.00    5.83  147.07    45.60  1149.20    15.63     0.22    1.46   32.50    0.23   1.28  19.52

    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sdx               0.00     0.00    5.40  168.37    53.87  1211.07    14.56     0.40    2.30   56.27    0.57   1.64  28.52

There have been a whole lot more fundamental changes since 0.6.2 than just the zvol rework you pointed out. Among other things that could impact a write-heavy load is the write throttle overhaul of e8b96c6. You should check out some of the new-ish kstats such as /proc/spl/kstat/zfs/dmu_tx and also the per-pool histograms in /proc/spl/kstat/zfs//dmu_tx_assign.

Below are some details for the kstats mentioned.

dmu_tx_assigned is increasing fast and dmu_tx_group is increasing slowly:

    dmu_tx (60s apart):
        root@san1:~# cat /proc/spl/kstat/zfs/dmu_tx
        5 1 0x01 11 528 5451072961 1032291543673689
        name                            type data
        dmu_tx_assigned                 4    594152327
        dmu_tx_delay                    4    0
        dmu_tx_error                    4    0
        dmu_tx_suspended                4    0
        dmu_tx_group                    4    264018
        dmu_tx_memory_reserve           4    0
        dmu_tx_memory_reclaim           4    0
        dmu_tx_dirty_throttle           4    3169
        dmu_tx_dirty_delay              4    0
        dmu_tx_dirty_over_max           4    0
        dmu_tx_quota                    4    0

        root@san1:~# cat /proc/spl/kstat/zfs/dmu_tx
        5 1 0x01 11 528 5451072961 1032351549214250
        name                            type data
        dmu_tx_assigned                 4    594190604
        dmu_tx_delay                    4    0
        dmu_tx_error                    4    0
        dmu_tx_suspended                4    0
        dmu_tx_group                    4    264044
        dmu_tx_memory_reserve           4    0
        dmu_tx_memory_reclaim           4    0
        dmu_tx_dirty_throttle           4    3169
        dmu_tx_dirty_delay              4    0
        dmu_tx_dirty_over_max           4    0
        dmu_tx_quota                    4    0

    dmu_tx_assign (60s apart):
        root@san1:~# cat /proc/spl/kstat/zfs/san1stdpool/dmu_tx_assign
        19 1 0x01 32 1536 15812990669 1032205962575895
        name                            type data
        1 ns                            4    0
        2 ns                            4    0
        4 ns                            4    0
        8 ns                            4    0
        16 ns                           4    0
        32 ns                           4    0
        64 ns                           4    0
        128 ns                          4    0
        256 ns                          4    39
        512 ns                          4    181
        1024 ns                         4    123
        2048 ns                         4    229
        4096 ns                         4    309
        8192 ns                         4    611
        16384 ns                        4    4702
        32768 ns                        4    7820
        65536 ns                        4    9830
        131072 ns                       4    8734
        262144 ns                       4    8626
        524288 ns                       4    7764
        1048576 ns                      4    6640
        2097152 ns                      4    6854
        4194304 ns                      4    8979
        8388608 ns                      4    14816
        16777216 ns                     4    23768
        33554432 ns                     4    26059
        67108864 ns                     4    21855
        134217728 ns                    4    19184
        268435456 ns                    4    21004
        536870912 ns                    4    23511
        1073741824 ns                   4    21004
        2147483648 ns                   4    24505

        root@san1:~# cat /proc/spl/kstat/zfs/san1stdpool/dmu_tx_assign
        19 1 0x01 32 1536 15812990669 1032266427393963
        name                            type data
        1 ns                            4    0
        2 ns                            4    0
        4 ns                            4    0
        8 ns                            4    0
        16 ns                           4    0
        32 ns                           4    0
        64 ns                           4    0
        128 ns                          4    0
        256 ns                          4    39
        512 ns                          4    181
        1024 ns                         4    123
        2048 ns                         4    229
        4096 ns                         4    309
        8192 ns                         4    611
        16384 ns                        4    4702
        32768 ns                        4    7820
        65536 ns                        4    9831
        131072 ns                       4    8735
        262144 ns                       4    8626
        524288 ns                       4    7764
        1048576 ns                      4    6640
        2097152 ns                      4    6854
        4194304 ns                      4    8979
        8388608 ns                      4    14816
        16777216 ns                     4    23768
        33554432 ns                     4    26059
        67108864 ns                     4    21856
        134217728 ns                    4    19184
        268435456 ns                    4    21007
        536870912 ns                    4    23512
        1073741824 ns                   4    21004
        2147483648 ns                   4    24510

Thanks again for any pointers.

@crollorc These aren't a good sign:

    536870912 ns                    4    23512
    1073741824 ns                   4    21004
    2147483648 ns                   4    24510

Please try echo 100 > /sys/module/zfs/parameters/zfs_txg_history and then monitor the tail of /proc/spl/kstat/zfs/san1stdpool/txgs. My guess is that there's going to be a lot of dirty data built up and you'll see txgs taking a long time to sync. With 148GiB of RAM, by default, it'll allow 14.8GiB of dirty data and will start delaying transactions once 60% of that's filled. If the dirty data is building up that much, it's quite a bit to be flushing to a set of 6 vdevs comprised of 5400 RPM drives. Your sync=disabled setting removes most of the pressure from the writes and is likely letting lots of dirty data build up.

One other note: You had suspected changes to the zvol code but it's not clear that your system is using zvols at all, especially if it's just an NFS server.

@dweeezil Thanks again for your help!

One other note: You had suspected changes to the zvol code but it's not clear that your system is using zvols at all, especially if it's just an NFS server.

Yes, sorry for this red herring!

Please try echo 100 > /sys/module/zfs/parameters/zfs_txg_history and then monitor the tail of /proc/spl/kstat/zfs/san1stdpool/txgs. My guess is that there's going to be a lot of dirty data built up and you'll see txgs taking a long time to sync. With 148GiB of RAM, by default, it'll allow 14.8GiB of dirty data and will start delaying transactions once 60% of that's filled. If the dirty data is building up that much, it's quite a bit to be flushing to a set of 6 vdevs comprised of 5400 RPM drives. Your sync=disabled setting removes most of the pressure from the writes and is likely letting lots of dirty data build up.

Some txgs below. As far as I can tell, for this sample the avg sync time is 0.785s and max dirty data is 389MB. Is that right?

Does it look like we have excessive dirty data? If so, should we change some options to reduce this?

Also, the thing I find most confusing post upgrade is the halving of average write size and doubling of write IO. Could this indicate some sort of issue with the coalescing or merging of txg async writes?

root@san1:~# tail -100 /proc/spl/kstat/zfs/san1stdpool/txgs
18 0 0x01 100 11200 15812979317 105133212710609918 0 0x01 100 11200 15812979317 1051533841667804
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
15119926 1051317597136163 C     67319808     1138176      30715392     22       4244     2288632025   41059        90431        539574908
15119927 1051319885768188 C     67148288     324096       29394944     11       4660     3023421149   42716        56583        421626699
15119928 1051322909189337 C     622592       286208       9233920      7        2013     133828       158336       421440564    418241927
15119929 1051322909323165 C     67190272     0            31543808     0        4339     3105828843   40835        34503        207592661
15119930 1051326015152008 C     67112960     246272       30817280     18       4012     3288812978   32739        23840        1613413641
15119931 1051329303964986 C     67280896     797696       32596992     8        4321     2928346818   27324        51089        1421791320
15119932 1051332232311804 C     67201024     723456       28093952     7        4179     2135289574   30965        67649        272378251
15119933 1051334367601378 C     185868800    553984       34114560     9        3945     272499109    23381        4422         2392586832
15119934 1051334640100487 C     388984320    1226752      68935680     16       6189     2392672328   25820        17348        3349258721
15119935 1051337032772815 C     69060608     654336       33100288     18       3841     3349315820   22396        5053         2735950416
15119936 1051340382088635 C     73426944     2148864      34294272     46       4346     2735995882   22438        58645        3479784551
15119937 1051343118084517 C     72868864     719360       34494464     16       4195     3479911953   35964        67916        4541587497
15119938 1051346597996470 C     72126464     273408       29708288     5        4051     4541745505   40595        71655        470704815
15119939 1051351139741975 C     67288064     14848        34340352     1        4389     2762774648   35648        59742        290774122
15119940 1051353902516623 C     67375616     0            29429248     0        3693     1752902477   32134        57074        213941754
15119941 1051355655419100 C     67144192     121344       30396416     2        4474     2401879065   33550        56413        227860347
15119942 1051358057298165 C     311296       181760       6515200      15       1388     117155       24713        227814025    3711561552
15119943 1051358057415320 C     77940736     211456       35246592     11       5057     3939418192   39508        25284        420930211
15119944 1051361996833512 C     67150848     499712       34809856     8        4503     4519004489   35635        59046        250941851
15119945 1051366515838001 C     245760       0            6135296      0        978      149579       25826        250866431    90757888
15119946 1051366515987580 C     161980928    3584         32300544     1        3054     341680448    23303        160649       328860642
15119947 1051366857668028 C     67266560     75264        28854272     5        3882     3171883004   33868        53873        1542083954
15119948 1051370029551032 C     67233792     119296       31291392     2        3844     4015941251   33066        61815        222378832
15119949 1051374045492283 C     67112448     73216        30317056     1        4108     2828123890   32982        51705        232453458
15119950 1051376873616173 C     294912       3072         4966912      1        959      94905        23727        232425852    219545549
15119951 1051376873711078 C     67220480     0            28463104     0        3923     3582735732   32428        56405        173446117
15119952 1051380456446810 C     67352064     0            32849408     0        4644     4557654099   33427        62170        260095880
15119954 1051386876962366 C     311296       7168         4756992      3        968      64058        104865       217125334    391759430
15119955 1051386877026424 C     67350016     512          29517312     1        3781     2162472992   37046        58895        243407587
15119956 1051389039499416 C     67367936     4096         26507264     2        3857     2859501850   32602        16814        213039912
15119957 1051391899001266 C     294912       3584         6612480      1        1160     56747        23807        213015136    282602466
15119958 1051391899058013 C     84591616     5120         23961088     2        2823     495659675    19990        9958         193463258
15119959 1051392394717688 C     67204608     0            32209408     0        4709     3241553670   29093        58300        196789366
15119960 1051395636271358 C     67344384     3584         21316608     1        3268     1424572876   29297        51815        218067837
15119961 1051397060844234 C     133722112    10240        27375104     3        3207     218169319    21447        14854        276689871
15119962 1051397279013553 C     425984       75264        4948992      2        885      43184        34551        276654583    839144752
15119963 1051397279056737 C     86631424     418304       28002816     10       3119     1115928940   34058        44815        3867624148
15119964 1051398394985677 C     67249152     116224       34108928     5        4120     3976622524   31413        56118        1811913426
15119965 1051402371608201 C     442368       271360       6130176      3        1145     116962       30316        1811861424   316975495
15119966 1051402371725163 C     67272704     7168         30453248     2        3696     4079832998   5366991      56363        177157625
15119967 1051406451558161 C     606208       6656         5591040      2        1041     5474039      28348        177085766    463744406
15119968 1051406457032200 C     67382784     11264        28375552     3        3694     4750365650   31639        58194        251079004
15119969 1051411207397850 C     67136000     174080       30353920     4        4368     2648867338   33360        19377        333135826
15119970 1051413856265188 C     67301376     0            32245248     0        4648     3323272140   33377        49368        260578413
15119971 1051417179537328 C     67398656     0            30830592     0        4187     3326830255   34880        82275        193184454
15119972 1051420506367583 C     704512       13824        5534208      2        1028     168081       1436206      191703067    287482413
15119973 1051420506535664 C     67311104     4096         28228608     2        3945     1893336652   33214        20482        254137463
15119974 1051422399872316 C     67278336     0            21765632     0        2773     348983041    27356        47972        145083331
15119975 1051422748855357 C     67352064     0            26700288     0        4364     4445354130   28759        90471        275635672
15119976 1051427194209487 C     67309568     4096         17486848     2        2648     287537177    19723        36940        285099234
15119977 1051427481746664 C     294912       3584         6289920      1        1418     78881        15976        285067202    372916241
15119978 1051427481825545 C     99746816     138240       32547328     3        3813     658033143    37851        112851       281154981
15119979 1051428139858688 C     67299840     0            35225088     0        4658     4791695098   29047        34142        238786288
15119980 1051432931553786 C     67450368     442880       32909312     9        4657     3382521809   29070        53666        455676317
15119981 1051436314075595 C     62214656     0            28602368     0        4430     4998970069   33190        48473        212310971
15119982 1051441313045664 C     67166208     7680         29568512     1        4322     4977762671   33126        56693        220063626
15119984 1051448657016982 C     67188224     39424        28069888     1        3891     3938588146   33204        63339        255697567
15119985 1051452595605128 C     206512128    262144       36888576     2        3808     255814316    20576        5260         433042218
15119986 1051452851419444 C     294912       0            5493760      0        941      30191        67419        432978226    245940283
15119987 1051452851449635 C     118355456    740864       30279680     25       3021     679079720    36735        46412        5243127770
15119988 1051453530529355 C     107720192    454144       44655616     10       5230     5243222808   22147        4742         2608530245
15119989 1051458773752163 C     193300480    567808       42410496     14       4978     2608579548   23583        4823         2329348098
15119990 1051461382331711 C     67341824     647168       27520000     11       3886     2861814130   84761564     51831        2984429468
15119991 1051464244145841 C     10947072     136704       10307584     6        1463     84872604     25224        2984351067   376682757
15119992 1051464329018445 C     67301888     131072       33234944     1        3847     4520813041   31963        54605        222658607
15119993 1051468849831486 C     67185152     0            30845952     0        4001     3531488284   41844        54481        199061853
15119994 1051472381319770 C     67261952     0            33922048     0        4376     3678401469   33453        55968        232224320
15119995 1051476059721239 C     67164672     131072       37501440     1        5200     4175202659   34296        60155        225753980
15119996 1051480234923898 C     67061760     3584         33344512     1        5753     3162438635   33377        60652        293532139
15119997 1051483397362533 C     67233280     0            27124224     0        3267     892441673    40501        17950        188961671
15119998 1051484289804206 C     72597504     7680         19038720     2        2022     189064281    25234        38076        499031632
15119999 1051484478868487 C     67157504     0            29156352     0        3885     4877431971   33126        58347        294803691
15120000 1051489356300458 C     67275776     0            17293312     0        2613     488698066    25123        15277        186339567
15120001 1051489844998524 C     67120640     189952       29140480     2        4070     2988750048   31981        54116        217461923
15120002 1051492833748572 C     67270144     30208        30515200     1        4008     2598587018   33083        24376        207672250
15120003 1051495432335590 C     67363328     0            23166976     0        3080     1092220744   59143        27075        197765246
15120004 1051496524556334 C     311296       77824        5860864      3        944      107004       115234       197634991    284243810
15120005 1051496524663338 C     67376128     34816        21529088     1        3238     1492765489   33958        61471        177605761
15120006 1051498017428827 C     67281920     88064        23701504     3        3359     1563872137   34359        56987        2221304322
15120007 1051499581300964 C     442368       446464       6493696      7        972      142416       121732       2221137710   579516387
15120008 1051499581443380 C     80051200     0            27474432     0        3577     2800846052   31181        36791        237096877
15120009 1051502382289432 C     67284480     0            24519168     0        3317     843436262    32395        21238        187852604
15120010 1051503225725694 C     606208       288256       5197312      9        973      70125        137521       187704935    632766615
15120011 1051503225795819 C     89024512     65536        26772992     7        3249     820634999    35718        45433        2604648562
15120012 1051504046430818 C     69491712     264192       26582016     6        3784     2604750202   20538        38766        322755582
15120014 1051509437880034 C     72758272     207872       21197824     7        2681     169953556    24450        4840         2003911666
15120015 1051509607833590 C     67322880     31744        33702912     7        4161     4465702675   32512        58383        1077354867
15120016 1051514073536265 C     81204224     0            25047040     0        3331     1077512643   31088        13559        159332753
15120017 1051515151048908 C     67190784     187904       14333440     2        2120     330590061    23868        65989        925606183
15120018 1051515481638969 C     294912       5632         5998080      1        899      110777       16520        925574862    298434801
15120019 1051515481749746 C     115534848    302080       32688640     6        3619     1224080377   27726        44641        1976941306
15120020 1051516705830123 C     67234816     262144       31841280     2        4065     4021937649   29437        65558        910526372
15120021 1051520727767772 C     67363328     0            33303552     0        3697     3745727914   29722        61160        191961445
15120022 1051524473495686 C     67283968     242176       31216640     4        4147     2288904899   40747        58234        653195841
15120023 1051526762400585 C     67494400     0            31962112     0        4467     3981323257   30209        34169        226151116
15120024 1051530743723842 C     67171840     5632         30683648     5        4057     2780591447   28499        32628        233295354
15120025 1051533524315289 O     0            0            0            0        0        0            0            0            0

Thanks again.

@crollorc I may have gotten ahead of myself here. So much has changed since 0.6.2 and I latched on to the new write throttle. There have also been big ARC changes. In order to get a better picture of what's going on, we also need to see your arcstats.

Are these changes:

options zfs zfs_arc_max=80530636800
options zfs zfs_arc_meta_limit=21474836480

left over from 0.6.2 or did you add them after the upgrade? If these are from 0.6.2, I'd suggest removing these and watching the arcstats.

@dweeezil

There have also been big ARC changes. In order to get a better picture of what's going on, we also need to see your arcstats.

I've added some arcstats and arc_summary below (hopefully with the right params).

Are these changes:
options zfs zfs_arc_max=80530636800
options zfs zfs_arc_meta_limit=21474836480
left over from 0.6.2 or did you add them after the upgrade? If these are from 0.6.2, I'd suggest removing these and watching the arcstats.

They are left over and were tuned to deal with some problems with a large ARC in 0.6.2. We were also dropping caches (echo 3 > /proc/sys/vm/drop_caches) every hour on 0.6.2 but stopped post upgrade.

We are very dependent on the high ARC hit rates we get (ARC:93% L2ARC:68%). How would i default these options from the live system? What size will the ARC default to and will this result in the ARC being cleared or will it resize dynamically?

Thanks again for your help.

root@san1:~# arcstat.py  2
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
22:19:34     1     0     33     0   33     0    0     0   33    75G   75G
22:19:36  4.5K   227      5   227    5     0    0   167   43    75G   75G
22:19:38  2.9K   199      6   199    6     0    0   163    7    74G   75G
22:19:40   749   132     17   132   17     0    0    95   29    74G   75G
22:19:42  4.4K   220      4   220    4     0    0   159   10    74G   75G
22:19:44  1.4K    60      4    60    4     0    0    31    3    74G   75G
22:19:46  1.6K   119      7   119    7     0    0    84    8    74G   75G
22:19:48  1.5K   128      8   128    8     0    0    71    7    75G   75G
22:19:50  2.3K   333     14   333   14     0    0   311   17    74G   75G
22:19:52  1.0K    81      8    81    8     0    0    44   32    75G   75G
22:19:54  4.5K   106      2   106    2     0    0    72    7    75G   75G
22:19:56  2.1K    31      1    31    1     0    0     7   35    75G   75G
22:19:58  3.0K    96      3    96    3     0    0    54    5    75G   75G
22:20:00  6.9K   353      5   353    5     0    0   304    5    75G   75G
22:20:02  1.5K   146      9   146    9     0    0    79   42    75G   75G
22:20:05   23K   190      0   190    0     0    0   138    0    75G   75G
22:20:07  5.3K   150      2   150    2     0    0   101    7    75G   75G
22:20:09  4.1K   325      8   325    8     0    0   254   17    75G   75G
22:20:11  3.8K   129      3   129    3     0    0    89    4    75G   75G
22:20:13  3.5K   127      3   127    3     0    0    78    6    75G   75G
22:20:15   968   118     12   118   12     0    0    78   40    75G   75G
22:20:17  1.4K   115      8   115    8     0    0    83    9    75G   75G
22:20:19   537    87     16    87   16     0    0    66   72    75G   75G

root@san1:~# arc_summary.py

------------------------------------------------------------------------
ZFS Subsystem Report                            Tue Apr 12 22:20:47 2016
ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                73.11m
        Mutex Misses:                           8.20k
        Evict Skips:                            8.20k

ARC Size:                               100.00% 75.00   GiB
        Target Size: (Adaptive)         100.00% 75.00   GiB
        Min Size (Hard Limit):          0.04%   32.00   MiB
        Max Size (High Water):          2400:1  75.00   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       93.75%  70.31   GiB
        Frequently Used Cache Size:     6.25%   4.69    GiB

ARC Hash Breakdown:
        Elements Max:                           9.19m
        Elements Current:               92.81%  8.53m
        Collisions:                             160.23m
        Chain Max:                              6
        Chains:                                 918.16k

ARC Total accesses:                                     3.54b
        Cache Hit Ratio:                93.96%  3.33b
        Cache Miss Ratio:               6.04%   213.86m
        Actual Hit Ratio:               93.80%  3.32b

        Data Demand Efficiency:         94.10%  1.08b
        Data Prefetch Efficiency:       30.63%  13.31m

        CACHE HITS BY CACHE LIST:
          Most Recently Used:           8.86%   294.70m
          Most Frequently Used:         90.97%  3.03b
          Most Recently Used Ghost:     1.40%   46.42m
          Most Frequently Used Ghost:   3.08%   102.49m

        CACHE HITS BY DATA TYPE:
          Demand Data:                  30.47%  1.01b
          Prefetch Data:                0.12%   4.07m
          Demand Metadata:              69.34%  2.31b
          Prefetch Metadata:            0.07%   2.25m

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  29.69%  63.48m
          Prefetch Data:                4.32%   9.23m
          Demand Metadata:              64.88%  138.75m
          Prefetch Metadata:            1.12%   2.40m

L2 ARC Summary: (HEALTHY)
        Low Memory Aborts:                      288
        Free on Write:                          475.27k
        R/W Clashes:                            7.58k
        Bad Checksums:                          0
        IO Errors:                              0

L2 ARC Size: (Adaptive)                         664.73  GiB
        Compressed:                     42.35%  281.54  GiB
        Header Size:                    0.09%   626.96  MiB

L2 ARC Evicts:
        Lock Retries:                           60
        Upon Reading:                           1

L2 ARC Breakdown:                               213.86m
        Hit Ratio:                      68.97%  147.50m
        Miss Ratio:                     31.03%  66.36m
        Feeds:                                  1.06m

L2 ARC Writes:
        Writes Sent:                    100.00% 1.02m

File-Level Prefetch: (HEALTHY)
DMU Efficiency:                                 9.40m
        Hit Ratio:                      31.19%  2.93m
        Miss Ratio:                     68.81%  6.47m

        Colinear:                               6.47m
          Hit Ratio:                    0.01%   536
          Miss Ratio:                   99.99%  6.46m

        Stride:                                 2.72m
          Hit Ratio:                    99.99%  2.72m
          Miss Ratio:                   0.01%   349

DMU Misc:
        Reclaim:                                6.46m
          Successes:                    0.35%   22.78k
          Failures:                     99.65%  6.44m

        Streams:                                214.18k
          +Resets:                      0.09%   200
          -Resets:                      99.91%  213.98k
          Bogus:                                0


ZFS Tunable:
        zfs_arc_sys_free                                  0
        zfs_arc_lotsfree_percent                          10
        l2arc_norw                                        0
        l2arc_feed_again                                  1
        l2arc_nocompress                                  0
        l2arc_noprefetch                                  1
        l2arc_feed_min_ms                                 200
        l2arc_feed_secs                                   1
        l2arc_headroom_boost                              200
        l2arc_headroom                                    2
        l2arc_write_boost                                 8388608
        l2arc_write_max                                   8388608
        zfs_arc_num_sublists_per_state                    8
        zfs_arc_min_prefetch_lifespan                     0
        zfs_arc_average_blocksize                         8192
        zfs_disable_dup_eviction                          0
        zfs_arc_p_min_shift                               0
        zfs_arc_shrink_shift                              0
        zfs_arc_p_dampener_disable                        1
        zfs_arc_p_aggressive_disable                      1
        zfs_arc_grow_retry                                0
        zfs_arc_meta_strategy                             1
        zfs_arc_meta_adjust_restarts                      4096
        zfs_arc_meta_prune                                10000
        zfs_arc_meta_min                                  0
        zfs_arc_meta_limit                                21474836480
        zfs_arc_max                                       80530636800
        zfs_arc_min                                       0
        zfs_dbuf_state_index                              0
        zfs_dedup_prefetch                                0
        zfs_nopwrite_enabled                              1
        zfs_mdcomp_disable                                0
        zfs_send_corrupt_data                             0
        zfs_pd_bytes_max                                  52428800
        zfetch_array_rd_sz                                1048576
        zfetch_block_cap                                  256
        zfetch_min_sec_reap                               2
        zfetch_max_streams                                8
        zfs_prefetch_disable                              1
        zfs_max_recordsize                                1048576
        zfs_delay_scale                                   500000
        zfs_dirty_data_sync                               67108864
        zfs_dirty_data_max_max                            38016151552
        zfs_dirty_data_max                                15206460620
        zfs_delay_min_dirty_percent                       60
        zfs_dirty_data_max_max_percent                    25
        zfs_dirty_data_max_percent                        10
        zfs_free_max_blocks                               100000
        zfs_no_scrub_prefetch                             0
        zfs_no_scrub_io                                   0
        zfs_resilver_min_time_ms                          3000
        zfs_free_min_time_ms                              1000
        zfs_scan_min_time_ms                              1000
        zfs_scan_idle                                     50
        zfs_scrub_delay                                   4
        zfs_resilver_delay                                2
        zfs_top_maxinflight                               32
        zfs_zevent_console                                0
        zfs_zevent_cols                                   80
        zfs_zevent_len_max                                2048
        metaslab_bias_enabled                             1
        metaslab_lba_weighting_enabled                    1
        metaslab_fragmentation_factor_enabled             1
        zfs_metaslab_fragmentation_threshold              70
        zfs_mg_fragmentation_threshold                    85
        zfs_mg_noalloc_threshold                          0
        metaslab_preload_enabled                          1
        metaslab_debug_unload                             0
        metaslab_debug_load                               0
        metaslab_aliquot                                  524288
        zio_taskq_batch_pct                               75
        spa_load_verify_data                              1
        spa_load_verify_metadata                          1
        spa_load_verify_maxinflight                       10000
        zfs_autoimport_disable                            1
        spa_config_path                                   /etc/zfs/zpool.cache
        spa_slop_shift                                    5
        spa_asize_inflation                               24
        zfs_deadman_enabled                               1
        zfs_deadman_synctime_ms                           1000000
        zfs_free_leak_on_eio                              0
        zfs_recover                                       0
        zfs_flags                                         0
        zfs_txg_history                                   100
        zfs_read_history_hits                             0
        zfs_read_history                                  0
        zfs_txg_timeout                                   5
        metaslabs_per_vdev                                200
        zfs_vdev_cache_bshift                             16
        zfs_vdev_cache_size                               0
        zfs_vdev_cache_max                                16384
        zfs_vdev_scheduler                                noop
        zfs_vdev_mirror_switch_us                         10000
        zfs_vdev_sync_write_min_active                    10
        zfs_vdev_sync_write_max_active                    10
        zfs_vdev_sync_read_min_active                     10
        zfs_vdev_sync_read_max_active                     10
        zfs_vdev_scrub_min_active                         1
        zfs_vdev_scrub_max_active                         2
        zfs_vdev_async_write_min_active                   1
        zfs_vdev_async_write_max_active                   10
        zfs_vdev_async_read_min_active                    1
        zfs_vdev_async_read_max_active                    3
        zfs_vdev_async_write_active_min_dirty_percent     30
        zfs_vdev_async_write_active_max_dirty_percent     60
        zfs_vdev_max_active                               1000
        zfs_vdev_write_gap_limit                          4096
        zfs_vdev_read_gap_limit                           32768
        zfs_vdev_aggregation_limit                        131072
        zfs_expire_snapshot                               300
        zfs_admin_snapshot                                0
        zfs_dbgmsg_maxsize                                4194304
        zfs_dbgmsg_enable                                 0
        zfs_immediate_write_sz                            32768
        zfs_read_chunk_size                               1048576
        zil_slog_limit                                    1048576
        zfs_nocacheflush                                  0
        zil_replay_disable                                0
        zfs_sync_pass_rewrite                             2
        zfs_sync_pass_dont_compress                       5
        zfs_sync_pass_deferred_free                       2
        zio_requeue_io_start_cut_in_line                  1
        zio_delay_max                                     30000
        zvol_prefetch_bytes                               131072
        zvol_max_discard_blocks                           16384
        zvol_major                                        230
        zvol_inhibit_dev                                  0

@crollorc Please cat /proc/spl/kstat/zfs/arcstats periodically when the problem is happening and post the output (ideally to a gist or paste or something).

@dweeezil

Thanks again for all your pointers.

Please cat /proc/spl/kstat/zfs/arcstats periodically when the problem is happening and post the output (ideally to a gist or paste or something).

We haven't noticed any problems with the ARC or reads in general since the upgrade so I'm unclear when I should be looking at arcstats?

Because the ARC and L2ARC are so efficient, only a small proportion of disk IO are reads -

    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sdx               0.00     0.00    3.79  172.44    82.42  1225.43    14.84     0.38    2.14   63.47    0.79   1.27  22.42

However writes have doubled post upgrade which gives us a lot less disk IOP headroom. When IOPs hit 220 or so IO queue depth and service time jump followed by SAN and client load linked to IO_WAIT.

@crollorc Both your iostat and txgs history do show an average write size of about 8K. I would have expected ec8501e, which _is_ in 0.6.5.4, to help aggregate them better. The small writes, which were clearly shown in your original iostat graph, do appear to be the problem.

If there is a bigger underlying issue here, I'd expect there to be a lot more problem reports given how widespread the use of NFS server is. The only unusual thing about your setup is that you're not using an SSD log device but instead have set sync=disabled (and maybe that's not as unusual as it ought to be).

@dweeezil

Both your iostat and txgs history do show an average write size of about 8K. I would have expected ec8501e, which is in 0.6.5.4, to help aggregate them better. The small writes, which were clearly shown in your original iostat graph, do appear to be the problem.
If there is a bigger underlying issue here, I'd expect there to be a lot more problem reports given how widespread the use of NFS server is. The only unusual thing about your setup is that you're not using an SSD log device but instead have set sync=disabled (and maybe that's not as unusual as it ought to be).

One thing I didn't mention is that we have not done a 'zpool upgrade' so that we can downgrade if necessary.

We do have compression=on and I disabled this to see if it would affect what effect it would have.

iostat avgrq-sz has doubled (to where it was pre-upgrade) but w/s has not dropped. Also, there are no wrqm/s which seems odd to me.

compression=on:
root@san1:~# iostat -dx 30 sdx
Linux 3.2.0-57-generic (san1)   12/04/16        _x86_64_        (8 CPU)
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdx               0.00     0.00    5.83  147.07    45.60  1149.20    15.63     0.22    1.46   32.50    0.23   1.28  19.52

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdx               0.00     0.00    5.40  168.37    53.87  1211.07    14.56     0.40    2.30   56.27    0.57   1.64  28.52

compression=off:    
root@san1:~# iostat -dx 30 sdx
Linux 3.2.0-57-generic (san1)   13/04/16        _x86_64_        (8 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdx               0.00     0.00    4.47  167.83    29.60  3029.73    35.51     0.98    5.95  132.15    2.59   3.10  53.36

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdx               0.00     0.00    3.47  104.90    24.53  1603.60    30.05     0.47    4.33   55.69    2.63   3.06  33.12

Any ideas on data I could provide, tests I could do or steps I could take to help identify the root-cause or mitigate the problem?

Thanks again.

@dweeezil

We tested sync=standard on 0.6.2 with an SSD based ZIL but the performance was poor and so we moved to sync=disabled.

We are planning to try sync=standard again to see if this eliminate the small writes.

Does this sound reasonable?

@dweeezil

We are planning to try sync=standard again to see if this eliminate the small writes.

We are now running with sync=standard and request sizes to the slog ssds and the vdev are both now 15 sectors so it looks like we are hitting the same issue with sync=disabled|standard.

Any ideas!?

@crollorc Unrelated to this issue, your earlier issue with sync=standard and an SSD log device may very well have been fixed by b39c22b. As to this current issue, I can't tell why you're not getting any write block aggregation but were in the past. The only thing I can thing of at the moment is that the new write throttle is starting to sync the txgs before there's enough dirty data to actually aggregate the writes. The new write throttle is controlled much less by time and more by the amount of dirty data. Maybe try increasing zfs_dirty_data_sync from its default of 64MiB to something several times larger. The commit comment in e8b96c60 outlines all the new tunables (there are a lot of them) and the rationale behind the new dirty-data-based logic.

@dweeezil Thanks for the advice.

I hiked up zfs_dirty_data_sync by a factor of 5

echo "335544320">>/sys/module/zfs/parameters/zfs_dirty_data_sync

This has lead to larger slower txg syncs but has had no effect on write sizes unfortunately

root@san1:~# cat /proc/spl/kstat/zfs/san1stdpool/txgs
18 0 0x01 100 11200 15812979317 1220138956414077
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
15165732 1219547306928970 C     103823872    8437248      108766208    687      6628     4998911226   21627        86364        5969940592
15165733 1219552305840196 C     133959680    16027136     133283840    890      8009     5970068750   19449        9661         9281926622
15165734 1219558275908946 C     419964416    10758144     418938880    948      10639    9281975753   38762642     12841        11173108648
15165735 1219567557884699 C     2506752      593408       7084544      46       1216     38783744     67981        11173038382  481888713
15165736 1219567596668443 C     335713280    14246912     329070592    911      11272    14922927079  33046        59473        10498697281
15165737 1219582519595522 C     208999424    19658240     209769984    1177     9328     10498829161  23350        11915        11889600512
15165738 1219593018424683 C     360969728    12538368     360671744    812      11064    11889653459  240645955    26036        12310457932
15165739 1219604908078142 C     3392512      1349632      11851264     142      2093     240679883    16497        12310439528  2259345943
15165740 1219605148758025 C     244354048    18569216     249533952    1169     11906    17309564928  29802        68381        12566762372
15165741 1219622458322953 C     614432768    8825344      612707328    687      13992    12566918310  18515761     24347        12789978166
15165742 1219635025241263 C     2703360      268288       9133056      46       1363     18546521     15738        12789962159  1117922924
15165743 1219635043787784 C     254418432    4844544      247903232    615      10461    17789000422  19663        28888        8831526786
15165744 1219652832788206 C     356394496    5272064      357357568    532      9933     8831597067   20160        20300        8794093168
15165745 1219661664385273 C     175232512    3508736      66828288     622      8274     8794205731   30647        27209        5537575188
15165746 1219670458591004 C     109496320    8479232      48517120     487      6868     5537646814   19917        4868         4994873362
15165747 1219675996237818 C     78661632     2893824      38667264     337      5141     4999150577   1095103      24052        2955321620
15165748 1219680995388395 C     335664128    10828800     71731200     695      7917     3664853246   31489        17776        5392371236
15165749 1219684660241641 C     425984       1851904      6299648      140      1209     66072        29508        5392331163   1340747972
15165750 1219684660307713 C     335826944    11036160     90896384     729      9605     10279364198  25490        16578        8948091217
15165751 1219694939671911 C     151248896    2181632      63707648     550      7995     8948185595   27002        35457        7073099465
15165752 1219703887857506 C     136689664    2355200      58725376     596      7312     7073210909   26658848     11622        5380774336
15165753 1219710961068415 C     174258688    2200576      58366464     410      6945     5407481756   28565        13310        4190204779
15165754 1219716368550171 C     106930688    5412352      50428928     493      6421     5000909906   13078253     22659        4051350662
15165755 1219721369460077 C     72669696     2042368      38909952     475      6270     5011741115   102642288    68493        3704350875
15165756 1219726381201192 C     77281792     1138688      35868672     267      5944     5099776317   30340        55163        2992220012
15165757 1219731480977509 C     86577664     1547776      42548736     441      6350     4999683313   34274        66523        2982665268
15165758 1219736480660822 C     125505024    2061312      51666944     434      7412     4999731645   4336226      66781        3676232961
15165760 1219745346040784 C     146661888    7630336      52675584     512      7413     6593591562   26484        43135        4919952556
15165761 1219751939632346 C     85427712     8373760      40056832     618      5884     4996010229   28315        61489        6691483521
15165762 1219756935642575 C     85024768     3826176      42681856     483      5982     6691619637   28521        43215        3149760693
15165763 1219763627262212 C     111463424    7518208      48117248     398      6533     4999845277   29544        46361        3226251398
15165764 1219768627107489 C     84978688     2159104      40262144     336      6124     4999694156   22488        56248        2707472379
15165765 1219773626801645 C     228312064    2410496      73106432     568      8755     4999814413   28472        67061        4537423339
15165766 1219778626616058 C     152351232    1446400      48518144     478      7144     4999704757   34587511     152924       3419871570
15165767 1219783626320815 C     85893632     1113088      42266112     346      6851     5031810548   29674        49372        2609414857
15165768 1219788658131363 C     80108544     3673088      40606208     531      7238     4999698306   29588        63670        5271838972
15165769 1219793657829669 C     107587072    3384320      49200640     778      7184     5271952524   18752        14482        6422220740
15165770 1219798929782193 C     283225088    1183232      75063808     337      8025     6422272179   25795865     26462        3306554213
15165771 1219805352054372 C     517490176    6026240      105960960    294      9675     3332407159   24361670     94654        7777553278
15165772 1219808684461531 C     328192       120832       8216576      38       1331     24462006     15774        7777537645   646071454
15165773 1219808708923537 C     335872512    2001408      96568832     196      10248    9625785224   3967403      58006        4411652001
15165774 1219818334708761 C     8011776      1228800      10922496     61       2146     4075025      25498        4411583229   1014418827
15165775 1219818338783786 C     122765312    285696       56177664     33       8769     9409308807   29039        17008        1427350402
15165776 1219827748092593 C     112653824    47104        52990464     19       7448     5003795561   29668        65926        823252720
15165777 1219832751888154 C     213636096    615936       68867072     22       7773     4999762114   29651        50609        582089623
15165778 1219837751650268 C     72818176     669184       40657920     90       7355     4999748160   29023        56489        2090515439
15165779 1219842751398428 C     69604864     148480       37343232     11       5940     4999820073   29721        75034        406247655
15165780 1219847751218501 C     92731392     64512        43635712     23       6909     4999625288   27797        56131        807718543
15165781 1219852750843789 C     136089600    16896        57552896     9        8189     4999839967   28295        66847        333888819
15165782 1219857750683756 C     119449088    397312       45880320     8        6320     5003233374   29865        36752        399538499
15165783 1219862753917130 C     290128896    865792       69095936     39       7460     4996205772   28943        49386        897644253
15165784 1219867750122902 C     272039424    941056       76449792     14       8472     4999853280   36926        58131        472168373
15165785 1219872749976182 C     57738752     230912       30526976     28       5454     4999638757   38953        15394        926580755
15165786 1219877749614939 C     112080384    2570752      49501696     62       7126     4999814566   30159        55156        1658388365
15165787 1219882749429505 C     118583808    927744       46692352     98       6621     5003692554   30183        26637        1538912251
15165788 1219887753122059 C     123766272    1391104      48965632     344      7353     4999812097   18289533     49957        3153770173
15165790 1219897768677826 C     115807232    5521408      48044544     599      7817     4999747181   27994        26217        5278535095
15165791 1219902768425007 C     107308032    3016192      43960320     421      6410     5278607879   30748        13269        3327519422
15165792 1219908047032886 C     88134144     2544640      44220928     428      6451     4996884894   24708        42158        3524859611
15165793 1219913043917780 C     101572096    1798656      47307264     402      6889     4999796341   29642        62216        3315015885
15165794 1219918043714121 C     201362432    9166848      64028160     856      8845     4999652480   21997593     51567        5958306468
15165795 1219923043366601 C     475703296    13393408     104780288    1027     11005    5980368461   53319        5879         10768368601
15165796 1219929023735062 C     411564032    5027328      102991360    670      10336    10768440854  18369        5008         9290714054
15165797 1219939792175916 C     216221184    6014464      81198592     520      9805     9290800250   26214        26780        9057522982
15165798 1219949082976166 C     269355008    2806784      96839680     480      9100     9057594477   19550        8993         7197828302
15165799 1219958140570643 C     120171520    1551872      52727808     352      6954     7197876001   20248892     185085       3336507211
15165800 1219965338446644 C     75444224     2197504      34736128     353      5539     5018575574   13689230     14698        3150206648
15165801 1219970357022218 C     72051712     1256960      35189248     346      5776     5011799878   29297        54952        2945432727
15165802 1219975368822096 C     113371648    1550336      51550720     328      6780     4999786866   29772        48439        2812015577
15165803 1219980368608962 C     148939264    1693184      50263040     441      6723     4999695466   22113        66972        3445798711
15165804 1219985368304428 C     369689600    1588736      80213504     408      7596     3445919602   22786        39240        4828515597
15165805 1219988814224030 C     146930176    2231296      60336128     503      7374     4997634368   25424        60583        5170268217
15165806 1219993811858398 C     97103360     1856000      44809216     446      6840     5170369617   20135        5213         5155910764
15165807 1219998982228015 C     66098688     3091456      36184064     371      5531     5155950845   19672        5059         3950018276
15165808 1220004138178860 C     101739520    1630208      47604736     361      6301     4996990395   17013433     51559        3971532466
15165809 1220009135169255 C     123042304    1344000      50500096     312      6664     5015688785   39308655     35128        4235817268
15165810 1220014150858040 C     257754112    1262080      64723456     335      7160     5035757232   37256        56746        3359800759
15165811 1220019186615272 C     78577664     1421824      40296448     356      6142     4999793735   18817233     42229        3218582558
15165812 1220024186409007 C     97896448     1326080      44792320     446      7251     5015675033   30864        70346        4524430587
15165813 1220029202084040 C     77125120     1417216      39683584     347      6158     4999808552   33373        48808        2920449939
15165814 1220034201892592 C     102322176    2422272      44349440     441      6314     4999800786   33444        62751        3078155648
15165815 1220039201693378 C     206567936    2083840      58301440     300      7239     4999646245   30567        53322        2940542504
15165816 1220044201339623 C     335696896    3860992      73400832     456      8006     4018577811   4150463      61669        5106145008
15165817 1220048219917434 C     1097728      546816       6807552      112      1469     4234043      29725        5106100328   797276855
15165818 1220048224151477 C     143298560    2547712      60027392     492      7608     10104533934  28355        52513        4927874857
15165820 1220063328383773 C     94451712     4605440      45061120     484      6269     4999770108   29912        47488        3966882626
15165821 1220068328153881 C     164284416    3402752      56995840     352      6653     4999796208   21953        86098        3702911755
15165822 1220073327950089 C     114455552    2962432      45576192     269      5641     4999694622   42065        16165        2549991152
15165823 1220078327644711 C     108544000    2788864      50128384     393      6284     5003727717   24171        59502        4126122936
15165824 1220083331372428 C     73833472     1990144      38971392     333      5788     4999829832   6680903      75739        2725572440
15165825 1220088331202260 C     90086400     2100736      43217920     319      6279     5003778113   28462        70563        2831876066
15165826 1220093334980373 C     121577984    4802560      56598016     857      8714     4999677310   28872        55741        7081100546
15165827 1220098334657683 C     339081728    3565568      86428160     499      9814     7081202852   47905        45680        8622300546
15165828 1220105415860535 C     330521088    7106048      98919936     903      10811    8622406877   22133        4582         12519017260
15165829 1220114038267412 C     204268032    58990592     79695360     1248     9552     12519064508  23334        4602         10409391997
15165830 1220126557331920 S     0            0            0            0        0        10409433593  19369        6400         0
15165831 1220136966765513 O     0            0            0            0        0        0            0            0            0

The only way we've found of increasing the average write size is to set compression=off.

@crollorc I could not duplicate this problem with a quick test. I had no problem getting about 128KiB average write size as reported by iostat when copying large files to an NFS exported zfs filesystem. When rsyncing lots of small files, the average write size was all over the place, exactly as I'd expect. I'll try a few more tests with varying pool geometries but at the moment, I'm not able to reproduce the problem.

@dweeezil Thanks for trying.

Looking at the NFSv3 traffic, I found that most (70%) of writes were 4K despite having wsize=1048576 in all client mount options

root@san1:~# tcpdump -r /tmp/dump1|grep -i 'write fh'|wc -l
5555
root@san1:~# tcpdump -r /tmp/dump1|grep -i 'write fh'|awk '{print $10}'|sort|uniq -c|sort -nr|head
   3856 4096
    476 8192
    318 12288
     92 16384
     90 18
     50 16
     44 20480
     30 91374
     24 512
     22 24576

I had a look at the clients and I think the rootcause for the 4K writes may be that many of these clients are running Linux and Windows KVM VMs which I would guess probably have a 4K blocksize.

However, I don't understand why iostat avgrq-sz halved and iostat w/s doubled post upgrade.

@crollorc I think your problem boils down to no aggregation. During my simple tests, 8K streaming writes were aggregated nicely. Could you try a simple dd bs=8k if=/dev/zero of=zeroes test both directly and also over NFS to an otherwise quiescent filesystem with compression disabled (even if only temporarily) to verify that aggregation is working at all.

@dweeezil

I think your problem boils down to no aggregation. During my simple tests, 8K streaming writes were aggregated nicely.

One thing I am unclear on is if I should see merged writes in iostat for all aggregations or are they happening inside ZFS and where exactly should I look for evidence of aggregation?

We ran without compression on 0.6.2 and I noted a large proportion of writes were merged in iostats. Last year we turned on compression and merged writes were significantly reduced but still present. So far on 0.6.5 we see no merged writes at all (prior to these tests) with compression on and off. You can see this on this graph for a sample pool member (unfortunately I don't have graphs illustrating merged writes with compression off on 0.6.2)

image

Could you try a simple dd bs=8k if=/dev/zero of=zeroes test both directly and also over NFS to an otherwise quiescent filesystem with compression disabled (even if only temporarily) to verify that aggregation is working at all.

Compression=off at the moment but I cannot do this on an entirely quiescent fs right now.

For the direct dd, I found that the request size jumped to 100 - 300 for the duration of the test with intermittent drops (i guess between flushes). Also there were merged writes during the test.

image

image

For the NFS dd, I found that the request size jumped to 100 - 150 for the duration of the test. There were also a few merged writes but much less than for the direct dd.

image

image

@crollorc

One thing I am unclear on is if I should see merged writes in iostat for all aggregations or are they happening inside ZFS and where exactly should I look for evidence of aggregation?

I was mainly concerned about the aggregation performed within ZFS and was concentrating as the new write throttle as possibly being the cause of your regression because of your write-heavy load. The new write throttle also performed some refactoring of ZFS' aggregation code.

As to the block layer's scheduler merging, since you're using whole disks, ZoL should be disabling the elevator but it looks like you may have got cfq enabled. I don't remember, offhand, whether this logic has changed over time in ZoL or whether there have been any bugs with it. I'd suggest you check with something like head /sys/block/*/queue/scheduler and if they're not noop, then change them all to noop and see what happens.

@dweeezil

I was mainly concerned about the aggregation performed within ZFS and was concentrating as the new write throttle as possibly being the cause of your regression because of your write-heavy load. The new write throttle also performed some refactoring of ZFS' aggregation code.

Ok, I guess this was demonstrated by a jump in iostats avgrq-sz (versus dd bs) in the dd results.

As to the block layer's scheduler merging, since you're using whole disks, ZoL should be disabling the elevator but it looks like you may have got cfq enabled. I don't remember, offhand, whether this logic has changed over time in ZoL or whether there have been any bugs with it. I'd suggest you check with something like head /sys/block/*/queue/scheduler and if they're not noop, then change them all to noop and see what happens.

We are already using the noop scheduler.

I wonder if others using compression=off and a noop elevator see no write merges as this was definitely not the behavior we saw in 0.6.2?

In any case, thanks for all your help

@crollorc Can you please clarify whether this problem occurs with 0.6.4? One interesting post-0.6.4 change was the task priority fix. Prior to 0.6.5, the zio write threads ran at a much lower priority (higher raw priority value under Linux) than they were supposed to. If you had a lot of nfsd threads pushing writes to them, it's possible that the ZFS aggregation code had fewer IOs to work with. If you don't see this issue with 0.6.4 but do see it with 0.6.5, it might be due to the task priority adjustment.

Hi,

I'm seeing the same behaviour after updating from 0.6.4 to 0.6.5. Version 0.6.5 is much heavier on the writes being flushed to disks. My observations:

  • Dirty data settings seems to be okay: no throttle is being hit.
  • Before the update there where quite some merged io's. After the update none
  • Disabling/enabling compression does nog change the situation
  • There are ~4 times more writes being submitted to disk
  • There's ~2 times more disk utilization (being caused by writes)

My pool:

  pool: userdata
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: scrub canceled on Thu May 12 09:19:13 2016
config:

    NAME                                             STATE     READ WRITE CKSUM
    userdata                                         ONLINE       0     0     0
      mirror-0                                       ONLINE       0     0     0
        scsi-35000c50058ab00e7                       ONLINE       0     0     0
        scsi-35000c50058ab3d43                       ONLINE       0     0     0
      mirror-1                                       ONLINE       0     0     0
        scsi-35000c50058ab1e6f                       ONLINE       0     0     0
        scsi-35000c50058ab42ef                       ONLINE       0     0     0
      mirror-2                                       ONLINE       0     0     0
        scsi-35000c50058ab7a8b                       ONLINE       0     0     0
        scsi-35000c50058ab2da3                       ONLINE       0     0     0
    logs
      mirror-3                                       ONLINE       0     0     0
        scsi-SATA_INTEL_SSDSC2BA1BTTV437401F9100FGN  ONLINE       0     0     0
        scsi-SATA_INTEL_SSDSC2BA1BTTV43730834100FGN  ONLINE       0     0     0
    cache
      scsi-35e83a9710005e87d-part1                   ONLINE       0     0     0
    spares
      scsi-35000c500625ff913                         AVAIL

Txgs:

txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
10793211 312084106776788  C     68587008     8612352      20948480     756      4330     10000167442  4250         5997         2692029204
10793212 312094106944230  C     62390272     7127552      23287808     1401     4039     10004438963  7385         9050         3514455600
10793213 312104111383193  C     80222720     10573824     31437312     1148     5861     10000334677  4690         6464         2527310364
10793214 312114111717870  C     61217792     6523392      25288704     1347     5121     10000254737  9817         11140        3950554512
10793215 312124111972607  C     64391168     9314304      40939008     1033     4203     10000733954  6719         12856        4010043119
10793216 312134112706561  C     59781632     24415744     24387072     1285     3270     9999964766   6732         8780         3851472777
10793217 312144112671327  C     44187648     4140544      14866432     786      2452     9999383400   4533         6290         3181786317
10793218 312154112054727  C     60757504     8823296      22792704     1392     3056     10000743534  4368         9225         3982234157
10793219 312164112798261  C     42994176     3391488      15173120     609      2472     10000054193  4357         8645         2739292926
10793220 312174112852454  C     37067776     5884928      14753280     927      2202     9999231541   4567         9956         3228039487
10793221 312184112083995  C     56749568     5754368      20854272     1166     3333     10000896011  4931         7163         3946754118
10793222 312194112980006  C     56281088     8103936      21481984     857      4062     10000885793  4550         8623         3045669876
10793223 312204113865799  C     71583232     4467712      28741632     694      4329     10000099090  7305         9762         3024174973
10793224 312214113964889  C     70732800     6185984      30164992     861      3869     9999279064   4789         7715         3627366045
10793225 312224113243953  C     61769728     17265664     22879232     1494     4572     10000041677  6715         12030        3915833094
10793226 312234113285630  C     70991360     44334080     39017472     1627     5993     10000022878  4590         8579         4292663828
10793227 312244113308508  C     55891456     34114560     20900864     1136     3888     10000849053  6443         7677         3264109345
10793229 312264113423647  C     47859200     30145536     19790848     1883     2812     10000797960  4874         6283         4633259277
10793230 312274114221607  C     182593024    19094528     72361984     1641     10974    9999955309   7216         11589        4878448168
10793231 312284114176916  C     105703424    4952576      33337344     676      5762     9999255379   4655         10300        2943566267
10793232 312294113432295  C     56116224     8543744      20745216     1125     3288     10000125459  4935         7960         3491906603
10793233 312304113557754  C     66479616     6044160      26195456     1108     4850     10001592357  6613         9548         3806978876
10793234 312314115150111  C     54763008     4553216      26543104     582      3274     10000157898  5583         11509        3701671774
10793235 312324115308009  C     42426880     14307840     15518208     1194     2520     9999282329   4727         11780        3694537855
10793236 312334114590338  C     57604608     14051328     21906944     1464     3866     10000930645  5122         9420         4263669008
10793237 312344115520983  C     58293248     5001728      22450688     924      3346     10000983924  11828        8393         3497986043
10793238 312354116504907  C     46852096     7068672      17994240     1316     2770     10001032639  8727         8535         3058301827
10793239 312364117537546  C     42475008     9290240      16683008     901      3047     10000060973  5130         6193         3822290645
10793240 312374117598519  C     51773440     8830464      19445760     1164     3034     10000071321  4729         8658         3780294235

screen shot 2016-06-06 at 11 15 23

screen shot 2016-06-06 at 11 14 43

screen shot 2016-06-06 at 11 06 15

@erikjanhofstede i never got a chance to test 0.6.4 unfortunately.

did you track iostat avgrq-sz as I found this halved for me with compression=on between 0.6.3 and 0.6.5?

@crollorc I only tested between 0.6.4 and 0.6.5. I don't have any stats from 0.6.3. I tried with and without compression on 0.6.5, but didn't notice any difference by the way.

I'm trying to reproduce the issue on a testing server, but that pool is almost empty and there's no workload either.

@dweeezil If this issue is caused by the new priorities, is it possible that disabling spl_taskq_thread_priority can give use any insight on this?

@crollorc I don't have any graphs when I was updating from 0.6.3 to 0.6.4 many months ago, but there was no performance regression for me as far as I can remember. Only improvements.

@erikjanhofstede in the latest master code zpool iostat has been extended to include latency (-w) and request size (-r) histograms. They provide an excellent view in to how well (or poorly) the ZFS I/O pipeline is merging requests _before_ they are submitted to the block device.

If you have a test system available and we suspect the thread priority changes. Then I'd suggest running your test case using the master source and collecting the request size histogram. You can then revert the thread priority changes, 1229323d5f82bcb0525b312c80330968cd41466e, and run the same test to see if there's a difference. I'd be very interested in your results.

@behlendorf I've reverted 1229323 (had to fix some conflicts though). My test consists of running the commands below in parallel, for the rest the servers is idle, with an almost empty zpool, unlike the server that I posted graphs about before.

while true; do dd if=CentOS-7-x86_64-Everything-1511.iso_1 of=CentOS-7-x86_64-Everything-1511.iso_2 bs=4k; done
while true; do dd if=CentOS-7-x86_64-Everything-1511.iso_3 of=CentOS-7-x86_64-Everything-1511.iso_4 bs=4k; done
while true; do dd if=CentOS-7-x86_64-Everything-1511.iso_5 of=CentOS-7-x86_64-Everything-1511.iso_6 bs=4k; done

The results running on master, polled over a period of 60 seconds (zpool iostat -r 60 2):

userdata      sync_read    sync_write    async_read    async_write      scrub
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             9      0      0      0      0      0     25      0      0      0
1K              1      0      6      0      0      0     15      1      0      0
2K              7      0      0      0      0      0     22      2      0      0
4K              1      0      0      0      4      0     24      4      0      0
8K              0      0      0      0      0      0     10      7      0      0
16K             0      0      0      0      0      0      5      4      0      0
32K             0      0      0      0      3      0     11      3      0      0
64K             0      0      0      0     16      0    106     18      0      0
128K            0      0      0      0    321      0  2.13K      0      0      0
256K            0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0
--------------------------------------------------------------------------------

userdata      sync_read    sync_write    async_read    async_write      scrub
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            27      0      0      0      0      0     21      0      0      0
1K              1      0      5      0      0      0     15      0      0      0
2K              9      0      0      0      0      0     19      3      0      0
4K              5      0      0      0      4      0     23      3      0      0
8K              0      0      0      0      0      0     10      7      0      0
16K             0      0      0      0      0      0      3      3      0      0
32K             0      0      0      0      3      0     13      3      0      0
64K             0      0      0      0     21      0    122     17      0      0
128K            0      0      0      0    309      0  1.97K      0      0      0
256K            0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0
--------------------------------------------------------------------------------

The results running on master, without 1229323, polled over a period of 60 seconds (zpool iostat -r 60 2):

userdata      sync_read    sync_write    async_read    async_write      scrub
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            27      0      0      0      0      0     35      0      0      0
1K              5      0      8      0      0      0     27      2      0      0
2K              8      0      0      0      0      0     25      3      0      0
4K              1      0      0      0      6      0     25      4      0      0
8K              1      0      0      0      0      0     13      9      0      0
16K             0      0      0      0      0      0      7      4      0      0
32K             0      0      0      0      2      0      8      4      0      0
64K             0      0      0      0     12      1     98     20      0      0
128K            2      0      0      0    235      0  2.32K      0      0      0
256K            0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0
--------------------------------------------------------------------------------

userdata      sync_read    sync_write    async_read    async_write      scrub
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            77      0      0      0      0      0     41      0      0      0
1K              3      0      5      0      1      0     30      1      0      0
2K             11      0      0      0      1      0     22      4      0      0
4K              7      0      0      0      4      0     30      6      0      0
8K              1      0      0      0      0      0     11      8      0      0
16K             0      0      0      0      0      0     11      3      0      0
32K             0      0      0      0      3      0     17      4      0      0
64K             0      0      0      0     14      0    160     21      0      0
128K            2      0      0      0    192      0  2.13K      0      0      0
256K            0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0
--------------------------------------------------------------------------------

@erikjanhofstede thanks for running the test. Based on these results it doesn't look like 1229323 has any real effect.

@behlendorf I'm unable to reproduce the issue I'm having on my testing server, but it's very clearly visible on production servers. Can you think of a safe way to try some things out on a production server? In the meantime I'll continue to try to reproduce the issue.

@erikjanhofstede How about reverting the priority changes on your production server? That should be completely safe; after all, ZoL had the old priority settings for quite a long time. My original suggestion that the priority changes were somehow involved was based on the guess that there was sufficient additional load from other tasks on the system, running at normal user priority (100), which starved the zio threads (running at priority 120 IIRC) and allowed more pending IO to be built up and increased the effectiveness of the coalescing.

@dweeezil I'm running a server with 0.6.5.6 with 1229323 reverted, but that's doesn't make any difference. Are there any other commits that I need to revert besides this one to figure out if these new priorities are causing the issue?

How about this pull request? And commit? 62aa81a

Besides reverting these changes, how about just setting spl_taskq_thread_priority=0 ? Will this revert the changes?

@erikjanhofstede The pull request to which you refer is the one which resulted in 1229323 being committed. It looks like neither 62aa81a nor the related 9dc5ffb were in 0.6.4, either. I had forgotten about the related SPL changes which went along with 1229323 so, yes, in order to restore the old (incorrect) 0.6.4 priorities, you'd need to revert both of them in your SPL as well.

Setting spl_taskq_thread_priority=0 will cause all the zio threads to run at priority 120 which will exactly match most of your user processes so this won't quite duplicate the behavior under 0.6.4. Here's a bit of further explanation:

You can get a quick display of the priorities of your zio tasks by running ps -eL -o pri_baz,cmd | grep z_. Keep in mind that 120 is the default Linux priority. Lower values mean higher priority and higher numbers mean lower priority. In a 0.6.5 system, all of the zio threads except for the write issue threads run at priority 100 which is a very high priority. The write issue threads run at the slightly lower, but still high priority of 101. In 0.6.4, the situation was quite different, and it's been awhile so I may not quite be describing it correctly. I think all the zio threads were running at priority 129 or 130, the very lowest user priority possible except that the write issue threads were running at a priority value one less (a higher priority). The situation was exactly the reverse of what was desired. The effect, however, is that _all_ the zio threads were running at a much lower priority than typical user processes and could have potentially been starved of CPU. According to my theory, that might have allowed more IOs to build up with the side-effect of causing coalescing to be more effective.

Jumping on the bandwagon here. Any further developments on this?

Background: Upgraded multiple (5 in total) established systems (running smoothly for a year+) from 0.6.4.2 to the current 0.6.5.7. Immediate noticeable jump in the # of write operations across all systems with corresponding higher latencies. I've attempted to use some of the new 0.6.5 tuning parameters to regain the 0.6.4 performance levels to no avail.

I'll be downgrading the heaviest IO system back to 0.6.4 tomorrow during a maintenance window, but have other systems available for testing. In particular, we have a nearly-identical drbd "mirror" of the system I'll be downgrading. Open to suggestion, initial plan on the drbd mirror system is to start with a downgrade to 0.6.5.4 (the version reported by OP) and step back from there until the write IO hopefully drops down to the 0.6.4 level.

If a particular patch / tuning parameter(s) has been identified that addresses the above, I would be very interested in that info as well.

Thanks in advance!

@dweeezil @olw2005 Reverting 1229323, 62aa81a and 9dc5ffb didn't make any difference. This issue is still present.

I 'feels' like the metaslab selection process picks different metaslabs to write the data to, compared to 0.6.4.2. Are there any other suspicious commits I can try to revert (safely)?

@erikjanhofstede What is your pool geometry (zpool status)? My testing has also shown that the priorities don't seem to make any difference.

@dweeezil

pool: userdata
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.
scan: scrub canceled on Sun Jun 12 04:29:37 2016
config:

NAME                                             STATE     READ WRITE CKSUM
userdata                                         ONLINE       0     0     0
  mirror-0                                       ONLINE       0     0     0
    scsi-35000c50059117303                       ONLINE       0     0     0
    scsi-35000c500626018f7                       ONLINE       0     0     0
  mirror-1                                       ONLINE       0     0     0
    scsi-35000c50062602c3b                       ONLINE       0     0     0
    scsi-35000c500625fb53f                       ONLINE       0     0     0
  mirror-2                                       ONLINE       0     0     0
    scsi-35000c500625ff1e3                       ONLINE       0     0     0
    scsi-35000c500591ae2a7                       ONLINE       0     0     0
logs
  mirror-3                                       ONLINE       0     0     0
    scsi-SATA_INTEL_SSDSC2BA1BTTV335106X9100FGN  ONLINE       0     0     0
    scsi-SATA_INTEL_SSDSC2BA1BTTV335403VE100FGN  ONLINE       0     0     0
cache
  scsi-35e83a9710005e4fb-part1                   ONLINE       0     0     0
spares
  scsi-35000c500591a75bf                         AVAIL

errors: No known data errors

@erikjanhofstede changes to the metaslab code could definitely cause this, and they would more significantly impact older, fuller, more fragmented pools like yours. However, the majority of those changes went in to 0.6.3 so I'd have expected both releases to be impacted. Still I'd suggest investigating the following tuning which are described in zfs-module-parameters(5).

If your system has a few GB of extra memory I'd suggest starting by setting metaslab_debug_unload=1. This will prevent ZFS from unloading any of the metaslabs from memory which could generate significant additional IO if they're being unloaded/loaded constantly. If this resolves the issue you may be able to fine tune things using some of the other options.

       metaslab_aliquot (ulong)
                   Metaslab granularity, in bytes. This is roughly similar to  what  would  be
                   referred  to  as  the  "stripe  size" in traditional RAID arrays. In normal
                   operation, ZFS will try to write this amount of data to  a  top-level  vdev
                   before moving on to the next one.

                   Default value: 524,288.

       metaslab_bias_enabled (int)
                   Enable  metaslab  group biasing based on its vdev's over- or under-utiliza‐
                   tion relative to the pool.

                   Use 1 for yes (default) and 0 for no.

       metaslab_debug_load (int)
                   Load all metaslabs during pool import.

                   Use 1 for yes and 0 for no (default).

       metaslab_debug_unload (int)
                   Prevent metaslabs from being unloaded.

                   Use 1 for yes and 0 for no (default).

       metaslab_fragmentation_factor_enabled (int)
                   Enable use of the fragmentation metric in computing metaslab weights.

                   Use 1 for yes (default) and 0 for no.

       metaslabs_per_vdev (int)
                   When a vdev is added, it will be divided into approximately  (but  no  more
                   than) this number of metaslabs.

                   Default value: 200.

       metaslab_preload_enabled (int)
                   Enable metaslab group preloading.

                   Use 1 for yes (default) and 0 for no.

       metaslab_lba_weighting_enabled (int)
                   Give  more  weight to metaslabs with lower LBAs, assuming they have greater
                   bandwidth as is typically the case on a modern  constant  angular  velocity
                   disk drive.

                   Use 1 for yes (default) and 0 for no.

       zfs_metaslab_fragmentation_threshold (int)
                   Allow  metaslabs  to keep their active state as long as their fragmentation
                   percentage is less than or equal to this value.  An  active  metaslab  that
                   exceeds  this threshold will no longer keep its active status allowing bet‐
                   ter metaslabs to be selected.

                   Default value: 70.

       zfs_mg_fragmentation_threshold (int)
                   Metaslab groups are considered eligible for allocations if their fragmenta‐
                   tion metric (measured as a percentage) is less than or equal to this value.
                   If a metaslab group exceeds this threshold then it will be  skipped  unless
                   all  metaslab  groups  within  the  metaslab  class  have also crossed this
                   threshold.

                   Default value: 85.

       zfs_mg_noalloc_threshold (int)
                   Defines a threshold at which metaslab groups should be eligible for alloca‐
                   tions.  The value is expressed as a percentage of free space beyond which a
                   metaslab group is always eligible for allocations.  If a  metaslab  group's
                   free space is less than or equal to the threshold, the allocator will avoid
                   allocating to that group unless all groups in the  pool  have  reached  the
                   threshold.   Once  all  groups  have  reached the threshold, all groups are
                   allowed to accept allocations.  The default value of 0 disables the feature
                   and causes all metaslab groups to be eligible for allocations.

                   This  parameter  allows  to deal with pools having heavily imbalanced vdevs
                   such as would be the case when a new vdev  has  been  added.   Setting  the
                   threshold to a non-zero percentage will stop allocations from being made to
                   vdevs that aren't filled to  the  specified  percentage  and  allow  lesser
                   filled  vdevs  to  acquire more allocations than they otherwise would under
                   the old zfs_mg_alloc_failures facility.

                   Default value: 0.

I'll note, however, that 0.6.4 should have the very same metaslab improvements so it's not something that should have changed between 0.6.4.X and 0.6.5.X.

@dweeezil @behlendorf I'll start with trying to set metaslab_debug_unload=1, see if that makes any difference. Is this something I can do online, or does it require a reboot/re-load of the kernel module?

@erikjanhofstede @dweeezil @behlendorf Spent most of today running through versions of 0.6.5.X on a more isolated system. The problem appears to manifest between 0.6.5.1 and 0.6.5.2 (see graphs below). These are taken from a stand-alone system which acts as storage for video recording. The steady-state IO load is a constant stream of around 1.5 MBps writes from a single virtual machine. The zfs server has 4 GB ram (and what turned out to be a really low-end controller from Dell) and a single disk (actually a six-disk hardware RAID10 array) presented to zfs. In normal production, it runs with sync=disabled which dramatically lowers the latency.

dtc-nvr_disk_io_zfs_0_6_5_0

dtc-nvr_disk_io_zfs_0_6_5_2_again

@olw2005 what did you use to produce those charts?

@mailinglists35 Cropped from the VMware vsphere client. =)

@olw2005 thanks! You've really narrowed the field of possible causes. Of the 13 patches applied between 0.6.5.1 and 0.6.5.2 only the following few have the potential to have caused something like this. In order of most to least likely:

057b87c Fix synchronous behavior in __vdev_disk_physio()
c1013eb Linux 4.3 compat: bio_end_io_t / BIO_UPTODATE
f4943dc Avoid blocking in arc_reclaim_thread()

My money's on 057b87c as the most likely cause. I suspect you're now unconditionally waiting on an bio and the previous code wasn't perhaps due to the exact kernel configure check. This may in fact be more correct behavior even if it's not as performant. @olw2005 can you please:

  • Post exactly what kernel version your using.
  • Check /sys/block/sdX/queue/rotational to verify if these are reported as rotation or non-rotation disks.
  • Try reverting just 057b87c from 0.6.5.2 (or newer) and see if it resolves the issue. Reverting any of these patches is safe, the worst case is it might result in a hang but even that's doubtful if you never observed an issue with 0.6.5.1.
  • Try reverting the other two suggested patches, one at a time, if reverting 057b87c didn't resolve the issue.

That should give us a very clearly idea what the issue is. Thanks for doing all the heavy lifting on this!

@olw2005 What's your timeline for reverting 057b87c and testing this?

@behlendorf I'd like to revert 057b87c on a production system, but aa159af is causing conflicts. I'm no programmer, and unable to fix this conflict. Can I revert this commit safely too? Thanks :)

@behlendorf @erikjanhofstede Just trying to do my part for the cause. =) Time permitting, I'll try to remove the above patch sets today. Will also be confirming the problem on another (higher end) system today if I can sneak in the downgrade. To answer your questions above:

[root@dtc-sannvr ~]# cat /etc/redhat-release
CentOS release 6.7 (Final)

[root@dtc-sannvr ~]# uname -r
2.6.32-573.12.1.el6.x86_64
(Also occurs with the newer Centos 6.8 kernel 2.6.32-642.el6.x86_64)

[root@dtc-sannvr ~]# cat /sys/block/sda/queue/rotational
1

@erikjanhofstede reverting 057b87c and aa159af will introduce a small race in the IO pipeline which can cause it to stall. Depending on the workloads on your system and its configuration you might never hit this race or you might hit it daily. It's impossible to say. Worst case you'll need to reboot the system.

@olw2005 thanks for the info.

Ok, further testing today. My git-fu is rusty so for simplicity, I used the zfs "point-in-time" as of commits f4943dc followed by the next commit, 057b87c. First graph shows performance for zfs-commit-f4943dc. The second graph shows the tail end of f4943dc on the left and 057b87c on the right. The contrast is readily apparent.

dtc-nvr_disk_io_zfs_commit_f4943dc

dtc-nvr_disk_io_zfs_commit_057b87c

However, the plot thickens... As mentioned earlier, I downgraded one of our higher-spec hardware servers this morning from 0.6.5.7 to 0.6.5.1 hoping to see the same decrease in latency / increase in performance. But instead saw the reverse -- performance of 0.6.5.1 was significantly worse. (Side note, this one also ran very nicely on 0.6.4.2.)

Must be some other commit that has something to do with the different hardware / zfs configuration in this case. ?? I'll increment the version number here as well to try to pin it down, but it will take longer as I need to "sneak it in" outside of business hours at that particular location. Hardware and zfs config details for this server:

2 x Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
192GB RAM
HP P421 RAID controller (2 GB cache)
12 x 3TB SATA 7200rpm disks
zero SSD's

[root@sctc-san2 ~]# cat /etc/redhat-release
CentOS release 6.8 (Final)

[root@sctc-san2 ~]# uname -r
2.6.32-642.el6.x86_64

[root@sctc-san2 ~]# cat /etc/modprobe.d/zfs.conf
options zfs zfs_nocacheflush=1 zfs_prefetch_disable=1 zfs_arc_max=137438953472 zfs_arc_min=103079215104

[root@sctc-san2 ~]# zpool status
pool: pool
state: ONLINE
scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    pool        ONLINE       0     0     0
      raidz2-0  ONLINE       0     0     0
        sdb     ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0
        sdf     ONLINE       0     0     0
        sdg     ONLINE       0     0     0
      raidz2-1  ONLINE       0     0     0
        sdh     ONLINE       0     0     0
        sdi     ONLINE       0     0     0
        sdj     ONLINE       0     0     0
        sdk     ONLINE       0     0     0
        sdl     ONLINE       0     0     0
        sdm     ONLINE       0     0     0

errors: No known data errors

[root@sctc-san2 ~]# zpool get all | grep -v default
NAME PROPERTY VALUE SOURCE
pool size 32.5T -
pool capacity 31% -
pool health ONLINE -
pool autoreplace on local
pool dedupratio 1.00x -
pool free 22.3T -
pool allocated 10.2T -
pool readonly off -
pool expandsize - -
pool fragmentation 39% -
pool feature@async_destroy enabled local
pool feature@empty_bpobj active local
pool feature@lz4_compress active local
pool feature@spacemap_histogram active local
pool feature@enabled_txg active local
pool feature@hole_birth active local
pool feature@extensible_dataset enabled local
pool feature@embedded_data active local
pool feature@bookmarks enabled local
pool feature@filesystem_limits enabled local
pool feature@large_blocks enabled local

[root@sctc-san2 ~]# zfs get all pool | grep -v default
NAME PROPERTY VALUE SOURCE
pool type filesystem -
pool creation Wed Sep 2 11:43 2015 -
pool used 6.81T -
pool available 14.2T -
pool referenced 32.0K -
pool compressratio 1.68x -
pool mounted no -
pool mountpoint none local
pool checksum off local
pool compression lz4 local
pool atime off local
pool xattr off local
pool version 5 -
pool utf8only off -
pool normalization none -
pool casesensitivity sensitive -
pool usedbysnapshots 0 -
pool usedbydataset 32.0K -
pool usedbychildren 6.81T -
pool usedbyrefreservation 0 -
pool refcompressratio 1.00x -
pool written 32.0K -
pool logicalused 11.4T -
pool logicalreferenced 9.50K -

@olw2005 Do all of these systems you're testing use zfs_nocacheflush=1?

@dweezil In general, yes. Not tied to the setting, but if I recall correctly (and it's been a few years since I set it so I could be wrong) the HP smart array controllers we're using (P421's with 2GB of BBWC) perform better without the flush.

However, come to think of it the graphs and testing above (apart from the last entry above) are on Dell systems with different lower-end controllers. Hmm...

I'd be curious to know some of the block queue settings of the underlying vdevs for anyone seeing this issue. At the very least, /sys/block/<block_device>/queue/max_sectors_kb. Also whether anyone is using large blocks (zpool get feature@large_blocks is "active"). Pinging @olw2005.

@dweeezil /sys/block/*/queue/max_sectors_kb is set on 512 voor all zpool devices over here. The large_blocks feature is disabled (haven't upgraded the zpool).

I've been going over the various related commits: b39c22b, 5592404 and aa159af. The latter two are, respecively, 057b87c and cd887ab on zfs-0.6.5-release.

@olw2005 reported seeing different behavior between 057b87c and its previous commit.

One of the most interesting things I realized about 5592404 (057b87c) is that the new wait parameter to __vdev_disk_physio() will _never_ be set under ZoL because ZoL doesn't use spa_import_rootpool() which means, ultimately, vdev_disk_physio() is never used. This has a couple of effects: first, there will never be any wait_for_completion(&dr->dr_comp) calls and, second, the calls to complete() in vdev_disk_physio_completion(), if they even happen, will not actually be unblocking a waiter.

Prior to 5592404 (057b87c) , however, the wait_for_completion(&dr->dr_comp) would happen any time a dio is marked as synchronous which pretty much means IOs initiated from zio_flush(). In other words, the wait_for_completion(&dr->dr_comp) should have happened relatively frequently.

I think the extra waiting was actually causing the better performance (larger IOs) because it allowed the bio's to stay plugged longer.

As for aa159af (cd887ab), it seems the only reason it actually fixed something is because it effectively eliminated the calls to complete() altogether since it added a requirement that wait be set.

Which brings us back to b39c22b: It converted the bio's to READ_SYNC or WRITE_SYNC depending on their ZIO type in order to effectively bypass the plugging and implicit delay which would otherwise have occurred. This seems like a perfectly reasonable thing to do: generate synchronous bio requests for operations which ZFS, itself, thinks ought to be synchronous.

How about this for a fix: [EDIT] 35a550e 0caf657 (pinging @behlendorf). Totally untested at the moment but posted for feedback. I'll try actually using it later.

Of course I realized right after writing the patch that it's probably not much different than looking at flags in __vdev_disk_physio() with the advantage that it's looking for ZFS-specific flags (the ZIO priority).

Hello,
If this helps. I noticed similar latency increase after updating ZoL from 0.6.4.2 to ZoL 0.6.5.x. This was introduced with commit 37f9dac592bf5889c3efb305c48ac39b4c7dd140. The last commit with the good latency was dca8c34da4212ed85a92111f26bd3a3cd782f270.

What do you think about introducing a module parameter that allows to use old mechanism?

Preliminary testing shows that 0caf657 (note updated commit) seems to do the Right Thing: Sync writes from userland do cause sync-writes to an slog without any unplug timer expiration. And also, the synchronous blocking (completion waiting) is working again which, if my theory is correct, will restore the aggregation and lower latency for which this issue was originally opened.

I was using a 2.6.32-573.22.1.el6 during all my previous testing to be in line with the kernel used by @olw2005.

Next, I decided to try a 4.6.3 kernel and in order to get a baseline behavior, started with current master and reverted aa159af, 5592404 and b39c22b. The behavior described in [EDIT, ugh] b39c22b aa159af did _not_ occur; there were no unplug timer events nor were there any wait_for_completion() calls. Unless some other post-0.6.5.x ZoL change affected things, it seems changes in the kernel's block layer may also come into play.

What kernel version are you using @crollorc? Forgive me if you mentioned it already but this issue is getting very long and it's difficult to search it for kernel version numbers.

@dweeezil We were on 3.2.0-102-generic on Ubuntu 12.04.5 for the performance problems which I described.

I don't see any mention of the kernel being used in #3529 or the commentary in b39c22b. Testing with a stock 3.2 also didn't show any plug/timer-unplug activity on a non-rotational (or rotational) slog device (or even on a normal vdev) so I have a feeling the behavior only occurs under older kernels; likely only the 2.6.X EL kernels.

@olw2005 Your difference between f4943dc and 057b87c is most likely explained by the difference in synchronous operations introduced with 057b87c. Since all your devices were listed as rotational, there will be _fewer_ synchronous writes. Apparently the 2.6.32 EL kernels benefit from the synchronous writes by bypassing the io queue plugging.

Other kernels are likely impacted differently by this family of changes which introduce synchronous reads and writes to the bio layer.

@crollorc Are all your vdevs rotational? I still can't come up with a good explanation as to why you'd see a difference so long as you're running 0.6.5.4 or newer.

@dweezil Playing catchup here, reading through lots of comments. Are you suggesting trying out that patch (b39c22b)? Would that be 0.6.5.7 + patch? It looks pretty innocuous. I assume that the chance of it munging my data is relatively low? _chuckle_

About the rotational vs. non-rotational. The config / stat dump above is from a smaller server, with all rotational disks. (No SSD SLOG / L2ARC devices.) Our larger servers DO have separate SLOG devices. I gather this could make a difference? Possibly a different patch set is needed depending on drive configuration?

I can run through one build + reboot cycle per day on the all-rotational server. Let me know what patch set you're thinking. And [hopefully it goes without saying but] thank you very much. Your efforts are appreciated! =)

Also not sure if it's still applicable, but here's the info you requested earlier. Pretty much the same on all our servers apart from one which I had not upgraded the "feature set" on (and that one is currently back-rev'd to 0.6.4.2):

[root@sctc-san2 ~]# cat /sys/block/sd[b-m]/queue/max_sectors_kb
512
512
512
512
512
512
512
512
512
512
512
512
[root@sctc-san2 ~]# zpool get feature@large_blocks
NAME PROPERTY VALUE SOURCE
pool feature@large_blocks enabled local

@dweezil @crollorc @behlendorf The comments above also mention / imply that the older kernel version in Centos 6 is a contributing factor. Any thoughts on whether the somewhat-less-old Centos 7 kernel (3.10.0 + backports) might play a little nicer with 0.6.5.x? I've been running a tiny C7 + zfs server at home on 0.6.5 for months without issue, not that I'd even notice a performance decrease given its light usage. These server will need to be upgraded at some point in the next year or so anyway.

@olw2005 It looks like 2.6.38 was the last kernel to have the unplug timer mentioned in b39c22b that started the series of commits I've referenced. [EDIT: Yet I can see unplug timer events on a 4.5.5 kernel]

The sync-zio-to-sync-bio approach may have been the wrong way to deal with the unplug timer issue in < 2.6.39 kernels. Among other things, it very well may be contributing to the increased latency and smaller request sizes being discussed in this issue.

I'm working on a patch stack to deal with the < 2.6.39 kernel unplug timer issue, to revert the sync-zio-to-sync-bio conversion and also to remove (#ifdef out) the whole root pool import chain (spa_import_rootpool, spa_generate_rootconf, vdev_disk_physio, etc.) for the Linux port.

@dweeezil Sounds like a bit of doing there. Meantime I'll see if I can find a suitable server to test out a newer kernel.

I just pushed 1bfdd1f to dweeezil:syncfixes as a starting point. It _does_ cherry pick cleanly on to 0.6.5.7 and should restore at least the BIO sync behavior to pre-0.6.5 behavior. It also contains a fairly comprehensive commit comment with a bit of rationale.

@olw2005 Did you have the change to try dweeezil's 1bfdd1f?

@dweeezil Is this a 'production ready' fix? (so it won't mangle my data)

@erikjanhofstede ran it through the buildbot via https://github.com/zfsonlinux/zfs/pull/4835
superficial errors are from whitespace,

three buildbots passed: http://build.zfsonlinux.org/builders/Amazon%202015.09%20x86_64%20%28TEST%29/builds/1738 , http://build.zfsonlinux.org/builders/Amazon%202015.09%20x86_64%20SIMD%20%28TEST%29/builds/826 , http://build.zfsonlinux.org/builders/CentOS%206.7%20x86_64%20%28TEST%29/builds/1754

the other ones errored out in ztest:
http://build.zfsonlinux.org/builders/Debian%208%20x86_64%20%28TEST%29/builds/1691 , http://build.zfsonlinux.org/builders/Ubuntu%2014.04%20x86_64%20%28TEST%29/builds/1787

<-- are those buildbot related ?

zfstests + zfsstress:
http://build.zfsonlinux.org/builders/CentOS%207.1%20x86_64%20%28TEST%29/builds/1750

<-- needs further investigation, or is it a known behavior ?

anyway - I took the plunge and it's stable so far in production

I'd wait for @dweeezil to confirm or advise against

I've had to step away from this issue for a bit to work on some other stuff. The 1bfdd1f patch _should_ have returned the sync BIO behavior to pre-0.6.5 and, compared to 0.6.5.7 in which no non-rotational vdevs are being used, it should be a noop. Other than that, all it does is remove some code which is dead under ZoL. In other words, it ought to be pretty safe.

I do still intend on coming up with a fix for pre-2.6.39 kernels.

@dweeezil @erikjanhofstede Status update. There's definitely something to the old kernel theory. I haven't tried the patch out yet, but just ran a baseline of vanilla 0.6.5.7 using an ElRepo kernel 3.10.102-1.el6.elrepo.x86_64 vs. the RH 2.6.32. Holy cow what a difference! First pic is 2.6.32 kernel, zfs 0.6.5.7 comparing two different SCST versions (no substanstantive difference).

dtc-nvr_disk_io_scst_comparison_zfs_0_6_5_7

Second pic is the 3.10 kernel and vanilla zfs 0.6.5.7. Note the significant drop in latency!

dtc-nvr_disk_io_kernel3 10_zfs_0_6_5_7

@dweeezil @erikjanhofstede Third performance graph with the 3.10 kernel and zfs 0.6.5.7 + patch set 1bfdd1f mentioned above. (Also to clarify, the graphs are a vmware client which dumps a more-or-less constant stream of video recordings at ~1.5 MB/s write speed.) It's not the enormous win that the newer kernel version is, but it does appear the patch has a positive impact on write latency / performance -- maybe 20-25% better?

dtc-nvr_disk_io_kernel_3 10_zfs_0_6_5_7_plus_patch

Confirmed, the latency appears to be really down while doing multimedia (watching HD or HQ videos while smoothing it up to 60 FPS), or 2 music streams at once - no to barely perceptible interruptions or latency spikes

while either running zpool scrub, or running a metadata-heavy workload (where #4850 might also help quite nicely, looking forward to test it in real-world usage once the buildbots pass)

@dweeezil thanks for all those changes !

See #4858 for a version of the 1bfdd1f patch which should fix the unplug timer issue with 2.6.32-2.6.38 kernels.

Both #4850 and #4858 LGTM for merging to master. However, it would be great to get some independent confirmation that these two changes resolve this issue. We'll of course want to evaluate cherry picking these fixes back to the release branch. Thanks @dweeezil!

@behlendorf Actually, #4850 was created to solve problems unrelated to this one, but instead related to some other work I'm doing for which there is no specific PR currently filed, however, I think it'll fix a whole lot of issues with the typical rsync-interleaved-with-lots-of-data workloads and have indicated a few issues in its commit log for which it ought to help.

@behlendorf I'm planning on testing these changes this week.

In order to help this along, I'd like to try to summarize this issue a bit. First, off, there are possibly different underlying issues. The series of commits isolated by @olw2005 which add/fix synchronoous BIO requests may have nothing to do with the original issue reported by @crollorc or others. Here's a quick summary:

@crollorc You posted the original issue, running a 3.2 kernel under Ubuntu Precise with ZoL 0.6.5.4 and have no nonrot devices in your pool. Since you're running 0.6.5.4, sync behavior _should_ be identical to that of 0.6.4 in that sync BIO requests should never happen and there should be no extraneous completion activity. My various patches, culminating with the #4858 sync fix PR should _not_, in theory, change anything for your case. If you've tried them, I'd sure like to know whether they had any effect but it sure seems like they should have not. Based on other subsequent comments, it does seem like it might be worth trying a newer kernel. Since you're running Precise, you can install the Trusty HWE from Ubuntu and get a much newer 3.13 kernel.

@olw2005 Your initial report concerned upgrading from 0.6.4.2 to 0.6.5.7 on a 2.6.32EL kernel. Apparently some of your servers have SSD log devices and some don't. For those that do, 0.6.5.7 _will_ cause some sync BIO requests and the requisite waits for completion and possibly cause some difference in performance. It also sounds like you've got various settings for either or both of sync=disabled and zfs_nocacheflush=1 (which is a whole separate story, BTW) and also have tested different storage controllers; some with and some without BBU (and an unspecified writeback policy set in the controller) so it's a bit hard to tell exactly what's being measured in all cases. I'm a bit confused by this particular posting of yours (emphasis mine):

Third performance graph with the 3.10 kernel and zfs 0.6.5.7 + patch set 1bfdd1f mentioned above. (Also to clarify, the graphs are a vmware client which dumps a more-or-less constant stream of video recordings at ~1.5 MB/s write speed.) It's not the enormous win that the newer kernel version is, but it does appear the patch has a positive impact on write latency / performance -- maybe 20-25% better?

Wasn't the 3.10 kernel the "newer kernel". Did you really mean that you were running the 2.6.32 kernel with the 1bfdd1f patch? If you were really running 3.10 with the 1bfdd1f patch, I'd expect the results to be identical to the 3.10 kernel without it. It it was really with the 2.6.32 kernel, you should use the newer patch in #4858.

@erikjanhofstede Your initial report concerned an update from 0.6.4 to 0.6.5 which based on a subsequent follow-up is 0.6.5.6. I don't see any mention of the kernel version. Since you _do_ have an SSD log device and apparently aren't changing any of the sync/flushing parameters, the commits isolated by @olw2005 may very well be coming into play and the #4858 patch may very well make a difference. I presume that's what you're going to test so it'll be interesting to hear the results.

@dweeezil I'm running 0.6.5.7 + #4858 for 5 days now, and I don't see any difference in performance. My kernel: 2.6.32-673.8.1.lve1.4.14.el6.x86_64

Below are some chart of the utilizations (%) and writes per second made in graphite. The server is running on 0.6.5.7 + #4858 starting from the 20th around 06:30.

screen shot 2016-07-25 at 11 56 54

screen shot 2016-07-25 at 11 55 56

@dweeezil Thanks for the response! I believe the comparison there was between kernel 3.10 and zfs 0.6.5.7 with and without the 1bfdd1f patch.

In general we're using sync=enabled. Sync=disabled was used to improved performance on the aforementioned "test" box, but sync=enabled was set for the above testing / graphs.

As you mentioned, we do have a variety of hardware configurations. I noticed the performance hit on all the systems upgrading from 0.6.4.2 to 0.6.5.7. The "test" system has the advantage of a known "static" I/O load for more accurate comparisons and is one I can easily take down for maintenance. (As an aside, all the systems use controllers with battery-backed write cache. My understanding is the zfs-no-cache-flush option is ok in that scenario. In truth I haven't noted much performance difference with it on or off.)

Last week I reinstalled the above "test" system with a more optimum drive configuration, zfs v0.6.4.2 and default options for the kernel module (followed by re-importing the zvol with zfs send / recv). It began its life running 0.6.5, so I did not have comparison data for zfs 0.6.4.2 vs. 0.6.5 under a known constant load. I'll post the graphs in a follow-up. But the difference is significant.

@dweeezil @erikjanhofstede @crollorc @behlendorf As mentioned above, I now have direct comparison data for 0.6.4.2 and 0.6.5.7 given a constant I/O load (~1.5 MB / sec write I/O, negligible read I/O). The results are consistent with the performance degradation on other upgraded systems. Short version is: 0.6.4.2 = great, 0.6.5.7 = really not great, 0.6.5.7 with newer 3.10 kernel = somewhere in the middle, 0.6.5.1 with stock kernel = somewhere in the middle.

A few notes on the test system hardware / config: 4 GB RAM, 8 x 4TB SATA disks, Dell RAID controller (lsilogic-based, disks are in JBOD mode), Centos 6.8 (stock kernel), default zfs kernel options. The pool contains a single zvol shared to a single vm (in vmware) via iscsi (scst).

[root@dtc-sannvr ~]# zpool status
pool: pool
state: ONLINE
scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    pool        ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sda     ONLINE       0     0     0
        sdb     ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
      mirror-2  ONLINE       0     0     0
        sde     ONLINE       0     0     0
        sdf     ONLINE       0     0     0
      mirror-3  ONLINE       0     0     0
        sdg     ONLINE       0     0     0
        sdh     ONLINE       0     0     0

errors: No known data errors

[root@dtc-sannvr ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool 5.07T 9.21T 19K none
pool/nvrstore 5.07T 9.21T 5.07T -

[root@dtc-sannvr ~]# zfs get all pool | grep local
pool mountpoint none local
pool checksum off local
pool compression off local
pool atime off local
pool xattr off local

[root@dtc-sannvr ~]# zfs get all pool/nvrstore | grep -v default | grep -v -
NAME PROPERTY VALUE SOURCE
pool/nvrstore volsize 16T local
pool/nvrstore checksum off inherited from pool
pool/nvrstore compression off inherited from pool

GRAPHS
Write latency is the highlighted graph. Read latency is also depicted, but is inconsequential. Note because of the (helpful) change in drive configuration, these results are not comparable with the previous ones.

zfs 0.6.4.2, centos 6.8 with stock RH kernel
dtc-nvr_disk_io_separate_os_disk_zfs_8rawdisks_mirror_stock_kernel_zfs_0_6_4_2_2nd_sample

zfs 0.6.5.7, centos 6.8 with stock RH kernel and also with the elrepo 3.10 kernel
dtc-nvr_disk_io_separate_os_disk_zfs_8rawdisks_mirror_kernel_comparison_zfs_0_6_5_7

zfs 0.6.5.7 + patch, elrepo 3.10 kernel = no change vs no patch (this is as expected)
dtc-nvr_disk_io_separate_os_disk_zfs_8rawdisks_mirror_kernel_3 10_zfs_0_6_5_7 patch

Observation: Roughly 20x increase in latency from 0.6.4.2 to 0.6.5.7. The newer 3.10 kernel cuts the latency in half to roughly a 10x increase. The qualitative experience was the same upgrading zfs on our other (main) systems.

I should have more time now to do additional patch testing, other diagnostic commands, etc. Just let me know if I can be of any assistance!

Warm Regards,
OLW

Further update to this issue after scouring the issues list and running tests on various revisions of zfs. I found a number of bug reports along the lines of "performance drop in virtual environment using 0.6.5" which are similar to OP's and my issue. From these and lots of testing I've identified & verified three distinct sources of this performance drop. The first two have been or can be mitigated. The third issue is mentioned briefly / tangentially in bug report #3754 and that one looks somewhat more intractable. It appears (at least to my layman's eye) to be deeply intertwined with subsequent commits in the zvol potion(s) of the code.

As a reminder, the test case (shown in the graphs below) is a VMware virtual machine. A zvol is present via iscsi (scst) and formatted with vmfs (vmware filesystem). I/O is only from the one vm, and consists of a continuous write stream of around 1.5 MB/s with sporadic / very light read I/O.

Issue 1. As previously noted above, zfs commit b39c22b. Testing confirms this was later fixed by commit 057b87c. Anything zfs version with b39c22b (and without 057b87c) takes a big latency hit.

Issue 2. Spl commit f7a973d (the taskq_dynamic feature). The "fix" for this is mentioned in another issue report.

echo 0 > /sys/module/spl/parameters/spl_taskq_thread_dynamic

Fairly dramatic & instantaneous difference as shown here. Testing with other zfs revisions shows they all behave the same for this specific load case.
avg write latency under 10ms (with the option set to zero)

dtc-sannvr_from_dtc-esx-01s_zfs_git_7a27ad0_8rawdisks_mirror_stock_kernel_19aug2016

Also note, the zfs revision in use here is 0.6.5.0 (commit 7a27ad0) minus the "zvol branch" commits [notably commit 37f9dac] and minus commit b39c22b.

Issue 3. Zfs commit 37f9dac ("zvol processing should use struct bio"). See above graph and graphs below. These are just a representative subset of the overall testing.

0642 baseline
avg write latency ~6ms.
dtc-nvr_disk_io_separate_os_disk_zfs_8rawdisks_mirror_stock_kernel_zfs_0_6_4_2_2nd_sample

0650 excluding b39c22b, spl_taskq_thread_dynamic=0, zvol branch commits INCLUDED
avg write latency increases to ~100ms.
dtc-sannvr_from_dtc-esx-01s_zfs_git_065_8rawdisks_mirror_stock_kernel_19aug2016

zfs git master branch (commit f26b4b3) minus the original commit 37f9dac and Numerous later zvol-related commits that fail to cleanly apply due to interdependencies
avg write latency ~4 ms.
dtc-sannvr_from_dtc-esx-01s_zfs_git_master_f26b4b3_minus_many_zvol_commits_8rawdisks_mirror_stock_kernel_19aug2016

zfs git master branch (as of 22 Aug 2016) with all the trimmings (spl_taskq_thread_dynamic=0)
avg write latency ~75-80 ms
dtc-sannvr_from_dtc-esx-01s_zfs_git_master_22aug2016_8rawdisks_mirror_stock_kernel_19aug2016

I was content to just sit at 0.6.4.2 until this was all sorted out. But I started reading about the various hole_birth issues which have various fixes contained in the 0.6.5 (and master) branch. (And a checksum comparison of one of my zvols tells me we are indeed affected by one or more of the hole_birth issues.) So now I have an interest in getting this performance issue sorted out sooner rather than later. But I'm not sure where to take it from here. Suggestions? Also, I'm sure I've left something out from the above. Questions, comments, etc. please let me know.

Final note. For reference, here are the other bug reports which mention similar issues: 4065, 4880, 3888, 4042, 4792, 4135, 3902, 4619, 4265, 3754, 4429, 3324, 3950.

Local testing showed no performance degradation from the "zvol branch" commits, but the other two commits mentioned above did make a big difference in local performance. The zvol branch performance issue only seems to express itself when a zvol is shared out via some mechanism: iscsi, fibre-channel, etc. This is echoed in some of the other bug reports listed above.

[EDIT @crollorc] @olw2005 I've not yet fully digested your new information, but am I correct in saying that the comments regarding dynamic taskqs is new information? In the context of the pre-reworked zvol code in which the "zvol" taskq still exists, this might make sense from the standpoint that dynamic taskqs require a certain number of sequential operations to occur before spawning new threads and that could impact latency.

It sounds like one of the real questions is why the zvol rework of 37f9dac (zvol processing should use struct bio) impacts latency so much for your workload in which the zvol is exported via iscsi. I've never looked into SCST so I've got no idea how it hooks into the zvol block devices. Have you tried running blktrace/blkparse on the zvols in these various situations to see whether there's anything obviously different?

@dweeezil I couldn't find the specific spot that references the dynamic taskq setting but I'm reasonably sure it was in one of the above issue #'s. I think the comment / suggestion of setting it to zero came from @behlendorf so it's not completely new. =)

I don't think there's anything special about SCST. In our case, I'm using block devices (as opposed to file-based devices which invoke the linux page cache). I believe scst uses the normal block device hooks in the kernel (hopefully my terminology is correct here)? The other aforementioned issues covered an array of different software. I think tgtd, lio and virtio were specifically mentioned, possibly some others I'm forgetting. The common themes seem to be zvols with various flavors of 0.6.5 presented to virtual machines / virtual environment via iscsi, fcoe, etc.

Not familar with blktrace/parse, I'll look at the man pages and give them a go. That said I did manage to compile cherry-picked versions of zfs containing the new iostat -w and -r options (with and without the zvol commits). I didn't see any difference in the -w or -r output between the two cases -- they basically look identical.

I agree, the question is why does 37f9dac have such a pronounced effect.

@dweeezil Ok, block trace data incoming... Would it be worthwhile including @ryao on this? The code in question is his baby; he may be able to shed additional light on things?

Ten seconds each of the zvol block device and sdb (one of the underlying physical disks in the pool - for reference, the pool is 8 x 4 TB drives setup as 4 sets of mirrored disks). First two links are the "bad" performance results. The zvol output looks odd, nothing but merges?

Blktrace of "bad" performance results:
zvol - http://pastebin.com/Ad18277Y
sdb - http://pastebin.com/iTvmbmgh

Blktrace of good performance results:
zvol - http://pastebin.com/zGzUvw6q
sdb - http://pastebin.com/3HRWXd5y

Again as a reminder, the I/O load is ~1.5 MB/s constant write stream. Nothing too taxing. The output is hieroglyphics to me, but based on the file size difference (particularly in the sdb traces) I'd say there's something peculiar going on. (However, I can say that the nvrstore_# processes are from the scst iscsi target software.) Thoughts?

[root@dtc-sannvr ~]# wc -l __trace_.txt
4622 sdb_trace_2.txt <-- bad perf
1809 sdb_trace.txt <-- good perf
275 zvol_trace_2.txt <-- bad perf
1314 zvol_trace.txt <-- good perf
8020 total

As I'm writing this, it occurs to me that blktrace output from within the virtual machine might be useful too??

Partial btt output shown below for the above two cases, somewhat more intelligible than the raw data. I'm not that familiar with blktrace interpretation, but the btt output does highlight significant differences in the treatment of the I/O's. The "good performance" run spends significantly more time in "I2D" (time spent in the request queue) as opposed to "D2C" (driver and device time).

This is taken from the /dev/sdb "good perf" run (zfs without the 0.6.5 zvol changes):
==================== All Devices ====================

        ALL           MIN           AVG           MAX           N

Q2Q 0.000009794 0.040899125 2.595884926 197
Q2G 0.000000357 0.000001027 0.000016522 198
G2I 0.000000547 0.000001754 0.000002867 198
I2D 0.000000381 0.002006721 0.002950211 198
D2C 0.000124630 0.001236773 0.022486040 198
Q2C 0.000126222 0.003246276 0.022490799 198

==================== Device Overhead ====================

   DEV |       Q2G       G2I       Q2M       I2D       D2C

---------- | --------- --------- --------- --------- ---------
( 8, 16) | 0.0317% 0.0540% 0.0000% 61.8161% 38.0982%
---------- | --------- --------- --------- --------- ---------
Overall | 0.0317% 0.0540% 0.0000% 61.8161% 38.0982%

And from the sdb "bad perf" run here (zfs master with all the zvol changes):
==================== All Devices ====================

        ALL           MIN           AVG           MAX           N

Q2Q 0.000021895 0.016768780 1.912321773 512
Q2G 0.000000340 0.000000551 0.000001722 513
G2I 0.000000500 0.000001081 0.000016659 513
I2D 0.000000331 0.000000545 0.000001505 513
D2C 0.000116908 0.000794772 0.033262668 513
Q2C 0.000118730 0.000796950 0.033264640 513

==================== Device Overhead ====================

   DEV |       Q2G       G2I       Q2M       I2D       D2C

---------- | --------- --------- --------- --------- ---------
( 8, 16) | 0.0692% 0.1357% 0.0000% 0.0684% 99.7267%
---------- | --------- --------- --------- --------- ---------
Overall | 0.0692% 0.1357% 0.0000% 0.0684% 99.7267%

@olw2005 I'm looking at the blktrace/parse output of the zvols now and before drawing any conclusions, I wanted to mention that the output of blkparse rather poorly documented in the man page. In particular, the "RWBS" field's output is not fully documented. The "M" in the field stands for "meta" as in "metadata". which means the REQ_META flag has been passed to the bio layer. It's not clear what kernel you're using right now but at least in 4.6, that flag isn't actually paid attention by anything. It is set by a few fileystems such as ext4, gfs2 and xfs.

I've not run blktarce on a zvol in a long time and it seems the new zvol architecture only allows blktrace to see the queue request (I'm not sure why right now) but seeing the RWBS field is at least somewhat useful. Reducing the "good" zvol trace to just the queue requests and comparing the two, the only thing that really sticks out to me is that the "bad" trace seems to interleave the requests between CPU cores much more evenly whereas the "good" trace performs a bunch on one CPU core and then a bunch on the other. Also, the "good" trace involved 3 CPU cores wherase the "bad" one only involved 2. The time between queue write requests doesn't look too much different between the two (examined only by eye).

I'm curious what the "nvrstore" processes are. Something to do with SCST?

I'm still looking at the traces of the underlying vdev. The first thing that sticks out to me are all the time unplugs in the "good" trace which I find odd because I'd normally associated them (timer unplugs) with worse performance. Are both these traces using the same kernel? What kernel version?

@dweeezil First, thanks for taking the time. =) Kernel version in use at the moment is stock Centos 6, 2.6.32-642.3.1. As a reminder, the behavior is the same with the newer 3.10 el repo "long term branch" kernel.

You're correct about the "nvrstore" processes. They are part of SCST, named after the "block device" name in its configuration.

The output for the "bad" zvol trace seemed a bit suspect to me (almost like it wasn't capturing everything??), which is why I ran it again at the physical disk level. Note I also included the first portion of the "btt" command output, which consolidates the raw information from blkparse into a more concise form. It shows the "good performance" trace spends about 2/3 of its time in "request queue" whereas the "bad" one spends almost no time there. Not sure what that means exactly, but I guess (??) the "bad" one is dumping everything down to the physical disk just as soon as it receives it?

One other point of interest, I had a go yesterday at unraveling a portion of the changes in commit 37f9dac. My programming skill isn't good enough to tackle zvol.c (not to mention the latest version of that file has changed significantly due to later patches), so I looked at the include/linux/blkdev_compat.h header file. I modified the ifdef checks to revert it back to previous settings (WRITE_BARRIER, REQ_HARDBARRIER, and REQ_FUA). The resulting performance was still bad, so I'm guessing that just leaves zvol.c itself?

This whole discussion makes me wonder if https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140 was the right way to go here. Removing the block layer interfaces has caused some real compatibility issues - qemu for instance, can't use virtio-scsi in passthrough (lun) mode without the SG interface which is now gone, and the latency thing is bad enough to be a blocker to adoption by storage vendors and informed engineers alike looking to run high performance storage with the consistency and data portability features of ZFS.

Using loop shims isn't ideal, including tcm_loop (SCST just plain doesnt work without the scsi-generic layer), as under heavy load, data errors start to creep in from writes to the "wrong offset." I think i have this documented in an issue from a while back, but the end result is that abusive DB workloads over iSCSI end up in data corruption in the shim layer. Interestingly, LVM doesnt seem to do this - we feed ZVOLs as blockdevs to LXC containers which then build LVM atop them for Cinder volumes. Works great, except the whole performance issue...

If there are GPL-specific block device related functions, can they be translated via SPL to match the Illumos counterparts such that the ZFS code stays close to Illumos' version and SPL handles the idiosyncrasies of wrapping the Linux side? ZVOLs are a major use case for virtualization, and IMO, getting performance to a point of being able to compete with the some of the flash storage abominations flooding the market (dont get me wrong, some are great), is going to significantly help adoption and therefore support/commit-rate.

During the slow test, have you measured the queue depth to the zvol (iostat -x, avgqu-sz) ?

@richardelling I think blktrace captures that (see links above). According to the manual, read & write depth are "the maximum queue depth ever observed". The output for the good test case shows write depth of 32 at the zvol level.

Unfortunately the "slow" test case was missing stats at the zvol block device level. I recall reading loss of disk stats was a side-effect of the zvol patchset. That said I also recall reading a later patch -- possibly only in the main branch -- that reintroduced those stats. I'll have to check if my "bad" test case included the zvol stats patch. I'm guessing not given the lack of interesting blktrace output at that level.

Side note, I did get valid output at the physical disk level but I'm not sure how useful that is. Per-cpu stats at that level showed a higher write depth for the "slow" test case.

Thoughts, suggestions, comments would be most welcome.

@sempervictus Appreciate your response. If you have any suggested workarounds / etc. I'd be very interested. I would really like to get to the latest version which includes the various hole birth issue fixes. It hasn't directly affected our environment so far as I can tell, but recent testing (shasum of the block device) of my offsite zvol snapshots shows that they have diverged from the primary snapshots. =/

Found the commit (8198d18ca7e05ec30139e10d658a8b06c0ec4c55) that brought back the zvol stats. Unfortunately no stats for me ... the RHEL / Centos 6 kernel fails that configure check. =/

My musings are mostly directed at @ryao. The zvol rework was his baby iirc.

As I understand the zvol problem, it has to do with the zvol being accessed
in ranges as a file (essentially) as opposed to operations on multiple
files in a dataset. How the intermediate layers of the old implementation
amortized this, I'm not sure (though I have a guess or two), or why the new
approach bottlenecks... What I do find odd is that the issues only crop up
under certain io patterns - we have tried blk-mq and schedulers from bfq to
deadline on the underlying devices and they rarely hit 20% load when this
occurs. Other workloads, like asm over iscsi screams (sync io single depth
to block device arrays imported via lio iscsi from zvols). I vote for
gremlins as the probable cause, and don't yet have a new stable build to
benchmark on hardware. Will post back some fio and iostat data when I do.
On Sep 6, 2016 3:40 PM, "olw2005" [email protected] wrote:

Found the commit (8198d18
https://github.com/zfsonlinux/zfs/commit/8198d18ca7e05ec30139e10d658a8b06c0ec4c55)
that brought back the zvol stats. Unfortunately no stats for me ... the
RHEL / Centos 6 kernel fails that configure check. =/


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/zfsonlinux/zfs/issues/4512#issuecomment-245065705,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABRPjD2Zwzc2CioZ2R0MduP0-xubMrSlks5qncGpgaJpZM4IFMu1
.

Been away trying various fio tests to nail down performance variance between 0.6.4.2 and 0.6.5.x without any success as yet.

Anyway, as mentioned in #4880 I've used ryao's 0.6.4.2 backport of essentially the zvol commit 37f9dac592bf5889c3efb305c48ac39b4c7dd140 this was also responsible for the blktrace / blkparse breaks so only queue commands can be seen

I'm also running zvols to support VMs under xen full summary in issue above. This commit is certainly confirmed as the cause, but seems very tricky to extricate it from the current code line. I was wondering if the passage of I/O requests through the virtualisation layer is exacerbating with bursts of multiple I/O requests to the zvol, but simple fio tests on quiescent disks with various simple sync / async workloads are not showing this.

It is however painfully evident during our benchmarking. If can help by applying any patch for this commit let me know, and I'll feed back results.

Getting back into this issue, I finally backed off my test system to a sufficiently old kernel (3.2.0) that allowed me to run these older revisions of spl/zfs. I just finished running a quick baseline test against a 10GiB zvol on a pool consisting of 4 10GiB tmpfs-backed vdevs. The SPL is at 4fa4cab which brings it to the time frame in which the zvol rework of 37f9dac was done.

The test is using fio with this configuration:

[test]
    blocksize=8k
    scramble_buffers=1
    disk_util=0
    invalidate=0
    size=256m
    numjobs=32
    create_serialize=1
    direct=1
    filename=/dev/zvol/tank/v1
    offset=0
    offset_increment=256m
    group_reporting=1
    ioengine=libaio
    iodepth=10
    rw=write
    thread=1
    time_based=1
    runtime=1m
    fsync=0

With ZFS at 782b2c3, right before the zvol rework, this test yields 30.9K IOPS with an average latency of 10.3ms. Advancing ZFS to 37f9dac (applying the base zvol re-work) brings the IOPS up to 34.9K and drops the average latency to 9.1ms. Not a huge difference but in every regard, the performance is _better_ with the new zvol code.

My guess is that anyone seeing a performance regression due to the zvol rework is using a consumer of the zvols that may not be interfacing with them in an obvious manner. The common thread here appears to be that various ISCSI targets are involved or something else in-kernel is in front of the zvol. I ran some similar tests on a my laptop in which I configured LIO to provide the zvol as a target and _did_ get worse performance across LIO but it wasn't a very controlled experiment.

I'm going to run a more extensive battery of tests now that I have the system set up correctly but for normal native zvol access by an application, I'm not seeing any performance regression at all between 37f9dac and 782b2c3. In particular, I intend on configuring LIO and trying some tests over that path.

@dweeezil Yep that comports with what I've seen on my own systems and read from other issue reports here on github. Local performance of a zvol seems to be as good or better. [Note, there were a couple of other performance issues in the early 6.5 releases that were later addressed.] All the cases with performance degradation from 0.6.4.x --> 0.6.5.y [at least every one that I have found] involve iscsi or other consumers of the zvol.

@dweeezil This is just a shot in the dark but could this be at all related to pull request #5181 ?

@dweeezil: If you're looking to see interface problems with consumers, try
scst or qemu passthru SCSI volumes as LUNs. As I mentioned in the zvol
issue, this behavior seems worst under non sync sustained write, fio with
direct io should not hit that use case.

On Sep 28, 2016 11:28 AM, "Tim Chase" [email protected] wrote:

Getting back into this issue, I finally backed off my test system to a
sufficiently old kernel (3.2.0) that allowed me to run these older
revisions of spl/zfs. I just finished running a quick baseline test against
a 10GiB zvol on a pool consisting of 4 10GiB tmpfs-backed vdevs. The SPL is
at 4fa4cab which brings it to the time frame in which the zvol rework of
37f9dac
https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140
was done.

The test is using fio with this configuration:

[test]
blocksize=8k
scramble_buffers=1
disk_util=0
invalidate=0
size=256m
numjobs=32
create_serialize=1
direct=1
filename=/dev/zvol/tank/v1
offset=0
offset_increment=256m
group_reporting=1
ioengine=libaio
iodepth=10
rw=write
thread=1
time_based=1
runtime=1m
fsync=0

With ZFS at 782b2c3
https://github.com/zfsonlinux/zfs/commit/782b2c326ea445c5cab0c1b0373d64d5e83cc5d4,
right before the zvol rework, this test yields 30.9K IOPS with an average
latency of 10.3ms. Advancing ZFS to 37f9dac
https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140
(applying the base zvol re-work) brings the IOPS up to 34.9K and drops the
average latency to 9.1ms. Not a huge difference but in every regard, the
performance is _better_ with the new zvol code.

My guess is that anyone seeing a performance regression due to the zvol
rework is using a consumer of the zvols that may not be interfacing with
them in an obvious manner. The common thread here appears to be that
various ISCSI targets are involved or something else in-kernel is in front
of the zvol. I ran some similar tests on a my laptop in which I configured
LIO to provide the zvol as a target and _did_ get worse performance
across LIO but it wasn't a very controlled experiment.

I'm going to run a more extensive battery of tests now that I have the
system set up correctly but for normal native zvol access by an
application, I'm not seeing any performance regression at all between
37f9dac
https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140
and 782b2c3
https://github.com/zfsonlinux/zfs/commit/782b2c326ea445c5cab0c1b0373d64d5e83cc5d4.
In particular, I intend on configuring LIO and trying some tests over that
path.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/zfsonlinux/zfs/issues/4512#issuecomment-250202055,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABRPjI89WV2xFOXCl9DubyArM64-5roNks5quoeEgaJpZM4IFMu1
.

@olw2005 w.r.t. #5181, that involves ZoL's use of disk vdevs. It doesn't have anything to do with zvols (other than the way ZFS might write to the underlying storage when zvols are written to). I did start looking into the iblock LIO module at the OpenZFS developer conference with @richardelling who is also seeing this issue in some places. It was interesting to note that iblock_submit_bios() does, itself, do explicit plug/unplug which it would seem ought to help aggregation and the expense of potentially a bit more latency. My next set of tests will be with the zvol exported via LIO. I'll see about testing with qemu and passthrough SCSI as well.

@dweeezil I watched the live stream of day 1. Very exciting, lots of good stuff coming. =) If I can assist in any way or if you happen to find any workaround / hack / fix / patch / etc. please l.m.k.

Also you may already know this, but just in case... If you are testing early 0.6.5 versions you'll want to revert / exclude commit b39c22b [reverted ~ 0.6.5.2] and manually set spl_taskq_thread_dynamic=0 [made the default ~ 0.6.5.4-ish??]. Both of those have a pronounced impact on performance, and definitely clouded my testing results early on.

There's been so much creep in this issue that I'm not sure where to go at this point to help push it along. There are numerous combinations of kernels, zvol vs. no zvol, NFS vs. no NFS, iSCSI vs. no iSCSI, SCST or LIO, etc. etc. The common thread appears to be a decrease in average block size written to the leaf vdevs and also higher write latency.

I would like to suggest that anyone experiencing a post-0.6.4 regression of this sort install the recently tagged 0.7.0 release candidate. It's currently at 0.7.0-rc1 and should be pretty easy to build packages for most distros. Other than all the other fixes and improvements, the big attraction for this type of problem is the vastly improved zpool iostat capability which is perfect for troubleshooting this type of issue. For those who've not seen it, check out 7e94507 and 193a37c for some details. This new detail, combined with that from the regular iostat ought to be very useful. Among other things, it now gives a view of the ZFS-provided aggregation which has existed all along but which has been impossible to monitor. Unfortunately, I don't know of anyone trying to port this to 0.6.5-release; hence my suggestion to try 0.7.0-rc1.

Finally, I'd like to suggest that anyone trying 0.7.0-rc1 and finding a regression please open a new issue and reference this one.

From my standpoint, now that I've got a large test system running a sufficiently old kernel, I will complete a set of tests to compare some write loads between 0.6.4 and 0.6.5 and also to see how much difference it makes when using zvols between accessing them directly versus accessing them via LIO.

I'm not sure what to make of this but figured I would post it as it's an interesting data point and perhaps someone here can make more sense of it. (Same setup as previously outlined, vmware virtual machine --> raw device map [e.g. passthrough] or virtual vmfs disk [qualitatively the same either way] --> scst iscsi target --> zvol.) Running the following as an additional I/O stream from within the client vm:

while true; do sleep .005 ; dd if=/dev/zero of=foo2 bs=8k count=1 oflag=sync 2>/dev/null; done

Basically this injects a small amount of sync I/O into the mix. But it also has the net effect of lowering the disk write I/O latency from the client side down to a "normal" number. As you might expect, it varies linearly with the amount of sleep time.

Moreover, I ran the same on our production video recording server (also mentioned above, normal operation generates ~1.5MB/s of continuous write I/O) and saw the same thing. Latency went way down (from ~100+ ms to ~2 ms) and video playback (generating a read I/O stream) was no longer choppy. Note, I tried this with zfs versions 0.6.5.8, 0.7.0rc1 and 0.6.4.2. Both 0.6.5.8 and 0.7.0rc1 show a massive latency reduction; 0.6.4.2 showed no change, it was low latency either way.

Again, not sure what the heck it means, but it's definitely interesting.

Hi. My first post here. I've been using ZOL on 50+ servers for several years now, great filesystem thanks for all your hard work.

I am stuck on 6.4.2 as anything later and the performance has made it unusable. They are all Centos 6.8. Mainly used by rsync to store backups on, but also some samba shares. I have also tried Centos 7 with same results. 6.4.2 is great for performance but would be great to upgrade.

Is this thread being looked at? It has been updated for months.

Thanks.

@markr123 Just as an FYI. I finally punted on this issue and switched our VM(ware) storage environment from zvol + iscsi to zfs + nfs v4. The latter configuration does not suffer the performance penalty in the newer zfs 0.6.5+ versions, enabling an upgrade from 0.6.4.2.

I gather the bit of [zvol] code that changed here is deeply embedded / intertwined at this point, making it difficult to undo?

@markr123:
I recently performed some I/O meter performance tests with 0.7.0-rc3 and 0.6.5.9 with reverted:
37f9dac592bf5889c3efb305c48ac39b4c7dd140

The results for 0.7.0-rc3:

Test name | Latency | Avg iops | Avg MBps | cpu load
----------|---------|----------|----------|---------
Random Write | 44.00 | 273 | 0 | 2%
Random Read | 32.30 | 371 | 0 | 1%
VDI | 30.70 | 390 | 1 | 2%
Exchange 2003 | 29.05 | 413 | 1 | 2%
Exchange 2007 | 31.06 | 386 | 3 | 1%
OLTP | 29.00 | 413 | 1 | 1%
SLQ Server | 34.91 | 343 | 5 | 1%
CCTV | 15.45 | 776 | 24 | 1%
Database | 19.00 | 631 | 4 | 1%
Fileserver | 24.23 | 495 | 15 | 1%
Media Streaming | 12.47 | 962 | 60 | 3%
WEB Server | 0.62 | 19364 | 75 | 16%
Sequential Read | 19.46 | 616 | 616 | 23%
Sequential Write | 26.62 | 450 | 450 | 7%

The results for 0.6.5.9 with reverted 37f9dac592bf5889c3efb305c48ac39b4c7dd140:

Test name | Latency | Avg iops | Avg MBps | cpu load
----------|---------|----------|----------|---------
Random Write | 0.33 | 36728 | 17 | 5%
Random Read | 0.31 | 38568 | 18 | 5%
VDI | 0.28 | 42762 |167 | 7%
Exchange 2003 | 0.28 | 43511 | 169 | 7%
Exchange 2007 | 0.29 | 41223 | 322 | 7%
OLTP | 0.28 | 42798 | 167 | 7%
SLQ Server | 0.32 | 37013 | 578 | 6%
CCTV | 0.55 | 21850 | 682 | 5%
Database | 0.28 | 42273 | 330 | 7%
Fileserver | 0.46 | 26333 | 822 | 5%
Media Streaming | 0.99 | 12097 | 756 | 3%
WEB Server | 0.35 | 33869 | 132 | 5%
Sequential Read | 18.90 | 634 | 634 | 1%
Sequential Write | 17.35 | 691 | 691 | 3%

There is a huge performance drop introduced by 37f9dac592bf5889c3efb305c48ac39b4c7dd140.

@ab-oe What were the details of the system you performed this benchmark against?

I've reported similar issues traced back to this same commit on a Xen virtualised system, just wondered if your setup was easier / harder to re-create for further investigation?

@koplover The system with ZoL ran on the Debian Jessie with mainline 4.4 kernel. For the iSCSI target I used SCST. For the initiator machine with Windows Server 2012 R2 was used.
There was one pool with zvol exported over iSCSI both created using the default settings. The zfs, spl module parameters remained unchanged. Pool consist of 4 mirrored groups each containing one physical drive. No read cache/write log devices. The initiator machine was connected to target over 10GbE.

@ab-oe could you please run the same tests using PR #5824. This PR reverts some key parts of commit 37f9dac which introduced this performance issue.

@behlendorf thank you. It is better now but still much worse than on older version. Especially when random I/O is performed.


0.6.5.9 without 37f9dac

Test name | Latency | Avg iops | Avg MBps | cpu load
----------|---------|----------|----------|---------
Random Write | 0.33 | 36550 | 17 | 19%
Random Read | 0.30 | 40192 | 19 | 25%
VDI | 0.28 | 42673 | 166 | 27%
Exchange 2003 | 0.28 | 43330 | 169 | 31%
Exchange 2007 | 0.29 | 41471 | 323 | 33%
OLTP | 0.28 | 43432 | 169 | 32%
SLQ Server | 0.33 | 36424 | 569 | 36%
CCTV | 0.54 | 22053 | 689 | 25%
Database | 0.30 | 39844 | 311 | 33%
Fileserver | 0.46 | 26113 | 816 | 39%
Media Streaming | 1.06 | 11321 | 707 | 30%
WEB Server | 0.36 | 33078 | 129 | 27%
Sequential Read | 18.65 | 643 | 643 | 24%
Sequential Write | 17.96 | 667 | 667 | 10%


0.7.0-rc3 with patches

Test name | Latency | Avg iops | Avg MBps | cpu load
----------|---------|----------|----------|---------
Random Write | 5.26 | 2286 | 1 | 0%
Random Read | 2.40 | 5006 | 2 | 1%
VDI | 3.09 | 3882 | 15 | 1%
Exchange 2003 | 1.57 | 7659 |29 | 2%
Exchange 2007 | 0.52 | 23168 | 181 | 6%
OLTP | 0.32 | 37381 |146 | 7%
SLQ Server | 0.42 | 28724 | 448 | 5%
CCTV | 0.89 | 13484 | 421 | 4%
Database | 3.28 | 3661 |28 | 1%
Fileserver | 0.51 | 23634 |738 | 6%
Media Streaming | 1.62 | 7385 |461 | 3%
WEB Server | 0.34 | 35323 | 137 | 6%
Sequential Read | 19.02 | 630 | 630 | 1%
Sequential Write | 23.79 | 504 | 504 | 2%

@behlendorf Little update: I performed several run of this test with ZoL 0.7.0-rc3 with from #5824 and it seems that performance gets better each run what didn't happen before. In the stable state results looks as follows:


0.7.0-rc3 with patches

Test name | Latency | Avg iops | Avg MBps | cpu load
----------|---------|----------|----------|---------
Random Write|0.31|38226|18|8%
Random Read|0.31|38313|18|6%
VDI|0.29|42025|164|10%
Exchange 2003|0.37|32147|125|6%
Exchange 2007|0.35|34291|267|8%
OLTP|0.34|35158|137|7%
SLQ Server|0.40|30173|471|7%
CCTV|2.02|4731|147|2%
Database|0.32|37510|293|7%
Fileserver|0.59|20317|634|5%
Media Streaming|1.45|8249|515|3%
WEB Server|0.34|35353|138|6%
Sequential Read|18.33|654|654|1%
Sequential Write|24.06|498|498|1%

This looks quite good except sequential write which still suffers from performance loss. Any clue why the performance on the old version is good and stable all the time and with 0.7.0-rc3 + #5824 needs some warm up?

@ab-oe thanks for collecting this data.

My best guess as to why the sequential write performance is down compared to v0.6.5.9 is that the updated patch is takes a range lock for correctness when updating the volume. So depending on the workload you might see some additional contention. The previous code allowed IOs to potentially complete out of order which wasn't strictly correct.

As for why a warm up time is required that's an interesting observation. How long would you say it took to warm up? Did you happen to observe arcstat.py during the testing and notice a change in caching behavior or hit rate? There were massive changes in the ARC between 0.6.59 and 0.7.0-rc3 including keeping it compressed in memory and breaking it up so it's all scatter-gather IO.

@behlendorf thank you for explanation. The strange thing is that the test file for I/O meter is created before every test run. So the test data differs from the previous run.

How long would you say it took to warm up?

The third run gave me the first stable results. Further runs gave me similar results.

Did you happen to observe arcstat.py during the testing and notice a change in caching behavior or hit rate?

Unfortunately, no. When I get back this environment for test I'll check it.

My 5 cents:
Recently, I've run such tests, and saw, that at the beginning of write test (0.7.0-rc3), there is a lot of reads:
I've run fio with write-only task, and watch with iostat at HDD. For the first time, minute or two, there was a lot of reads (but fio make only writes !!!).
With 0.6.4.2/0.6.5.9 there is no such behaviour.
Sadly, I'm not looked at arcstat during this tests.

@redtex For existing volumes the read are likely caused by partial block updates. Although this behavior would be the same between 0.6.x and 0.7.0-rc3.

@behlendorf I think, it's not partial updates, because zvol blocksize=4k and fio blocksize=4k. No compression. And I want to say again, reads present ONLY with 0.7.0-rc3. I can make short video, if it helps.

@redtex another possibility is the zvol prefetch. The first and last 128k of a zvol will be prefetched when it's added to the system to make it more likely that blkid will get a cache hit when probing the volume. You could try setting zvol_prefetch_bytes=0 and see if that eliminates the behavior for the benchmarks.

Writing from phone, so don't have the PR # in front of me, but I have tried to address some of these issues at the Linux block layer with decent results. Ryao has said that we need to implement a fix lower in the stack for correct operation. Personally, I think this is a first order class of bug - zvols are too unpredictable and overall slow to compete with conventional storage for iscsi and direct image bases on virtual hosts, which is probably the most common use case in general purpose IT for them. Performance regressions cost consumer confidence, which gets even more hurt when they are ignored for a year... This really needs to be fixed properly and permanently.

I am seeing the performance issue across Centos 6 and Centos 7 and with no zvols. Just rsync files in and also samba writing files. On a production system it has made the 0.6.5 branch unusable as it oftens hangs. We have not been able to upgrade past 6.4.2 which performance is great. I am desperate to upgrade for better stability but cannot due to the performance.

Over here it's the same as @markr123 CentOS6, no zvol, any upgrade from 0.6.4.2 to 0.6.5 results in a dramatic write performance degradation. You can see my findings in this thread, but if you need any more info/testing I'll be happy to assists you in this.

Hi. This problem/ticket has now been opened for over a year and the performance issue is stopping most if not all people from upgrading past the 0.6.4.2 version.

Is this problem being looked at or worked on? Anything I can send you? as it seems to have stalled.

A new version of ZOL may come out but if this problem is not fixed there is no point as no one can upgrade to it.

thanks for all your hard work on ZOL. It's the best filesystem out there for Linux.

I believe @ryao is currently digging into zvols at work. There may be a light at the end of this tunnel yet. Plus, people can't stay on 0.6.4 for compatibility reasons.
So many features are getting added that imports on even slightly outdated systems don't work RW.
Consumers are upgrading, realizing there's a problem, trying btrfs, failing, trying other things, failing, and coming back to zfs angry (seen this a few times). Users expect everything to improve by orders of magnitude with each release and work by way of an easy button which let's them tune all the details without knowing them or having to actually tune them... This pressure goes up to their IT staff, and the buck sorta stops there. Its a major perception problem because they don't realize this isn't a commercial effort (on Linux), and expect commercial grade function for free...
My $0.02 is that we can't have performance regressions in releases, qualifies as a first order bug at the same level as data integrity. Doesn't have to improve, but can't get worse. The fallout lasts years, and with the shaky history behind ZFS in the eyes of corporate decision makers, we need many years of solid releases to build confidence.
Fact that we are out of tree makes that many times harder - Linus can do something silly and zol just has to adapt. Past that case, in our own code, we should be testing our ops for time and resource consumption like we test our features/data for return values, to have consistent metrics across commits since a change in the spa could cause the dmu to wait or memory usage to spike elsewhere for buffers and only testing the functions changed by the commit wouldn't expose those metrics. Its not a small problem to address, but probably doable, with enough blood, sweat, and profiling mechanisms.

@sempervictus agreed, ZFS needs a kind of "Phoronix Test Suite" for Filesystems that tracks performance (and latency ?) regressions and improvements regularly,

the buildbots & tests will test integrity and functionality but I guess asking for performance or latency related regular tests (for each commit) would be too much and not entirely suitable ?

Dedicated runs on servers (from Raspberry Pi, Pi2 - "little, small systems) or server farms (Lustre, etc.) would be much better suited to catch "corner", scaling and other cases

@sempervictus agreed. We need additional automated testing which is run regularly in order to prevent performance regressions.

ZFS needs a kind of "Phoronix Test Suite" for Filesystems

I think that would be an entirely reasonable place to start. If we can put together a meaningful battery of ZFS performance tests for the PTS which take less than 6 hours to run they could be added to the buildbot and run regularly. Having a predefined suite of tests anyone can easily run to benchmark their hardware would be convenient. @kernelOfTruth if you're already familiar with the PTS would you be interested in putting something together?

Hi, Is there any progress with this? Sorry to chase but I would really like to upgrade to the new versions of zfs but am stuck on 6.4.2 until performance issues are fixed.
Thanks.

You might be interested in giving 0.7.0-rc4 a try, it has various performance enhancements. There are also various ZVOL/ZIL improvements currently in the works c/o @ryao. Not sure if any of them are relevant to your usage case. Feel free to look through the list of current pull requests and give some of those a try as well.

thanks for coming back so quick on a Sunday.

my main problem has been lots of rsync writes coming in at once to zfs. 6.4.2 handled these really well, but from 6.5 onwards the cpu usage went high and the rsyncs and all filesystem writes blocked. samba writing was also stalled.

Anybody who's experiencing this issue already tried the 0.7.x release?

I do with HEAD at 376994828fd3753aba75d492859727ca76f6a293 and RHEL6 kernel.

12 1 0x01 92 25024 8325332541 4337538365561
name                            type data
hits                            4    11106114
misses                          4    5791566
demand_data_hits                4    5166935
demand_data_misses              4    1052871
demand_metadata_hits            4    1997434
demand_metadata_misses          4    4092990
prefetch_data_hits              4    2495568
prefetch_data_misses            4    348742
prefetch_metadata_hits          4    1446177
prefetch_metadata_misses        4    296963
mru_hits                        4    1703164
mru_ghost_hits                  4    0
mfu_hits                        4    7881109
mfu_ghost_hits                  4    0
deleted                         4    169
mutex_miss                      4    0
evict_skip                      4    47
evict_not_enough                4    0
evict_l2_cached                 4    0
evict_l2_eligible               4    2530816
evict_l2_ineligible             4    0
evict_l2_skip                   4    0
hash_elements                   4    2100829
hash_elements_max               4    2101303
hash_collisions                 4    156175
hash_chains                     4    63727
hash_chain_max                  4    4
p                               4    67573442560
c                               4    135146885120
c_min                           4    33554432
c_max                           4    135146885120
size                            4    87287478000
compressed_size                 4    77323989504
uncompressed_size               4    123392078848
overhead_size                   4    5172511232
hdr_size                        4    713580208
data_size                       4    75959115264
metadata_size                   4    6537385472
dbuf_size                       4    902585728
dnode_size                      4    2319490368
bonus_size                      4    855320960
anon_size                       4    473503232
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    41532464128
mru_evictable_data              4    36605826560
mru_evictable_metadata          4    419079168
mru_ghost_size                  4    0
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_size                        4    40490533376
mfu_evictable_data              4    38751274496
mfu_evictable_metadata          4    407086080
mfu_ghost_size                  4    0
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    0
l2_hits                         4    0
l2_misses                       4    5791532
l2_feeds                        4    8026
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    80685456896
l2_writes_sent                  4    7791
l2_writes_done                  4    7791
l2_writes_error                 4    0
l2_writes_lock_retry            4    42
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    6895
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    103058641920
l2_asize                        4    66307313152
l2_hdr_size                     4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    264192
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    11328362736
arc_meta_limit                  4    101360163840
arc_dnode_limit                 4    10136016384
arc_meta_max                    4    11353754696
arc_meta_min                    4    16777216
sync_wait_for_async             4    22570
demand_hit_predictive_prefetch  4    234743
arc_need_free                   4    0
arc_sys_free                    4    4223340160
23 0 0x01 86 9632 12311035077 4362579253819
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
13884744 4164236604749    C     36271104     123392000    61915136     1123     1778     2110737401   6910         1941900771   2215539012  
13884745 4166347342150    C     35784192     118595584    60448768     1094     1778     1941919728   6631         2215550371   1876709958  
13884746 4168289261878    C     63946752     109617152    75390976     1010     2069     2215566746   5264         1876714240   2504001285  
13884747 4170504828624    C     45242880     133758976    43941888     1206     1775     1876725141   4365         2504018775   2415840671  
13884748 4172381553765    C     26675712     131870720    32587776     1212     1640     2504031394   6207         2415853093   2484338338  
13884749 4174885585159    C     23851520     107413504    28540928     1009     1945     2415870320   5403         2484350763   2328118255  
13884750 4177301455479    C     49197568     120823808    47751168     1160     1851     2484366136   6345         2328119881   2555925449  
13884751 4179785821615    C     34044416     120647680    35971072     1030     1630     2328134325   6501         2555932784   1866568996  
13884752 4182113955940    C     36459008     139554816    37232640     1200     1915     2556038602   5064         1866484048   2234865186  
13884753 4184669994542    C     29173760     125607936    32129024     1070     1632     1866505692   6882         2234860797   1936573602  
13884754 4186536500234    C     29129728     116928512    39550976     1013     1655     2234879641   5388         1936584095   2055491295  
13884755 4188771379875    C     72613888     157794304    56950784     1347     2109     1936598521   6485         2055498967   2703213206  
13884756 4190707978396    C     57694720     203923456    59187200     1702     1929     2055557736   6087         2703183238   2910482353  
13884757 4192763536132    C     41032704     152723456    52813824     1295     1676     2703199501   6481         2910483823   2012891445  
13884758 4195466735633    C     106031616    132505600    85508096     1130     2197     2910517873   5054         2012876880   2258632437  
13884759 4198377253506    C     29989376     142020608    31498240     1217     1490     2012887710   5152         2258648842   2024240489  
13884760 4200390141216    C     39697920     111620096    40534016     1055     1967     2258683609   5520         2024231689   2771066342  
13884761 4202648824825    C     100624384    105779200    77398016     1039     2486     2024245121   6332         2771080493   2276465887  
13884762 4204673069946    C     52517376     118792192    56631296     1027     2295     2771093324   4882         2276485176   1815269833  
13884763 4207444163270    C     89269760     179376128    80846848     1474     2303     2276499597   6223         1815274214   2528415135  
13884764 4209720662867    C     86455808     153219072    71598080     1296     2304     1815287497   5182         2528422794   2096165541  
13884765 4211535950364    C     50241536     112902144    47398912     1055     2062     2528434602   5115         2096181106   2575307696  
13884766 4214064384966    C     48568832     127492096    53805056     1142     1758     2096197167   5995         2575320771   2110255466  
13884767 4216160582133    C     41707520     150327296    43040768     1442     1704     2575335623   4913         2110260731   2743316802  
13884768 4218735917756    C     54598144     115286016    51937280     1186     1652     2110275185   6364         2743328586   2166999266  
13884769 4220846192941    C     38547968     82055168     41779200     894      1915     2743356757   5090         2166992049   2345322866  
13884770 4223589549698    C     29343744     134176768    35012608     1233     1700     2167005958   5656         2345328367   1993619416  
13884772 4228101897894    C     36727808     140693504    37928960     1218     1483     1993648727   8629         2162407073   1855508114  
13884773 4230095546621    C     23354368     136331264    31248384     1533     1467     2162423513   5743         1855514720   2420886278  
13884774 4232257970134    C     72283136     127410176    54194176     1500     1873     1855528761   4571         2420896618   2809874073  
13884775 4234113498895    C     50669056     122269696    69353472     1092     2079     2420909258   7177053      2802708436   2118281111  
13884776 4236534408153    C     35153920     142364672    61358080     1222     1777     2809893566   6081         2118295890   2007734999  
13884777 4239344301719    C     47450112     140697600    63119360     1246     1830     2118320919   14139        2007720866   2196866389  
13884778 4241462622638    C     56240640     143757312    80420864     1344     1900     2007748896   7129         2196865716   2541512318  
13884779 4243470371534    C     28441600     132440064    42409984     1261     1666     2196880689   6153         2541526140   2072776457  
13884780 4245667252223    C     53864448     125136896    111034368    1128     2333     2541543644   6552         2072800582   2149140917  
13884781 4248208795867    C     83119616     141053952    107102208    1294     2219     2072813715   4112         2149148544   2564207783  
13884782 4250281609582    C     84401664     108220416    88825856     1065     2360     2149158675   5242         2564215611   2516223608  
13884783 4252430768257    C     53362176     136302592    51167232     1225     1963     2564242028   4741         2516217376   2108411013  
13884784 4254995010285    C     23246848     77152256     39084032     899      1731     2516233278   6925         2108422419   2390715576  
13884785 4257511243563    C     100516864    98656256     77406208     1013     2314     2108443540   7946         2390712547   2697941751  
13884786 4259619687103    C     59678208     127086592    74522624     1479     2087     2390728524   5162         2697948082   2518428769  
13884787 4262010415627    C     51313152     133058560    56836096     1160     1850     2697961085   5785         2518438066   1874581249  
13884788 4264708376712    C     38499840     130056192    40427520     1175     2138     2518456463   4040859      1870546505   2491954978  
13884789 4267226833175    C     58814976     90529792     117776384    864      3098     1874607776   4026         2491956053   2475562255  
13884790 4269101440951    C     86475776     139202560    174727168    1246     4186     2491969522   4868         2475578556   2327543107  
13884791 4271593410473    C     108779520    129941504    191479808    1104     2934     2475594576   5917         2327543433   2065292104  
13884792 4274069005049    C     136452608    145252352    256712704    1456     4042     2327557865   6033         2065305318   2788907047  
13884793 4276396562914    C     77019136     140771328    155049984    1189     2780     2065320358   5701         2788918012   2412242775  
13884794 4278461883272    C     143178752    76984320     251576320    852      4020     2788933634   5398         2412245428   2815951378  
13884795 4281250816906    C     69138944     145485824    91815936     1220     2216     2412258432   4946         2815966106   2335489136  
13884796 4283663075338    C     114761216    141082624    158588928    1228     3151     2815978557   4281         2335508011   2782529898  
13884797 4286479053895    C     35223040     119554048    37945344     1068     1612     2335521727   4598         2782534251   1982206309  
13884798 4288814575622    C     64258560     143339520    108339200    1204     2616     2782545586   3910         1982223964   2600873631  
13884799 4291597121208    C     24773632     207495168    36782080     1733     1582     1982237772   5099         2600891868   1916979266  
13884800 4293579358980    C     48922112     137027584    75210752     1200     1574     2600906689   5127         1916990316   2014285410  
13884802 4298097293340    C     33272320     150700032    102039552    1253     2404     2014286990   5110         2156809695   2035838820  
13884803 4300111580330    C     81604608     138477568    126861312    1192     2630     2156824003   5153         2035852326   2064413620  
13884804 4302268404333    C     110749696    125366272    170745856    1088     2985     2035864271   4277         2064431564   2524985032  
13884805 4304304268604    C     60029952     151941120    90644480     1284     2082     2064447577   6658         2524985414   2341167947  
13884806 4306368716181    C     65350656     158601216    110608384    1384     2593     2524997687   4230         2341175864   2778715774  
13884807 4308893713868    C     41444864     140292096    89128960     1219     2097     2341186801   4016         2778725366   2245347679  
13884808 4311234900669    C     54941696     148480000    108142592    1286     2384     2778736522   5175         2245354933   2465445800  
13884809 4314013637191    C     71305216     141561856    114016256    1190     2394     2245371433   35507        2465417436   2043746134  
13884810 4316259008624    C     76064256     147419136    115892224    1243     2522     2465460281   5492         2043752965   2362551344  
13884811 4318724468905    C     95911936     131731456    146939904    1220     2857     2043767537   4649         2362556913   2703281041  
13884812 4320768236442    C     46643712     154566656    106479616    1346     2253     2362580955   5659         2703282138   2281451691  
13884813 4323130817397    C     74088960     150323200    212606976    1552     3717     2703298387   6454         2281453627   3030101641  
13884814 4325834115784    C     60939264     130662400    140001280    1279     2537     2281466788   4724         3030117687   2297613763  
13884815 4328115582572    C     165896704    142925824    366075904    1382     5678     3030143591   4028         2297608111   3558481627  
13884816 4331145726163    C     142273024    85852160     291889152    903      4652     2297622990   5684         3558483940   3239762671  
13884817 4333443349153    C     264770560    165347328    435494912    1628     7283     3558495720   4809         3239816411   4048062420  
13884818 4337001844873    C     162929664    86028288     291962880    902      4540     3239834932   7746         4048068030   2584101426  
13884819 4340241679805    C     137316352    94298112     281124864    1217     3654     4048085836   5507         2584105436   2336033214  
13884820 4344289765641    C     95045120     89255936     175349760    910      2625     2584123986   6581         2336032299   1924461857  
13884821 4346873889627    C     97711616     87232512     184975360    877      3075     2336051996   7233         1924461104   1861019067  
13884822 4349209941623    C     84547584     97206272     157278208    1083     2972     1924476383   4748         1861035267   2099725077  
13884823 4351134418006    C     88662016     90320896     175153152    905      2417     1861048221   5092         2099731154   1837980140  
13884824 4352995466227    C     112260608    90034176     182132736    931      3010     2099747002   5675         1837983064   1818140317  
13884825 4355095213229    C     54755840     77709312     99287040     692      1760     1837997759   5168         1818144607   1574600950  
13884826 4356933210988    C     60847616     78692352     102064128    716      1724     1818156051   14286179     1560326897   1501072823  
13884827 4358751367039    S     0            0            0            0        0        1574634670   4651         1501064607   0           
13884828 4360326001709    W     0            0            0            0        0        1501074833   4289         0            0           
13884829 4361827076542    O     0            0            0            0        0        0            0            0            0  
dbuf_cache_hiwater_pct = 10
dbuf_cache_lowater_pct = 10
dbuf_cache_max_bytes = 104857600
dbuf_cache_max_shift = 5
dmu_object_alloc_chunk_shift = 7
ignore_hole_birth = 1
l2arc_feed_again = 1
l2arc_feed_min_ms = 200
l2arc_feed_secs = 1
l2arc_headroom = 2
l2arc_headroom_boost = 200
l2arc_noprefetch = 1
l2arc_norw = 0
l2arc_write_boost = 8388608
l2arc_write_max = 8388608
metaslab_aliquot = 524288
metaslab_bias_enabled = 1
metaslab_debug_load = 0
metaslab_debug_unload = 0
metaslab_fragmentation_factor_enabled = 1
metaslab_lba_weighting_enabled = 1
metaslab_preload_enabled = 1
metaslabs_per_vdev = 200
send_holes_without_birth_time = 1
spa_asize_inflation = 24
spa_config_path = /etc/zfs/zpool.cache
spa_load_verify_data = 1
spa_load_verify_maxinflight = 10000
spa_load_verify_metadata = 1
spa_slop_shift = 5
zfetch_array_rd_sz = 1048576
zfetch_max_distance = 8388608
zfetch_max_streams = 8
zfetch_min_sec_reap = 2
zfs_abd_scatter_enabled = 1
zfs_abd_scatter_max_order = 10
zfs_admin_snapshot = 1
zfs_arc_average_blocksize = 8192
zfs_arc_dnode_limit = 0
zfs_arc_dnode_limit_percent = 10
zfs_arc_dnode_reduce_percent = 10
zfs_arc_grow_retry = 0
zfs_arc_lotsfree_percent = 10
zfs_arc_max = 0
zfs_arc_meta_adjust_restarts = 4096
zfs_arc_meta_limit = 0
zfs_arc_meta_limit_percent = 75
zfs_arc_meta_min = 0
zfs_arc_meta_prune = 10000
zfs_arc_meta_strategy = 1
zfs_arc_min = 0
zfs_arc_min_prefetch_lifespan = 0
zfs_arc_p_aggressive_disable = 1
zfs_arc_pc_percent = 0
zfs_arc_p_dampener_disable = 1
zfs_arc_p_min_shift = 0
zfs_arc_shrink_shift = 0
zfs_arc_sys_free = 0
zfs_autoimport_disable = 1
zfs_compressed_arc_enabled = 1
zfs_dbgmsg_enable = 0
zfs_dbgmsg_maxsize = 4194304
zfs_dbuf_state_index = 0
zfs_deadman_checktime_ms = 5000
zfs_deadman_enabled = 1
zfs_deadman_synctime_ms = 1000000
zfs_dedup_prefetch = 0
zfs_delay_min_dirty_percent = 60
zfs_delay_scale = 500000
zfs_delete_blocks = 20480
zfs_dirty_data_max = 4294967296
zfs_dirty_data_max_max = 4294967296
zfs_dirty_data_max_max_percent = 25
zfs_dirty_data_max_percent = 10
zfs_dirty_data_sync = 824633720832
zfs_dmu_offset_next_sync = 0
zfs_expire_snapshot = 300
zfs_flags = 0
zfs_free_bpobj_enabled = 1
zfs_free_leak_on_eio = 0
zfs_free_max_blocks = 100000
zfs_free_min_time_ms = 1000
zfs_immediate_write_sz = 32768
zfs_max_recordsize = 1048576
zfs_mdcomp_disable = 0
zfs_metaslab_fragmentation_threshold = 70
zfs_metaslab_segment_weight_enabled = 1
zfs_metaslab_switch_threshold = 2
zfs_mg_fragmentation_threshold = 85
zfs_mg_noalloc_threshold = 0
zfs_multihost_fail_intervals = 5
zfs_multihost_history = 0
zfs_multihost_import_intervals = 10
zfs_multihost_interval = 1000
zfs_multilist_num_sublists = 0
zfs_nocacheflush = 0
zfs_nopwrite_enabled = 1
zfs_no_scrub_io = 0
zfs_no_scrub_prefetch = 0
zfs_object_mutex_size = 64
zfs_pd_bytes_max = 52428800
zfs_per_txg_dirty_frees_percent = 30
zfs_prefetch_disable = 0
zfs_read_chunk_size = 1048576
zfs_read_history = 0
zfs_read_history_hits = 0
zfs_recover = 0
zfs_resilver_delay = 2
zfs_resilver_min_time_ms = 3000
zfs_scan_idle = 50
zfs_scan_min_time_ms = 1000
zfs_scrub_delay = 4
zfs_send_corrupt_data = 0
zfs_sync_pass_deferred_free = 2
zfs_sync_pass_dont_compress = 5
zfs_sync_pass_rewrite = 2
zfs_sync_taskq_batch_pct = 75
zfs_top_maxinflight = 32
zfs_txg_history = 200
zfs_txg_timeout = 20
zfs_vdev_aggregation_limit = 131072
zfs_vdev_async_read_max_active = 3
zfs_vdev_async_read_min_active = 1
zfs_vdev_async_write_active_max_dirty_percent = 60
zfs_vdev_async_write_active_min_dirty_percent = 30
zfs_vdev_async_write_max_active = 10
zfs_vdev_async_write_min_active = 2
zfs_vdev_cache_bshift = 16
zfs_vdev_cache_max = 16384
zfs_vdev_cache_size = 0
zfs_vdev_max_active = 1000
zfs_vdev_mirror_non_rotating_inc = 0
zfs_vdev_mirror_non_rotating_seek_inc = 1
zfs_vdev_mirror_rotating_inc = 0
zfs_vdev_mirror_rotating_seek_inc = 5
zfs_vdev_mirror_rotating_seek_offset = 1048576
zfs_vdev_queue_depth_pct = 1000
zfs_vdev_raidz_impl = [fastest] original scalar sse2 ssse3 
zfs_vdev_read_gap_limit = 32768
zfs_vdev_scheduler = noop
zfs_vdev_scrub_max_active = 2
zfs_vdev_scrub_min_active = 1
zfs_vdev_sync_read_max_active = 10
zfs_vdev_sync_read_min_active = 10
zfs_vdev_sync_write_max_active = 10
zfs_vdev_sync_write_min_active = 10
zfs_vdev_write_gap_limit = 4096
zfs_zevent_cols = 80
zfs_zevent_console = 0
zfs_zevent_len_max = 384
zil_replay_disable = 0
zil_slog_bulk = 786432
zio_delay_max = 30000
zio_dva_throttle_enabled = 1
zio_requeue_io_start_cut_in_line = 1
zio_taskq_batch_pct = 75
zvol_inhibit_dev = 0
zvol_major = 230
zvol_max_discard_blocks = 16384
zvol_prefetch_bytes = 131072
zvol_request_sync = 0
zvol_threads = 32
zvol_volmode = 1

@snajpa And still having these write performance issues?

I'm running 0.7.1 for a couple of days now and the write performance issue I was having with 0.6.5.x seems to be gone!

Unfortunately I can't upgrade that often, next round of updates won't be for another few weeks. I'll get back to this issue within a ~month max with concrete data.

Hi @erikjanhofstede, could you please share the zfs parameters applied after upgrade to 0.7.1. Thanx in advance.
@snajpa Do you have any updates regarding the new packages for ubuntu 16.04 ? Thanx

@nawang87 I have some custom parameters in place for this specific configuration and workload. I didn't apply any custom parameter settings except renaming zil_slog_limit to zil_slog_bulk for running 0.7.1.

I performed some tests and noticed similar like @ab-oe that without commit https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140 performance was pretty good on 0.6.5.x (kernel 4.4.45)

ZOL 0.6.5.x with commit https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140

zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00


dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync  count=10000
480903168 bytes (481 MB) copied, 74.4589 s, 6.5 MB/s

dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync  count=10000 
508559360 bytes (509 MB) copied, 80.3149 s, 6.3 MB/s

ZOL 0.6.5.x with reverted commit https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140

zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00

dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync  count=10000
1310720000 bytes (1.3 GB) copied, 3.33955 s, 392 MB/s

dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync  count=10000 
10485760000 bytes (10 GB) copied, 14.1663 s, 740 MB/s

ZOL 0.7 HEAD and 0.7.1 similar result like below:

zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00

dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync  count=10000
455213056 bytes (455 MB) copied, 71.3401 s, 6.4 MB/s

dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync  count=10000
484442112 bytes (484 MB) copied, 74.2631 s, 6.5 MB/s


Noticed that ZOL head by default also has only one zvol kernel process run:

ps ax|grep zvol
 61303 ?        S<     0:00 [zvol]

htop during I/O shows that zvol process does not occupies resources:

61303 root        0 -20     0     0     0 S  0.0  0.0  0:00.00 zvol

spl module with spl_taskq_thread_dynamic=0 runs 32 zvol processes:

ps ax|grep zvol
 58474 ?        S<     0:00 [zvol]
 58475 ?        S<     0:00 [zvol]
 ....
 58504 ?        S<     0:00 [zvol]
 58505 ?        S<     0:00 [zvol]

but anyway htop during I/O shows that zvol processes do not occupies resources:

58474 root        0 -20     0     0     0 S  0.0  0.0  0:00.00 zvol
58475 root        0 -20     0     0     0 S  0.0  0.0  0:00.00 zvol
...

58504 root        0 -20     0     0     0 S  0.0  0.0  0:00.00 zvol
58505 root        0 -20     0     0     0 S  0.0  0.0  0:00.00 zvol

However ZOL 0.6.5.x with reverted commit https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140 shows that zvol processes nice occupies resources (CPU%):

 20416 root       0 -20       0      0      0 S   2.0  0.0   0:01.83 zvol                                                                                                   
 20403 root       0 -20       0      0      0 S   1.7  0.0   0:01.83 zvol                                                                                                   
 20407 root       0 -20       0      0      0 S   1.7  0.0   0:01.82 zvol                                                                                                   
 20414 root       0 -20       0      0      0 S   1.7  0.0   0:02.18 zvol                                                                                                                                                                                              
 20404 root       0 -20       0      0      0 S   1.3  0.0   0:01.93 zvol                                                                                                   
 20406 root       0 -20       0      0      0 S   1.3  0.0   0:01.79 zvol                                                                                                   
 20413 root       0 -20       0      0      0 S   1.3  0.0   0:02.03 zvol                                                                                                   
 20417 root       0 -20       0      0      0 S   1.3  0.0   0:02.07 zvol                                                                                                   
 20425 root       0 -20       0      0      0 S   1.3  0.0   0:02.06 zvol                                                                                                   
 20429 root       0 -20       0      0      0 S   1.3  0.0   0:01.96 zvol  

@behlendorf: maybe its time to revert the bio work. Its obviously causing major slowdown and there's no solution in sight. I've not seen anything from @ryao in a while on this (or actually anything, hope he's OK), and I'm not sure anyone else is working on/knows how to solve the problem. If LLNL had to back the workload with blockdev, would you allow engineering to implement a pathologically slow system which literally dropped 100x in terms of performance on a commit and never fixed it? This has blocked many production implementations for us, client engineering teams shoot us down and spent 30X on ... name a vendor ... Then they're stuck in proprietary hell locked in to the people who now own their data with other performance and consistency problems.
With crypto in place, this is probably now the most serious outstanding feature gap. Can this get prioritized please, or reverted if there is no developer bandwidth to fix it?

Listing the following commits (for reference), that touch upon bio work (or related), may it help to track down issues and avoid having to revert all of it ☕️ :

https://github.com/zfsonlinux/zfs/commit/d4541210f3e07930dfefb6723d21cd8b313bb156 Linux 3.14 compat: Immutable biovec changes in vdev_disk.c [via git search]

https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140 zvol processing should use struct bio [mentioned here]

https://github.com/zfsonlinux/zfs/commit/2727b9d3b63a938c1797d31378e6a5a1dcd43573 Use uio for zvol_{read,write}

https://github.com/zfsonlinux/zfs/commit/a765a34a3142a1dcedd256e076a54e6ae61953ca Clean up zvol request processing to pass uio and fix porting regressions

https://github.com/zfsonlinux/zfs/pull/4316 Reduce overhead in zvol_write/zvol_read via existing hold

I noticed one more performance reducing which was provided in ZOL 0.7.2:

ZOL 0.7.1-1:

cat /sys/module/zfs/version 
0.7.1-1

zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00

dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync  count=5000
5000+0 records in
5000+0 records out
655360000 bytes (655 MB) copied, 16.8642 s, 38.9 MB/s

dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync  count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 23.6626 s, 44.3 MB/s

ZOL 0.7.2-1:

cat /sys/module/zfs/version 
0.7.2-1

zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00

dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync  count=5000
5000+0 records in
5000+0 records out
655360000 bytes (655 MB) copied, 94.3687 s, 6.9 MB/s

dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync  count=500 
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 75.3563 s, 7.0 MB/s

After revert each single commits seems that commit https://github.com/zfsonlinux/zfs/pull/6477/commits/a5b91f32e966197967639b93dced259a170b84ee introduced that slowdown:

ZOL 0.7.2-1 without https://github.com/zfsonlinux/zfs/pull/6477/commits/a5b91f32e966197967639b93dced259a170b84ee :

zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00

dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync  count=5000
5000+0 records in
5000+0 records out
655360000 bytes (655 MB) copied, 16.7465 s, 39.1 MB/s

dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync  count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.0499 s, 43.6 MB/s

By the way: with direct flag performance is generally good but non each application can use direct I/O so it would be nice to refactor bio work:

ZOL 0.7.2-1:

dd if=/dev/zero of=/dev/zd0 bs=128k oflag=direct  count=50000
50000+0 records in
50000+0 records out
6553600000 bytes (6.6 GB) copied, 13.6889 s, 479 MB/s

dd if=/dev/zero of=/dev/zd0 bs=1M oflag=direct  count=50000
50000+0 records in
50000+0 records out
52428800000 bytes (52 GB) copied, 47.4183 s, 1.1 GB/s

I may provide additional tests on a server experiencing the same issue (on 0.6.5.x).
How easily can commit 37f9dac be rolled back in source code to reproduce the issue ?

@sempervictus @arturpzol @odoucet I've read through this issue again, but I'm not sure I completely understand. Can you help me understand your target workload. From the recently posted test results it looks like your focus is entirely on synchronous single threaded IO which is unfortunately exactly where I'd expect performance to be the worst. Particularly for a raidz pool without a dedicate log device.

Let me try and summarize the state of things in 0.7.2 and master do my best answer the questions posted. Then I can properly investigate your performance concerns and we can determine what can be done.

maybe its time to revert the bio work

The flaw in the bio work referenced above has already been reverted. This was done in commit 692e55b8fea00a0d0bd46188d68031292f04e4a8, and 8fa5250f5d779e577406c581fc2d7fbf0baceea5 which reinstated the task queues to resolve the asynchronous IO performance issue.

commit a5b91f3 introduced that slowdown:

That's a bigger hit that I would have expected, we'll have to look in to that.

Using the current master code I tested this worst case scenario as posted above. The volume was created with __sync=always__ and I ran a single threaded dd for the workload. That meant that every single 128K block needs to be written to the log, which is part of the primary pool, and we can't return to user space to handle another write until we're absolutely sure it's on disk. Then we need to write it again as part of the normal IO path. This is absolutely going to be slow.

That all said, you're right. In my testing it _does_ look even worse than what I would have expected I see a large write multiplier which I can't immediately explain, let me investigate what's going on and get back to you.

On the upside, if you stick with the default values I do see reasonable performance numbers.

@behlendorf dd scenario was only for the test and for the easy repetition the issue.

In my main enviroment I use SCST which shares zvols to iSCSI initiator. Initially I noticed with ZOL 0.7 master very long disk format on Windows side:

ZOL 0.6.5.x with reverted bio work:

zvol size: 10TB, sync: always - about 18 seconds

ZOL 0.7 HEAD:

zvol size: 10TB, sync: always - about 4 minutes 36 seconds

so I tried to eliminate SCST, network and iSCSI initiator layer and switched to local I/O using dd to be sure that it is something on zol layer.

I understand that with sync=alwaysand without a dedicate log device has to be slow but why 0.6.5.x with reverted bio work or 0.6.4.2 with the same test scenario is so fast? Does it mean that 0.6.5.x with reverted bio or 0.6.4.2 is not fully synchronized? but on the other hand it cannot be because I performed many test with force shutdown the node and have never experienced data corruption.

Of course, I can use default value of sync property for zvol but it does not protect me to avoid data corruption in case of power failure.

@behlendorf are you able to perform test with dd with ZOL 0.6.4.2 and 0.6.5.x/0.7.x? You will notice very big performance differences.

@odoucet you can simply use 0.6.4.2 or revert bio work on first released 0.6.5.
In order to do it try download patch from https://uploadfiles.io/pba86 .

It is diff between released 0.6.5 and https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140
Next patch your zfs sources e.g:

patch -p1 < ./revert_bio.patch

or use git to do it.

For next released version of v0.6.5 (e.g. v0.6.5.1) it will be harder to revert it because of more source code dependencies.

Make sure to run zfstests after those changes,

ZFS is about data integrity after all, if the modified filesystem code puts your data through the grinder (data corruption, no integrity, issues) - it's pretty much useless

@kernelOfTruth I can agree in 100% that ZFS data integrity and others need to be tested with reverted bio.
I only wanted to show performance differences between 0.6.4.2 and 0.6.5.

@behlendorf I tested 37f9dac with PR #5824 on top has the same poor performance as described in most posts here. Restoring the 32 zvol threads has no impact on performance. The performance on 0.7.0-rc4 was better because of incorrect ranges locked for indirect writes which by the way led to broken data on zvols but after the fix the performance is as bad as it was on 37f9dac.

I found another performance regression described in #6728 which affects discard operation on zvols.

Was this page helpful?
0 / 5 - 0 ratings