Hi,
First post here so apologies in advance for my mistakes!
We recently upgraded ZFS on our Ubuntu 12.04.5 SAN from 0.6.2-1~precise to 0.6.5.4-1~precise. This was done mistakenly as part of a crash recovery.
This system has been in-place and stable for several years.
Following the upgrade we have been experiencing higher disk utilisation resulting in intermittent IOWAIT CPU issues for the server and its NFS clients.
We noticed several performance changes post upgrade. _Note that the upgrade occurred @ Week 13 on the graphs below_.
1) All clients of this SAN use NFS and there has been no change in the number of inbound NFS writes (dark brown line)

2) zpool iostat write ops have doubled

3) iostat writes have doubled for all pool disks (graph is for a sample disk)

4) iostat write sizes (avgrq-sz) have halved for all pool disks

I would guess that these issues are related to this entry in the 0.6.5 changelog -
- Improved zvol performance on all kernels (>50% higher throughput, >20% lower latency)
I've also seen write coalescing mentioned in github and looking at zpool iostat, I don't see much of this happening -
san1stdpool 48.6T 16.6T 63 1.36K 404K 9.00M
raidz2 8.04T 2.83T 14 306 88.6K 2.12M
scsi-2001b4d2031085703 - - 3 268 29.6K 1.55M
scsi-2001b4d2061836505 - - 4 265 38.4K 1.53M
scsi-2001b4d2061930609 - - 3 260 23.2K 1.52M
scsi-2001b4d2061838504 - - 3 266 15.2K 1.55M
raidz2 8.10T 2.78T 7 205 101K 1.40M
scsi-2001b4d2004191406 - - 2 154 40.8K 1.04M
scsi-2001b4d2061656700 - - 1 156 34.4K 1.05M
scsi-2001b4d2061540302 - - 2 156 17.6K 1.07M
scsi-2001b4d2061648702 - - 1 159 19.2K 1.08M
raidz2 8.11T 2.77T 11 169 118K 1.11M
scsi-2001b4d2061055300 - - 3 138 44.8K 865K
scsi-2001b4d2061651708 - - 3 136 42.4K 861K
scsi-2001b4d2061645707 - - 2 133 19.2K 861K
scsi-2001b4d2061558207 - - 2 137 23.2K 875K
raidz2 8.11T 2.76T 11 188 34.9K 1.07M
scsi-2001b4d2061040403 - - 1 154 7.20K 891K
scsi-2001b4d2061556002 - - 2 156 9.60K 892K
scsi-2001b4d2061652709 - - 4 151 17.6K 866K
scsi-2001b4d2061634008 - - 2 152 11.2K 873K
raidz2 8.12T 2.76T 8 236 28.6K 1.39M
scsi-2001b4d2061541201 - - 1 170 7.20K 1.12M
scsi-2001b4d2061642603 - - 2 169 12.0K 1.10M
scsi-2001b4d2061650808 - - 2 168 9.60K 1.09M
scsi-2001b4d2061543300 - - 2 167 9.60K 1.10M
raidz2 8.13T 2.75T 10 281 32.3K 1.89M
scsi-2001b4d2061653704 - - 2 241 12.8K 1.40M
scsi-2001b4d2061643704 - - 3 241 16.0K 1.40M
scsi-2001b4d2061651706 - - 2 243 8.80K 1.40M
scsi-2001b4d2061555002 - - 2 242 11.2K 1.41M
What would you guys suggest? Downgrade to 0.6.4?
Thanks for your help
@crollorc There have been a whole lot more fundamental changes since 0.6.2 than just the zvol rework you pointed out. Among other things that could impact a write-heavy load is the write throttle overhaul of e8b96c6. You should check out some of the new-ish kstats such as /proc/spl/kstat/zfs/dmu_tx and also the per-pool histograms in /proc/spl/kstat/zfs/<pool>/dmu_tx_assign. The new system may be absorbing writes much more quickly than previously. The zvol rework you mention lowers latency a lot and cause excessive write throttling. It would depend on the amount of memory in the system and the speed at which your drives can handle the writes.
@dweeezil Thank you for the speedy response!
The new system may be absorbing writes much more quickly than previously.
Note that sync=disabled was setup on all pools a long time ago to get reasonable NFS performance
root@san1:~# zfs get sync san1stdpool
NAME PROPERTY VALUE SOURCE
san1stdpool sync disabled local
It would depend on the amount of memory in the system and the speed at which your drives can handle the writes.
We use an Areca ARC-1880 controller and a bunch of WD30EZRX disks. Here are some details on system memory, ZFS tunings and iostat for a sample disk
root@san1:~# grep Mem /proc/meminfo
MemTotal: 148500592 kB
MemFree: 52012052 kB
root@san1:~# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=80530636800
options zfs zfs_arc_meta_limit=21474836480
options zfs zfs_prefetch_disable=1
root@san1:~# iostat -dx 30 sdx
Linux 3.2.0-57-generic (san1) 12/04/16 _x86_64_ (8 CPU)
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdx 0.00 0.00 3.79 172.44 82.42 1225.43 14.84 0.38 2.14 63.47 0.79 1.27 22.42
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdx 0.00 0.00 5.83 147.07 45.60 1149.20 15.63 0.22 1.46 32.50 0.23 1.28 19.52
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdx 0.00 0.00 5.40 168.37 53.87 1211.07 14.56 0.40 2.30 56.27 0.57 1.64 28.52
There have been a whole lot more fundamental changes since 0.6.2 than just the zvol rework you pointed out. Among other things that could impact a write-heavy load is the write throttle overhaul of e8b96c6. You should check out some of the new-ish kstats such as /proc/spl/kstat/zfs/dmu_tx and also the per-pool histograms in /proc/spl/kstat/zfs/
/dmu_tx_assign.
Below are some details for the kstats mentioned.
dmu_tx_assigned is increasing fast and dmu_tx_group is increasing slowly:
dmu_tx (60s apart):
root@san1:~# cat /proc/spl/kstat/zfs/dmu_tx
5 1 0x01 11 528 5451072961 1032291543673689
name type data
dmu_tx_assigned 4 594152327
dmu_tx_delay 4 0
dmu_tx_error 4 0
dmu_tx_suspended 4 0
dmu_tx_group 4 264018
dmu_tx_memory_reserve 4 0
dmu_tx_memory_reclaim 4 0
dmu_tx_dirty_throttle 4 3169
dmu_tx_dirty_delay 4 0
dmu_tx_dirty_over_max 4 0
dmu_tx_quota 4 0
root@san1:~# cat /proc/spl/kstat/zfs/dmu_tx
5 1 0x01 11 528 5451072961 1032351549214250
name type data
dmu_tx_assigned 4 594190604
dmu_tx_delay 4 0
dmu_tx_error 4 0
dmu_tx_suspended 4 0
dmu_tx_group 4 264044
dmu_tx_memory_reserve 4 0
dmu_tx_memory_reclaim 4 0
dmu_tx_dirty_throttle 4 3169
dmu_tx_dirty_delay 4 0
dmu_tx_dirty_over_max 4 0
dmu_tx_quota 4 0
dmu_tx_assign (60s apart):
root@san1:~# cat /proc/spl/kstat/zfs/san1stdpool/dmu_tx_assign
19 1 0x01 32 1536 15812990669 1032205962575895
name type data
1 ns 4 0
2 ns 4 0
4 ns 4 0
8 ns 4 0
16 ns 4 0
32 ns 4 0
64 ns 4 0
128 ns 4 0
256 ns 4 39
512 ns 4 181
1024 ns 4 123
2048 ns 4 229
4096 ns 4 309
8192 ns 4 611
16384 ns 4 4702
32768 ns 4 7820
65536 ns 4 9830
131072 ns 4 8734
262144 ns 4 8626
524288 ns 4 7764
1048576 ns 4 6640
2097152 ns 4 6854
4194304 ns 4 8979
8388608 ns 4 14816
16777216 ns 4 23768
33554432 ns 4 26059
67108864 ns 4 21855
134217728 ns 4 19184
268435456 ns 4 21004
536870912 ns 4 23511
1073741824 ns 4 21004
2147483648 ns 4 24505
root@san1:~# cat /proc/spl/kstat/zfs/san1stdpool/dmu_tx_assign
19 1 0x01 32 1536 15812990669 1032266427393963
name type data
1 ns 4 0
2 ns 4 0
4 ns 4 0
8 ns 4 0
16 ns 4 0
32 ns 4 0
64 ns 4 0
128 ns 4 0
256 ns 4 39
512 ns 4 181
1024 ns 4 123
2048 ns 4 229
4096 ns 4 309
8192 ns 4 611
16384 ns 4 4702
32768 ns 4 7820
65536 ns 4 9831
131072 ns 4 8735
262144 ns 4 8626
524288 ns 4 7764
1048576 ns 4 6640
2097152 ns 4 6854
4194304 ns 4 8979
8388608 ns 4 14816
16777216 ns 4 23768
33554432 ns 4 26059
67108864 ns 4 21856
134217728 ns 4 19184
268435456 ns 4 21007
536870912 ns 4 23512
1073741824 ns 4 21004
2147483648 ns 4 24510
Thanks again for any pointers.
@crollorc These aren't a good sign:
536870912 ns 4 23512 1073741824 ns 4 21004 2147483648 ns 4 24510
Please try echo 100 > /sys/module/zfs/parameters/zfs_txg_history and then monitor the tail of /proc/spl/kstat/zfs/san1stdpool/txgs. My guess is that there's going to be a lot of dirty data built up and you'll see txgs taking a long time to sync. With 148GiB of RAM, by default, it'll allow 14.8GiB of dirty data and will start delaying transactions once 60% of that's filled. If the dirty data is building up that much, it's quite a bit to be flushing to a set of 6 vdevs comprised of 5400 RPM drives. Your sync=disabled setting removes most of the pressure from the writes and is likely letting lots of dirty data build up.
One other note: You had suspected changes to the zvol code but it's not clear that your system is using zvols at all, especially if it's just an NFS server.
@dweeezil Thanks again for your help!
One other note: You had suspected changes to the zvol code but it's not clear that your system is using zvols at all, especially if it's just an NFS server.
Yes, sorry for this red herring!
Please try echo 100 > /sys/module/zfs/parameters/zfs_txg_history and then monitor the tail of /proc/spl/kstat/zfs/san1stdpool/txgs. My guess is that there's going to be a lot of dirty data built up and you'll see txgs taking a long time to sync. With 148GiB of RAM, by default, it'll allow 14.8GiB of dirty data and will start delaying transactions once 60% of that's filled. If the dirty data is building up that much, it's quite a bit to be flushing to a set of 6 vdevs comprised of 5400 RPM drives. Your sync=disabled setting removes most of the pressure from the writes and is likely letting lots of dirty data build up.
Some txgs below. As far as I can tell, for this sample the avg sync time is 0.785s and max dirty data is 389MB. Is that right?
Does it look like we have excessive dirty data? If so, should we change some options to reduce this?
Also, the thing I find most confusing post upgrade is the halving of average write size and doubling of write IO. Could this indicate some sort of issue with the coalescing or merging of txg async writes?
root@san1:~# tail -100 /proc/spl/kstat/zfs/san1stdpool/txgs
18 0 0x01 100 11200 15812979317 105133212710609918 0 0x01 100 11200 15812979317 1051533841667804
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
15119926 1051317597136163 C 67319808 1138176 30715392 22 4244 2288632025 41059 90431 539574908
15119927 1051319885768188 C 67148288 324096 29394944 11 4660 3023421149 42716 56583 421626699
15119928 1051322909189337 C 622592 286208 9233920 7 2013 133828 158336 421440564 418241927
15119929 1051322909323165 C 67190272 0 31543808 0 4339 3105828843 40835 34503 207592661
15119930 1051326015152008 C 67112960 246272 30817280 18 4012 3288812978 32739 23840 1613413641
15119931 1051329303964986 C 67280896 797696 32596992 8 4321 2928346818 27324 51089 1421791320
15119932 1051332232311804 C 67201024 723456 28093952 7 4179 2135289574 30965 67649 272378251
15119933 1051334367601378 C 185868800 553984 34114560 9 3945 272499109 23381 4422 2392586832
15119934 1051334640100487 C 388984320 1226752 68935680 16 6189 2392672328 25820 17348 3349258721
15119935 1051337032772815 C 69060608 654336 33100288 18 3841 3349315820 22396 5053 2735950416
15119936 1051340382088635 C 73426944 2148864 34294272 46 4346 2735995882 22438 58645 3479784551
15119937 1051343118084517 C 72868864 719360 34494464 16 4195 3479911953 35964 67916 4541587497
15119938 1051346597996470 C 72126464 273408 29708288 5 4051 4541745505 40595 71655 470704815
15119939 1051351139741975 C 67288064 14848 34340352 1 4389 2762774648 35648 59742 290774122
15119940 1051353902516623 C 67375616 0 29429248 0 3693 1752902477 32134 57074 213941754
15119941 1051355655419100 C 67144192 121344 30396416 2 4474 2401879065 33550 56413 227860347
15119942 1051358057298165 C 311296 181760 6515200 15 1388 117155 24713 227814025 3711561552
15119943 1051358057415320 C 77940736 211456 35246592 11 5057 3939418192 39508 25284 420930211
15119944 1051361996833512 C 67150848 499712 34809856 8 4503 4519004489 35635 59046 250941851
15119945 1051366515838001 C 245760 0 6135296 0 978 149579 25826 250866431 90757888
15119946 1051366515987580 C 161980928 3584 32300544 1 3054 341680448 23303 160649 328860642
15119947 1051366857668028 C 67266560 75264 28854272 5 3882 3171883004 33868 53873 1542083954
15119948 1051370029551032 C 67233792 119296 31291392 2 3844 4015941251 33066 61815 222378832
15119949 1051374045492283 C 67112448 73216 30317056 1 4108 2828123890 32982 51705 232453458
15119950 1051376873616173 C 294912 3072 4966912 1 959 94905 23727 232425852 219545549
15119951 1051376873711078 C 67220480 0 28463104 0 3923 3582735732 32428 56405 173446117
15119952 1051380456446810 C 67352064 0 32849408 0 4644 4557654099 33427 62170 260095880
15119954 1051386876962366 C 311296 7168 4756992 3 968 64058 104865 217125334 391759430
15119955 1051386877026424 C 67350016 512 29517312 1 3781 2162472992 37046 58895 243407587
15119956 1051389039499416 C 67367936 4096 26507264 2 3857 2859501850 32602 16814 213039912
15119957 1051391899001266 C 294912 3584 6612480 1 1160 56747 23807 213015136 282602466
15119958 1051391899058013 C 84591616 5120 23961088 2 2823 495659675 19990 9958 193463258
15119959 1051392394717688 C 67204608 0 32209408 0 4709 3241553670 29093 58300 196789366
15119960 1051395636271358 C 67344384 3584 21316608 1 3268 1424572876 29297 51815 218067837
15119961 1051397060844234 C 133722112 10240 27375104 3 3207 218169319 21447 14854 276689871
15119962 1051397279013553 C 425984 75264 4948992 2 885 43184 34551 276654583 839144752
15119963 1051397279056737 C 86631424 418304 28002816 10 3119 1115928940 34058 44815 3867624148
15119964 1051398394985677 C 67249152 116224 34108928 5 4120 3976622524 31413 56118 1811913426
15119965 1051402371608201 C 442368 271360 6130176 3 1145 116962 30316 1811861424 316975495
15119966 1051402371725163 C 67272704 7168 30453248 2 3696 4079832998 5366991 56363 177157625
15119967 1051406451558161 C 606208 6656 5591040 2 1041 5474039 28348 177085766 463744406
15119968 1051406457032200 C 67382784 11264 28375552 3 3694 4750365650 31639 58194 251079004
15119969 1051411207397850 C 67136000 174080 30353920 4 4368 2648867338 33360 19377 333135826
15119970 1051413856265188 C 67301376 0 32245248 0 4648 3323272140 33377 49368 260578413
15119971 1051417179537328 C 67398656 0 30830592 0 4187 3326830255 34880 82275 193184454
15119972 1051420506367583 C 704512 13824 5534208 2 1028 168081 1436206 191703067 287482413
15119973 1051420506535664 C 67311104 4096 28228608 2 3945 1893336652 33214 20482 254137463
15119974 1051422399872316 C 67278336 0 21765632 0 2773 348983041 27356 47972 145083331
15119975 1051422748855357 C 67352064 0 26700288 0 4364 4445354130 28759 90471 275635672
15119976 1051427194209487 C 67309568 4096 17486848 2 2648 287537177 19723 36940 285099234
15119977 1051427481746664 C 294912 3584 6289920 1 1418 78881 15976 285067202 372916241
15119978 1051427481825545 C 99746816 138240 32547328 3 3813 658033143 37851 112851 281154981
15119979 1051428139858688 C 67299840 0 35225088 0 4658 4791695098 29047 34142 238786288
15119980 1051432931553786 C 67450368 442880 32909312 9 4657 3382521809 29070 53666 455676317
15119981 1051436314075595 C 62214656 0 28602368 0 4430 4998970069 33190 48473 212310971
15119982 1051441313045664 C 67166208 7680 29568512 1 4322 4977762671 33126 56693 220063626
15119984 1051448657016982 C 67188224 39424 28069888 1 3891 3938588146 33204 63339 255697567
15119985 1051452595605128 C 206512128 262144 36888576 2 3808 255814316 20576 5260 433042218
15119986 1051452851419444 C 294912 0 5493760 0 941 30191 67419 432978226 245940283
15119987 1051452851449635 C 118355456 740864 30279680 25 3021 679079720 36735 46412 5243127770
15119988 1051453530529355 C 107720192 454144 44655616 10 5230 5243222808 22147 4742 2608530245
15119989 1051458773752163 C 193300480 567808 42410496 14 4978 2608579548 23583 4823 2329348098
15119990 1051461382331711 C 67341824 647168 27520000 11 3886 2861814130 84761564 51831 2984429468
15119991 1051464244145841 C 10947072 136704 10307584 6 1463 84872604 25224 2984351067 376682757
15119992 1051464329018445 C 67301888 131072 33234944 1 3847 4520813041 31963 54605 222658607
15119993 1051468849831486 C 67185152 0 30845952 0 4001 3531488284 41844 54481 199061853
15119994 1051472381319770 C 67261952 0 33922048 0 4376 3678401469 33453 55968 232224320
15119995 1051476059721239 C 67164672 131072 37501440 1 5200 4175202659 34296 60155 225753980
15119996 1051480234923898 C 67061760 3584 33344512 1 5753 3162438635 33377 60652 293532139
15119997 1051483397362533 C 67233280 0 27124224 0 3267 892441673 40501 17950 188961671
15119998 1051484289804206 C 72597504 7680 19038720 2 2022 189064281 25234 38076 499031632
15119999 1051484478868487 C 67157504 0 29156352 0 3885 4877431971 33126 58347 294803691
15120000 1051489356300458 C 67275776 0 17293312 0 2613 488698066 25123 15277 186339567
15120001 1051489844998524 C 67120640 189952 29140480 2 4070 2988750048 31981 54116 217461923
15120002 1051492833748572 C 67270144 30208 30515200 1 4008 2598587018 33083 24376 207672250
15120003 1051495432335590 C 67363328 0 23166976 0 3080 1092220744 59143 27075 197765246
15120004 1051496524556334 C 311296 77824 5860864 3 944 107004 115234 197634991 284243810
15120005 1051496524663338 C 67376128 34816 21529088 1 3238 1492765489 33958 61471 177605761
15120006 1051498017428827 C 67281920 88064 23701504 3 3359 1563872137 34359 56987 2221304322
15120007 1051499581300964 C 442368 446464 6493696 7 972 142416 121732 2221137710 579516387
15120008 1051499581443380 C 80051200 0 27474432 0 3577 2800846052 31181 36791 237096877
15120009 1051502382289432 C 67284480 0 24519168 0 3317 843436262 32395 21238 187852604
15120010 1051503225725694 C 606208 288256 5197312 9 973 70125 137521 187704935 632766615
15120011 1051503225795819 C 89024512 65536 26772992 7 3249 820634999 35718 45433 2604648562
15120012 1051504046430818 C 69491712 264192 26582016 6 3784 2604750202 20538 38766 322755582
15120014 1051509437880034 C 72758272 207872 21197824 7 2681 169953556 24450 4840 2003911666
15120015 1051509607833590 C 67322880 31744 33702912 7 4161 4465702675 32512 58383 1077354867
15120016 1051514073536265 C 81204224 0 25047040 0 3331 1077512643 31088 13559 159332753
15120017 1051515151048908 C 67190784 187904 14333440 2 2120 330590061 23868 65989 925606183
15120018 1051515481638969 C 294912 5632 5998080 1 899 110777 16520 925574862 298434801
15120019 1051515481749746 C 115534848 302080 32688640 6 3619 1224080377 27726 44641 1976941306
15120020 1051516705830123 C 67234816 262144 31841280 2 4065 4021937649 29437 65558 910526372
15120021 1051520727767772 C 67363328 0 33303552 0 3697 3745727914 29722 61160 191961445
15120022 1051524473495686 C 67283968 242176 31216640 4 4147 2288904899 40747 58234 653195841
15120023 1051526762400585 C 67494400 0 31962112 0 4467 3981323257 30209 34169 226151116
15120024 1051530743723842 C 67171840 5632 30683648 5 4057 2780591447 28499 32628 233295354
15120025 1051533524315289 O 0 0 0 0 0 0 0 0 0
Thanks again.
@crollorc I may have gotten ahead of myself here. So much has changed since 0.6.2 and I latched on to the new write throttle. There have also been big ARC changes. In order to get a better picture of what's going on, we also need to see your arcstats.
Are these changes:
options zfs zfs_arc_max=80530636800 options zfs zfs_arc_meta_limit=21474836480
left over from 0.6.2 or did you add them after the upgrade? If these are from 0.6.2, I'd suggest removing these and watching the arcstats.
@dweeezil
There have also been big ARC changes. In order to get a better picture of what's going on, we also need to see your arcstats.
I've added some arcstats and arc_summary below (hopefully with the right params).
Are these changes:
options zfs zfs_arc_max=80530636800
options zfs zfs_arc_meta_limit=21474836480
left over from 0.6.2 or did you add them after the upgrade? If these are from 0.6.2, I'd suggest removing these and watching the arcstats.
They are left over and were tuned to deal with some problems with a large ARC in 0.6.2. We were also dropping caches (echo 3 > /proc/sys/vm/drop_caches) every hour on 0.6.2 but stopped post upgrade.
We are very dependent on the high ARC hit rates we get (ARC:93% L2ARC:68%). How would i default these options from the live system? What size will the ARC default to and will this result in the ARC being cleared or will it resize dynamically?
Thanks again for your help.
root@san1:~# arcstat.py 2
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
22:19:34 1 0 33 0 33 0 0 0 33 75G 75G
22:19:36 4.5K 227 5 227 5 0 0 167 43 75G 75G
22:19:38 2.9K 199 6 199 6 0 0 163 7 74G 75G
22:19:40 749 132 17 132 17 0 0 95 29 74G 75G
22:19:42 4.4K 220 4 220 4 0 0 159 10 74G 75G
22:19:44 1.4K 60 4 60 4 0 0 31 3 74G 75G
22:19:46 1.6K 119 7 119 7 0 0 84 8 74G 75G
22:19:48 1.5K 128 8 128 8 0 0 71 7 75G 75G
22:19:50 2.3K 333 14 333 14 0 0 311 17 74G 75G
22:19:52 1.0K 81 8 81 8 0 0 44 32 75G 75G
22:19:54 4.5K 106 2 106 2 0 0 72 7 75G 75G
22:19:56 2.1K 31 1 31 1 0 0 7 35 75G 75G
22:19:58 3.0K 96 3 96 3 0 0 54 5 75G 75G
22:20:00 6.9K 353 5 353 5 0 0 304 5 75G 75G
22:20:02 1.5K 146 9 146 9 0 0 79 42 75G 75G
22:20:05 23K 190 0 190 0 0 0 138 0 75G 75G
22:20:07 5.3K 150 2 150 2 0 0 101 7 75G 75G
22:20:09 4.1K 325 8 325 8 0 0 254 17 75G 75G
22:20:11 3.8K 129 3 129 3 0 0 89 4 75G 75G
22:20:13 3.5K 127 3 127 3 0 0 78 6 75G 75G
22:20:15 968 118 12 118 12 0 0 78 40 75G 75G
22:20:17 1.4K 115 8 115 8 0 0 83 9 75G 75G
22:20:19 537 87 16 87 16 0 0 66 72 75G 75G
root@san1:~# arc_summary.py
------------------------------------------------------------------------
ZFS Subsystem Report Tue Apr 12 22:20:47 2016
ARC Summary: (HEALTHY)
Memory Throttle Count: 0
ARC Misc:
Deleted: 73.11m
Mutex Misses: 8.20k
Evict Skips: 8.20k
ARC Size: 100.00% 75.00 GiB
Target Size: (Adaptive) 100.00% 75.00 GiB
Min Size (Hard Limit): 0.04% 32.00 MiB
Max Size (High Water): 2400:1 75.00 GiB
ARC Size Breakdown:
Recently Used Cache Size: 93.75% 70.31 GiB
Frequently Used Cache Size: 6.25% 4.69 GiB
ARC Hash Breakdown:
Elements Max: 9.19m
Elements Current: 92.81% 8.53m
Collisions: 160.23m
Chain Max: 6
Chains: 918.16k
ARC Total accesses: 3.54b
Cache Hit Ratio: 93.96% 3.33b
Cache Miss Ratio: 6.04% 213.86m
Actual Hit Ratio: 93.80% 3.32b
Data Demand Efficiency: 94.10% 1.08b
Data Prefetch Efficiency: 30.63% 13.31m
CACHE HITS BY CACHE LIST:
Most Recently Used: 8.86% 294.70m
Most Frequently Used: 90.97% 3.03b
Most Recently Used Ghost: 1.40% 46.42m
Most Frequently Used Ghost: 3.08% 102.49m
CACHE HITS BY DATA TYPE:
Demand Data: 30.47% 1.01b
Prefetch Data: 0.12% 4.07m
Demand Metadata: 69.34% 2.31b
Prefetch Metadata: 0.07% 2.25m
CACHE MISSES BY DATA TYPE:
Demand Data: 29.69% 63.48m
Prefetch Data: 4.32% 9.23m
Demand Metadata: 64.88% 138.75m
Prefetch Metadata: 1.12% 2.40m
L2 ARC Summary: (HEALTHY)
Low Memory Aborts: 288
Free on Write: 475.27k
R/W Clashes: 7.58k
Bad Checksums: 0
IO Errors: 0
L2 ARC Size: (Adaptive) 664.73 GiB
Compressed: 42.35% 281.54 GiB
Header Size: 0.09% 626.96 MiB
L2 ARC Evicts:
Lock Retries: 60
Upon Reading: 1
L2 ARC Breakdown: 213.86m
Hit Ratio: 68.97% 147.50m
Miss Ratio: 31.03% 66.36m
Feeds: 1.06m
L2 ARC Writes:
Writes Sent: 100.00% 1.02m
File-Level Prefetch: (HEALTHY)
DMU Efficiency: 9.40m
Hit Ratio: 31.19% 2.93m
Miss Ratio: 68.81% 6.47m
Colinear: 6.47m
Hit Ratio: 0.01% 536
Miss Ratio: 99.99% 6.46m
Stride: 2.72m
Hit Ratio: 99.99% 2.72m
Miss Ratio: 0.01% 349
DMU Misc:
Reclaim: 6.46m
Successes: 0.35% 22.78k
Failures: 99.65% 6.44m
Streams: 214.18k
+Resets: 0.09% 200
-Resets: 99.91% 213.98k
Bogus: 0
ZFS Tunable:
zfs_arc_sys_free 0
zfs_arc_lotsfree_percent 10
l2arc_norw 0
l2arc_feed_again 1
l2arc_nocompress 0
l2arc_noprefetch 1
l2arc_feed_min_ms 200
l2arc_feed_secs 1
l2arc_headroom_boost 200
l2arc_headroom 2
l2arc_write_boost 8388608
l2arc_write_max 8388608
zfs_arc_num_sublists_per_state 8
zfs_arc_min_prefetch_lifespan 0
zfs_arc_average_blocksize 8192
zfs_disable_dup_eviction 0
zfs_arc_p_min_shift 0
zfs_arc_shrink_shift 0
zfs_arc_p_dampener_disable 1
zfs_arc_p_aggressive_disable 1
zfs_arc_grow_retry 0
zfs_arc_meta_strategy 1
zfs_arc_meta_adjust_restarts 4096
zfs_arc_meta_prune 10000
zfs_arc_meta_min 0
zfs_arc_meta_limit 21474836480
zfs_arc_max 80530636800
zfs_arc_min 0
zfs_dbuf_state_index 0
zfs_dedup_prefetch 0
zfs_nopwrite_enabled 1
zfs_mdcomp_disable 0
zfs_send_corrupt_data 0
zfs_pd_bytes_max 52428800
zfetch_array_rd_sz 1048576
zfetch_block_cap 256
zfetch_min_sec_reap 2
zfetch_max_streams 8
zfs_prefetch_disable 1
zfs_max_recordsize 1048576
zfs_delay_scale 500000
zfs_dirty_data_sync 67108864
zfs_dirty_data_max_max 38016151552
zfs_dirty_data_max 15206460620
zfs_delay_min_dirty_percent 60
zfs_dirty_data_max_max_percent 25
zfs_dirty_data_max_percent 10
zfs_free_max_blocks 100000
zfs_no_scrub_prefetch 0
zfs_no_scrub_io 0
zfs_resilver_min_time_ms 3000
zfs_free_min_time_ms 1000
zfs_scan_min_time_ms 1000
zfs_scan_idle 50
zfs_scrub_delay 4
zfs_resilver_delay 2
zfs_top_maxinflight 32
zfs_zevent_console 0
zfs_zevent_cols 80
zfs_zevent_len_max 2048
metaslab_bias_enabled 1
metaslab_lba_weighting_enabled 1
metaslab_fragmentation_factor_enabled 1
zfs_metaslab_fragmentation_threshold 70
zfs_mg_fragmentation_threshold 85
zfs_mg_noalloc_threshold 0
metaslab_preload_enabled 1
metaslab_debug_unload 0
metaslab_debug_load 0
metaslab_aliquot 524288
zio_taskq_batch_pct 75
spa_load_verify_data 1
spa_load_verify_metadata 1
spa_load_verify_maxinflight 10000
zfs_autoimport_disable 1
spa_config_path /etc/zfs/zpool.cache
spa_slop_shift 5
spa_asize_inflation 24
zfs_deadman_enabled 1
zfs_deadman_synctime_ms 1000000
zfs_free_leak_on_eio 0
zfs_recover 0
zfs_flags 0
zfs_txg_history 100
zfs_read_history_hits 0
zfs_read_history 0
zfs_txg_timeout 5
metaslabs_per_vdev 200
zfs_vdev_cache_bshift 16
zfs_vdev_cache_size 0
zfs_vdev_cache_max 16384
zfs_vdev_scheduler noop
zfs_vdev_mirror_switch_us 10000
zfs_vdev_sync_write_min_active 10
zfs_vdev_sync_write_max_active 10
zfs_vdev_sync_read_min_active 10
zfs_vdev_sync_read_max_active 10
zfs_vdev_scrub_min_active 1
zfs_vdev_scrub_max_active 2
zfs_vdev_async_write_min_active 1
zfs_vdev_async_write_max_active 10
zfs_vdev_async_read_min_active 1
zfs_vdev_async_read_max_active 3
zfs_vdev_async_write_active_min_dirty_percent 30
zfs_vdev_async_write_active_max_dirty_percent 60
zfs_vdev_max_active 1000
zfs_vdev_write_gap_limit 4096
zfs_vdev_read_gap_limit 32768
zfs_vdev_aggregation_limit 131072
zfs_expire_snapshot 300
zfs_admin_snapshot 0
zfs_dbgmsg_maxsize 4194304
zfs_dbgmsg_enable 0
zfs_immediate_write_sz 32768
zfs_read_chunk_size 1048576
zil_slog_limit 1048576
zfs_nocacheflush 0
zil_replay_disable 0
zfs_sync_pass_rewrite 2
zfs_sync_pass_dont_compress 5
zfs_sync_pass_deferred_free 2
zio_requeue_io_start_cut_in_line 1
zio_delay_max 30000
zvol_prefetch_bytes 131072
zvol_max_discard_blocks 16384
zvol_major 230
zvol_inhibit_dev 0
@crollorc Please cat /proc/spl/kstat/zfs/arcstats periodically when the problem is happening and post the output (ideally to a gist or paste or something).
@dweeezil
Thanks again for all your pointers.
Please cat /proc/spl/kstat/zfs/arcstats periodically when the problem is happening and post the output (ideally to a gist or paste or something).
We haven't noticed any problems with the ARC or reads in general since the upgrade so I'm unclear when I should be looking at arcstats?
Because the ARC and L2ARC are so efficient, only a small proportion of disk IO are reads -
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdx 0.00 0.00 3.79 172.44 82.42 1225.43 14.84 0.38 2.14 63.47 0.79 1.27 22.42
However writes have doubled post upgrade which gives us a lot less disk IOP headroom. When IOPs hit 220 or so IO queue depth and service time jump followed by SAN and client load linked to IO_WAIT.
@crollorc Both your iostat and txgs history do show an average write size of about 8K. I would have expected ec8501e, which _is_ in 0.6.5.4, to help aggregate them better. The small writes, which were clearly shown in your original iostat graph, do appear to be the problem.
If there is a bigger underlying issue here, I'd expect there to be a lot more problem reports given how widespread the use of NFS server is. The only unusual thing about your setup is that you're not using an SSD log device but instead have set sync=disabled (and maybe that's not as unusual as it ought to be).
@dweeezil
Both your iostat and txgs history do show an average write size of about 8K. I would have expected ec8501e, which is in 0.6.5.4, to help aggregate them better. The small writes, which were clearly shown in your original iostat graph, do appear to be the problem.
If there is a bigger underlying issue here, I'd expect there to be a lot more problem reports given how widespread the use of NFS server is. The only unusual thing about your setup is that you're not using an SSD log device but instead have set sync=disabled (and maybe that's not as unusual as it ought to be).
One thing I didn't mention is that we have not done a 'zpool upgrade' so that we can downgrade if necessary.
We do have compression=on and I disabled this to see if it would affect what effect it would have.
iostat avgrq-sz has doubled (to where it was pre-upgrade) but w/s has not dropped. Also, there are no wrqm/s which seems odd to me.
compression=on:
root@san1:~# iostat -dx 30 sdx
Linux 3.2.0-57-generic (san1) 12/04/16 _x86_64_ (8 CPU)
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdx 0.00 0.00 5.83 147.07 45.60 1149.20 15.63 0.22 1.46 32.50 0.23 1.28 19.52
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdx 0.00 0.00 5.40 168.37 53.87 1211.07 14.56 0.40 2.30 56.27 0.57 1.64 28.52
compression=off:
root@san1:~# iostat -dx 30 sdx
Linux 3.2.0-57-generic (san1) 13/04/16 _x86_64_ (8 CPU)
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdx 0.00 0.00 4.47 167.83 29.60 3029.73 35.51 0.98 5.95 132.15 2.59 3.10 53.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdx 0.00 0.00 3.47 104.90 24.53 1603.60 30.05 0.47 4.33 55.69 2.63 3.06 33.12
Any ideas on data I could provide, tests I could do or steps I could take to help identify the root-cause or mitigate the problem?
Thanks again.
@dweeezil
We tested sync=standard on 0.6.2 with an SSD based ZIL but the performance was poor and so we moved to sync=disabled.
We are planning to try sync=standard again to see if this eliminate the small writes.
Does this sound reasonable?
@dweeezil
We are planning to try sync=standard again to see if this eliminate the small writes.
We are now running with sync=standard and request sizes to the slog ssds and the vdev are both now 15 sectors so it looks like we are hitting the same issue with sync=disabled|standard.
Any ideas!?
@crollorc Unrelated to this issue, your earlier issue with sync=standard and an SSD log device may very well have been fixed by b39c22b. As to this current issue, I can't tell why you're not getting any write block aggregation but were in the past. The only thing I can thing of at the moment is that the new write throttle is starting to sync the txgs before there's enough dirty data to actually aggregate the writes. The new write throttle is controlled much less by time and more by the amount of dirty data. Maybe try increasing zfs_dirty_data_sync from its default of 64MiB to something several times larger. The commit comment in e8b96c60 outlines all the new tunables (there are a lot of them) and the rationale behind the new dirty-data-based logic.
@dweeezil Thanks for the advice.
I hiked up zfs_dirty_data_sync by a factor of 5
echo "335544320">>/sys/module/zfs/parameters/zfs_dirty_data_sync
This has lead to larger slower txg syncs but has had no effect on write sizes unfortunately
root@san1:~# cat /proc/spl/kstat/zfs/san1stdpool/txgs
18 0 0x01 100 11200 15812979317 1220138956414077
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
15165732 1219547306928970 C 103823872 8437248 108766208 687 6628 4998911226 21627 86364 5969940592
15165733 1219552305840196 C 133959680 16027136 133283840 890 8009 5970068750 19449 9661 9281926622
15165734 1219558275908946 C 419964416 10758144 418938880 948 10639 9281975753 38762642 12841 11173108648
15165735 1219567557884699 C 2506752 593408 7084544 46 1216 38783744 67981 11173038382 481888713
15165736 1219567596668443 C 335713280 14246912 329070592 911 11272 14922927079 33046 59473 10498697281
15165737 1219582519595522 C 208999424 19658240 209769984 1177 9328 10498829161 23350 11915 11889600512
15165738 1219593018424683 C 360969728 12538368 360671744 812 11064 11889653459 240645955 26036 12310457932
15165739 1219604908078142 C 3392512 1349632 11851264 142 2093 240679883 16497 12310439528 2259345943
15165740 1219605148758025 C 244354048 18569216 249533952 1169 11906 17309564928 29802 68381 12566762372
15165741 1219622458322953 C 614432768 8825344 612707328 687 13992 12566918310 18515761 24347 12789978166
15165742 1219635025241263 C 2703360 268288 9133056 46 1363 18546521 15738 12789962159 1117922924
15165743 1219635043787784 C 254418432 4844544 247903232 615 10461 17789000422 19663 28888 8831526786
15165744 1219652832788206 C 356394496 5272064 357357568 532 9933 8831597067 20160 20300 8794093168
15165745 1219661664385273 C 175232512 3508736 66828288 622 8274 8794205731 30647 27209 5537575188
15165746 1219670458591004 C 109496320 8479232 48517120 487 6868 5537646814 19917 4868 4994873362
15165747 1219675996237818 C 78661632 2893824 38667264 337 5141 4999150577 1095103 24052 2955321620
15165748 1219680995388395 C 335664128 10828800 71731200 695 7917 3664853246 31489 17776 5392371236
15165749 1219684660241641 C 425984 1851904 6299648 140 1209 66072 29508 5392331163 1340747972
15165750 1219684660307713 C 335826944 11036160 90896384 729 9605 10279364198 25490 16578 8948091217
15165751 1219694939671911 C 151248896 2181632 63707648 550 7995 8948185595 27002 35457 7073099465
15165752 1219703887857506 C 136689664 2355200 58725376 596 7312 7073210909 26658848 11622 5380774336
15165753 1219710961068415 C 174258688 2200576 58366464 410 6945 5407481756 28565 13310 4190204779
15165754 1219716368550171 C 106930688 5412352 50428928 493 6421 5000909906 13078253 22659 4051350662
15165755 1219721369460077 C 72669696 2042368 38909952 475 6270 5011741115 102642288 68493 3704350875
15165756 1219726381201192 C 77281792 1138688 35868672 267 5944 5099776317 30340 55163 2992220012
15165757 1219731480977509 C 86577664 1547776 42548736 441 6350 4999683313 34274 66523 2982665268
15165758 1219736480660822 C 125505024 2061312 51666944 434 7412 4999731645 4336226 66781 3676232961
15165760 1219745346040784 C 146661888 7630336 52675584 512 7413 6593591562 26484 43135 4919952556
15165761 1219751939632346 C 85427712 8373760 40056832 618 5884 4996010229 28315 61489 6691483521
15165762 1219756935642575 C 85024768 3826176 42681856 483 5982 6691619637 28521 43215 3149760693
15165763 1219763627262212 C 111463424 7518208 48117248 398 6533 4999845277 29544 46361 3226251398
15165764 1219768627107489 C 84978688 2159104 40262144 336 6124 4999694156 22488 56248 2707472379
15165765 1219773626801645 C 228312064 2410496 73106432 568 8755 4999814413 28472 67061 4537423339
15165766 1219778626616058 C 152351232 1446400 48518144 478 7144 4999704757 34587511 152924 3419871570
15165767 1219783626320815 C 85893632 1113088 42266112 346 6851 5031810548 29674 49372 2609414857
15165768 1219788658131363 C 80108544 3673088 40606208 531 7238 4999698306 29588 63670 5271838972
15165769 1219793657829669 C 107587072 3384320 49200640 778 7184 5271952524 18752 14482 6422220740
15165770 1219798929782193 C 283225088 1183232 75063808 337 8025 6422272179 25795865 26462 3306554213
15165771 1219805352054372 C 517490176 6026240 105960960 294 9675 3332407159 24361670 94654 7777553278
15165772 1219808684461531 C 328192 120832 8216576 38 1331 24462006 15774 7777537645 646071454
15165773 1219808708923537 C 335872512 2001408 96568832 196 10248 9625785224 3967403 58006 4411652001
15165774 1219818334708761 C 8011776 1228800 10922496 61 2146 4075025 25498 4411583229 1014418827
15165775 1219818338783786 C 122765312 285696 56177664 33 8769 9409308807 29039 17008 1427350402
15165776 1219827748092593 C 112653824 47104 52990464 19 7448 5003795561 29668 65926 823252720
15165777 1219832751888154 C 213636096 615936 68867072 22 7773 4999762114 29651 50609 582089623
15165778 1219837751650268 C 72818176 669184 40657920 90 7355 4999748160 29023 56489 2090515439
15165779 1219842751398428 C 69604864 148480 37343232 11 5940 4999820073 29721 75034 406247655
15165780 1219847751218501 C 92731392 64512 43635712 23 6909 4999625288 27797 56131 807718543
15165781 1219852750843789 C 136089600 16896 57552896 9 8189 4999839967 28295 66847 333888819
15165782 1219857750683756 C 119449088 397312 45880320 8 6320 5003233374 29865 36752 399538499
15165783 1219862753917130 C 290128896 865792 69095936 39 7460 4996205772 28943 49386 897644253
15165784 1219867750122902 C 272039424 941056 76449792 14 8472 4999853280 36926 58131 472168373
15165785 1219872749976182 C 57738752 230912 30526976 28 5454 4999638757 38953 15394 926580755
15165786 1219877749614939 C 112080384 2570752 49501696 62 7126 4999814566 30159 55156 1658388365
15165787 1219882749429505 C 118583808 927744 46692352 98 6621 5003692554 30183 26637 1538912251
15165788 1219887753122059 C 123766272 1391104 48965632 344 7353 4999812097 18289533 49957 3153770173
15165790 1219897768677826 C 115807232 5521408 48044544 599 7817 4999747181 27994 26217 5278535095
15165791 1219902768425007 C 107308032 3016192 43960320 421 6410 5278607879 30748 13269 3327519422
15165792 1219908047032886 C 88134144 2544640 44220928 428 6451 4996884894 24708 42158 3524859611
15165793 1219913043917780 C 101572096 1798656 47307264 402 6889 4999796341 29642 62216 3315015885
15165794 1219918043714121 C 201362432 9166848 64028160 856 8845 4999652480 21997593 51567 5958306468
15165795 1219923043366601 C 475703296 13393408 104780288 1027 11005 5980368461 53319 5879 10768368601
15165796 1219929023735062 C 411564032 5027328 102991360 670 10336 10768440854 18369 5008 9290714054
15165797 1219939792175916 C 216221184 6014464 81198592 520 9805 9290800250 26214 26780 9057522982
15165798 1219949082976166 C 269355008 2806784 96839680 480 9100 9057594477 19550 8993 7197828302
15165799 1219958140570643 C 120171520 1551872 52727808 352 6954 7197876001 20248892 185085 3336507211
15165800 1219965338446644 C 75444224 2197504 34736128 353 5539 5018575574 13689230 14698 3150206648
15165801 1219970357022218 C 72051712 1256960 35189248 346 5776 5011799878 29297 54952 2945432727
15165802 1219975368822096 C 113371648 1550336 51550720 328 6780 4999786866 29772 48439 2812015577
15165803 1219980368608962 C 148939264 1693184 50263040 441 6723 4999695466 22113 66972 3445798711
15165804 1219985368304428 C 369689600 1588736 80213504 408 7596 3445919602 22786 39240 4828515597
15165805 1219988814224030 C 146930176 2231296 60336128 503 7374 4997634368 25424 60583 5170268217
15165806 1219993811858398 C 97103360 1856000 44809216 446 6840 5170369617 20135 5213 5155910764
15165807 1219998982228015 C 66098688 3091456 36184064 371 5531 5155950845 19672 5059 3950018276
15165808 1220004138178860 C 101739520 1630208 47604736 361 6301 4996990395 17013433 51559 3971532466
15165809 1220009135169255 C 123042304 1344000 50500096 312 6664 5015688785 39308655 35128 4235817268
15165810 1220014150858040 C 257754112 1262080 64723456 335 7160 5035757232 37256 56746 3359800759
15165811 1220019186615272 C 78577664 1421824 40296448 356 6142 4999793735 18817233 42229 3218582558
15165812 1220024186409007 C 97896448 1326080 44792320 446 7251 5015675033 30864 70346 4524430587
15165813 1220029202084040 C 77125120 1417216 39683584 347 6158 4999808552 33373 48808 2920449939
15165814 1220034201892592 C 102322176 2422272 44349440 441 6314 4999800786 33444 62751 3078155648
15165815 1220039201693378 C 206567936 2083840 58301440 300 7239 4999646245 30567 53322 2940542504
15165816 1220044201339623 C 335696896 3860992 73400832 456 8006 4018577811 4150463 61669 5106145008
15165817 1220048219917434 C 1097728 546816 6807552 112 1469 4234043 29725 5106100328 797276855
15165818 1220048224151477 C 143298560 2547712 60027392 492 7608 10104533934 28355 52513 4927874857
15165820 1220063328383773 C 94451712 4605440 45061120 484 6269 4999770108 29912 47488 3966882626
15165821 1220068328153881 C 164284416 3402752 56995840 352 6653 4999796208 21953 86098 3702911755
15165822 1220073327950089 C 114455552 2962432 45576192 269 5641 4999694622 42065 16165 2549991152
15165823 1220078327644711 C 108544000 2788864 50128384 393 6284 5003727717 24171 59502 4126122936
15165824 1220083331372428 C 73833472 1990144 38971392 333 5788 4999829832 6680903 75739 2725572440
15165825 1220088331202260 C 90086400 2100736 43217920 319 6279 5003778113 28462 70563 2831876066
15165826 1220093334980373 C 121577984 4802560 56598016 857 8714 4999677310 28872 55741 7081100546
15165827 1220098334657683 C 339081728 3565568 86428160 499 9814 7081202852 47905 45680 8622300546
15165828 1220105415860535 C 330521088 7106048 98919936 903 10811 8622406877 22133 4582 12519017260
15165829 1220114038267412 C 204268032 58990592 79695360 1248 9552 12519064508 23334 4602 10409391997
15165830 1220126557331920 S 0 0 0 0 0 10409433593 19369 6400 0
15165831 1220136966765513 O 0 0 0 0 0 0 0 0 0
The only way we've found of increasing the average write size is to set compression=off.
@crollorc I could not duplicate this problem with a quick test. I had no problem getting about 128KiB average write size as reported by iostat when copying large files to an NFS exported zfs filesystem. When rsyncing lots of small files, the average write size was all over the place, exactly as I'd expect. I'll try a few more tests with varying pool geometries but at the moment, I'm not able to reproduce the problem.
@dweeezil Thanks for trying.
Looking at the NFSv3 traffic, I found that most (70%) of writes were 4K despite having wsize=1048576 in all client mount options
root@san1:~# tcpdump -r /tmp/dump1|grep -i 'write fh'|wc -l
5555
root@san1:~# tcpdump -r /tmp/dump1|grep -i 'write fh'|awk '{print $10}'|sort|uniq -c|sort -nr|head
3856 4096
476 8192
318 12288
92 16384
90 18
50 16
44 20480
30 91374
24 512
22 24576
I had a look at the clients and I think the rootcause for the 4K writes may be that many of these clients are running Linux and Windows KVM VMs which I would guess probably have a 4K blocksize.
However, I don't understand why iostat avgrq-sz halved and iostat w/s doubled post upgrade.
@crollorc I think your problem boils down to no aggregation. During my simple tests, 8K streaming writes were aggregated nicely. Could you try a simple dd bs=8k if=/dev/zero of=zeroes test both directly and also over NFS to an otherwise quiescent filesystem with compression disabled (even if only temporarily) to verify that aggregation is working at all.
@dweeezil
I think your problem boils down to no aggregation. During my simple tests, 8K streaming writes were aggregated nicely.
One thing I am unclear on is if I should see merged writes in iostat for all aggregations or are they happening inside ZFS and where exactly should I look for evidence of aggregation?
We ran without compression on 0.6.2 and I noted a large proportion of writes were merged in iostats. Last year we turned on compression and merged writes were significantly reduced but still present. So far on 0.6.5 we see no merged writes at all (prior to these tests) with compression on and off. You can see this on this graph for a sample pool member (unfortunately I don't have graphs illustrating merged writes with compression off on 0.6.2)

Could you try a simple dd bs=8k if=/dev/zero of=zeroes test both directly and also over NFS to an otherwise quiescent filesystem with compression disabled (even if only temporarily) to verify that aggregation is working at all.
Compression=off at the moment but I cannot do this on an entirely quiescent fs right now.
For the direct dd, I found that the request size jumped to 100 - 300 for the duration of the test with intermittent drops (i guess between flushes). Also there were merged writes during the test.


For the NFS dd, I found that the request size jumped to 100 - 150 for the duration of the test. There were also a few merged writes but much less than for the direct dd.


@crollorc
One thing I am unclear on is if I should see merged writes in iostat for all aggregations or are they happening inside ZFS and where exactly should I look for evidence of aggregation?
I was mainly concerned about the aggregation performed within ZFS and was concentrating as the new write throttle as possibly being the cause of your regression because of your write-heavy load. The new write throttle also performed some refactoring of ZFS' aggregation code.
As to the block layer's scheduler merging, since you're using whole disks, ZoL should be disabling the elevator but it looks like you may have got cfq enabled. I don't remember, offhand, whether this logic has changed over time in ZoL or whether there have been any bugs with it. I'd suggest you check with something like head /sys/block/*/queue/scheduler and if they're not noop, then change them all to noop and see what happens.
@dweeezil
I was mainly concerned about the aggregation performed within ZFS and was concentrating as the new write throttle as possibly being the cause of your regression because of your write-heavy load. The new write throttle also performed some refactoring of ZFS' aggregation code.
Ok, I guess this was demonstrated by a jump in iostats avgrq-sz (versus dd bs) in the dd results.
As to the block layer's scheduler merging, since you're using whole disks, ZoL should be disabling the elevator but it looks like you may have got cfq enabled. I don't remember, offhand, whether this logic has changed over time in ZoL or whether there have been any bugs with it. I'd suggest you check with something like head /sys/block/*/queue/scheduler and if they're not noop, then change them all to noop and see what happens.
We are already using the noop scheduler.
I wonder if others using compression=off and a noop elevator see no write merges as this was definitely not the behavior we saw in 0.6.2?
In any case, thanks for all your help
@crollorc Can you please clarify whether this problem occurs with 0.6.4? One interesting post-0.6.4 change was the task priority fix. Prior to 0.6.5, the zio write threads ran at a much lower priority (higher raw priority value under Linux) than they were supposed to. If you had a lot of nfsd threads pushing writes to them, it's possible that the ZFS aggregation code had fewer IOs to work with. If you don't see this issue with 0.6.4 but do see it with 0.6.5, it might be due to the task priority adjustment.
Hi,
I'm seeing the same behaviour after updating from 0.6.4 to 0.6.5. Version 0.6.5 is much heavier on the writes being flushed to disks. My observations:
My pool:
pool: userdata
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.
scan: scrub canceled on Thu May 12 09:19:13 2016
config:
NAME STATE READ WRITE CKSUM
userdata ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
scsi-35000c50058ab00e7 ONLINE 0 0 0
scsi-35000c50058ab3d43 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
scsi-35000c50058ab1e6f ONLINE 0 0 0
scsi-35000c50058ab42ef ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
scsi-35000c50058ab7a8b ONLINE 0 0 0
scsi-35000c50058ab2da3 ONLINE 0 0 0
logs
mirror-3 ONLINE 0 0 0
scsi-SATA_INTEL_SSDSC2BA1BTTV437401F9100FGN ONLINE 0 0 0
scsi-SATA_INTEL_SSDSC2BA1BTTV43730834100FGN ONLINE 0 0 0
cache
scsi-35e83a9710005e87d-part1 ONLINE 0 0 0
spares
scsi-35000c500625ff913 AVAIL
Txgs:
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
10793211 312084106776788 C 68587008 8612352 20948480 756 4330 10000167442 4250 5997 2692029204
10793212 312094106944230 C 62390272 7127552 23287808 1401 4039 10004438963 7385 9050 3514455600
10793213 312104111383193 C 80222720 10573824 31437312 1148 5861 10000334677 4690 6464 2527310364
10793214 312114111717870 C 61217792 6523392 25288704 1347 5121 10000254737 9817 11140 3950554512
10793215 312124111972607 C 64391168 9314304 40939008 1033 4203 10000733954 6719 12856 4010043119
10793216 312134112706561 C 59781632 24415744 24387072 1285 3270 9999964766 6732 8780 3851472777
10793217 312144112671327 C 44187648 4140544 14866432 786 2452 9999383400 4533 6290 3181786317
10793218 312154112054727 C 60757504 8823296 22792704 1392 3056 10000743534 4368 9225 3982234157
10793219 312164112798261 C 42994176 3391488 15173120 609 2472 10000054193 4357 8645 2739292926
10793220 312174112852454 C 37067776 5884928 14753280 927 2202 9999231541 4567 9956 3228039487
10793221 312184112083995 C 56749568 5754368 20854272 1166 3333 10000896011 4931 7163 3946754118
10793222 312194112980006 C 56281088 8103936 21481984 857 4062 10000885793 4550 8623 3045669876
10793223 312204113865799 C 71583232 4467712 28741632 694 4329 10000099090 7305 9762 3024174973
10793224 312214113964889 C 70732800 6185984 30164992 861 3869 9999279064 4789 7715 3627366045
10793225 312224113243953 C 61769728 17265664 22879232 1494 4572 10000041677 6715 12030 3915833094
10793226 312234113285630 C 70991360 44334080 39017472 1627 5993 10000022878 4590 8579 4292663828
10793227 312244113308508 C 55891456 34114560 20900864 1136 3888 10000849053 6443 7677 3264109345
10793229 312264113423647 C 47859200 30145536 19790848 1883 2812 10000797960 4874 6283 4633259277
10793230 312274114221607 C 182593024 19094528 72361984 1641 10974 9999955309 7216 11589 4878448168
10793231 312284114176916 C 105703424 4952576 33337344 676 5762 9999255379 4655 10300 2943566267
10793232 312294113432295 C 56116224 8543744 20745216 1125 3288 10000125459 4935 7960 3491906603
10793233 312304113557754 C 66479616 6044160 26195456 1108 4850 10001592357 6613 9548 3806978876
10793234 312314115150111 C 54763008 4553216 26543104 582 3274 10000157898 5583 11509 3701671774
10793235 312324115308009 C 42426880 14307840 15518208 1194 2520 9999282329 4727 11780 3694537855
10793236 312334114590338 C 57604608 14051328 21906944 1464 3866 10000930645 5122 9420 4263669008
10793237 312344115520983 C 58293248 5001728 22450688 924 3346 10000983924 11828 8393 3497986043
10793238 312354116504907 C 46852096 7068672 17994240 1316 2770 10001032639 8727 8535 3058301827
10793239 312364117537546 C 42475008 9290240 16683008 901 3047 10000060973 5130 6193 3822290645
10793240 312374117598519 C 51773440 8830464 19445760 1164 3034 10000071321 4729 8658 3780294235



@erikjanhofstede i never got a chance to test 0.6.4 unfortunately.
did you track iostat avgrq-sz as I found this halved for me with compression=on between 0.6.3 and 0.6.5?
@crollorc I only tested between 0.6.4 and 0.6.5. I don't have any stats from 0.6.3. I tried with and without compression on 0.6.5, but didn't notice any difference by the way.
I'm trying to reproduce the issue on a testing server, but that pool is almost empty and there's no workload either.
@dweeezil If this issue is caused by the new priorities, is it possible that disabling spl_taskq_thread_priority can give use any insight on this?
@crollorc I don't have any graphs when I was updating from 0.6.3 to 0.6.4 many months ago, but there was no performance regression for me as far as I can remember. Only improvements.
@erikjanhofstede in the latest master code zpool iostat has been extended to include latency (-w) and request size (-r) histograms. They provide an excellent view in to how well (or poorly) the ZFS I/O pipeline is merging requests _before_ they are submitted to the block device.
If you have a test system available and we suspect the thread priority changes. Then I'd suggest running your test case using the master source and collecting the request size histogram. You can then revert the thread priority changes, 1229323d5f82bcb0525b312c80330968cd41466e, and run the same test to see if there's a difference. I'd be very interested in your results.
@behlendorf I've reverted 1229323 (had to fix some conflicts though). My test consists of running the commands below in parallel, for the rest the servers is idle, with an almost empty zpool, unlike the server that I posted graphs about before.
while true; do dd if=CentOS-7-x86_64-Everything-1511.iso_1 of=CentOS-7-x86_64-Everything-1511.iso_2 bs=4k; done
while true; do dd if=CentOS-7-x86_64-Everything-1511.iso_3 of=CentOS-7-x86_64-Everything-1511.iso_4 bs=4k; done
while true; do dd if=CentOS-7-x86_64-Everything-1511.iso_5 of=CentOS-7-x86_64-Everything-1511.iso_6 bs=4k; done
The results running on master, polled over a period of 60 seconds (zpool iostat -r 60 2):
userdata sync_read sync_write async_read async_write scrub
req_size ind agg ind agg ind agg ind agg ind agg
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
512 9 0 0 0 0 0 25 0 0 0
1K 1 0 6 0 0 0 15 1 0 0
2K 7 0 0 0 0 0 22 2 0 0
4K 1 0 0 0 4 0 24 4 0 0
8K 0 0 0 0 0 0 10 7 0 0
16K 0 0 0 0 0 0 5 4 0 0
32K 0 0 0 0 3 0 11 3 0 0
64K 0 0 0 0 16 0 106 18 0 0
128K 0 0 0 0 321 0 2.13K 0 0 0
256K 0 0 0 0 0 0 0 0 0 0
512K 0 0 0 0 0 0 0 0 0 0
1M 0 0 0 0 0 0 0 0 0 0
2M 0 0 0 0 0 0 0 0 0 0
4M 0 0 0 0 0 0 0 0 0 0
8M 0 0 0 0 0 0 0 0 0 0
16M 0 0 0 0 0 0 0 0 0 0
--------------------------------------------------------------------------------
userdata sync_read sync_write async_read async_write scrub
req_size ind agg ind agg ind agg ind agg ind agg
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
512 27 0 0 0 0 0 21 0 0 0
1K 1 0 5 0 0 0 15 0 0 0
2K 9 0 0 0 0 0 19 3 0 0
4K 5 0 0 0 4 0 23 3 0 0
8K 0 0 0 0 0 0 10 7 0 0
16K 0 0 0 0 0 0 3 3 0 0
32K 0 0 0 0 3 0 13 3 0 0
64K 0 0 0 0 21 0 122 17 0 0
128K 0 0 0 0 309 0 1.97K 0 0 0
256K 0 0 0 0 0 0 0 0 0 0
512K 0 0 0 0 0 0 0 0 0 0
1M 0 0 0 0 0 0 0 0 0 0
2M 0 0 0 0 0 0 0 0 0 0
4M 0 0 0 0 0 0 0 0 0 0
8M 0 0 0 0 0 0 0 0 0 0
16M 0 0 0 0 0 0 0 0 0 0
--------------------------------------------------------------------------------
The results running on master, without 1229323, polled over a period of 60 seconds (zpool iostat -r 60 2):
userdata sync_read sync_write async_read async_write scrub
req_size ind agg ind agg ind agg ind agg ind agg
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
512 27 0 0 0 0 0 35 0 0 0
1K 5 0 8 0 0 0 27 2 0 0
2K 8 0 0 0 0 0 25 3 0 0
4K 1 0 0 0 6 0 25 4 0 0
8K 1 0 0 0 0 0 13 9 0 0
16K 0 0 0 0 0 0 7 4 0 0
32K 0 0 0 0 2 0 8 4 0 0
64K 0 0 0 0 12 1 98 20 0 0
128K 2 0 0 0 235 0 2.32K 0 0 0
256K 0 0 0 0 0 0 0 0 0 0
512K 0 0 0 0 0 0 0 0 0 0
1M 0 0 0 0 0 0 0 0 0 0
2M 0 0 0 0 0 0 0 0 0 0
4M 0 0 0 0 0 0 0 0 0 0
8M 0 0 0 0 0 0 0 0 0 0
16M 0 0 0 0 0 0 0 0 0 0
--------------------------------------------------------------------------------
userdata sync_read sync_write async_read async_write scrub
req_size ind agg ind agg ind agg ind agg ind agg
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
512 77 0 0 0 0 0 41 0 0 0
1K 3 0 5 0 1 0 30 1 0 0
2K 11 0 0 0 1 0 22 4 0 0
4K 7 0 0 0 4 0 30 6 0 0
8K 1 0 0 0 0 0 11 8 0 0
16K 0 0 0 0 0 0 11 3 0 0
32K 0 0 0 0 3 0 17 4 0 0
64K 0 0 0 0 14 0 160 21 0 0
128K 2 0 0 0 192 0 2.13K 0 0 0
256K 0 0 0 0 0 0 0 0 0 0
512K 0 0 0 0 0 0 0 0 0 0
1M 0 0 0 0 0 0 0 0 0 0
2M 0 0 0 0 0 0 0 0 0 0
4M 0 0 0 0 0 0 0 0 0 0
8M 0 0 0 0 0 0 0 0 0 0
16M 0 0 0 0 0 0 0 0 0 0
--------------------------------------------------------------------------------
@erikjanhofstede thanks for running the test. Based on these results it doesn't look like 1229323 has any real effect.
@behlendorf I'm unable to reproduce the issue I'm having on my testing server, but it's very clearly visible on production servers. Can you think of a safe way to try some things out on a production server? In the meantime I'll continue to try to reproduce the issue.
@erikjanhofstede How about reverting the priority changes on your production server? That should be completely safe; after all, ZoL had the old priority settings for quite a long time. My original suggestion that the priority changes were somehow involved was based on the guess that there was sufficient additional load from other tasks on the system, running at normal user priority (100), which starved the zio threads (running at priority 120 IIRC) and allowed more pending IO to be built up and increased the effectiveness of the coalescing.
@dweeezil I'm running a server with 0.6.5.6 with 1229323 reverted, but that's doesn't make any difference. Are there any other commits that I need to revert besides this one to figure out if these new priorities are causing the issue?
How about this pull request? And commit? 62aa81a
Besides reverting these changes, how about just setting spl_taskq_thread_priority=0 ? Will this revert the changes?
@erikjanhofstede The pull request to which you refer is the one which resulted in 1229323 being committed. It looks like neither 62aa81a nor the related 9dc5ffb were in 0.6.4, either. I had forgotten about the related SPL changes which went along with 1229323 so, yes, in order to restore the old (incorrect) 0.6.4 priorities, you'd need to revert both of them in your SPL as well.
Setting spl_taskq_thread_priority=0 will cause all the zio threads to run at priority 120 which will exactly match most of your user processes so this won't quite duplicate the behavior under 0.6.4. Here's a bit of further explanation:
You can get a quick display of the priorities of your zio tasks by running ps -eL -o pri_baz,cmd | grep z_. Keep in mind that 120 is the default Linux priority. Lower values mean higher priority and higher numbers mean lower priority. In a 0.6.5 system, all of the zio threads except for the write issue threads run at priority 100 which is a very high priority. The write issue threads run at the slightly lower, but still high priority of 101. In 0.6.4, the situation was quite different, and it's been awhile so I may not quite be describing it correctly. I think all the zio threads were running at priority 129 or 130, the very lowest user priority possible except that the write issue threads were running at a priority value one less (a higher priority). The situation was exactly the reverse of what was desired. The effect, however, is that _all_ the zio threads were running at a much lower priority than typical user processes and could have potentially been starved of CPU. According to my theory, that might have allowed more IOs to build up with the side-effect of causing coalescing to be more effective.
Jumping on the bandwagon here. Any further developments on this?
Background: Upgraded multiple (5 in total) established systems (running smoothly for a year+) from 0.6.4.2 to the current 0.6.5.7. Immediate noticeable jump in the # of write operations across all systems with corresponding higher latencies. I've attempted to use some of the new 0.6.5 tuning parameters to regain the 0.6.4 performance levels to no avail.
I'll be downgrading the heaviest IO system back to 0.6.4 tomorrow during a maintenance window, but have other systems available for testing. In particular, we have a nearly-identical drbd "mirror" of the system I'll be downgrading. Open to suggestion, initial plan on the drbd mirror system is to start with a downgrade to 0.6.5.4 (the version reported by OP) and step back from there until the write IO hopefully drops down to the 0.6.4 level.
If a particular patch / tuning parameter(s) has been identified that addresses the above, I would be very interested in that info as well.
Thanks in advance!
@erikjanhofstede What is your pool geometry (zpool status)? My testing has also shown that the priorities don't seem to make any difference.
@dweeezil
pool: userdata
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.
scan: scrub canceled on Sun Jun 12 04:29:37 2016
config:
NAME STATE READ WRITE CKSUM
userdata ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
scsi-35000c50059117303 ONLINE 0 0 0
scsi-35000c500626018f7 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
scsi-35000c50062602c3b ONLINE 0 0 0
scsi-35000c500625fb53f ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
scsi-35000c500625ff1e3 ONLINE 0 0 0
scsi-35000c500591ae2a7 ONLINE 0 0 0
logs
mirror-3 ONLINE 0 0 0
scsi-SATA_INTEL_SSDSC2BA1BTTV335106X9100FGN ONLINE 0 0 0
scsi-SATA_INTEL_SSDSC2BA1BTTV335403VE100FGN ONLINE 0 0 0
cache
scsi-35e83a9710005e4fb-part1 ONLINE 0 0 0
spares
scsi-35000c500591a75bf AVAIL
errors: No known data errors
@erikjanhofstede changes to the metaslab code could definitely cause this, and they would more significantly impact older, fuller, more fragmented pools like yours. However, the majority of those changes went in to 0.6.3 so I'd have expected both releases to be impacted. Still I'd suggest investigating the following tuning which are described in zfs-module-parameters(5).
If your system has a few GB of extra memory I'd suggest starting by setting metaslab_debug_unload=1. This will prevent ZFS from unloading any of the metaslabs from memory which could generate significant additional IO if they're being unloaded/loaded constantly. If this resolves the issue you may be able to fine tune things using some of the other options.
metaslab_aliquot (ulong)
Metaslab granularity, in bytes. This is roughly similar to what would be
referred to as the "stripe size" in traditional RAID arrays. In normal
operation, ZFS will try to write this amount of data to a top-level vdev
before moving on to the next one.
Default value: 524,288.
metaslab_bias_enabled (int)
Enable metaslab group biasing based on its vdev's over- or under-utiliza‐
tion relative to the pool.
Use 1 for yes (default) and 0 for no.
metaslab_debug_load (int)
Load all metaslabs during pool import.
Use 1 for yes and 0 for no (default).
metaslab_debug_unload (int)
Prevent metaslabs from being unloaded.
Use 1 for yes and 0 for no (default).
metaslab_fragmentation_factor_enabled (int)
Enable use of the fragmentation metric in computing metaslab weights.
Use 1 for yes (default) and 0 for no.
metaslabs_per_vdev (int)
When a vdev is added, it will be divided into approximately (but no more
than) this number of metaslabs.
Default value: 200.
metaslab_preload_enabled (int)
Enable metaslab group preloading.
Use 1 for yes (default) and 0 for no.
metaslab_lba_weighting_enabled (int)
Give more weight to metaslabs with lower LBAs, assuming they have greater
bandwidth as is typically the case on a modern constant angular velocity
disk drive.
Use 1 for yes (default) and 0 for no.
zfs_metaslab_fragmentation_threshold (int)
Allow metaslabs to keep their active state as long as their fragmentation
percentage is less than or equal to this value. An active metaslab that
exceeds this threshold will no longer keep its active status allowing bet‐
ter metaslabs to be selected.
Default value: 70.
zfs_mg_fragmentation_threshold (int)
Metaslab groups are considered eligible for allocations if their fragmenta‐
tion metric (measured as a percentage) is less than or equal to this value.
If a metaslab group exceeds this threshold then it will be skipped unless
all metaslab groups within the metaslab class have also crossed this
threshold.
Default value: 85.
zfs_mg_noalloc_threshold (int)
Defines a threshold at which metaslab groups should be eligible for alloca‐
tions. The value is expressed as a percentage of free space beyond which a
metaslab group is always eligible for allocations. If a metaslab group's
free space is less than or equal to the threshold, the allocator will avoid
allocating to that group unless all groups in the pool have reached the
threshold. Once all groups have reached the threshold, all groups are
allowed to accept allocations. The default value of 0 disables the feature
and causes all metaslab groups to be eligible for allocations.
This parameter allows to deal with pools having heavily imbalanced vdevs
such as would be the case when a new vdev has been added. Setting the
threshold to a non-zero percentage will stop allocations from being made to
vdevs that aren't filled to the specified percentage and allow lesser
filled vdevs to acquire more allocations than they otherwise would under
the old zfs_mg_alloc_failures facility.
Default value: 0.
I'll note, however, that 0.6.4 should have the very same metaslab improvements so it's not something that should have changed between 0.6.4.X and 0.6.5.X.
@dweeezil @behlendorf I'll start with trying to set metaslab_debug_unload=1, see if that makes any difference. Is this something I can do online, or does it require a reboot/re-load of the kernel module?
@erikjanhofstede @dweeezil @behlendorf Spent most of today running through versions of 0.6.5.X on a more isolated system. The problem appears to manifest between 0.6.5.1 and 0.6.5.2 (see graphs below). These are taken from a stand-alone system which acts as storage for video recording. The steady-state IO load is a constant stream of around 1.5 MBps writes from a single virtual machine. The zfs server has 4 GB ram (and what turned out to be a really low-end controller from Dell) and a single disk (actually a six-disk hardware RAID10 array) presented to zfs. In normal production, it runs with sync=disabled which dramatically lowers the latency.


@olw2005 what did you use to produce those charts?
@mailinglists35 Cropped from the VMware vsphere client. =)
@olw2005 thanks! You've really narrowed the field of possible causes. Of the 13 patches applied between 0.6.5.1 and 0.6.5.2 only the following few have the potential to have caused something like this. In order of most to least likely:
057b87c Fix synchronous behavior in __vdev_disk_physio()
c1013eb Linux 4.3 compat: bio_end_io_t / BIO_UPTODATE
f4943dc Avoid blocking in arc_reclaim_thread()
My money's on 057b87c as the most likely cause. I suspect you're now unconditionally waiting on an bio and the previous code wasn't perhaps due to the exact kernel configure check. This may in fact be more correct behavior even if it's not as performant. @olw2005 can you please:
/sys/block/sdX/queue/rotational to verify if these are reported as rotation or non-rotation disks.That should give us a very clearly idea what the issue is. Thanks for doing all the heavy lifting on this!
@behlendorf @erikjanhofstede Just trying to do my part for the cause. =) Time permitting, I'll try to remove the above patch sets today. Will also be confirming the problem on another (higher end) system today if I can sneak in the downgrade. To answer your questions above:
[root@dtc-sannvr ~]# cat /etc/redhat-release
CentOS release 6.7 (Final)
[root@dtc-sannvr ~]# uname -r
2.6.32-573.12.1.el6.x86_64
(Also occurs with the newer Centos 6.8 kernel 2.6.32-642.el6.x86_64)
[root@dtc-sannvr ~]# cat /sys/block/sda/queue/rotational
1
@erikjanhofstede reverting 057b87c and aa159af will introduce a small race in the IO pipeline which can cause it to stall. Depending on the workloads on your system and its configuration you might never hit this race or you might hit it daily. It's impossible to say. Worst case you'll need to reboot the system.
@olw2005 thanks for the info.
Ok, further testing today. My git-fu is rusty so for simplicity, I used the zfs "point-in-time" as of commits f4943dc followed by the next commit, 057b87c. First graph shows performance for zfs-commit-f4943dc. The second graph shows the tail end of f4943dc on the left and 057b87c on the right. The contrast is readily apparent.


However, the plot thickens... As mentioned earlier, I downgraded one of our higher-spec hardware servers this morning from 0.6.5.7 to 0.6.5.1 hoping to see the same decrease in latency / increase in performance. But instead saw the reverse -- performance of 0.6.5.1 was significantly worse. (Side note, this one also ran very nicely on 0.6.4.2.)
Must be some other commit that has something to do with the different hardware / zfs configuration in this case. ?? I'll increment the version number here as well to try to pin it down, but it will take longer as I need to "sneak it in" outside of business hours at that particular location. Hardware and zfs config details for this server:
2 x Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
192GB RAM
HP P421 RAID controller (2 GB cache)
12 x 3TB SATA 7200rpm disks
zero SSD's
[root@sctc-san2 ~]# cat /etc/redhat-release
CentOS release 6.8 (Final)
[root@sctc-san2 ~]# uname -r
2.6.32-642.el6.x86_64
[root@sctc-san2 ~]# cat /etc/modprobe.d/zfs.conf
options zfs zfs_nocacheflush=1 zfs_prefetch_disable=1 zfs_arc_max=137438953472 zfs_arc_min=103079215104
[root@sctc-san2 ~]# zpool status
pool: pool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
pool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
sdh ONLINE 0 0 0
sdi ONLINE 0 0 0
sdj ONLINE 0 0 0
sdk ONLINE 0 0 0
sdl ONLINE 0 0 0
sdm ONLINE 0 0 0
errors: No known data errors
[root@sctc-san2 ~]# zpool get all | grep -v default
NAME PROPERTY VALUE SOURCE
pool size 32.5T -
pool capacity 31% -
pool health ONLINE -
pool autoreplace on local
pool dedupratio 1.00x -
pool free 22.3T -
pool allocated 10.2T -
pool readonly off -
pool expandsize - -
pool fragmentation 39% -
pool feature@async_destroy enabled local
pool feature@empty_bpobj active local
pool feature@lz4_compress active local
pool feature@spacemap_histogram active local
pool feature@enabled_txg active local
pool feature@hole_birth active local
pool feature@extensible_dataset enabled local
pool feature@embedded_data active local
pool feature@bookmarks enabled local
pool feature@filesystem_limits enabled local
pool feature@large_blocks enabled local
[root@sctc-san2 ~]# zfs get all pool | grep -v default
NAME PROPERTY VALUE SOURCE
pool type filesystem -
pool creation Wed Sep 2 11:43 2015 -
pool used 6.81T -
pool available 14.2T -
pool referenced 32.0K -
pool compressratio 1.68x -
pool mounted no -
pool mountpoint none local
pool checksum off local
pool compression lz4 local
pool atime off local
pool xattr off local
pool version 5 -
pool utf8only off -
pool normalization none -
pool casesensitivity sensitive -
pool usedbysnapshots 0 -
pool usedbydataset 32.0K -
pool usedbychildren 6.81T -
pool usedbyrefreservation 0 -
pool refcompressratio 1.00x -
pool written 32.0K -
pool logicalused 11.4T -
pool logicalreferenced 9.50K -
@olw2005 Do all of these systems you're testing use zfs_nocacheflush=1?
@dweezil In general, yes. Not tied to the setting, but if I recall correctly (and it's been a few years since I set it so I could be wrong) the HP smart array controllers we're using (P421's with 2GB of BBWC) perform better without the flush.
However, come to think of it the graphs and testing above (apart from the last entry above) are on Dell systems with different lower-end controllers. Hmm...
I'd be curious to know some of the block queue settings of the underlying vdevs for anyone seeing this issue. At the very least, /sys/block/<block_device>/queue/max_sectors_kb. Also whether anyone is using large blocks (zpool get feature@large_blocks is "active"). Pinging @olw2005.
@dweeezil /sys/block/*/queue/max_sectors_kb is set on 512 voor all zpool devices over here. The large_blocks feature is disabled (haven't upgraded the zpool).
I've been going over the various related commits: b39c22b, 5592404 and aa159af. The latter two are, respecively, 057b87c and cd887ab on zfs-0.6.5-release.
@olw2005 reported seeing different behavior between 057b87c and its previous commit.
One of the most interesting things I realized about 5592404 (057b87c) is that the new wait parameter to __vdev_disk_physio() will _never_ be set under ZoL because ZoL doesn't use spa_import_rootpool() which means, ultimately, vdev_disk_physio() is never used. This has a couple of effects: first, there will never be any wait_for_completion(&dr->dr_comp) calls and, second, the calls to complete() in vdev_disk_physio_completion(), if they even happen, will not actually be unblocking a waiter.
Prior to 5592404 (057b87c) , however, the wait_for_completion(&dr->dr_comp) would happen any time a dio is marked as synchronous which pretty much means IOs initiated from zio_flush(). In other words, the wait_for_completion(&dr->dr_comp) should have happened relatively frequently.
I think the extra waiting was actually causing the better performance (larger IOs) because it allowed the bio's to stay plugged longer.
As for aa159af (cd887ab), it seems the only reason it actually fixed something is because it effectively eliminated the calls to complete() altogether since it added a requirement that wait be set.
Which brings us back to b39c22b: It converted the bio's to READ_SYNC or WRITE_SYNC depending on their ZIO type in order to effectively bypass the plugging and implicit delay which would otherwise have occurred. This seems like a perfectly reasonable thing to do: generate synchronous bio requests for operations which ZFS, itself, thinks ought to be synchronous.
How about this for a fix: [EDIT] 35a550e 0caf657 (pinging @behlendorf). Totally untested at the moment but posted for feedback. I'll try actually using it later.
Of course I realized right after writing the patch that it's probably not much different than looking at flags in __vdev_disk_physio() with the advantage that it's looking for ZFS-specific flags (the ZIO priority).
Hello,
If this helps. I noticed similar latency increase after updating ZoL from 0.6.4.2 to ZoL 0.6.5.x. This was introduced with commit 37f9dac592bf5889c3efb305c48ac39b4c7dd140. The last commit with the good latency was dca8c34da4212ed85a92111f26bd3a3cd782f270.
What do you think about introducing a module parameter that allows to use old mechanism?
Preliminary testing shows that 0caf657 (note updated commit) seems to do the Right Thing: Sync writes from userland do cause sync-writes to an slog without any unplug timer expiration. And also, the synchronous blocking (completion waiting) is working again which, if my theory is correct, will restore the aggregation and lower latency for which this issue was originally opened.
I was using a 2.6.32-573.22.1.el6 during all my previous testing to be in line with the kernel used by @olw2005.
Next, I decided to try a 4.6.3 kernel and in order to get a baseline behavior, started with current master and reverted aa159af, 5592404 and b39c22b. The behavior described in [EDIT, ugh] b39c22b aa159af did _not_ occur; there were no unplug timer events nor were there any wait_for_completion() calls. Unless some other post-0.6.5.x ZoL change affected things, it seems changes in the kernel's block layer may also come into play.
What kernel version are you using @crollorc? Forgive me if you mentioned it already but this issue is getting very long and it's difficult to search it for kernel version numbers.
@dweeezil We were on 3.2.0-102-generic on Ubuntu 12.04.5 for the performance problems which I described.
I don't see any mention of the kernel being used in #3529 or the commentary in b39c22b. Testing with a stock 3.2 also didn't show any plug/timer-unplug activity on a non-rotational (or rotational) slog device (or even on a normal vdev) so I have a feeling the behavior only occurs under older kernels; likely only the 2.6.X EL kernels.
@olw2005 Your difference between f4943dc and 057b87c is most likely explained by the difference in synchronous operations introduced with 057b87c. Since all your devices were listed as rotational, there will be _fewer_ synchronous writes. Apparently the 2.6.32 EL kernels benefit from the synchronous writes by bypassing the io queue plugging.
Other kernels are likely impacted differently by this family of changes which introduce synchronous reads and writes to the bio layer.
@crollorc Are all your vdevs rotational? I still can't come up with a good explanation as to why you'd see a difference so long as you're running 0.6.5.4 or newer.
@dweezil Playing catchup here, reading through lots of comments. Are you suggesting trying out that patch (b39c22b)? Would that be 0.6.5.7 + patch? It looks pretty innocuous. I assume that the chance of it munging my data is relatively low? _chuckle_
About the rotational vs. non-rotational. The config / stat dump above is from a smaller server, with all rotational disks. (No SSD SLOG / L2ARC devices.) Our larger servers DO have separate SLOG devices. I gather this could make a difference? Possibly a different patch set is needed depending on drive configuration?
I can run through one build + reboot cycle per day on the all-rotational server. Let me know what patch set you're thinking. And [hopefully it goes without saying but] thank you very much. Your efforts are appreciated! =)
Also not sure if it's still applicable, but here's the info you requested earlier. Pretty much the same on all our servers apart from one which I had not upgraded the "feature set" on (and that one is currently back-rev'd to 0.6.4.2):
[root@sctc-san2 ~]# cat /sys/block/sd[b-m]/queue/max_sectors_kb
512
512
512
512
512
512
512
512
512
512
512
512
[root@sctc-san2 ~]# zpool get feature@large_blocks
NAME PROPERTY VALUE SOURCE
pool feature@large_blocks enabled local
@dweezil @crollorc @behlendorf The comments above also mention / imply that the older kernel version in Centos 6 is a contributing factor. Any thoughts on whether the somewhat-less-old Centos 7 kernel (3.10.0 + backports) might play a little nicer with 0.6.5.x? I've been running a tiny C7 + zfs server at home on 0.6.5 for months without issue, not that I'd even notice a performance decrease given its light usage. These server will need to be upgraded at some point in the next year or so anyway.
@olw2005 It looks like 2.6.38 was the last kernel to have the unplug timer mentioned in b39c22b that started the series of commits I've referenced. [EDIT: Yet I can see unplug timer events on a 4.5.5 kernel]
The sync-zio-to-sync-bio approach may have been the wrong way to deal with the unplug timer issue in < 2.6.39 kernels. Among other things, it very well may be contributing to the increased latency and smaller request sizes being discussed in this issue.
I'm working on a patch stack to deal with the < 2.6.39 kernel unplug timer issue, to revert the sync-zio-to-sync-bio conversion and also to remove (#ifdef out) the whole root pool import chain (spa_import_rootpool, spa_generate_rootconf, vdev_disk_physio, etc.) for the Linux port.
@dweeezil Sounds like a bit of doing there. Meantime I'll see if I can find a suitable server to test out a newer kernel.
I just pushed 1bfdd1f to dweeezil:syncfixes as a starting point. It _does_ cherry pick cleanly on to 0.6.5.7 and should restore at least the BIO sync behavior to pre-0.6.5 behavior. It also contains a fairly comprehensive commit comment with a bit of rationale.
@olw2005 Did you have the change to try dweeezil's 1bfdd1f?
@dweeezil Is this a 'production ready' fix? (so it won't mangle my data)
@erikjanhofstede ran it through the buildbot via https://github.com/zfsonlinux/zfs/pull/4835
superficial errors are from whitespace,
three buildbots passed: http://build.zfsonlinux.org/builders/Amazon%202015.09%20x86_64%20%28TEST%29/builds/1738 , http://build.zfsonlinux.org/builders/Amazon%202015.09%20x86_64%20SIMD%20%28TEST%29/builds/826 , http://build.zfsonlinux.org/builders/CentOS%206.7%20x86_64%20%28TEST%29/builds/1754
the other ones errored out in ztest:
http://build.zfsonlinux.org/builders/Debian%208%20x86_64%20%28TEST%29/builds/1691 , http://build.zfsonlinux.org/builders/Ubuntu%2014.04%20x86_64%20%28TEST%29/builds/1787
<-- are those buildbot related ?
zfstests + zfsstress:
http://build.zfsonlinux.org/builders/CentOS%207.1%20x86_64%20%28TEST%29/builds/1750
<-- needs further investigation, or is it a known behavior ?
anyway - I took the plunge and it's stable so far in production
I'd wait for @dweeezil to confirm or advise against
I've had to step away from this issue for a bit to work on some other stuff. The 1bfdd1f patch _should_ have returned the sync BIO behavior to pre-0.6.5 and, compared to 0.6.5.7 in which no non-rotational vdevs are being used, it should be a noop. Other than that, all it does is remove some code which is dead under ZoL. In other words, it ought to be pretty safe.
I do still intend on coming up with a fix for pre-2.6.39 kernels.
@dweeezil @erikjanhofstede Status update. There's definitely something to the old kernel theory. I haven't tried the patch out yet, but just ran a baseline of vanilla 0.6.5.7 using an ElRepo kernel 3.10.102-1.el6.elrepo.x86_64 vs. the RH 2.6.32. Holy cow what a difference! First pic is 2.6.32 kernel, zfs 0.6.5.7 comparing two different SCST versions (no substanstantive difference).

Second pic is the 3.10 kernel and vanilla zfs 0.6.5.7. Note the significant drop in latency!

@dweeezil @erikjanhofstede Third performance graph with the 3.10 kernel and zfs 0.6.5.7 + patch set 1bfdd1f mentioned above. (Also to clarify, the graphs are a vmware client which dumps a more-or-less constant stream of video recordings at ~1.5 MB/s write speed.) It's not the enormous win that the newer kernel version is, but it does appear the patch has a positive impact on write latency / performance -- maybe 20-25% better?

Confirmed, the latency appears to be really down while doing multimedia (watching HD or HQ videos while smoothing it up to 60 FPS), or 2 music streams at once - no to barely perceptible interruptions or latency spikes
while either running zpool scrub, or running a metadata-heavy workload (where #4850 might also help quite nicely, looking forward to test it in real-world usage once the buildbots pass)
@dweeezil thanks for all those changes !
See #4858 for a version of the 1bfdd1f patch which should fix the unplug timer issue with 2.6.32-2.6.38 kernels.
Both #4850 and #4858 LGTM for merging to master. However, it would be great to get some independent confirmation that these two changes resolve this issue. We'll of course want to evaluate cherry picking these fixes back to the release branch. Thanks @dweeezil!
@behlendorf Actually, #4850 was created to solve problems unrelated to this one, but instead related to some other work I'm doing for which there is no specific PR currently filed, however, I think it'll fix a whole lot of issues with the typical rsync-interleaved-with-lots-of-data workloads and have indicated a few issues in its commit log for which it ought to help.
@behlendorf I'm planning on testing these changes this week.
In order to help this along, I'd like to try to summarize this issue a bit. First, off, there are possibly different underlying issues. The series of commits isolated by @olw2005 which add/fix synchronoous BIO requests may have nothing to do with the original issue reported by @crollorc or others. Here's a quick summary:
@crollorc You posted the original issue, running a 3.2 kernel under Ubuntu Precise with ZoL 0.6.5.4 and have no nonrot devices in your pool. Since you're running 0.6.5.4, sync behavior _should_ be identical to that of 0.6.4 in that sync BIO requests should never happen and there should be no extraneous completion activity. My various patches, culminating with the #4858 sync fix PR should _not_, in theory, change anything for your case. If you've tried them, I'd sure like to know whether they had any effect but it sure seems like they should have not. Based on other subsequent comments, it does seem like it might be worth trying a newer kernel. Since you're running Precise, you can install the Trusty HWE from Ubuntu and get a much newer 3.13 kernel.
@olw2005 Your initial report concerned upgrading from 0.6.4.2 to 0.6.5.7 on a 2.6.32EL kernel. Apparently some of your servers have SSD log devices and some don't. For those that do, 0.6.5.7 _will_ cause some sync BIO requests and the requisite waits for completion and possibly cause some difference in performance. It also sounds like you've got various settings for either or both of sync=disabled and zfs_nocacheflush=1 (which is a whole separate story, BTW) and also have tested different storage controllers; some with and some without BBU (and an unspecified writeback policy set in the controller) so it's a bit hard to tell exactly what's being measured in all cases. I'm a bit confused by this particular posting of yours (emphasis mine):
Third performance graph with the 3.10 kernel and zfs 0.6.5.7 + patch set 1bfdd1f mentioned above. (Also to clarify, the graphs are a vmware client which dumps a more-or-less constant stream of video recordings at ~1.5 MB/s write speed.) It's not the enormous win that the newer kernel version is, but it does appear the patch has a positive impact on write latency / performance -- maybe 20-25% better?
Wasn't the 3.10 kernel the "newer kernel". Did you really mean that you were running the 2.6.32 kernel with the 1bfdd1f patch? If you were really running 3.10 with the 1bfdd1f patch, I'd expect the results to be identical to the 3.10 kernel without it. It it was really with the 2.6.32 kernel, you should use the newer patch in #4858.
@erikjanhofstede Your initial report concerned an update from 0.6.4 to 0.6.5 which based on a subsequent follow-up is 0.6.5.6. I don't see any mention of the kernel version. Since you _do_ have an SSD log device and apparently aren't changing any of the sync/flushing parameters, the commits isolated by @olw2005 may very well be coming into play and the #4858 patch may very well make a difference. I presume that's what you're going to test so it'll be interesting to hear the results.
@dweeezil I'm running 0.6.5.7 + #4858 for 5 days now, and I don't see any difference in performance. My kernel: 2.6.32-673.8.1.lve1.4.14.el6.x86_64
Below are some chart of the utilizations (%) and writes per second made in graphite. The server is running on 0.6.5.7 + #4858 starting from the 20th around 06:30.


@dweeezil Thanks for the response! I believe the comparison there was between kernel 3.10 and zfs 0.6.5.7 with and without the 1bfdd1f patch.
In general we're using sync=enabled. Sync=disabled was used to improved performance on the aforementioned "test" box, but sync=enabled was set for the above testing / graphs.
As you mentioned, we do have a variety of hardware configurations. I noticed the performance hit on all the systems upgrading from 0.6.4.2 to 0.6.5.7. The "test" system has the advantage of a known "static" I/O load for more accurate comparisons and is one I can easily take down for maintenance. (As an aside, all the systems use controllers with battery-backed write cache. My understanding is the zfs-no-cache-flush option is ok in that scenario. In truth I haven't noted much performance difference with it on or off.)
Last week I reinstalled the above "test" system with a more optimum drive configuration, zfs v0.6.4.2 and default options for the kernel module (followed by re-importing the zvol with zfs send / recv). It began its life running 0.6.5, so I did not have comparison data for zfs 0.6.4.2 vs. 0.6.5 under a known constant load. I'll post the graphs in a follow-up. But the difference is significant.
@dweeezil @erikjanhofstede @crollorc @behlendorf As mentioned above, I now have direct comparison data for 0.6.4.2 and 0.6.5.7 given a constant I/O load (~1.5 MB / sec write I/O, negligible read I/O). The results are consistent with the performance degradation on other upgraded systems. Short version is: 0.6.4.2 = great, 0.6.5.7 = really not great, 0.6.5.7 with newer 3.10 kernel = somewhere in the middle, 0.6.5.1 with stock kernel = somewhere in the middle.
A few notes on the test system hardware / config: 4 GB RAM, 8 x 4TB SATA disks, Dell RAID controller (lsilogic-based, disks are in JBOD mode), Centos 6.8 (stock kernel), default zfs kernel options. The pool contains a single zvol shared to a single vm (in vmware) via iscsi (scst).
[root@dtc-sannvr ~]# zpool status
pool: pool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
errors: No known data errors
[root@dtc-sannvr ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool 5.07T 9.21T 19K none
pool/nvrstore 5.07T 9.21T 5.07T -
[root@dtc-sannvr ~]# zfs get all pool | grep local
pool mountpoint none local
pool checksum off local
pool compression off local
pool atime off local
pool xattr off local
[root@dtc-sannvr ~]# zfs get all pool/nvrstore | grep -v default | grep -v -
NAME PROPERTY VALUE SOURCE
pool/nvrstore volsize 16T local
pool/nvrstore checksum off inherited from pool
pool/nvrstore compression off inherited from pool
GRAPHS
Write latency is the highlighted graph. Read latency is also depicted, but is inconsequential. Note because of the (helpful) change in drive configuration, these results are not comparable with the previous ones.
zfs 0.6.4.2, centos 6.8 with stock RH kernel

zfs 0.6.5.7, centos 6.8 with stock RH kernel and also with the elrepo 3.10 kernel

zfs 0.6.5.7 + patch, elrepo 3.10 kernel = no change vs no patch (this is as expected)

Observation: Roughly 20x increase in latency from 0.6.4.2 to 0.6.5.7. The newer 3.10 kernel cuts the latency in half to roughly a 10x increase. The qualitative experience was the same upgrading zfs on our other (main) systems.
I should have more time now to do additional patch testing, other diagnostic commands, etc. Just let me know if I can be of any assistance!
Warm Regards,
OLW
Further update to this issue after scouring the issues list and running tests on various revisions of zfs. I found a number of bug reports along the lines of "performance drop in virtual environment using 0.6.5" which are similar to OP's and my issue. From these and lots of testing I've identified & verified three distinct sources of this performance drop. The first two have been or can be mitigated. The third issue is mentioned briefly / tangentially in bug report #3754 and that one looks somewhat more intractable. It appears (at least to my layman's eye) to be deeply intertwined with subsequent commits in the zvol potion(s) of the code.
As a reminder, the test case (shown in the graphs below) is a VMware virtual machine. A zvol is present via iscsi (scst) and formatted with vmfs (vmware filesystem). I/O is only from the one vm, and consists of a continuous write stream of around 1.5 MB/s with sporadic / very light read I/O.
Issue 1. As previously noted above, zfs commit b39c22b. Testing confirms this was later fixed by commit 057b87c. Anything zfs version with b39c22b (and without 057b87c) takes a big latency hit.
Issue 2. Spl commit f7a973d (the taskq_dynamic feature). The "fix" for this is mentioned in another issue report.
echo 0 > /sys/module/spl/parameters/spl_taskq_thread_dynamic
Fairly dramatic & instantaneous difference as shown here. Testing with other zfs revisions shows they all behave the same for this specific load case.
avg write latency under 10ms (with the option set to zero)

Also note, the zfs revision in use here is 0.6.5.0 (commit 7a27ad0) minus the "zvol branch" commits [notably commit 37f9dac] and minus commit b39c22b.
Issue 3. Zfs commit 37f9dac ("zvol processing should use struct bio"). See above graph and graphs below. These are just a representative subset of the overall testing.
0642 baseline
avg write latency ~6ms.

0650 excluding b39c22b, spl_taskq_thread_dynamic=0, zvol branch commits INCLUDED
avg write latency increases to ~100ms.

zfs git master branch (commit f26b4b3) minus the original commit 37f9dac and Numerous later zvol-related commits that fail to cleanly apply due to interdependencies
avg write latency ~4 ms.

zfs git master branch (as of 22 Aug 2016) with all the trimmings (spl_taskq_thread_dynamic=0)
avg write latency ~75-80 ms

I was content to just sit at 0.6.4.2 until this was all sorted out. But I started reading about the various hole_birth issues which have various fixes contained in the 0.6.5 (and master) branch. (And a checksum comparison of one of my zvols tells me we are indeed affected by one or more of the hole_birth issues.) So now I have an interest in getting this performance issue sorted out sooner rather than later. But I'm not sure where to take it from here. Suggestions? Also, I'm sure I've left something out from the above. Questions, comments, etc. please let me know.
Final note. For reference, here are the other bug reports which mention similar issues: 4065, 4880, 3888, 4042, 4792, 4135, 3902, 4619, 4265, 3754, 4429, 3324, 3950.
Local testing showed no performance degradation from the "zvol branch" commits, but the other two commits mentioned above did make a big difference in local performance. The zvol branch performance issue only seems to express itself when a zvol is shared out via some mechanism: iscsi, fibre-channel, etc. This is echoed in some of the other bug reports listed above.
[EDIT @crollorc] @olw2005 I've not yet fully digested your new information, but am I correct in saying that the comments regarding dynamic taskqs is new information? In the context of the pre-reworked zvol code in which the "zvol" taskq still exists, this might make sense from the standpoint that dynamic taskqs require a certain number of sequential operations to occur before spawning new threads and that could impact latency.
It sounds like one of the real questions is why the zvol rework of 37f9dac (zvol processing should use struct bio) impacts latency so much for your workload in which the zvol is exported via iscsi. I've never looked into SCST so I've got no idea how it hooks into the zvol block devices. Have you tried running blktrace/blkparse on the zvols in these various situations to see whether there's anything obviously different?
@dweeezil I couldn't find the specific spot that references the dynamic taskq setting but I'm reasonably sure it was in one of the above issue #'s. I think the comment / suggestion of setting it to zero came from @behlendorf so it's not completely new. =)
I don't think there's anything special about SCST. In our case, I'm using block devices (as opposed to file-based devices which invoke the linux page cache). I believe scst uses the normal block device hooks in the kernel (hopefully my terminology is correct here)? The other aforementioned issues covered an array of different software. I think tgtd, lio and virtio were specifically mentioned, possibly some others I'm forgetting. The common themes seem to be zvols with various flavors of 0.6.5 presented to virtual machines / virtual environment via iscsi, fcoe, etc.
Not familar with blktrace/parse, I'll look at the man pages and give them a go. That said I did manage to compile cherry-picked versions of zfs containing the new iostat -w and -r options (with and without the zvol commits). I didn't see any difference in the -w or -r output between the two cases -- they basically look identical.
I agree, the question is why does 37f9dac have such a pronounced effect.
@dweeezil Ok, block trace data incoming... Would it be worthwhile including @ryao on this? The code in question is his baby; he may be able to shed additional light on things?
Ten seconds each of the zvol block device and sdb (one of the underlying physical disks in the pool - for reference, the pool is 8 x 4 TB drives setup as 4 sets of mirrored disks). First two links are the "bad" performance results. The zvol output looks odd, nothing but merges?
Blktrace of "bad" performance results:
zvol - http://pastebin.com/Ad18277Y
sdb - http://pastebin.com/iTvmbmgh
Blktrace of good performance results:
zvol - http://pastebin.com/zGzUvw6q
sdb - http://pastebin.com/3HRWXd5y
Again as a reminder, the I/O load is ~1.5 MB/s constant write stream. Nothing too taxing. The output is hieroglyphics to me, but based on the file size difference (particularly in the sdb traces) I'd say there's something peculiar going on. (However, I can say that the nvrstore_# processes are from the scst iscsi target software.) Thoughts?
[root@dtc-sannvr ~]# wc -l __trace_.txt
4622 sdb_trace_2.txt <-- bad perf
1809 sdb_trace.txt <-- good perf
275 zvol_trace_2.txt <-- bad perf
1314 zvol_trace.txt <-- good perf
8020 total
As I'm writing this, it occurs to me that blktrace output from within the virtual machine might be useful too??
Partial btt output shown below for the above two cases, somewhat more intelligible than the raw data. I'm not that familiar with blktrace interpretation, but the btt output does highlight significant differences in the treatment of the I/O's. The "good performance" run spends significantly more time in "I2D" (time spent in the request queue) as opposed to "D2C" (driver and device time).
This is taken from the /dev/sdb "good perf" run (zfs without the 0.6.5 zvol changes):
==================== All Devices ====================
ALL MIN AVG MAX N
Q2Q 0.000009794 0.040899125 2.595884926 197
Q2G 0.000000357 0.000001027 0.000016522 198
G2I 0.000000547 0.000001754 0.000002867 198
I2D 0.000000381 0.002006721 0.002950211 198
D2C 0.000124630 0.001236773 0.022486040 198
Q2C 0.000126222 0.003246276 0.022490799 198
==================== Device Overhead ====================
DEV | Q2G G2I Q2M I2D D2C
---------- | --------- --------- --------- --------- ---------
( 8, 16) | 0.0317% 0.0540% 0.0000% 61.8161% 38.0982%
---------- | --------- --------- --------- --------- ---------
Overall | 0.0317% 0.0540% 0.0000% 61.8161% 38.0982%
And from the sdb "bad perf" run here (zfs master with all the zvol changes):
==================== All Devices ====================
ALL MIN AVG MAX N
Q2Q 0.000021895 0.016768780 1.912321773 512
Q2G 0.000000340 0.000000551 0.000001722 513
G2I 0.000000500 0.000001081 0.000016659 513
I2D 0.000000331 0.000000545 0.000001505 513
D2C 0.000116908 0.000794772 0.033262668 513
Q2C 0.000118730 0.000796950 0.033264640 513
==================== Device Overhead ====================
DEV | Q2G G2I Q2M I2D D2C
---------- | --------- --------- --------- --------- ---------
( 8, 16) | 0.0692% 0.1357% 0.0000% 0.0684% 99.7267%
---------- | --------- --------- --------- --------- ---------
Overall | 0.0692% 0.1357% 0.0000% 0.0684% 99.7267%
@olw2005 I'm looking at the blktrace/parse output of the zvols now and before drawing any conclusions, I wanted to mention that the output of blkparse rather poorly documented in the man page. In particular, the "RWBS" field's output is not fully documented. The "M" in the field stands for "meta" as in "metadata". which means the REQ_META flag has been passed to the bio layer. It's not clear what kernel you're using right now but at least in 4.6, that flag isn't actually paid attention by anything. It is set by a few fileystems such as ext4, gfs2 and xfs.
I've not run blktarce on a zvol in a long time and it seems the new zvol architecture only allows blktrace to see the queue request (I'm not sure why right now) but seeing the RWBS field is at least somewhat useful. Reducing the "good" zvol trace to just the queue requests and comparing the two, the only thing that really sticks out to me is that the "bad" trace seems to interleave the requests between CPU cores much more evenly whereas the "good" trace performs a bunch on one CPU core and then a bunch on the other. Also, the "good" trace involved 3 CPU cores wherase the "bad" one only involved 2. The time between queue write requests doesn't look too much different between the two (examined only by eye).
I'm curious what the "nvrstore" processes are. Something to do with SCST?
I'm still looking at the traces of the underlying vdev. The first thing that sticks out to me are all the time unplugs in the "good" trace which I find odd because I'd normally associated them (timer unplugs) with worse performance. Are both these traces using the same kernel? What kernel version?
@dweeezil First, thanks for taking the time. =) Kernel version in use at the moment is stock Centos 6, 2.6.32-642.3.1. As a reminder, the behavior is the same with the newer 3.10 el repo "long term branch" kernel.
You're correct about the "nvrstore" processes. They are part of SCST, named after the "block device" name in its configuration.
The output for the "bad" zvol trace seemed a bit suspect to me (almost like it wasn't capturing everything??), which is why I ran it again at the physical disk level. Note I also included the first portion of the "btt" command output, which consolidates the raw information from blkparse into a more concise form. It shows the "good performance" trace spends about 2/3 of its time in "request queue" whereas the "bad" one spends almost no time there. Not sure what that means exactly, but I guess (??) the "bad" one is dumping everything down to the physical disk just as soon as it receives it?
One other point of interest, I had a go yesterday at unraveling a portion of the changes in commit 37f9dac. My programming skill isn't good enough to tackle zvol.c (not to mention the latest version of that file has changed significantly due to later patches), so I looked at the include/linux/blkdev_compat.h header file. I modified the ifdef checks to revert it back to previous settings (WRITE_BARRIER, REQ_HARDBARRIER, and REQ_FUA). The resulting performance was still bad, so I'm guessing that just leaves zvol.c itself?
This whole discussion makes me wonder if https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140 was the right way to go here. Removing the block layer interfaces has caused some real compatibility issues - qemu for instance, can't use virtio-scsi in passthrough (lun) mode without the SG interface which is now gone, and the latency thing is bad enough to be a blocker to adoption by storage vendors and informed engineers alike looking to run high performance storage with the consistency and data portability features of ZFS.
Using loop shims isn't ideal, including tcm_loop (SCST just plain doesnt work without the scsi-generic layer), as under heavy load, data errors start to creep in from writes to the "wrong offset." I think i have this documented in an issue from a while back, but the end result is that abusive DB workloads over iSCSI end up in data corruption in the shim layer. Interestingly, LVM doesnt seem to do this - we feed ZVOLs as blockdevs to LXC containers which then build LVM atop them for Cinder volumes. Works great, except the whole performance issue...
If there are GPL-specific block device related functions, can they be translated via SPL to match the Illumos counterparts such that the ZFS code stays close to Illumos' version and SPL handles the idiosyncrasies of wrapping the Linux side? ZVOLs are a major use case for virtualization, and IMO, getting performance to a point of being able to compete with the some of the flash storage abominations flooding the market (dont get me wrong, some are great), is going to significantly help adoption and therefore support/commit-rate.
During the slow test, have you measured the queue depth to the zvol (iostat -x, avgqu-sz) ?
@richardelling I think blktrace captures that (see links above). According to the manual, read & write depth are "the maximum queue depth ever observed". The output for the good test case shows write depth of 32 at the zvol level.
Unfortunately the "slow" test case was missing stats at the zvol block device level. I recall reading loss of disk stats was a side-effect of the zvol patchset. That said I also recall reading a later patch -- possibly only in the main branch -- that reintroduced those stats. I'll have to check if my "bad" test case included the zvol stats patch. I'm guessing not given the lack of interesting blktrace output at that level.
Side note, I did get valid output at the physical disk level but I'm not sure how useful that is. Per-cpu stats at that level showed a higher write depth for the "slow" test case.
Thoughts, suggestions, comments would be most welcome.
@sempervictus Appreciate your response. If you have any suggested workarounds / etc. I'd be very interested. I would really like to get to the latest version which includes the various hole birth issue fixes. It hasn't directly affected our environment so far as I can tell, but recent testing (shasum of the block device) of my offsite zvol snapshots shows that they have diverged from the primary snapshots. =/
Found the commit (8198d18ca7e05ec30139e10d658a8b06c0ec4c55) that brought back the zvol stats. Unfortunately no stats for me ... the RHEL / Centos 6 kernel fails that configure check. =/
My musings are mostly directed at @ryao. The zvol rework was his baby iirc.
As I understand the zvol problem, it has to do with the zvol being accessed
in ranges as a file (essentially) as opposed to operations on multiple
files in a dataset. How the intermediate layers of the old implementation
amortized this, I'm not sure (though I have a guess or two), or why the new
approach bottlenecks... What I do find odd is that the issues only crop up
under certain io patterns - we have tried blk-mq and schedulers from bfq to
deadline on the underlying devices and they rarely hit 20% load when this
occurs. Other workloads, like asm over iscsi screams (sync io single depth
to block device arrays imported via lio iscsi from zvols). I vote for
gremlins as the probable cause, and don't yet have a new stable build to
benchmark on hardware. Will post back some fio and iostat data when I do.
On Sep 6, 2016 3:40 PM, "olw2005" [email protected] wrote:
Found the commit (8198d18
https://github.com/zfsonlinux/zfs/commit/8198d18ca7e05ec30139e10d658a8b06c0ec4c55)
that brought back the zvol stats. Unfortunately no stats for me ... the
RHEL / Centos 6 kernel fails that configure check. =/—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/zfsonlinux/zfs/issues/4512#issuecomment-245065705,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABRPjD2Zwzc2CioZ2R0MduP0-xubMrSlks5qncGpgaJpZM4IFMu1
.
Been away trying various fio tests to nail down performance variance between 0.6.4.2 and 0.6.5.x without any success as yet.
Anyway, as mentioned in #4880 I've used ryao's 0.6.4.2 backport of essentially the zvol commit 37f9dac592bf5889c3efb305c48ac39b4c7dd140 this was also responsible for the blktrace / blkparse breaks so only queue commands can be seen
I'm also running zvols to support VMs under xen full summary in issue above. This commit is certainly confirmed as the cause, but seems very tricky to extricate it from the current code line. I was wondering if the passage of I/O requests through the virtualisation layer is exacerbating with bursts of multiple I/O requests to the zvol, but simple fio tests on quiescent disks with various simple sync / async workloads are not showing this.
It is however painfully evident during our benchmarking. If can help by applying any patch for this commit let me know, and I'll feed back results.
Getting back into this issue, I finally backed off my test system to a sufficiently old kernel (3.2.0) that allowed me to run these older revisions of spl/zfs. I just finished running a quick baseline test against a 10GiB zvol on a pool consisting of 4 10GiB tmpfs-backed vdevs. The SPL is at 4fa4cab which brings it to the time frame in which the zvol rework of 37f9dac was done.
The test is using fio with this configuration:
[test]
blocksize=8k
scramble_buffers=1
disk_util=0
invalidate=0
size=256m
numjobs=32
create_serialize=1
direct=1
filename=/dev/zvol/tank/v1
offset=0
offset_increment=256m
group_reporting=1
ioengine=libaio
iodepth=10
rw=write
thread=1
time_based=1
runtime=1m
fsync=0
With ZFS at 782b2c3, right before the zvol rework, this test yields 30.9K IOPS with an average latency of 10.3ms. Advancing ZFS to 37f9dac (applying the base zvol re-work) brings the IOPS up to 34.9K and drops the average latency to 9.1ms. Not a huge difference but in every regard, the performance is _better_ with the new zvol code.
My guess is that anyone seeing a performance regression due to the zvol rework is using a consumer of the zvols that may not be interfacing with them in an obvious manner. The common thread here appears to be that various ISCSI targets are involved or something else in-kernel is in front of the zvol. I ran some similar tests on a my laptop in which I configured LIO to provide the zvol as a target and _did_ get worse performance across LIO but it wasn't a very controlled experiment.
I'm going to run a more extensive battery of tests now that I have the system set up correctly but for normal native zvol access by an application, I'm not seeing any performance regression at all between 37f9dac and 782b2c3. In particular, I intend on configuring LIO and trying some tests over that path.
@dweeezil Yep that comports with what I've seen on my own systems and read from other issue reports here on github. Local performance of a zvol seems to be as good or better. [Note, there were a couple of other performance issues in the early 6.5 releases that were later addressed.] All the cases with performance degradation from 0.6.4.x --> 0.6.5.y [at least every one that I have found] involve iscsi or other consumers of the zvol.
@dweeezil This is just a shot in the dark but could this be at all related to pull request #5181 ?
@dweeezil: If you're looking to see interface problems with consumers, try
scst or qemu passthru SCSI volumes as LUNs. As I mentioned in the zvol
issue, this behavior seems worst under non sync sustained write, fio with
direct io should not hit that use case.
On Sep 28, 2016 11:28 AM, "Tim Chase" [email protected] wrote:
Getting back into this issue, I finally backed off my test system to a
sufficiently old kernel (3.2.0) that allowed me to run these older
revisions of spl/zfs. I just finished running a quick baseline test against
a 10GiB zvol on a pool consisting of 4 10GiB tmpfs-backed vdevs. The SPL is
at 4fa4cab which brings it to the time frame in which the zvol rework of
37f9dac
https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140
was done.The test is using fio with this configuration:
[test]
blocksize=8k
scramble_buffers=1
disk_util=0
invalidate=0
size=256m
numjobs=32
create_serialize=1
direct=1
filename=/dev/zvol/tank/v1
offset=0
offset_increment=256m
group_reporting=1
ioengine=libaio
iodepth=10
rw=write
thread=1
time_based=1
runtime=1m
fsync=0With ZFS at 782b2c3
https://github.com/zfsonlinux/zfs/commit/782b2c326ea445c5cab0c1b0373d64d5e83cc5d4,
right before the zvol rework, this test yields 30.9K IOPS with an average
latency of 10.3ms. Advancing ZFS to 37f9dac
https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140
(applying the base zvol re-work) brings the IOPS up to 34.9K and drops the
average latency to 9.1ms. Not a huge difference but in every regard, the
performance is _better_ with the new zvol code.My guess is that anyone seeing a performance regression due to the zvol
rework is using a consumer of the zvols that may not be interfacing with
them in an obvious manner. The common thread here appears to be that
various ISCSI targets are involved or something else in-kernel is in front
of the zvol. I ran some similar tests on a my laptop in which I configured
LIO to provide the zvol as a target and _did_ get worse performance
across LIO but it wasn't a very controlled experiment.I'm going to run a more extensive battery of tests now that I have the
system set up correctly but for normal native zvol access by an
application, I'm not seeing any performance regression at all between
37f9dac
https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140
and 782b2c3
https://github.com/zfsonlinux/zfs/commit/782b2c326ea445c5cab0c1b0373d64d5e83cc5d4.
In particular, I intend on configuring LIO and trying some tests over that
path.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/zfsonlinux/zfs/issues/4512#issuecomment-250202055,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABRPjI89WV2xFOXCl9DubyArM64-5roNks5quoeEgaJpZM4IFMu1
.
@olw2005 w.r.t. #5181, that involves ZoL's use of disk vdevs. It doesn't have anything to do with zvols (other than the way ZFS might write to the underlying storage when zvols are written to). I did start looking into the iblock LIO module at the OpenZFS developer conference with @richardelling who is also seeing this issue in some places. It was interesting to note that iblock_submit_bios() does, itself, do explicit plug/unplug which it would seem ought to help aggregation and the expense of potentially a bit more latency. My next set of tests will be with the zvol exported via LIO. I'll see about testing with qemu and passthrough SCSI as well.
@dweeezil I watched the live stream of day 1. Very exciting, lots of good stuff coming. =) If I can assist in any way or if you happen to find any workaround / hack / fix / patch / etc. please l.m.k.
Also you may already know this, but just in case... If you are testing early 0.6.5 versions you'll want to revert / exclude commit b39c22b [reverted ~ 0.6.5.2] and manually set spl_taskq_thread_dynamic=0 [made the default ~ 0.6.5.4-ish??]. Both of those have a pronounced impact on performance, and definitely clouded my testing results early on.
There's been so much creep in this issue that I'm not sure where to go at this point to help push it along. There are numerous combinations of kernels, zvol vs. no zvol, NFS vs. no NFS, iSCSI vs. no iSCSI, SCST or LIO, etc. etc. The common thread appears to be a decrease in average block size written to the leaf vdevs and also higher write latency.
I would like to suggest that anyone experiencing a post-0.6.4 regression of this sort install the recently tagged 0.7.0 release candidate. It's currently at 0.7.0-rc1 and should be pretty easy to build packages for most distros. Other than all the other fixes and improvements, the big attraction for this type of problem is the vastly improved zpool iostat capability which is perfect for troubleshooting this type of issue. For those who've not seen it, check out 7e94507 and 193a37c for some details. This new detail, combined with that from the regular iostat ought to be very useful. Among other things, it now gives a view of the ZFS-provided aggregation which has existed all along but which has been impossible to monitor. Unfortunately, I don't know of anyone trying to port this to 0.6.5-release; hence my suggestion to try 0.7.0-rc1.
Finally, I'd like to suggest that anyone trying 0.7.0-rc1 and finding a regression please open a new issue and reference this one.
From my standpoint, now that I've got a large test system running a sufficiently old kernel, I will complete a set of tests to compare some write loads between 0.6.4 and 0.6.5 and also to see how much difference it makes when using zvols between accessing them directly versus accessing them via LIO.
I'm not sure what to make of this but figured I would post it as it's an interesting data point and perhaps someone here can make more sense of it. (Same setup as previously outlined, vmware virtual machine --> raw device map [e.g. passthrough] or virtual vmfs disk [qualitatively the same either way] --> scst iscsi target --> zvol.) Running the following as an additional I/O stream from within the client vm:
while true; do sleep .005 ; dd if=/dev/zero of=foo2 bs=8k count=1 oflag=sync 2>/dev/null; done
Basically this injects a small amount of sync I/O into the mix. But it also has the net effect of lowering the disk write I/O latency from the client side down to a "normal" number. As you might expect, it varies linearly with the amount of sleep time.
Moreover, I ran the same on our production video recording server (also mentioned above, normal operation generates ~1.5MB/s of continuous write I/O) and saw the same thing. Latency went way down (from ~100+ ms to ~2 ms) and video playback (generating a read I/O stream) was no longer choppy. Note, I tried this with zfs versions 0.6.5.8, 0.7.0rc1 and 0.6.4.2. Both 0.6.5.8 and 0.7.0rc1 show a massive latency reduction; 0.6.4.2 showed no change, it was low latency either way.
Again, not sure what the heck it means, but it's definitely interesting.
Hi. My first post here. I've been using ZOL on 50+ servers for several years now, great filesystem thanks for all your hard work.
I am stuck on 6.4.2 as anything later and the performance has made it unusable. They are all Centos 6.8. Mainly used by rsync to store backups on, but also some samba shares. I have also tried Centos 7 with same results. 6.4.2 is great for performance but would be great to upgrade.
Is this thread being looked at? It has been updated for months.
Thanks.
@markr123 Just as an FYI. I finally punted on this issue and switched our VM(ware) storage environment from zvol + iscsi to zfs + nfs v4. The latter configuration does not suffer the performance penalty in the newer zfs 0.6.5+ versions, enabling an upgrade from 0.6.4.2.
I gather the bit of [zvol] code that changed here is deeply embedded / intertwined at this point, making it difficult to undo?
@markr123:
I recently performed some I/O meter performance tests with 0.7.0-rc3 and 0.6.5.9 with reverted:
37f9dac592bf5889c3efb305c48ac39b4c7dd140
The results for 0.7.0-rc3:
Test name | Latency | Avg iops | Avg MBps | cpu load
----------|---------|----------|----------|---------
Random Write | 44.00 | 273 | 0 | 2%
Random Read | 32.30 | 371 | 0 | 1%
VDI | 30.70 | 390 | 1 | 2%
Exchange 2003 | 29.05 | 413 | 1 | 2%
Exchange 2007 | 31.06 | 386 | 3 | 1%
OLTP | 29.00 | 413 | 1 | 1%
SLQ Server | 34.91 | 343 | 5 | 1%
CCTV | 15.45 | 776 | 24 | 1%
Database | 19.00 | 631 | 4 | 1%
Fileserver | 24.23 | 495 | 15 | 1%
Media Streaming | 12.47 | 962 | 60 | 3%
WEB Server | 0.62 | 19364 | 75 | 16%
Sequential Read | 19.46 | 616 | 616 | 23%
Sequential Write | 26.62 | 450 | 450 | 7%
The results for 0.6.5.9 with reverted 37f9dac592bf5889c3efb305c48ac39b4c7dd140:
Test name | Latency | Avg iops | Avg MBps | cpu load
----------|---------|----------|----------|---------
Random Write | 0.33 | 36728 | 17 | 5%
Random Read | 0.31 | 38568 | 18 | 5%
VDI | 0.28 | 42762 |167 | 7%
Exchange 2003 | 0.28 | 43511 | 169 | 7%
Exchange 2007 | 0.29 | 41223 | 322 | 7%
OLTP | 0.28 | 42798 | 167 | 7%
SLQ Server | 0.32 | 37013 | 578 | 6%
CCTV | 0.55 | 21850 | 682 | 5%
Database | 0.28 | 42273 | 330 | 7%
Fileserver | 0.46 | 26333 | 822 | 5%
Media Streaming | 0.99 | 12097 | 756 | 3%
WEB Server | 0.35 | 33869 | 132 | 5%
Sequential Read | 18.90 | 634 | 634 | 1%
Sequential Write | 17.35 | 691 | 691 | 3%
There is a huge performance drop introduced by 37f9dac592bf5889c3efb305c48ac39b4c7dd140.
@ab-oe What were the details of the system you performed this benchmark against?
I've reported similar issues traced back to this same commit on a Xen virtualised system, just wondered if your setup was easier / harder to re-create for further investigation?
@koplover The system with ZoL ran on the Debian Jessie with mainline 4.4 kernel. For the iSCSI target I used SCST. For the initiator machine with Windows Server 2012 R2 was used.
There was one pool with zvol exported over iSCSI both created using the default settings. The zfs, spl module parameters remained unchanged. Pool consist of 4 mirrored groups each containing one physical drive. No read cache/write log devices. The initiator machine was connected to target over 10GbE.
@ab-oe could you please run the same tests using PR #5824. This PR reverts some key parts of commit 37f9dac which introduced this performance issue.
@behlendorf thank you. It is better now but still much worse than on older version. Especially when random I/O is performed.
Test name | Latency | Avg iops | Avg MBps | cpu load
----------|---------|----------|----------|---------
Random Write | 0.33 | 36550 | 17 | 19%
Random Read | 0.30 | 40192 | 19 | 25%
VDI | 0.28 | 42673 | 166 | 27%
Exchange 2003 | 0.28 | 43330 | 169 | 31%
Exchange 2007 | 0.29 | 41471 | 323 | 33%
OLTP | 0.28 | 43432 | 169 | 32%
SLQ Server | 0.33 | 36424 | 569 | 36%
CCTV | 0.54 | 22053 | 689 | 25%
Database | 0.30 | 39844 | 311 | 33%
Fileserver | 0.46 | 26113 | 816 | 39%
Media Streaming | 1.06 | 11321 | 707 | 30%
WEB Server | 0.36 | 33078 | 129 | 27%
Sequential Read | 18.65 | 643 | 643 | 24%
Sequential Write | 17.96 | 667 | 667 | 10%
Test name | Latency | Avg iops | Avg MBps | cpu load
----------|---------|----------|----------|---------
Random Write | 5.26 | 2286 | 1 | 0%
Random Read | 2.40 | 5006 | 2 | 1%
VDI | 3.09 | 3882 | 15 | 1%
Exchange 2003 | 1.57 | 7659 |29 | 2%
Exchange 2007 | 0.52 | 23168 | 181 | 6%
OLTP | 0.32 | 37381 |146 | 7%
SLQ Server | 0.42 | 28724 | 448 | 5%
CCTV | 0.89 | 13484 | 421 | 4%
Database | 3.28 | 3661 |28 | 1%
Fileserver | 0.51 | 23634 |738 | 6%
Media Streaming | 1.62 | 7385 |461 | 3%
WEB Server | 0.34 | 35323 | 137 | 6%
Sequential Read | 19.02 | 630 | 630 | 1%
Sequential Write | 23.79 | 504 | 504 | 2%
@behlendorf Little update: I performed several run of this test with ZoL 0.7.0-rc3 with from #5824 and it seems that performance gets better each run what didn't happen before. In the stable state results looks as follows:
Test name | Latency | Avg iops | Avg MBps | cpu load
----------|---------|----------|----------|---------
Random Write|0.31|38226|18|8%
Random Read|0.31|38313|18|6%
VDI|0.29|42025|164|10%
Exchange 2003|0.37|32147|125|6%
Exchange 2007|0.35|34291|267|8%
OLTP|0.34|35158|137|7%
SLQ Server|0.40|30173|471|7%
CCTV|2.02|4731|147|2%
Database|0.32|37510|293|7%
Fileserver|0.59|20317|634|5%
Media Streaming|1.45|8249|515|3%
WEB Server|0.34|35353|138|6%
Sequential Read|18.33|654|654|1%
Sequential Write|24.06|498|498|1%
This looks quite good except sequential write which still suffers from performance loss. Any clue why the performance on the old version is good and stable all the time and with 0.7.0-rc3 + #5824 needs some warm up?
@ab-oe thanks for collecting this data.
My best guess as to why the sequential write performance is down compared to v0.6.5.9 is that the updated patch is takes a range lock for correctness when updating the volume. So depending on the workload you might see some additional contention. The previous code allowed IOs to potentially complete out of order which wasn't strictly correct.
As for why a warm up time is required that's an interesting observation. How long would you say it took to warm up? Did you happen to observe arcstat.py during the testing and notice a change in caching behavior or hit rate? There were massive changes in the ARC between 0.6.59 and 0.7.0-rc3 including keeping it compressed in memory and breaking it up so it's all scatter-gather IO.
@behlendorf thank you for explanation. The strange thing is that the test file for I/O meter is created before every test run. So the test data differs from the previous run.
How long would you say it took to warm up?
The third run gave me the first stable results. Further runs gave me similar results.
Did you happen to observe arcstat.py during the testing and notice a change in caching behavior or hit rate?
Unfortunately, no. When I get back this environment for test I'll check it.
My 5 cents:
Recently, I've run such tests, and saw, that at the beginning of write test (0.7.0-rc3), there is a lot of reads:
I've run fio with write-only task, and watch with iostat at HDD. For the first time, minute or two, there was a lot of reads (but fio make only writes !!!).
With 0.6.4.2/0.6.5.9 there is no such behaviour.
Sadly, I'm not looked at arcstat during this tests.
@redtex For existing volumes the read are likely caused by partial block updates. Although this behavior would be the same between 0.6.x and 0.7.0-rc3.
@behlendorf I think, it's not partial updates, because zvol blocksize=4k and fio blocksize=4k. No compression. And I want to say again, reads present ONLY with 0.7.0-rc3. I can make short video, if it helps.
@redtex another possibility is the zvol prefetch. The first and last 128k of a zvol will be prefetched when it's added to the system to make it more likely that blkid will get a cache hit when probing the volume. You could try setting zvol_prefetch_bytes=0 and see if that eliminates the behavior for the benchmarks.
Writing from phone, so don't have the PR # in front of me, but I have tried to address some of these issues at the Linux block layer with decent results. Ryao has said that we need to implement a fix lower in the stack for correct operation. Personally, I think this is a first order class of bug - zvols are too unpredictable and overall slow to compete with conventional storage for iscsi and direct image bases on virtual hosts, which is probably the most common use case in general purpose IT for them. Performance regressions cost consumer confidence, which gets even more hurt when they are ignored for a year... This really needs to be fixed properly and permanently.
I am seeing the performance issue across Centos 6 and Centos 7 and with no zvols. Just rsync files in and also samba writing files. On a production system it has made the 0.6.5 branch unusable as it oftens hangs. We have not been able to upgrade past 6.4.2 which performance is great. I am desperate to upgrade for better stability but cannot due to the performance.
Over here it's the same as @markr123 CentOS6, no zvol, any upgrade from 0.6.4.2 to 0.6.5 results in a dramatic write performance degradation. You can see my findings in this thread, but if you need any more info/testing I'll be happy to assists you in this.
Hi. This problem/ticket has now been opened for over a year and the performance issue is stopping most if not all people from upgrading past the 0.6.4.2 version.
Is this problem being looked at or worked on? Anything I can send you? as it seems to have stalled.
A new version of ZOL may come out but if this problem is not fixed there is no point as no one can upgrade to it.
thanks for all your hard work on ZOL. It's the best filesystem out there for Linux.
I believe @ryao is currently digging into zvols at work. There may be a light at the end of this tunnel yet. Plus, people can't stay on 0.6.4 for compatibility reasons.
So many features are getting added that imports on even slightly outdated systems don't work RW.
Consumers are upgrading, realizing there's a problem, trying btrfs, failing, trying other things, failing, and coming back to zfs angry (seen this a few times). Users expect everything to improve by orders of magnitude with each release and work by way of an easy button which let's them tune all the details without knowing them or having to actually tune them... This pressure goes up to their IT staff, and the buck sorta stops there. Its a major perception problem because they don't realize this isn't a commercial effort (on Linux), and expect commercial grade function for free...
My $0.02 is that we can't have performance regressions in releases, qualifies as a first order bug at the same level as data integrity. Doesn't have to improve, but can't get worse. The fallout lasts years, and with the shaky history behind ZFS in the eyes of corporate decision makers, we need many years of solid releases to build confidence.
Fact that we are out of tree makes that many times harder - Linus can do something silly and zol just has to adapt. Past that case, in our own code, we should be testing our ops for time and resource consumption like we test our features/data for return values, to have consistent metrics across commits since a change in the spa could cause the dmu to wait or memory usage to spike elsewhere for buffers and only testing the functions changed by the commit wouldn't expose those metrics. Its not a small problem to address, but probably doable, with enough blood, sweat, and profiling mechanisms.
@sempervictus agreed, ZFS needs a kind of "Phoronix Test Suite" for Filesystems that tracks performance (and latency ?) regressions and improvements regularly,
the buildbots & tests will test integrity and functionality but I guess asking for performance or latency related regular tests (for each commit) would be too much and not entirely suitable ?
Dedicated runs on servers (from Raspberry Pi, Pi2 - "little, small systems) or server farms (Lustre, etc.) would be much better suited to catch "corner", scaling and other cases
@sempervictus agreed. We need additional automated testing which is run regularly in order to prevent performance regressions.
ZFS needs a kind of "Phoronix Test Suite" for Filesystems
I think that would be an entirely reasonable place to start. If we can put together a meaningful battery of ZFS performance tests for the PTS which take less than 6 hours to run they could be added to the buildbot and run regularly. Having a predefined suite of tests anyone can easily run to benchmark their hardware would be convenient. @kernelOfTruth if you're already familiar with the PTS would you be interested in putting something together?
Hi, Is there any progress with this? Sorry to chase but I would really like to upgrade to the new versions of zfs but am stuck on 6.4.2 until performance issues are fixed.
Thanks.
You might be interested in giving 0.7.0-rc4 a try, it has various performance enhancements. There are also various ZVOL/ZIL improvements currently in the works c/o @ryao. Not sure if any of them are relevant to your usage case. Feel free to look through the list of current pull requests and give some of those a try as well.
thanks for coming back so quick on a Sunday.
my main problem has been lots of rsync writes coming in at once to zfs. 6.4.2 handled these really well, but from 6.5 onwards the cpu usage went high and the rsyncs and all filesystem writes blocked. samba writing was also stalled.
Anybody who's experiencing this issue already tried the 0.7.x release?
I do with HEAD at 376994828fd3753aba75d492859727ca76f6a293 and RHEL6 kernel.
12 1 0x01 92 25024 8325332541 4337538365561
name type data
hits 4 11106114
misses 4 5791566
demand_data_hits 4 5166935
demand_data_misses 4 1052871
demand_metadata_hits 4 1997434
demand_metadata_misses 4 4092990
prefetch_data_hits 4 2495568
prefetch_data_misses 4 348742
prefetch_metadata_hits 4 1446177
prefetch_metadata_misses 4 296963
mru_hits 4 1703164
mru_ghost_hits 4 0
mfu_hits 4 7881109
mfu_ghost_hits 4 0
deleted 4 169
mutex_miss 4 0
evict_skip 4 47
evict_not_enough 4 0
evict_l2_cached 4 0
evict_l2_eligible 4 2530816
evict_l2_ineligible 4 0
evict_l2_skip 4 0
hash_elements 4 2100829
hash_elements_max 4 2101303
hash_collisions 4 156175
hash_chains 4 63727
hash_chain_max 4 4
p 4 67573442560
c 4 135146885120
c_min 4 33554432
c_max 4 135146885120
size 4 87287478000
compressed_size 4 77323989504
uncompressed_size 4 123392078848
overhead_size 4 5172511232
hdr_size 4 713580208
data_size 4 75959115264
metadata_size 4 6537385472
dbuf_size 4 902585728
dnode_size 4 2319490368
bonus_size 4 855320960
anon_size 4 473503232
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_size 4 41532464128
mru_evictable_data 4 36605826560
mru_evictable_metadata 4 419079168
mru_ghost_size 4 0
mru_ghost_evictable_data 4 0
mru_ghost_evictable_metadata 4 0
mfu_size 4 40490533376
mfu_evictable_data 4 38751274496
mfu_evictable_metadata 4 407086080
mfu_ghost_size 4 0
mfu_ghost_evictable_data 4 0
mfu_ghost_evictable_metadata 4 0
l2_hits 4 0
l2_misses 4 5791532
l2_feeds 4 8026
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 80685456896
l2_writes_sent 4 7791
l2_writes_done 4 7791
l2_writes_error 4 0
l2_writes_lock_retry 4 42
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
l2_free_on_write 4 6895
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 103058641920
l2_asize 4 66307313152
l2_hdr_size 4 0
memory_throttle_count 4 0
memory_direct_count 4 0
memory_indirect_count 4 0
arc_no_grow 4 0
arc_tempreserve 4 264192
arc_loaned_bytes 4 0
arc_prune 4 0
arc_meta_used 4 11328362736
arc_meta_limit 4 101360163840
arc_dnode_limit 4 10136016384
arc_meta_max 4 11353754696
arc_meta_min 4 16777216
sync_wait_for_async 4 22570
demand_hit_predictive_prefetch 4 234743
arc_need_free 4 0
arc_sys_free 4 4223340160
23 0 0x01 86 9632 12311035077 4362579253819
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime
13884744 4164236604749 C 36271104 123392000 61915136 1123 1778 2110737401 6910 1941900771 2215539012
13884745 4166347342150 C 35784192 118595584 60448768 1094 1778 1941919728 6631 2215550371 1876709958
13884746 4168289261878 C 63946752 109617152 75390976 1010 2069 2215566746 5264 1876714240 2504001285
13884747 4170504828624 C 45242880 133758976 43941888 1206 1775 1876725141 4365 2504018775 2415840671
13884748 4172381553765 C 26675712 131870720 32587776 1212 1640 2504031394 6207 2415853093 2484338338
13884749 4174885585159 C 23851520 107413504 28540928 1009 1945 2415870320 5403 2484350763 2328118255
13884750 4177301455479 C 49197568 120823808 47751168 1160 1851 2484366136 6345 2328119881 2555925449
13884751 4179785821615 C 34044416 120647680 35971072 1030 1630 2328134325 6501 2555932784 1866568996
13884752 4182113955940 C 36459008 139554816 37232640 1200 1915 2556038602 5064 1866484048 2234865186
13884753 4184669994542 C 29173760 125607936 32129024 1070 1632 1866505692 6882 2234860797 1936573602
13884754 4186536500234 C 29129728 116928512 39550976 1013 1655 2234879641 5388 1936584095 2055491295
13884755 4188771379875 C 72613888 157794304 56950784 1347 2109 1936598521 6485 2055498967 2703213206
13884756 4190707978396 C 57694720 203923456 59187200 1702 1929 2055557736 6087 2703183238 2910482353
13884757 4192763536132 C 41032704 152723456 52813824 1295 1676 2703199501 6481 2910483823 2012891445
13884758 4195466735633 C 106031616 132505600 85508096 1130 2197 2910517873 5054 2012876880 2258632437
13884759 4198377253506 C 29989376 142020608 31498240 1217 1490 2012887710 5152 2258648842 2024240489
13884760 4200390141216 C 39697920 111620096 40534016 1055 1967 2258683609 5520 2024231689 2771066342
13884761 4202648824825 C 100624384 105779200 77398016 1039 2486 2024245121 6332 2771080493 2276465887
13884762 4204673069946 C 52517376 118792192 56631296 1027 2295 2771093324 4882 2276485176 1815269833
13884763 4207444163270 C 89269760 179376128 80846848 1474 2303 2276499597 6223 1815274214 2528415135
13884764 4209720662867 C 86455808 153219072 71598080 1296 2304 1815287497 5182 2528422794 2096165541
13884765 4211535950364 C 50241536 112902144 47398912 1055 2062 2528434602 5115 2096181106 2575307696
13884766 4214064384966 C 48568832 127492096 53805056 1142 1758 2096197167 5995 2575320771 2110255466
13884767 4216160582133 C 41707520 150327296 43040768 1442 1704 2575335623 4913 2110260731 2743316802
13884768 4218735917756 C 54598144 115286016 51937280 1186 1652 2110275185 6364 2743328586 2166999266
13884769 4220846192941 C 38547968 82055168 41779200 894 1915 2743356757 5090 2166992049 2345322866
13884770 4223589549698 C 29343744 134176768 35012608 1233 1700 2167005958 5656 2345328367 1993619416
13884772 4228101897894 C 36727808 140693504 37928960 1218 1483 1993648727 8629 2162407073 1855508114
13884773 4230095546621 C 23354368 136331264 31248384 1533 1467 2162423513 5743 1855514720 2420886278
13884774 4232257970134 C 72283136 127410176 54194176 1500 1873 1855528761 4571 2420896618 2809874073
13884775 4234113498895 C 50669056 122269696 69353472 1092 2079 2420909258 7177053 2802708436 2118281111
13884776 4236534408153 C 35153920 142364672 61358080 1222 1777 2809893566 6081 2118295890 2007734999
13884777 4239344301719 C 47450112 140697600 63119360 1246 1830 2118320919 14139 2007720866 2196866389
13884778 4241462622638 C 56240640 143757312 80420864 1344 1900 2007748896 7129 2196865716 2541512318
13884779 4243470371534 C 28441600 132440064 42409984 1261 1666 2196880689 6153 2541526140 2072776457
13884780 4245667252223 C 53864448 125136896 111034368 1128 2333 2541543644 6552 2072800582 2149140917
13884781 4248208795867 C 83119616 141053952 107102208 1294 2219 2072813715 4112 2149148544 2564207783
13884782 4250281609582 C 84401664 108220416 88825856 1065 2360 2149158675 5242 2564215611 2516223608
13884783 4252430768257 C 53362176 136302592 51167232 1225 1963 2564242028 4741 2516217376 2108411013
13884784 4254995010285 C 23246848 77152256 39084032 899 1731 2516233278 6925 2108422419 2390715576
13884785 4257511243563 C 100516864 98656256 77406208 1013 2314 2108443540 7946 2390712547 2697941751
13884786 4259619687103 C 59678208 127086592 74522624 1479 2087 2390728524 5162 2697948082 2518428769
13884787 4262010415627 C 51313152 133058560 56836096 1160 1850 2697961085 5785 2518438066 1874581249
13884788 4264708376712 C 38499840 130056192 40427520 1175 2138 2518456463 4040859 1870546505 2491954978
13884789 4267226833175 C 58814976 90529792 117776384 864 3098 1874607776 4026 2491956053 2475562255
13884790 4269101440951 C 86475776 139202560 174727168 1246 4186 2491969522 4868 2475578556 2327543107
13884791 4271593410473 C 108779520 129941504 191479808 1104 2934 2475594576 5917 2327543433 2065292104
13884792 4274069005049 C 136452608 145252352 256712704 1456 4042 2327557865 6033 2065305318 2788907047
13884793 4276396562914 C 77019136 140771328 155049984 1189 2780 2065320358 5701 2788918012 2412242775
13884794 4278461883272 C 143178752 76984320 251576320 852 4020 2788933634 5398 2412245428 2815951378
13884795 4281250816906 C 69138944 145485824 91815936 1220 2216 2412258432 4946 2815966106 2335489136
13884796 4283663075338 C 114761216 141082624 158588928 1228 3151 2815978557 4281 2335508011 2782529898
13884797 4286479053895 C 35223040 119554048 37945344 1068 1612 2335521727 4598 2782534251 1982206309
13884798 4288814575622 C 64258560 143339520 108339200 1204 2616 2782545586 3910 1982223964 2600873631
13884799 4291597121208 C 24773632 207495168 36782080 1733 1582 1982237772 5099 2600891868 1916979266
13884800 4293579358980 C 48922112 137027584 75210752 1200 1574 2600906689 5127 1916990316 2014285410
13884802 4298097293340 C 33272320 150700032 102039552 1253 2404 2014286990 5110 2156809695 2035838820
13884803 4300111580330 C 81604608 138477568 126861312 1192 2630 2156824003 5153 2035852326 2064413620
13884804 4302268404333 C 110749696 125366272 170745856 1088 2985 2035864271 4277 2064431564 2524985032
13884805 4304304268604 C 60029952 151941120 90644480 1284 2082 2064447577 6658 2524985414 2341167947
13884806 4306368716181 C 65350656 158601216 110608384 1384 2593 2524997687 4230 2341175864 2778715774
13884807 4308893713868 C 41444864 140292096 89128960 1219 2097 2341186801 4016 2778725366 2245347679
13884808 4311234900669 C 54941696 148480000 108142592 1286 2384 2778736522 5175 2245354933 2465445800
13884809 4314013637191 C 71305216 141561856 114016256 1190 2394 2245371433 35507 2465417436 2043746134
13884810 4316259008624 C 76064256 147419136 115892224 1243 2522 2465460281 5492 2043752965 2362551344
13884811 4318724468905 C 95911936 131731456 146939904 1220 2857 2043767537 4649 2362556913 2703281041
13884812 4320768236442 C 46643712 154566656 106479616 1346 2253 2362580955 5659 2703282138 2281451691
13884813 4323130817397 C 74088960 150323200 212606976 1552 3717 2703298387 6454 2281453627 3030101641
13884814 4325834115784 C 60939264 130662400 140001280 1279 2537 2281466788 4724 3030117687 2297613763
13884815 4328115582572 C 165896704 142925824 366075904 1382 5678 3030143591 4028 2297608111 3558481627
13884816 4331145726163 C 142273024 85852160 291889152 903 4652 2297622990 5684 3558483940 3239762671
13884817 4333443349153 C 264770560 165347328 435494912 1628 7283 3558495720 4809 3239816411 4048062420
13884818 4337001844873 C 162929664 86028288 291962880 902 4540 3239834932 7746 4048068030 2584101426
13884819 4340241679805 C 137316352 94298112 281124864 1217 3654 4048085836 5507 2584105436 2336033214
13884820 4344289765641 C 95045120 89255936 175349760 910 2625 2584123986 6581 2336032299 1924461857
13884821 4346873889627 C 97711616 87232512 184975360 877 3075 2336051996 7233 1924461104 1861019067
13884822 4349209941623 C 84547584 97206272 157278208 1083 2972 1924476383 4748 1861035267 2099725077
13884823 4351134418006 C 88662016 90320896 175153152 905 2417 1861048221 5092 2099731154 1837980140
13884824 4352995466227 C 112260608 90034176 182132736 931 3010 2099747002 5675 1837983064 1818140317
13884825 4355095213229 C 54755840 77709312 99287040 692 1760 1837997759 5168 1818144607 1574600950
13884826 4356933210988 C 60847616 78692352 102064128 716 1724 1818156051 14286179 1560326897 1501072823
13884827 4358751367039 S 0 0 0 0 0 1574634670 4651 1501064607 0
13884828 4360326001709 W 0 0 0 0 0 1501074833 4289 0 0
13884829 4361827076542 O 0 0 0 0 0 0 0 0 0
dbuf_cache_hiwater_pct = 10
dbuf_cache_lowater_pct = 10
dbuf_cache_max_bytes = 104857600
dbuf_cache_max_shift = 5
dmu_object_alloc_chunk_shift = 7
ignore_hole_birth = 1
l2arc_feed_again = 1
l2arc_feed_min_ms = 200
l2arc_feed_secs = 1
l2arc_headroom = 2
l2arc_headroom_boost = 200
l2arc_noprefetch = 1
l2arc_norw = 0
l2arc_write_boost = 8388608
l2arc_write_max = 8388608
metaslab_aliquot = 524288
metaslab_bias_enabled = 1
metaslab_debug_load = 0
metaslab_debug_unload = 0
metaslab_fragmentation_factor_enabled = 1
metaslab_lba_weighting_enabled = 1
metaslab_preload_enabled = 1
metaslabs_per_vdev = 200
send_holes_without_birth_time = 1
spa_asize_inflation = 24
spa_config_path = /etc/zfs/zpool.cache
spa_load_verify_data = 1
spa_load_verify_maxinflight = 10000
spa_load_verify_metadata = 1
spa_slop_shift = 5
zfetch_array_rd_sz = 1048576
zfetch_max_distance = 8388608
zfetch_max_streams = 8
zfetch_min_sec_reap = 2
zfs_abd_scatter_enabled = 1
zfs_abd_scatter_max_order = 10
zfs_admin_snapshot = 1
zfs_arc_average_blocksize = 8192
zfs_arc_dnode_limit = 0
zfs_arc_dnode_limit_percent = 10
zfs_arc_dnode_reduce_percent = 10
zfs_arc_grow_retry = 0
zfs_arc_lotsfree_percent = 10
zfs_arc_max = 0
zfs_arc_meta_adjust_restarts = 4096
zfs_arc_meta_limit = 0
zfs_arc_meta_limit_percent = 75
zfs_arc_meta_min = 0
zfs_arc_meta_prune = 10000
zfs_arc_meta_strategy = 1
zfs_arc_min = 0
zfs_arc_min_prefetch_lifespan = 0
zfs_arc_p_aggressive_disable = 1
zfs_arc_pc_percent = 0
zfs_arc_p_dampener_disable = 1
zfs_arc_p_min_shift = 0
zfs_arc_shrink_shift = 0
zfs_arc_sys_free = 0
zfs_autoimport_disable = 1
zfs_compressed_arc_enabled = 1
zfs_dbgmsg_enable = 0
zfs_dbgmsg_maxsize = 4194304
zfs_dbuf_state_index = 0
zfs_deadman_checktime_ms = 5000
zfs_deadman_enabled = 1
zfs_deadman_synctime_ms = 1000000
zfs_dedup_prefetch = 0
zfs_delay_min_dirty_percent = 60
zfs_delay_scale = 500000
zfs_delete_blocks = 20480
zfs_dirty_data_max = 4294967296
zfs_dirty_data_max_max = 4294967296
zfs_dirty_data_max_max_percent = 25
zfs_dirty_data_max_percent = 10
zfs_dirty_data_sync = 824633720832
zfs_dmu_offset_next_sync = 0
zfs_expire_snapshot = 300
zfs_flags = 0
zfs_free_bpobj_enabled = 1
zfs_free_leak_on_eio = 0
zfs_free_max_blocks = 100000
zfs_free_min_time_ms = 1000
zfs_immediate_write_sz = 32768
zfs_max_recordsize = 1048576
zfs_mdcomp_disable = 0
zfs_metaslab_fragmentation_threshold = 70
zfs_metaslab_segment_weight_enabled = 1
zfs_metaslab_switch_threshold = 2
zfs_mg_fragmentation_threshold = 85
zfs_mg_noalloc_threshold = 0
zfs_multihost_fail_intervals = 5
zfs_multihost_history = 0
zfs_multihost_import_intervals = 10
zfs_multihost_interval = 1000
zfs_multilist_num_sublists = 0
zfs_nocacheflush = 0
zfs_nopwrite_enabled = 1
zfs_no_scrub_io = 0
zfs_no_scrub_prefetch = 0
zfs_object_mutex_size = 64
zfs_pd_bytes_max = 52428800
zfs_per_txg_dirty_frees_percent = 30
zfs_prefetch_disable = 0
zfs_read_chunk_size = 1048576
zfs_read_history = 0
zfs_read_history_hits = 0
zfs_recover = 0
zfs_resilver_delay = 2
zfs_resilver_min_time_ms = 3000
zfs_scan_idle = 50
zfs_scan_min_time_ms = 1000
zfs_scrub_delay = 4
zfs_send_corrupt_data = 0
zfs_sync_pass_deferred_free = 2
zfs_sync_pass_dont_compress = 5
zfs_sync_pass_rewrite = 2
zfs_sync_taskq_batch_pct = 75
zfs_top_maxinflight = 32
zfs_txg_history = 200
zfs_txg_timeout = 20
zfs_vdev_aggregation_limit = 131072
zfs_vdev_async_read_max_active = 3
zfs_vdev_async_read_min_active = 1
zfs_vdev_async_write_active_max_dirty_percent = 60
zfs_vdev_async_write_active_min_dirty_percent = 30
zfs_vdev_async_write_max_active = 10
zfs_vdev_async_write_min_active = 2
zfs_vdev_cache_bshift = 16
zfs_vdev_cache_max = 16384
zfs_vdev_cache_size = 0
zfs_vdev_max_active = 1000
zfs_vdev_mirror_non_rotating_inc = 0
zfs_vdev_mirror_non_rotating_seek_inc = 1
zfs_vdev_mirror_rotating_inc = 0
zfs_vdev_mirror_rotating_seek_inc = 5
zfs_vdev_mirror_rotating_seek_offset = 1048576
zfs_vdev_queue_depth_pct = 1000
zfs_vdev_raidz_impl = [fastest] original scalar sse2 ssse3
zfs_vdev_read_gap_limit = 32768
zfs_vdev_scheduler = noop
zfs_vdev_scrub_max_active = 2
zfs_vdev_scrub_min_active = 1
zfs_vdev_sync_read_max_active = 10
zfs_vdev_sync_read_min_active = 10
zfs_vdev_sync_write_max_active = 10
zfs_vdev_sync_write_min_active = 10
zfs_vdev_write_gap_limit = 4096
zfs_zevent_cols = 80
zfs_zevent_console = 0
zfs_zevent_len_max = 384
zil_replay_disable = 0
zil_slog_bulk = 786432
zio_delay_max = 30000
zio_dva_throttle_enabled = 1
zio_requeue_io_start_cut_in_line = 1
zio_taskq_batch_pct = 75
zvol_inhibit_dev = 0
zvol_major = 230
zvol_max_discard_blocks = 16384
zvol_prefetch_bytes = 131072
zvol_request_sync = 0
zvol_threads = 32
zvol_volmode = 1
@snajpa And still having these write performance issues?
I'm running 0.7.1 for a couple of days now and the write performance issue I was having with 0.6.5.x seems to be gone!
Unfortunately I can't upgrade that often, next round of updates won't be for another few weeks. I'll get back to this issue within a ~month max with concrete data.
Hi @erikjanhofstede, could you please share the zfs parameters applied after upgrade to 0.7.1. Thanx in advance.
@snajpa Do you have any updates regarding the new packages for ubuntu 16.04 ? Thanx
@nawang87 I have some custom parameters in place for this specific configuration and workload. I didn't apply any custom parameter settings except renaming zil_slog_limit to zil_slog_bulk for running 0.7.1.
I performed some tests and noticed similar like @ab-oe that without commit https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140 performance was pretty good on 0.6.5.x (kernel 4.4.45)
ZOL 0.6.5.x with commit https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140
zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00
dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync count=10000
480903168 bytes (481 MB) copied, 74.4589 s, 6.5 MB/s
dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync count=10000
508559360 bytes (509 MB) copied, 80.3149 s, 6.3 MB/s
ZOL 0.6.5.x with reverted commit https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140
zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00
dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync count=10000
1310720000 bytes (1.3 GB) copied, 3.33955 s, 392 MB/s
dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync count=10000
10485760000 bytes (10 GB) copied, 14.1663 s, 740 MB/s
ZOL 0.7 HEAD and 0.7.1 similar result like below:
zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00
dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync count=10000
455213056 bytes (455 MB) copied, 71.3401 s, 6.4 MB/s
dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync count=10000
484442112 bytes (484 MB) copied, 74.2631 s, 6.5 MB/s
Noticed that ZOL head by default also has only one zvol kernel process run:
ps ax|grep zvol
61303 ? S< 0:00 [zvol]
htop during I/O shows that zvol process does not occupies resources:
61303 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 zvol
spl module with spl_taskq_thread_dynamic=0 runs 32 zvol processes:
ps ax|grep zvol
58474 ? S< 0:00 [zvol]
58475 ? S< 0:00 [zvol]
....
58504 ? S< 0:00 [zvol]
58505 ? S< 0:00 [zvol]
but anyway htop during I/O shows that zvol processes do not occupies resources:
58474 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 zvol
58475 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 zvol
...
58504 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 zvol
58505 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 zvol
However ZOL 0.6.5.x with reverted commit https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140 shows that zvol processes nice occupies resources (CPU%):
20416 root 0 -20 0 0 0 S 2.0 0.0 0:01.83 zvol
20403 root 0 -20 0 0 0 S 1.7 0.0 0:01.83 zvol
20407 root 0 -20 0 0 0 S 1.7 0.0 0:01.82 zvol
20414 root 0 -20 0 0 0 S 1.7 0.0 0:02.18 zvol
20404 root 0 -20 0 0 0 S 1.3 0.0 0:01.93 zvol
20406 root 0 -20 0 0 0 S 1.3 0.0 0:01.79 zvol
20413 root 0 -20 0 0 0 S 1.3 0.0 0:02.03 zvol
20417 root 0 -20 0 0 0 S 1.3 0.0 0:02.07 zvol
20425 root 0 -20 0 0 0 S 1.3 0.0 0:02.06 zvol
20429 root 0 -20 0 0 0 S 1.3 0.0 0:01.96 zvol
@behlendorf: maybe its time to revert the bio work. Its obviously causing major slowdown and there's no solution in sight. I've not seen anything from @ryao in a while on this (or actually anything, hope he's OK), and I'm not sure anyone else is working on/knows how to solve the problem. If LLNL had to back the workload with blockdev, would you allow engineering to implement a pathologically slow system which literally dropped 100x in terms of performance on a commit and never fixed it? This has blocked many production implementations for us, client engineering teams shoot us down and spent 30X on ... name a vendor ... Then they're stuck in proprietary hell locked in to the people who now own their data with other performance and consistency problems.
With crypto in place, this is probably now the most serious outstanding feature gap. Can this get prioritized please, or reverted if there is no developer bandwidth to fix it?
Listing the following commits (for reference), that touch upon bio work (or related), may it help to track down issues and avoid having to revert all of it ☕️ :
https://github.com/zfsonlinux/zfs/commit/d4541210f3e07930dfefb6723d21cd8b313bb156 Linux 3.14 compat: Immutable biovec changes in vdev_disk.c [via git search]
https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140 zvol processing should use struct bio [mentioned here]
https://github.com/zfsonlinux/zfs/commit/2727b9d3b63a938c1797d31378e6a5a1dcd43573 Use uio for zvol_{read,write}
https://github.com/zfsonlinux/zfs/commit/a765a34a3142a1dcedd256e076a54e6ae61953ca Clean up zvol request processing to pass uio and fix porting regressions
https://github.com/zfsonlinux/zfs/pull/4316 Reduce overhead in zvol_write/zvol_read via existing hold
I noticed one more performance reducing which was provided in ZOL 0.7.2:
ZOL 0.7.1-1:
cat /sys/module/zfs/version
0.7.1-1
zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00
dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync count=5000
5000+0 records in
5000+0 records out
655360000 bytes (655 MB) copied, 16.8642 s, 38.9 MB/s
dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 23.6626 s, 44.3 MB/s
ZOL 0.7.2-1:
cat /sys/module/zfs/version
0.7.2-1
zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00
dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync count=5000
5000+0 records in
5000+0 records out
655360000 bytes (655 MB) copied, 94.3687 s, 6.9 MB/s
dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 75.3563 s, 7.0 MB/s
After revert each single commits seems that commit https://github.com/zfsonlinux/zfs/pull/6477/commits/a5b91f32e966197967639b93dced259a170b84ee introduced that slowdown:
ZOL 0.7.2-1 without https://github.com/zfsonlinux/zfs/pull/6477/commits/a5b91f32e966197967639b93dced259a170b84ee :
zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00
dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync count=5000
5000+0 records in
5000+0 records out
655360000 bytes (655 MB) copied, 16.7465 s, 39.1 MB/s
dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.0499 s, 43.6 MB/s
By the way: with direct flag performance is generally good but non each application can use direct I/O so it would be nice to refactor bio work:
ZOL 0.7.2-1:
dd if=/dev/zero of=/dev/zd0 bs=128k oflag=direct count=50000
50000+0 records in
50000+0 records out
6553600000 bytes (6.6 GB) copied, 13.6889 s, 479 MB/s
dd if=/dev/zero of=/dev/zd0 bs=1M oflag=direct count=50000
50000+0 records in
50000+0 records out
52428800000 bytes (52 GB) copied, 47.4183 s, 1.1 GB/s
I may provide additional tests on a server experiencing the same issue (on 0.6.5.x).
How easily can commit 37f9dac be rolled back in source code to reproduce the issue ?
@sempervictus @arturpzol @odoucet I've read through this issue again, but I'm not sure I completely understand. Can you help me understand your target workload. From the recently posted test results it looks like your focus is entirely on synchronous single threaded IO which is unfortunately exactly where I'd expect performance to be the worst. Particularly for a raidz pool without a dedicate log device.
Let me try and summarize the state of things in 0.7.2 and master do my best answer the questions posted. Then I can properly investigate your performance concerns and we can determine what can be done.
maybe its time to revert the bio work
The flaw in the bio work referenced above has already been reverted. This was done in commit 692e55b8fea00a0d0bd46188d68031292f04e4a8, and 8fa5250f5d779e577406c581fc2d7fbf0baceea5 which reinstated the task queues to resolve the asynchronous IO performance issue.
commit a5b91f3 introduced that slowdown:
That's a bigger hit that I would have expected, we'll have to look in to that.
Using the current master code I tested this worst case scenario as posted above. The volume was created with __sync=always__ and I ran a single threaded dd for the workload. That meant that every single 128K block needs to be written to the log, which is part of the primary pool, and we can't return to user space to handle another write until we're absolutely sure it's on disk. Then we need to write it again as part of the normal IO path. This is absolutely going to be slow.
That all said, you're right. In my testing it _does_ look even worse than what I would have expected I see a large write multiplier which I can't immediately explain, let me investigate what's going on and get back to you.
On the upside, if you stick with the default values I do see reasonable performance numbers.
@behlendorf dd scenario was only for the test and for the easy repetition the issue.
In my main enviroment I use SCST which shares zvols to iSCSI initiator. Initially I noticed with ZOL 0.7 master very long disk format on Windows side:
ZOL 0.6.5.x with reverted bio work:
zvol size: 10TB, sync: always - about 18 seconds
ZOL 0.7 HEAD:
zvol size: 10TB, sync: always - about 4 minutes 36 seconds
so I tried to eliminate SCST, network and iSCSI initiator layer and switched to local I/O using dd to be sure that it is something on zol layer.
I understand that with sync=alwaysand without a dedicate log device has to be slow but why 0.6.5.x with reverted bio work or 0.6.4.2 with the same test scenario is so fast? Does it mean that 0.6.5.x with reverted bio or 0.6.4.2 is not fully synchronized? but on the other hand it cannot be because I performed many test with force shutdown the node and have never experienced data corruption.
Of course, I can use default value of sync property for zvol but it does not protect me to avoid data corruption in case of power failure.
@behlendorf are you able to perform test with dd with ZOL 0.6.4.2 and 0.6.5.x/0.7.x? You will notice very big performance differences.
@odoucet you can simply use 0.6.4.2 or revert bio work on first released 0.6.5.
In order to do it try download patch from https://uploadfiles.io/pba86 .
It is diff between released 0.6.5 and https://github.com/zfsonlinux/zfs/commit/37f9dac592bf5889c3efb305c48ac39b4c7dd140
Next patch your zfs sources e.g:
patch -p1 < ./revert_bio.patch
or use git to do it.
For next released version of v0.6.5 (e.g. v0.6.5.1) it will be harder to revert it because of more source code dependencies.
Make sure to run zfstests after those changes,
ZFS is about data integrity after all, if the modified filesystem code puts your data through the grinder (data corruption, no integrity, issues) - it's pretty much useless
@kernelOfTruth I can agree in 100% that ZFS data integrity and others need to be tested with reverted bio.
I only wanted to show performance differences between 0.6.4.2 and 0.6.5.
@behlendorf I tested 37f9dac with PR #5824 on top has the same poor performance as described in most posts here. Restoring the 32 zvol threads has no impact on performance. The performance on 0.7.0-rc4 was better because of incorrect ranges locked for indirect writes which by the way led to broken data on zvols but after the fix the performance is as bad as it was on 37f9dac.
I found another performance regression described in #6728 which affects discard operation on zvols.
Most helpful comment
@behlendorf: maybe its time to revert the bio work. Its obviously causing major slowdown and there's no solution in sight. I've not seen anything from @ryao in a while on this (or actually anything, hope he's OK), and I'm not sure anyone else is working on/knows how to solve the problem. If LLNL had to back the workload with blockdev, would you allow engineering to implement a pathologically slow system which literally dropped 100x in terms of performance on a commit and never fixed it? This has blocked many production implementations for us, client engineering teams shoot us down and spent 30X on ... name a vendor ... Then they're stuck in proprietary hell locked in to the people who now own their data with other performance and consistency problems.
With crypto in place, this is probably now the most serious outstanding feature gap. Can this get prioritized please, or reverted if there is no developer bandwidth to fix it?