Our server has HP SmartArray RAID with four 2 TB SATA drives in hardware RAID10 configuration, 12 GB RAM. On top of that we have few ext4 partitions for system and then zpool on additional partition. We are using zfs-0.6.2-r1 on Gentoo 3.10.1-hardened-r1 kernel.
# parted /dev/sda print
Model: HP LOGICAL VOLUME (scsi)
Disk /dev/sda: 4001GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 64.0MB 62.9MB ext2 boot boot
2 64.0MB 2112MB 2048MB linux-swap(v1) swap
3 2112MB 4160MB 2048MB ext4 root
4 4160MB 4672MB 513MB ext2 tmp
5 4672MB 14.9GB 10.2GB ext4 var
6 14.9GB 25.2GB 10.2GB ext4 usr
7 25.2GB 26.0GB 848MB zfs pool-log-1
8 26.0GB 4001GB 3975GB zfs pool-db-1
# zpool status
pool: pool-db
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Wed Oct 16 15:23:43 2013
config:
NAME STATE READ WRITE CKSUM
pool-db ONLINE 0 0 0
sda8 ONLINE 0 0 0
errors: No known data errors
# zfs get all pool-db|grep local
pool-db recordsize 16K local
pool-db compression lz4 local
pool-db atime off local
pool-db primarycache metadata local
# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=4294967296
options zfs zfs_prefetch_disable=1
options zfs zfs_nocacheflush=1
We tested performance with "iozone -n 128M -g 1G -r 16 -O -a C 1", running in on ext4 partition with cfq io scheduler and then on zfs, with both cfq and noop io schedulers. And results were following:
ext4:
Using minimum file size of 131072 kilobytes.
Using maximum file size of 1048576 kilobytes.
Record Size 16 KB
OPS Mode. Output is in operations per second.
Auto Mode
Command line used: iozone -n 128M -g 1G -r 16 -O -a C 1
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
random random bkwd record stride
KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread
131072 16 51001 146345 359566 360945 337506 144383 340793 231587 339691 136617 137417 356936 343220
262144 16 71109 148043 362326 366705 333510 144512 343120 208439 327117 139089 139238 360760 361695
524288 16 73212 120953 363653 364416 334869 141474 342375 234181 351229 167829 171961 439114 442930
1048576 16 56005 54332 275627 370318 331604 95633 259598 245874 408339 120053 57568 282868 364282
iozone test complete.
zfs with cfq, prefetch disabled:
random random bkwd record stride
KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread
131072 16 49283 56028 166325 167464 159784 82341 163369 94593 160000 79332 79220 107510 107191
262144 16 46140 57947 9992 10154 10424 36484 19761 68573 6459 45519 33259 7701 10157
524288 16 67088 36380 8529 10332 10288 51737 116464 90049 8279 53336 33732 8418 10239
1048576 16 73994 40824 8856 10637 10218 55370 8829 83333 3907 53302 56661 15011 16213
zfs with cfq, prefetch enabled:
random random bkwd record stride
KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread
131072 16 47652 72157 158993 158646 152531 79260 154644 72881 152089 76110 76152 5541 9932
262144 16 55126 33221 4905 10748 10497 47493 151959 87790 7433 46743 30377 4641 10317
524288 16 66327 34825 8441 10404 10389 51715 118680 96467 7924 51023 31693 4529 10460
1048576 16 49780 63566 18251 16269 10118 45524 12063 81380 2915 57960 40290 8639 10233
zfs with noop, prefetch disabled:
random random bkwd record stride
KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread
131072 16 49091 64081 166217 167439 158691 82665 10466 57773 11120 41502 49651 166829 166038
262144 16 55188 42318 9315 11107 10858 37382 15678 72751 8944 45727 34565 9471 11138
524288 16 65049 37889 8840 11131 10912 41559 19469 75323 5792 51099 35179 66470 67360
1048576 16 69045 40268 9252 10933 10834 47960 12193 84699 3406 43824 68876 15921 14669
zfs with noop, prefetch enabled:
random random bkwd record stride
KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread
131072 16 47380 67498 163816 164089 156035 79816 158950 92026 156157 76874 43619 30008 29649
262144 16 59965 32874 5129 10729 10632 46395 99301 87307 5981 46248 28464 8901 10079
524288 16 64776 40197 8994 10899 10583 44736 25038 69032 3720 51457 34901 8954 10707
1048576 16 70999 43519 7401 10742 10652 59807 9917 68851 573 41784 65564 14715 8492
So, what we see here is that ZFS performs especially bad on reads. Should it be that way or is something really wrong with our configuration or what?
Try creating your pool with ashift=13.
Re-created with ashift=13, now some of the reads perform better, but not all, especially stride read and backward read. And even the best numbers are two times better with ext4.
cfq, prefetch disabled:
random random bkwd record stride
KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread
131072 16 47653 75679 170110 170247 161406 83819 164759 93200 161667 81141 80934 109365 108888
262144 16 61288 38627 106955 124989 156678 76070 5157 66504 931 47698 75137 166412 166678
524288 16 60276 35168 128903 168093 158139 57583 3316 76493 827 39020 62791 130818 168441
1048576 16 48302 46831 73181 8990 865 38819 3401 78254 503 44583 42114 117810 2842
noop, prefetch disabled:
random random bkwd record stride
KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread
131072 16 49439 85065 168061 167522 160690 83504 163594 97212 160480 80165 80202 165354 166587
262144 16 43223 70812 110294 156597 161515 80035 4808 65580 679 46039 61195 8177 8096
524288 16 64946 38672 82476 4448 4234 42875 23231 80221 910 51090 74154 75802 95033
1048576 16 67175 7315 6094 7538 513 42603 7057 83787 474 46433 65267 4272 12864
Also, what seems strange to me is that, I have read somewhere that noop is recommended Linux I/O scheduler with ZFS, but in our tests, it performs worse than cfq.
With primarycache=all and compression=off on pool-db you won't compare uncompressed file contents cached completely in RAM for ext3 vs. zfs being forced to forget everything it knows about the contents of the file asap (plus having to de-/compress it on every access).
Setting primarycache=disabled _will_ cripple performace massively, since zfs is quite smart about caching - which it relies on to counter the overhead of CoW (and all the other nice features which ext3 dosn't have to deal with since it dosn't support them).
Note
The root filesystem of the pool _might_ behave a bit different than the siblings (f.ex. you most likely won't be able to use zfs recv on it in some scenarios since it can't be destroyed). Best practice for zfs is to use the root filesystem of the pool only as a container (can even have canmount=off or mountpoint=none) to just inherit properties to the 'real' filesystems|volumes. So maybe do the test on a child filesystem instead.
Running dstat -c -l -d -D sda -r --disk-util in another terminal might be interesting to you, since zfs operates way differently compared to traditional filesystems: What you might see is that one test is impacted by the writing to disk from the test before (since you have sync=standard and didn't instruct iozone to fsync zfs will still be writing while iozone started reading for the next test).
Rule of thumb: ZFS is _different_, but the last word in filesystems ;)
Part of the problem is that ZFS will limit the number of inflight IOs sent to the block device to avoid overwhelming a disk while taking advantage of write reordering. When you use it on top of hardware RAID, the number of inflight IOs that are required to keep the disks busy increaes, but ZFS is still at the level of a single disk.
As for noop versus cfq, noop is recommended because ZFS has its own internal IO scheduler and using CFQ just adds CPU overhead to IO processing. In your case, you are putting ZFS on a partition and the other partitions likely use CFQ. You could be losing some performance because Linux is doing CFQ on one partition and noop on another.
With that being said, ZFS is really meant to be given direct control of the disks and can be expected to take a performance hit when it does not have direct control. Hardware RAID was designed for in-place filesystems that tend to write things continuously, so this does not affect them like it affects ZFS.
I was told in IRC that zfsonlinux/zfs@7a6144076166944655d86f1449be8566d1a3c71a improved synchronous performance of dd on NFS shares backed by ZFS datasets by about a factor of 2. It might help here.
@kristapsk Your initial issue report prodded me to run iozone for the first time in many years. Here's a few observations: With files that fit into RAM and/or the file system's cache (ARC for ZFS), read performance should effectively be testing how quickly data can be fetched from memory. You had disabled ZFS' caching of file contents with primarycache=medadata setting so it's no surprise that read performance was awful.
My second observation applies to testing under either ZFS or ext4: For some reason, iozone doesn't take into account the fsync() time by default, you must use the -e option to make it do so. It would seem like that ought to be a default that can't be overridden. During my initial test, I was getting impossibly high throughput numbers for both ZFS and ext4 until I discovered that the option even existed. It typically makes the largest difference in the initial write and re-write tests during which the scratch file is populated.
Next, I'll mention that the combination of the iozone's block size and ZFS' recordsize can have a large impact on performance. You're clearly aware of this given that you set recordsize to 16K under ZFS and were using 16K blocks in iozone. I mention this mainly as it relates to comparing ZFS to ext4 or other file systems because there are potentially some significant performance penalties that can occur in ZFS when using a small recordsize. In particular, a small recordsize will cause ZFS to use a lot more space for metadata than with a larger recordsize. In my opinion, recordsize is something that should be tuned against a real workload as opposed to a benchmark. Presumably you're testing with 16k recordsize because you're planning on running an application that performs all I/O in 16k chunks. I'd suggest benchmarking with whatever that application may be and to try various settings for recordsize.
I'll share a few of my baseline iozone tests and observations in a subsequent post.
@ryao, so, you think zfs could perform significally better as RAIDZ than hardware RAID? Didn't get your point about "because Linux is doing CFQ on one partition and noop on another" - when I switched IO scheduler, I did it for whole disk / all partitions, via /sys/block/sda/queue/scheduler.
@dweeezil, yes, 16k is recordsize for InnoDB, that's why we are benchmarking with that. I will do benchmarking with Percona's tpcc-mysql on mysql datadir on both ext4 and zfs and then compare results.
@kristapsk I had not realized that you had changed the scheduler for the entire disk. In that case, disregard that comment. I also had not realized that you wanted to run PostgreSQL. Read performance should not matter as much for PostgreSQL as it does for other applications because PostgreSQL implements its own cache algorithm that is similar to ARC, but tuned for database workloads.
ZFS would perform better if you did the equivalent of RAID 10 by doing zpool create name mirror /path/to/device/0 /path/to/device/1 mirror /path/to/device/2 path/to/device/3. If you do that, I suggest ashift=12 unless you are certain that your disks have 512-byte sectors and are not misreporting their sector size for Windows XP compatibility. That should maximize your IOPS performance. I would expect at least a factor of 2 improvement.
I expect that the patches in #1775 will further improve things. They should be in 0.6.3, which will be released later this year. If you want, I could provide instructions on how to get the code early with the 9999 ebuilds, although I would suggest waiting for it to be merged to HEAD.
@kristapsk It just dawned on me that InnoDB implies MySQL, not PostgreSQL. Disregard what I said about the cache algorithm. MySQL does not have that. Also, you will want to use primarycache=all for MySQL.
@kristapsk There's a ton of tuning guides for ZFS/MySQL out there. Some of them pre-date l2arc and other recent ZFS features. The one at https://blogs.oracle.com/realneel/entry/mysql_innodb_zfs_best_practices appears to be pretty good but there a lot more out there.
@ryao It looks like when InnoDB is used, it does have a caching layer similar to PostgreSQL, at least according to the document at http://dev.mysql.com/doc/refman/5.5/en/innodb-buffer-pool.html.
@kristapsk In a previous post in this issue, I said I'd share some of my iozone test results. Unfortunately, I'm not likely to be able to do that because I ended up burning a lot of time just trying to figure out what the numbers it was showing me actually meant. I had decided it made sense to use its "throughput mode" in which it spawns multiple processes/threads so that work is done in parallel. Some of the results I got didn't make much sense and/or were highly variable. I guess this is to be expected given the small file sizes I was working with. I started digging through the source and adding debugging, etc. etc. and the after seeing your notes regarding InnoDB, etc. realized that whatever results I might come up with wouldn't be very helpful.
In any case, I think you're on the right track to do your testing with MySQL as opposed to iozone. It does sound like you're planning on using a synthetic benchmark. I'd suggest trying to rig up a test using your actual workload, too. In particular, it would be worth testing whether you really need to lower the recordsize to 16k for your workload. The tuning for InnoDB should be similar to that which is typically done for PostgreSQL except for the 16k vs. 8k recordsize.
For completeness, I'll mention that in my limited work with iozone in which I tested on freshly-created ext4 and zfs pools of identical size in dedicated disk partitions in the same general area of the hard drive, except for single-threaded writes and re-writes, zfs typically performed equal to or sometimes much better than ext4. I was mainly exploring effects of different combinations of file I/O size and zfs' recordsize.
@dweeezil, yes, we have studied this and other guides. That's whay blocksize=16k and primarycache=metadata settings, zfs_arc_max to 4G, etc. Unfortunatelly, tpcc-mysql tests today also shows significally better results for ext4 (this time actulally ext4 under LVM2) against ZFS. :(
Unfortunatelly, could not get any benchmark to have ZFS perform better than ext4, also using RAIDZ. Decided to go proven way (LVM+ext4) this time, but I clearly see that ZFS could be useful for us in future.
Ignore any complaints by kristapsk.
kristapsk made one of the biggest mistakes when running and basically disregarded every zfs guide on the internet.
You are NOT suppose to run zfs on top of hardware raid, it completely defeats the reason to use zfs.
zfs wants to see all the disk directly, and handles all the caching and writing to the disks. zfs basically turns you entire computer into big raid card.
So don't run zfs on top of hardware raid and then complain about the performance.
I would disagree with the above, I've seen drastically better performance running on lvm stripes and caching than with ZFS striped vdevs (non-raid) and arc/zil. Why non-raid? For cloud providers where the storage provided is a virtual, RAID1/10 block device attached to the virtual machine, it rarely makes sense to add additional parity/mirroring on top of that. If you don't trust the block devices the cloud provider attaches to your virtual machine, you have bigger problems.
Still, zfs can make a lot of sense in those deployments. It adds deduplication, compression, and snapshots, and excellent Docker/Kubernetes compatibility. With lvmcache configured, random writes can be smoothed out, and so on.
Most helpful comment
Ignore any complaints by kristapsk.
kristapsk made one of the biggest mistakes when running and basically disregarded every zfs guide on the internet.
You are NOT suppose to run zfs on top of hardware raid, it completely defeats the reason to use zfs.
zfs wants to see all the disk directly, and handles all the caching and writing to the disks. zfs basically turns you entire computer into big raid card.
So don't run zfs on top of hardware raid and then complain about the performance.