Zfs: Very bad ZFS performance, especially reads, compared to ext4 - is it normal?

Created on 17 Oct 2013 · 16Comments · Source: openzfs/zfs

Our server has HP SmartArray RAID with four 2 TB SATA drives in hardware RAID10 configuration, 12 GB RAM. On top of that we have few ext4 partitions for system and then zpool on additional partition. We are using zfs-0.6.2-r1 on Gentoo 3.10.1-hardened-r1 kernel.

# parted /dev/sda print
Model: HP LOGICAL VOLUME (scsi)
Disk /dev/sda: 4001GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system     Name        Flags
 1      1049kB  64.0MB  62.9MB  ext2            boot        boot
 2      64.0MB  2112MB  2048MB  linux-swap(v1)  swap
 3      2112MB  4160MB  2048MB  ext4            root
 4      4160MB  4672MB  513MB   ext2            tmp
 5      4672MB  14.9GB  10.2GB  ext4            var
 6      14.9GB  25.2GB  10.2GB  ext4            usr
 7      25.2GB  26.0GB  848MB   zfs             pool-log-1
 8      26.0GB  4001GB  3975GB  zfs             pool-db-1

# zpool status
  pool: pool-db
 state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Wed Oct 16 15:23:43 2013
config:

    NAME        STATE     READ WRITE CKSUM
    pool-db     ONLINE       0     0     0
      sda8      ONLINE       0     0     0

errors: No known data errors
# zfs get all pool-db|grep local
pool-db  recordsize            16K                    local
pool-db  compression           lz4                    local
pool-db  atime                 off                    local
pool-db  primarycache          metadata               local
# cat /etc/modprobe.d/zfs.conf 
options zfs zfs_arc_max=4294967296
options zfs zfs_prefetch_disable=1
options zfs zfs_nocacheflush=1

We tested performance with "iozone -n 128M -g 1G -r 16 -O -a C 1", running in on ext4 partition with cfq io scheduler and then on zfs, with both cfq and noop io schedulers. And results were following:

ext4:

    Using minimum file size of 131072 kilobytes.
    Using maximum file size of 1048576 kilobytes.
    Record Size 16 KB
    OPS Mode. Output is in operations per second.
    Auto Mode
    Command line used: iozone -n 128M -g 1G -r 16 -O -a C 1
    Time Resolution = 0.000001 seconds.
    Processor cache size set to 1024 Kbytes.
    Processor cache line size set to 32 bytes.
    File stride size set to 17 * record size.

                                                            random  random    bkwd   record   stride                                   
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
          131072      16   51001  146345   359566   360945  337506  144383  340793   231587   339691   136617   137417  356936   343220
          262144      16   71109  148043   362326   366705  333510  144512  343120   208439   327117   139089   139238  360760   361695
          524288      16   73212  120953   363653   364416  334869  141474  342375   234181   351229   167829   171961  439114   442930
         1048576      16   56005   54332   275627   370318  331604   95633  259598   245874   408339   120053    57568  282868   364282

iozone test complete.

zfs with cfq, prefetch disabled:

                                                            random  random    bkwd   record   stride                                   
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
          131072      16   49283   56028   166325   167464  159784   82341  163369    94593   160000    79332    79220  107510   107191
          262144      16   46140   57947     9992    10154   10424   36484   19761    68573     6459    45519    33259    7701    10157
          524288      16   67088   36380     8529    10332   10288   51737  116464    90049     8279    53336    33732    8418    10239
         1048576      16   73994   40824     8856    10637   10218   55370    8829    83333     3907    53302    56661   15011    16213

zfs with cfq, prefetch enabled:

                                                            random  random    bkwd   record   stride                                   
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
          131072      16   47652   72157   158993   158646  152531   79260  154644    72881   152089    76110    76152    5541     9932
          262144      16   55126   33221     4905    10748   10497   47493  151959    87790     7433    46743    30377    4641    10317
          524288      16   66327   34825     8441    10404   10389   51715  118680    96467     7924    51023    31693    4529    10460
         1048576      16   49780   63566    18251    16269   10118   45524   12063    81380     2915    57960    40290    8639    10233

zfs with noop, prefetch disabled:

                                                            random  random    bkwd   record   stride                                   
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
          131072      16   49091   64081   166217   167439  158691   82665   10466    57773    11120    41502    49651  166829   166038
          262144      16   55188   42318     9315    11107   10858   37382   15678    72751     8944    45727    34565    9471    11138
          524288      16   65049   37889     8840    11131   10912   41559   19469    75323     5792    51099    35179   66470    67360
         1048576      16   69045   40268     9252    10933   10834   47960   12193    84699     3406    43824    68876   15921    14669

zfs with noop, prefetch enabled:

                                                            random  random    bkwd   record   stride                                   
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
          131072      16   47380   67498   163816   164089  156035   79816  158950    92026   156157    76874    43619   30008    29649
          262144      16   59965   32874     5129    10729   10632   46395   99301    87307     5981    46248    28464    8901    10079
          524288      16   64776   40197     8994    10899   10583   44736   25038    69032     3720    51457    34901    8954    10707
         1048576      16   70999   43519     7401    10742   10652   59807    9917    68851      573    41784    65564   14715     8492

So, what we see here is that ZFS performs especially bad on reads. Should it be that way or is something really wrong with our configuration or what?

Performance

Source

kristapsk

Most helpful comment

Ignore any complaints by kristapsk.
kristapsk made one of the biggest mistakes when running and basically disregarded every zfs guide on the internet.

You are NOT suppose to run zfs on top of hardware raid, it completely defeats the reason to use zfs.
zfs wants to see all the disk directly, and handles all the caching and writing to the disks. zfs basically turns you entire computer into big raid card.

So don't run zfs on top of hardware raid and then complain about the performance.

benbennett on 20 Apr 2017

👍11 👎2

All 16 comments

Try creating your pool with ashift=13.

ryao on 17 Oct 2013

Re-created with ashift=13, now some of the reads perform better, but not all, especially stride read and backward read. And even the best numbers are two times better with ext4.

cfq, prefetch disabled:

                                                            random  random    bkwd   record   stride                                   
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
          131072      16   47653   75679   170110   170247  161406   83819  164759    93200   161667    81141    80934  109365   108888
          262144      16   61288   38627   106955   124989  156678   76070    5157    66504      931    47698    75137  166412   166678
          524288      16   60276   35168   128903   168093  158139   57583    3316    76493      827    39020    62791  130818   168441
         1048576      16   48302   46831    73181     8990     865   38819    3401    78254      503    44583    42114  117810     2842

noop, prefetch disabled:

                                                            random  random    bkwd   record   stride                                   
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
          131072      16   49439   85065   168061   167522  160690   83504  163594    97212   160480    80165    80202  165354   166587
          262144      16   43223   70812   110294   156597  161515   80035    4808    65580      679    46039    61195    8177     8096
          524288      16   64946   38672    82476     4448    4234   42875   23231    80221      910    51090    74154   75802    95033
         1048576      16   67175    7315     6094     7538     513   42603    7057    83787      474    46433    65267    4272    12864

kristapsk on 17 Oct 2013

Also, what seems strange to me is that, I have read somewhere that noop is recommended Linux I/O scheduler with ZFS, but in our tests, it performs worse than cfq.

kristapsk on 17 Oct 2013

With primarycache=all and compression=off on pool-db you won't compare uncompressed file contents cached completely in RAM for ext3 vs. zfs being forced to forget everything it knows about the contents of the file asap (plus having to de-/compress it on every access).

Setting primarycache=disabled _will_ cripple performace massively, since zfs is quite smart about caching - which it relies on to counter the overhead of CoW (and all the other nice features which ext3 dosn't have to deal with since it dosn't support them).

Note
The root filesystem of the pool _might_ behave a bit different than the siblings (f.ex. you most likely won't be able to use zfs recv on it in some scenarios since it can't be destroyed). Best practice for zfs is to use the root filesystem of the pool only as a container (can even have canmount=off or mountpoint=none) to just inherit properties to the 'real' filesystems|volumes. So maybe do the test on a child filesystem instead.

Running dstat -c -l -d -D sda -r --disk-util in another terminal might be interesting to you, since zfs operates way differently compared to traditional filesystems: What you might see is that one test is impacted by the writing to disk from the test before (since you have sync=standard and didn't instruct iozone to fsync zfs will still be writing while iozone started reading for the next test).

Rule of thumb: ZFS is _different_, but the last word in filesystems ;)

GregorKopka on 17 Oct 2013

Part of the problem is that ZFS will limit the number of inflight IOs sent to the block device to avoid overwhelming a disk while taking advantage of write reordering. When you use it on top of hardware RAID, the number of inflight IOs that are required to keep the disks busy increaes, but ZFS is still at the level of a single disk.

As for noop versus cfq, noop is recommended because ZFS has its own internal IO scheduler and using CFQ just adds CPU overhead to IO processing. In your case, you are putting ZFS on a partition and the other partitions likely use CFQ. You could be losing some performance because Linux is doing CFQ on one partition and noop on another.

With that being said, ZFS is really meant to be given direct control of the disks and can be expected to take a performance hit when it does not have direct control. Hardware RAID was designed for in-place filesystems that tend to write things continuously, so this does not affect them like it affects ZFS.

ryao on 17 Oct 2013

I was told in IRC that zfsonlinux/zfs@7a6144076166944655d86f1449be8566d1a3c71a improved synchronous performance of dd on NFS shares backed by ZFS datasets by about a factor of 2. It might help here.

ryao on 19 Oct 2013

@kristapsk Your initial issue report prodded me to run iozone for the first time in many years. Here's a few observations: With files that fit into RAM and/or the file system's cache (ARC for ZFS), read performance should effectively be testing how quickly data can be fetched from memory. You had disabled ZFS' caching of file contents with primarycache=medadata setting so it's no surprise that read performance was awful.

My second observation applies to testing under either ZFS or ext4: For some reason, iozone doesn't take into account the fsync() time by default, you must use the -e option to make it do so. It would seem like that ought to be a default that can't be overridden. During my initial test, I was getting impossibly high throughput numbers for both ZFS and ext4 until I discovered that the option even existed. It typically makes the largest difference in the initial write and re-write tests during which the scratch file is populated.

Next, I'll mention that the combination of the iozone's block size and ZFS' recordsize can have a large impact on performance. You're clearly aware of this given that you set recordsize to 16K under ZFS and were using 16K blocks in iozone. I mention this mainly as it relates to comparing ZFS to ext4 or other file systems because there are potentially some significant performance penalties that can occur in ZFS when using a small recordsize. In particular, a small recordsize will cause ZFS to use a lot more space for metadata than with a larger recordsize. In my opinion, recordsize is something that should be tuned against a real workload as opposed to a benchmark. Presumably you're testing with 16k recordsize because you're planning on running an application that performs all I/O in 16k chunks. I'd suggest benchmarking with whatever that application may be and to try various settings for recordsize.

I'll share a few of my baseline iozone tests and observations in a subsequent post.

dweeezil on 19 Oct 2013

@ryao, so, you think zfs could perform significally better as RAIDZ than hardware RAID? Didn't get your point about "because Linux is doing CFQ on one partition and noop on another" - when I switched IO scheduler, I did it for whole disk / all partitions, via /sys/block/sda/queue/scheduler.

@dweeezil, yes, 16k is recordsize for InnoDB, that's why we are benchmarking with that. I will do benchmarking with Percona's tpcc-mysql on mysql datadir on both ext4 and zfs and then compare results.

kristapsk on 20 Oct 2013

@kristapsk I had not realized that you had changed the scheduler for the entire disk. In that case, disregard that comment. I also had not realized that you wanted to run PostgreSQL. Read performance should not matter as much for PostgreSQL as it does for other applications because PostgreSQL implements its own cache algorithm that is similar to ARC, but tuned for database workloads.

ZFS would perform better if you did the equivalent of RAID 10 by doing zpool create name mirror /path/to/device/0 /path/to/device/1 mirror /path/to/device/2 path/to/device/3. If you do that, I suggest ashift=12 unless you are certain that your disks have 512-byte sectors and are not misreporting their sector size for Windows XP compatibility. That should maximize your IOPS performance. I would expect at least a factor of 2 improvement.

I expect that the patches in #1775 will further improve things. They should be in 0.6.3, which will be released later this year. If you want, I could provide instructions on how to get the code early with the 9999 ebuilds, although I would suggest waiting for it to be merged to HEAD.

ryao on 20 Oct 2013

@kristapsk It just dawned on me that InnoDB implies MySQL, not PostgreSQL. Disregard what I said about the cache algorithm. MySQL does not have that. Also, you will want to use primarycache=all for MySQL.

ryao on 20 Oct 2013

@kristapsk There's a ton of tuning guides for ZFS/MySQL out there. Some of them pre-date l2arc and other recent ZFS features. The one at https://blogs.oracle.com/realneel/entry/mysql_innodb_zfs_best_practices appears to be pretty good but there a lot more out there.

@ryao It looks like when InnoDB is used, it does have a caching layer similar to PostgreSQL, at least according to the document at http://dev.mysql.com/doc/refman/5.5/en/innodb-buffer-pool.html.

dweeezil on 20 Oct 2013

@kristapsk In a previous post in this issue, I said I'd share some of my iozone test results. Unfortunately, I'm not likely to be able to do that because I ended up burning a lot of time just trying to figure out what the numbers it was showing me actually meant. I had decided it made sense to use its "throughput mode" in which it spawns multiple processes/threads so that work is done in parallel. Some of the results I got didn't make much sense and/or were highly variable. I guess this is to be expected given the small file sizes I was working with. I started digging through the source and adding debugging, etc. etc. and the after seeing your notes regarding InnoDB, etc. realized that whatever results I might come up with wouldn't be very helpful.

In any case, I think you're on the right track to do your testing with MySQL as opposed to iozone. It does sound like you're planning on using a synthetic benchmark. I'd suggest trying to rig up a test using your actual workload, too. In particular, it would be worth testing whether you really need to lower the recordsize to 16k for your workload. The tuning for InnoDB should be similar to that which is typically done for PostgreSQL except for the 16k vs. 8k recordsize.

For completeness, I'll mention that in my limited work with iozone in which I tested on freshly-created ext4 and zfs pools of identical size in dedicated disk partitions in the same general area of the hard drive, except for single-threaded writes and re-writes, zfs typically performed equal to or sometimes much better than ext4. I was mainly exploring effects of different combinations of file I/O size and zfs' recordsize.

dweeezil on 20 Oct 2013

@dweeezil, yes, we have studied this and other guides. That's whay blocksize=16k and primarycache=metadata settings, zfs_arc_max to 4G, etc. Unfortunatelly, tpcc-mysql tests today also shows significally better results for ext4 (this time actulally ext4 under LVM2) against ZFS. :(

kristapsk on 20 Oct 2013

Unfortunatelly, could not get any benchmark to have ZFS perform better than ext4, also using RAIDZ. Decided to go proven way (LVM+ext4) this time, but I clearly see that ZFS could be useful for us in future.

kristapsk on 23 Oct 2013

Ignore any complaints by kristapsk.
kristapsk made one of the biggest mistakes when running and basically disregarded every zfs guide on the internet.

So don't run zfs on top of hardware raid and then complain about the performance.

benbennett on 20 Apr 2017

👍11 👎2

I would disagree with the above, I've seen drastically better performance running on lvm stripes and caching than with ZFS striped vdevs (non-raid) and arc/zil. Why non-raid? For cloud providers where the storage provided is a virtual, RAID1/10 block device attached to the virtual machine, it rarely makes sense to add additional parity/mirroring on top of that. If you don't trust the block devices the cloud provider attaches to your virtual machine, you have bigger problems.

Still, zfs can make a lot of sense in those deployments. It adds deduplication, compression, and snapshots, and excellent Docker/Kubernetes compatibility. With lvmcache configured, random writes can be smoothed out, and so on.