Zfs: Feature request: change ashift default value to 12

Created on 19 Jul 2017  路  25Comments  路  Source: openzfs/zfs

Describe the problem you're observing

The majority of modern storage hardware uses 4K or even 8K blocks, but ZFS defaults to ashift=9, and 512 byte blocks. While there is a hard-coded list to force some hardware to use ashift=12 when it self-reports itself as a 512b device, in my experience the majority of common hardware will end up ashift=9 if not manually specified.

Describe how to reproduce the problem

Create a vdev using Samsung 840, 840 Pro, 850 EVO, or 850 Pro SSDs. Or Western Digital HDDs. Or any of a gigantic laundry list of other common storage hardware. If you don't specify ashift, it ends up being 9, and your performance ends up being atrocious. Since ashift is immutable per vdev and you can't remove a vdev from the pool (and even the new code from mahrens to enable removing vdevs is something of a heroic last-ditch effort), this effectively means screwing this up a single time necessitates destroying and rebuilding the pool.

Suggested mitigation

Default ashift when not specifed to ashift=12, or even ashift=13. The penalty for setting ashift higher than native blocksize is very, very low - increased use of slack space, and a small write amplification penalty only when writing amounts of data less than 4K (or 8K, for ashift=13). Writes this small are uncommon, given that databases tend to operate in 16K pages, making this a much more livable sub-optimization than ashift too small, which results in roughly 10x write amplification.

Given the disparity in penalties paid, it makes much more sense to ask people who are sure they want 512b blocks to manually specify ashift=9, than to ask people to manually specify ashift=12. Personally, I'd go as far as saying ashift=13 should be the default, since 8K drives are becoming more and more common, and it's likely that a pool with ashift=12 drives will eventually need a disk replacement that might be ashift=13. But anything would be an improvement over the current situation.

Feature

Most helpful comment

I think it is true that modern hardware is generally moving to larger sector sizes, even if some of them still report 512-byte sectors for compatibility reasons. Changing the default to ashift=12 doesn't prevent users from selecting ashift=9 if their workload benefits from it (i.e. millions of tiny files), it's just a sign that more hardware today works better with ashift=12 and should be the default.

Also, as new drives become with predominantly 4KB sectors, it will be impossible to add new 4KB sector drives to existing pools created with ashift=9 (e.g. to replace failing drives). The converse is not true, so it is much safer to make ashift=12 the default.

All 25 comments

Which ZFS version are we talking about? I thinks there's an "issue template", that should help provide meaningful information when opening a new issue.

The majority of modern storage hardware uses 4K or even 8K blocks, but ZFS defaults to ashift=9, and 512 byte blocks.

0.7.0-rc5 seems to work fine here:

root@linux:~# blockdev --getss --getpbsz /dev/vdb
512
512
root@linux:~# blockdev --getss --getpbsz /dev/vdc
4096
4096
root@linux:~# zpool create B /dev/vdb && zdb -C B | grep ashift
                ashift: 9
root@linux:~# zpool create C /dev/vdc && zdb -C C | grep ashift
                ashift: 12
root@linux:~# 

I'm on 0.6.5.6-0ubuntu17.

You're misunderstanding the issue. Your /dev/vdc correctly reports itself as a 4K device. Many, many devices lie and report themselves as 512b devices when they're actually 4K devices. That's when you have the problem with ashift defaulting to 9. For example, these Samsung SSDs:

~~
root@box:~# blockdev --getss --getpbsz /dev/disk/by-id/ata-Samsung_SSD_840_PRO_Series_S12SNEAD309333E
512
512
root@box:~# blockdev --getss --getpbsz /dev/disk/by-id/ata-Samsung_SSD_850_PRO_512GB_S250NSAG403718A
512
512
~
~

I know from repeated experience that if I forget to set ashift manually when creating a new vdev using these drives (which are actually 8K sector devices), I end up with an ashift=9 vdev. I've seen the same thing happen with other makes of SSD and with some rust disks as well; the manufacturers still like to lie about the hardware blocksize so that their equipment will work with older operating systems that are confused by anything other than a 512b block.

The suggestion here is that since ashift=12 has minimal impact on 512b devices, ashift=9 should not be set unless the user manually specifies it.

The suggestion here is that since ashift=12 has minimal impact on 512b devices, ashift=9 should not be set unless the user聽manually聽specifies it.

This is not true when it comes to space efficiency for raidz*. 512b is clearly superior to 4k and vastly superior to 8k for most workloads.

While maintaining the whitelist is difficult, it is the only viable solution.

While maintaining the whitelist is difficult, it is the only viable solution.

"Viable" is a little arguable here. This basically strikes me as you saying "no way, I don't want to have to remember to type ashift=9!" vs me saying "no way, I don't want to have to remember to type ashift=12 or ashift=13!"

Another possibility would be issuing a warning if ashift isn't set, requiring some sort of force argument to ACTUALLY set a vdev with "whatever looks good" as the ashift. Basically I'd just like it to be more difficult to instantly and immutably destroy a pool's performance without so much as a warning.

I think it is true that modern hardware is generally moving to larger sector sizes, even if some of them still report 512-byte sectors for compatibility reasons. Changing the default to ashift=12 doesn't prevent users from selecting ashift=9 if their workload benefits from it (i.e. millions of tiny files), it's just a sign that more hardware today works better with ashift=12 and should be the default.

Also, as new drives become with predominantly 4KB sectors, it will be impossible to add new 4KB sector drives to existing pools created with ashift=9 (e.g. to replace failing drives). The converse is not true, so it is much safer to make ashift=12 the default.

This is a thread that will never die, as long as drives misrepresent themselves :-(

Hear me now, believe me later: whitelist is the only viable solution to misrepresentation.

I don't see how it's the "only" solution. Having a list of drives with the correct ashifts is possible and desirable with either default. Not making this change isn't equivalent to "using whitelists", it's equivalent to "We believe ashift=9 is the best default".

Pros of ashift=9 default:

  • Much better space efficiency in raidz

Pros of ashift=12 default:

  • Much better performance on SSDs and other drives that have 4k or larger sectors

So those are the tradeoffs. We should weigh them keeping in mind how many drives are sold these days with 512 byte native sectors vs 4k+ native sectors.

The goal should be "never need to type ashift=", if for no other reason than ashift is something few people can explain to the average sysadmin. The implementation is via whitelists because disks lie.

Unless you can propose a way to suddenly make the ZFS whitelist exhaustive - which it demonstrably isn't - we should put some thought into a default that is the least bad.

The current ashift default _is_ the least bad default. It's not always set to 9, it's dynmaic and is set to what the drives reports as it's physical block size, see: https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev_disk.c#L316-L332 and https://github.com/zfsonlinux/zfs/blob/master/include/linux/blkdev_compat.h#L295-L315

So if the drive doesn't lie about its physical block size, the right "default" will be used. To deal with lying drives the ashift option was added to zpool create in ZoL. I get that some people run into lying drives a lot but in my experience, the majority of devices produced now do not lie.

We are not going to hard code ANY ashift values be that 9 or 12 or 13 etc.

That depends on what "least bad" means. If it means "use space more efficiently", then ashift=9 is least bad. If it means "don't prevent your pool from using modern replacement drives when the old ones fail" then ashift=12 is least bad. IMHO, the fallout from using ashift=12 by default is some percentage more space being used, which is increasingly a non-issue as drives get larger, and this can still be handled by specifying ashift=9 for the cases where space usage is critical.

However, using ashift=9 (whether it is correct for the current drives or not) is going to increasingly be a problem moving forward since 512-byte sector devices will just stop being available, forcing users to do full backup/restore of their pool in order to replace a faulty drive.

The performance impact of using ashift=9 on drives that actually have 4KB sectors can be significant, but isn't the worst problem. There is a second hidden impact, namely that the 4KB sectors are updated with read-modify-write, and if that fails in the middle of the write it can corrupt adjacent sectors on the drive for ZFS 512-byte blocks that were not involved in the write at all, essentially re-introducing the "RAID write hole" because multiple 512-byte "ZFS blocks" are actually sharing the same sector on disk. This won't happen often, because crashes in the middle of sector updates that cause a CRC error are relatively rare, and there have to be pre-existing 512-byte blocks that are adjacent to new blocks being written, but the CRC errors are definitely seen in the field occasionally (independent of ZFS).

Earlier discussion on ashift: https://github.com/zfsonlinux/zfs/issues/289

The current ashift default is the least bad default. It's not always set to 9, it's dynamic and is set to what the drives reports as it's physical block size, see: [elided]

Sorry, this is right, my bad.

I get that some people run into lying drives a lot but in my experience, the majority of devices produced now do not lie.

Do we have any better data on this? I believe that my fairly new NVMe SSD actually lies; I'll have to recheck.

Every single Samsung SSD produced for the last several years, including brand new 850 Pro, lies and reports 512 512. I've encountered lying rust drives as well; I think they were Western Digital Black.

To deal with lying drives the ashift option was added to zpool create in ZoL.

And it is very discouraging, for the devs that put so much effort in implemeting and testing this feature, to see users simply "forgetting" to use it.

Answer to initial issue request - there is no default hardcode ashift value, it asks device for it, so the initial request is not right for me.

I think it's a problem of "lying" hardware, not a ZFS one. As an engineer, you can check your hardware before any installation. That's why you can change ashift to anything you want.

About request to make some white- or blacklist - who will support it? We can start to speak about it only then somebody will be ready to do it. I don't see any need in this.

I think now this topic is better for mailing list, thanks everybody for participation, closed.

About request to make some white- or blacklist - who will support it?

??? Good question; it already exists. https://github.com/zfsonlinux/zfs/blob/6eb6073a044653016013b1a72de03a1257e899c5/cmd/zpool/zpool_vdev.c#L107

Incidentally, it shows the 830 and 840 as already in the database.

@grantwwu thanks, so it's not a question, and everyone can just make a PR with new disks to it.

@jimsalterjrs I do not share your experience. Here's the proof:
Vendor identification: SAMSUNG
Product identification: MZILS1T9HCHP/0NW
Unit serial number: S32CNCAH301443
from /sys/block/DEVICE/queue
logical_block_size 512
physical_block_size 4096
optimal_io_size 8192

Right now, the default is to honor the physical_block_size from the drive. @jimsalterjrs has proposed to make the default a hard-coded 4 kiB (or 8kiB) if I understand correctly. If the default is 4 kiB, this is bad for drives that correctly report 8 kiB.

I would like to suggest that the default be: honor the _larger_ of: physical_block_size from the drive, 4kiB. The whitelist would override.

Note that the current state of affairs is difficult for the root-on-ZFS HOWTO (or any other instructions). If I tell people to hard-code ashift=12 (which I do), that's harmful to anyone with a drive properly reporting (or _in the whitelist_ with) 8 kiB or larger. If I leave it at the default, we get all the usual problems from lying drives. It'd be nice to either fix this behavior, or have some way to specify a _minimum_ ashift on the command line.

"Hard code" is a bit harsh. I'd settle for a warning about using default ashift values that needed a --force argument to override. Right now, given that there are lots of drives that lie, and that adding a vdev with a bad ashift - particulary a bad low ashift - means instantly and immutably damaging a pool, I think it's pretty criminally risky not to specify it directly in the first place.

Another option would be simply to require ashift whenever a vdev is added - even if the setting is default; i.e. zpool create tank sdb would throw an informational error, but zpool create -o ashift=default tank sdb would work just the same as a bare create does now.

Just posting for reference, I have some Samsung 850 EVO SSD's in my machines. I had to set ashift manually for these as it defaulted to 9.

Machine 1: 2 disk mirrored pool

Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 850 EVO 250GB
# cat /sys/block/sd{a,b}/queue/{logical_block_size,physical_block_size,optimal_io_size}
512
512
0
512
512
0

Machine 2: Combo L2ARC/SLOG for platter pool

Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 850 EVO 120GB
# cat /sys/block/sd{d,g}/queue/{logical_block_size,physical_block_size,optimal_io_size}
512
512
0
512
512
0

I just wanted to make a comment on something said at the start:

The penalty for setting ashift higher than native blocksize is very, very low

It turns out that there can be a substantial space penalty on raidz1/2/3, which varies depending on the number of devices in the vdev. The details are in this spreadsheet, but to take a couple of more extreme examples:

  • with ashift=12 and recordsize=128KiB:

    • a 7 disk raidz2 wastes 7.1%

    • an 11 disk raidz2 wastes 7.4%

  • but with ashift=9 and recordsize=128KiB:

    • a 7 disk raidz2 wastes 0.45%

    • an 11 disk raidz2 wastes 0.67%

When you have a raidz2 vdev of 17 disks (7.54% overhead), these figures become significant; 7% represents a whole disk wasted. (Aside: an 18 disk raidz2 has zero wastage!)

However I do agree that in the vast majority of cases, the performance benefits from ashift=12 with modern drives make it the right choice as a default. I'm just pointing out that for certain special cases, such as bulk archival servers, there may be benefit in setting this back to ashift=9.

We have seen considerably worse compression ratios when using ashift=13 on Samsung 850 EVOs (in raidz1 pools)

When ashift was set to 9, we would get 5-6X compression using gzip2, with ashift set to 13, it's dropped to 2-3X.

Have others seen this behavior too?

Was this page helpful?
0 / 5 - 0 ratings