Zfs: Want to specify block size when creating file

Created on 27 May 2020 · 12Comments · Source: openzfs/zfs

Databases created on ZFS datasets always run into read-modify-write overhead. We could make this better by implementing an ioctl operation on directories for enabling file creation while specifying the record size. Then databases could check to see if they are creating files on ZFS and if they are, use the ioctl operation to ensure that the record sizes that they need are used.

This is a solution in search of users. I am just putting this out there for feedback to see what others think about it before doing anything else with it.

Feature

Source

ryao

👍3

Most helpful comment

I would suggest a new API to change the block size of a file. It would only work if the file has no more than one block. The DMU already supports this - see dmu_object_set_blocksize().

The advantage of this over a new file-creation API (e.g. your proposed ZFS_IOC_DIR_CREATE_FILE) is that I think it would integrate better with existing system calls. E.g. they can create the file however they like (perhaps through many layers of abstraction in userspace), and then do a new call to set the block size (which might fail on non-ZFS filesystems or older ZFS code).

It would require an extension to the on-disk format

Are you thinking this would be needed to indicate that this file's block size should not be increased up to the recordsize setting, when its size is increased? For most cases, we could ensure this by setting dn_maxblkid=2, which prevents subsequent changes to the block size. The exception would be if the file is truncated to zero length, we also reset dn_maxblkid=0 and the file's blocksize can be changed again. Not sure if that's good enough, but thought I'd throw it out there.

ahrens on 28 May 2020

👍3

All 12 comments

I _believe_ we discussed this on a phone call, but I can't find notes about it.

The easiest way to do this is to create a dataset just for the database, since you can set the recordsize property easily enough that way. Setting something on a directory is harder, since there are attributes, but not properties -- and the attributes are not inherited. Doing only one level (that is, /var/db and all the files in it, rather than /var/db and have it impact /var/db/subdir/file1) seems like it would be pretty accomplishable with an extended attribute. There are likely other tricks I am not thinking of to do it as well.

kithrup on 27 May 2020

The easiest way to do this is to create a dataset just for the database, since you can set the recordsize property easily enough that way.

This is the old way, which tends to bite users who are unaware of it. Talking to the NixOS developers made me realize th is would be a problem because their SQLite package database is automatically going to be doing IO amplification, although it does not necessarily qualify for its own dataset. It also isn't always practical for things like embedded SQLite databases. SQLite for example is in so many places that it is hard to keep track of them all and they are unlikely to see their own datasets. When one just happens to get hit hard, it would be best if the record sizes of its datafiles were set correctly.

Setting something on a directory is harder, since there are attributes, but not properties -- and the attributes are not inherited. Doing only one level (that is, /var/db and all the files in it, rather than /var/db and have it impact /var/db/subdir/file1) seems like it would be pretty accomplishable with an extended attribute. There are likely other tricks I am not thinking of to do it as well.

This isn't what I had in mind. I had in mind a replacement for the following with the added caveat that we specify a recordsize to be used for file creation:

createFile(const char *path)
{
    int old = errno;
    int fd = open(path, O_CREAT|O_EXCL);
    if (fd != -1)
        close(fd);
    errno = old;
}

Instead of that, we would have something like:

#define ZFS_SUPER_MAGIC 0x2fc12fc1 /* Real value */
#define ZFS_IOC_DIR_CREATE_FILE 0x00F00F00 /* Made up value */

void
createFileNew(const char *path, int recordsize)
{
    struct statfs buf;
    int fd, ret;
    char *basename, *dirname;
    (void) statfs(path, &buf);
    if (buf->f_type != ZFS_SUPER_MAGIC) {
        createFile(path);
        return;
    }

    (void) asprintf(&basename, "%s", path);
    (void) asprintf(&dirname, "%s", path);

    basename(basename);
    dirname(dirname);

    fd = open(dirname, O_WRONLY);

    /* Silent failure from `open()`. Fine for an example as this isn't production code */
    if (fd == -1)
        goto out;

    ret = ioctl(fd, ZFS_IOC_DIR_CREATE_FILE, basename, recordsize);

    if (ret == EINVAL)
        createFile(path);

    close(fd);
    /* 
       The irony of freeing memory in an example
       that is not production quality is not lost on me, but I dislike leaks.
    */
out:
    free(basename);
    free(dirname);
}

Of course, programs do not usually do exactly this, but I am using this for an example to illustrate my idea. If we had an ioctl that did this, then software like databases could check for ZFS and try it when on ZFS in order to ensure that the files are made with optimal record sizes.

Users can already get this sort of thing by changing the recordsize on a dataset, making the database and then changing the recordsize back. The files created under the old recordsize will remain under the old recordsize. However, few people know this trick. They really should be using a dedicated dataset, but it seems like it would be better if we took measures to minimize the need for this in the first place.

It would require an extension to the on-disk format, but we could probably create a file attribute for ensuring that the record size is sticky and have it be set so that it is preserved across send/recv. That would go a step beyond a hack to prevent users from having the wrong record size on software that has a strong need for a specific setting.

Another benefit of this would be in improving results from benchmarks done by people who don't know how to set recordsize properly for various workloads. There are plenty of such benchmarks on the internet. If we can get the applications that benefit from proper record size to set their own recordsizes on a per file basis, the benchmark numbers would be a much more accurate representation of ZFS performance.

ryao on 27 May 2020

That was it: the suggestion was to allow an ioctl to change the recordsize of a file. Nothing inherited or very complicated, just some glue to make it settable to userspace.

kithrup on 27 May 2020

That was it: the suggestion was to allow an ioctl to change the recordsize of a file. Nothing inherited or very complicated, just some glue to make it settable to userspace.

Changing the recordsize of a file isn't exactly possible. It is set at file creation time. I heard from sef that not much came of that conference call that mentioned this.

It is an idea that I had years ago and had discarded as pointless. I revisited it today after the NixOS developers inspired me to think of the potential for databases and other recordsize sensitive software to hook into it.

Adding a file attribute to make the recordsize be preserved by send/recv would be a step beyond that though. It also would be somewhat hacky as we would be breaking layering internally and possibly other things to make it work.

ryao on 27 May 2020

@kithrup I just realized that you are sef. Sorry for not recognizing you.

ryao on 27 May 2020

I would suggest a new API to change the block size of a file. It would only work if the file has no more than one block. The DMU already supports this - see dmu_object_set_blocksize().

It would require an extension to the on-disk format

ahrens on 28 May 2020

👍3

I would suggest a new API to change the block size of a file. It would only work if the file has no more than one block. The DMU already supports this - see dmu_object_set_blocksize().

The advantage of this over a new file-creation API (e.g. your proposed ZFS_IOC_DIR_CREATE_FILE) is that I think it would integrate better with existing system calls. E.g. they can create the file however they like (perhaps through many layers of abstraction in userspace), and then do a new call to set the block size (which might fail on non-ZFS filesystems or older ZFS code).

I like the idea of an ioctl for changing the record size of a zero length file better than my original idea. If we can get userland software on-board with this, let’s go with that.

It would require an extension to the on-disk format

Are you thinking this would be needed to indicate that this file's block size should not be increased up to the recordsize setting, when its size is increased? For most cases, we could ensure this by setting dn_maxblkid=2, which prevents subsequent changes to the block size. The exception would be if the file is truncated to zero length, we also reset dn_maxblkid=0 and the file's blocksize can be changed again. Not sure if that's good enough, but thought I'd throw it out there.

This should be good enough. I actually had not known the function of this field, so I just learned something new. Thanks. :)

ryao on 28 May 2020

Talking with @ryao it was suggested that this become a dataset property, so unintentional malicious activity cannot happen.

Rough idea of what the options for the property
recordsize_control (Better name should be suggested)
off = default to legacy control ignore specific recordsize requests
on = allow records to be created with specific recordsize
denied = return EPERM to application that was trying to set the recordsize.

manfromafar on 28 May 2020

We currently do not have a property value called denied. It would be better to use existing values from other properties for consistency. Also, while we could use on and off to match the xattr property, I would prefer that the obvious option for turning this off give applications EPERM so that ones that really do benefit could warn the user about it. This would be inconsistent with the on and off used in the xattr property, so I think a different set of property values should be used. In particular, I am partial to the following:

standard = default value that permits userland applications to set record sizes on 0 length files
disabled = return EPERM
legacy = return ENOTTY

The standard and disabled values are already used in the sync property. The semantics of standard would be the same while the semantics of disabled would be similar in that we turned things off in a way that tries to play nicely with other layers. legacy is already used in the mountpoint and dnodesize properties to mean what was done in the past and restoring the current behavior would be consistent with that.

This knob would be to allow the system administrator to maintain final say over the record sizes used. It would be there so that system adminstrators could make the decision to override the hints given to us by applications should the need arise. As long as userland applications do not misuse it, that need should never arise. Also, the name is still tentative, but I feel like recordsize_application_hints would make the most sense for this property.

Edit: An earlier version of this used EINVAL. It turns out that unimplemented ioctls return ENOTTY, so I edited this to use ENOTTY.

ryao on 29 May 2020

The semantics of the ioctl call have not been fully fleshed out, but it occurs to me that when doing that, we should consider the interactions between this ioctl, this property and the recordsize setting in two scenarios:

When applications set a recordsize larger than the dataset record size. Should this be allowed?
When applications set a recordsize that is too small, like 512, especially given that most systems use ashift=12 vdevs these days.

I am thinking that we could allow the system administrator to set values ranging from 512 to 1M just like on the record size property. When an application attempts to set a record size smaller than this, we will round up to either their setting or the value of the zfs_max_recordsize LKM parameter, whichever is smaller. Furthermore, the maximum record size that applications may set would be dictated by the current setting of the global zfs_max_recordsize LKM parameter.

I am not sure whether we should require that userspace pass a power of 2 or that we should require userspace to pass the log base 2 of the record size. Illegal sizes would return EINVAL while I imagine that we could return the size set on success. This might be of interest to userspace if we were to increase the record size or decrease it based on system constraints. If we go with the log base 2 of the size, we would not have an invalid size as everything would then be rounded up or down in that case.

ryao on 29 May 2020

Another Gentoo developer (chutzpah on freenode IRC) suggested that we send a patch to SQLite upstream to detect the optimal record size and set its own internal record size accordingly. Our variable record size will require that it make a 16MB sparse file before doing statx to get the recordsize, but this is doable and should be applicable to all filesystems. It would not work with PostgreSQL, MySQL, VMs or bit torrent clients. Also, things like FIO that are told to test 4KB IO have similarly rigid requirements and being able to give us a hint that it needs 4KB IO as long as other applications go on board should be fine in my mind, but we would need to talk to their upstreams.

Getting SQLite to set its own internal record size might reduces the value of this ioctl somewhat, but it still has enough applications to make this seem worthwhile to me if we can get other projects on board. In the case of SQLite, I think whether they can use this might depend on how much freedom they have in setting their record size. I know that they can go up to 128KB, although I am not sure if they can handle the 16MB case that we support, so they might still benefit in very rare situations. Also, they would benefit whenever the application developer / user overrides the default.

Feedback from clever on freenode IRC, one of the NixOS developers, was that we should test SQLite to see which approach is superior. That is getting SQLite to scale up to match us or getting us to scale down to match SQLite's default. This might be easier said than done, but I agree that we should evaluate it.

ryao on 29 May 2020

Users can already get this sort of thing by changing the recordsize on a dataset, making the database and then changing the recordsize back. The files created under the old recordsize will remain under the old recordsize. However, few people know this trick. They really should be using a dedicated dataset, but it seems like it would be better if we took measures to minimize the need for this in the first place.

Exactly what I am doing right now for specific files which benefit from a lower recordsize than the default 128K when a new dataset is not practical. For what is worth, I think it is a good idea and I agree with @ahrens proposal.