We haven't yet integrated ZFS with filefrag on Linux. But this is certainly something which is worth looking in to me. My initial understanding is that to make this work we would just need to implement the FIEMAP ioctl. Although this is basically an e2fsprogs utility so it may not be the right tool for the job.
Brian, most Linux filesystems support FIEMAP today, including ext3/4 and Lustre, since it was originally based on a similar XFS ioctl, and has been adopted by other filesystems since then. While "filefrag" is packaged with e2fsprogs, it is a candidate to move into util-linux-ng at some point (along with lsattr/chattr), as have other e2fsprogs-developed libraries/tools like libblkid, libuuid, libcom_err, fsck, etc.
The other alternative to FIEMAP is FIBMAP, but that ioctl is limited to root, and it only allows returning a single block at a time. Not only is FIBMAP very inefficient for large files, but it also misses some additional features that are nice to have on modern filesystems, like xattr mapping, unwritten extents, delalloc blocks, etc.
I would recommend to start by looking at __generic_block_fiemap() in the kernel, to get an idea of what needs to be done from the FIEMAP side of the code. Note that there is also support in very new kernels to add the SEEK_HOLE
and SEEK_DATA arguments to llseek() (see generic_file_llseek()). For ZFS blocks it should set FIEMAP_EXTENT_MERGED on merged ranges of blocks.
Lustre uses a patched version of filefrag which allows returning the underlying device for each extent (fe_device), because the file is not located on a single LUN. For Lustre this is an index value (0-N). For ZFS one might either return the Linux block device (major << 16 | minor) or 32 bits of the VDEV GUID or similar.
struct fiemap_extent {
__u64 fe_logical; /* logical offset in bytes for the start of
* the extent from the beginning of the file */
__u64 fe_physical; /* physical offset in bytes for the start
* of the extent from the beginning of the disk */
__u64 fe_length; /* length in bytes for this extent */
__u64 fe_reserved64[2];
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_device; /* device number */
__u32 fe_reserved[2];
};
One of the fe_reserved64[2] fields was (in my mind at least) reserved for returning the actual length of the extent, for compressed blocks. I haven't given a lot of thought to whether the existing fe_length should be treated as the physical length or the logical length (currently both are the same), so ideas are welcome. If anyone goes down that road, please send me a patch for e2fsprogs filefrag so I can submit it upstream.
As for the ZFS side, I'd recommend looking at the Solaris ZPL layer to see how they implement SEEK_HOLE and SEEK_DATA traversal of the blocks in the dnode.
Don't forget about reporting dirty pages (ARC buffers) in memory, or copying a file that was just written will result in an empty copy. Even though they do not have blocks allocated yet, it should set FIEMAP_EXTENT_DELALLOC to indicate that the space is in use.
The most recent (but as yet unlanded) patches for handling compressed extents in FIEMAP are at https://lwn.net/Articles/607552/ and http://linux-fsdevel.vger.kernel.narkive.com/S8u3GLaY/patch-0-6-v5-fiemap-introduce-data-compressed-and-phys-length-flags
I'm the one who originally requested this in the zfsonlinux mailing list. in order to generate fragmentation reports.
Knowing the actual extent of fragmentation is fairly important given how badly ZFS performance falls over when fragmentation becomes widespread (short version: It's pretty ugly. You don't want to experience it)
Brian, what's the state of play?
@Stoatwblr this issue is waiting for a developer with enough available time and interest to tackle it. I agree it would be great to have.
This is also a requirement for running the shake defragmenter.
@behlendorf, I saw your FIEMAP project was listed as the runner-up project for the ZFS DevCon Hackathon. Awesome that you had the opportunity to work on this. I'd be interested to review the patch when it is available, and would be interested to discuss how to handle ditto blocks with FIEMAP, if you haven't already implemented this.
It would also be very useful to add FIEMAP support to Lustre osd-zfs so that it can export this information to the client.
@adilger thanks! So I don't have any code ready to share just yet but I hope to fairly soon. When I have something I'd love to get your input. The OpenZFS summit was a good opportunity to give some careful thought as to how to best go about this.
I was happy the see the FIEMAP interface already provides most of what's needed. Ditto blocks are one exception you mentioned, another concern is including the block device in the extent.
For returning the block device in the extent, I would suggest to use the same mechanism as Lustre does for returning the OST index to the caller. We return the OST index (__u32) in a reserved field:
#define fe_device fe_reserved[0]
that hasn't yet been reserved in the upstream kernel, though I should probably do that at some point. For ZFS it could either return the actual __u32 block device number (kdev_t, which would make it easier for users to locate), or it could e.g. return the low 32 bits of the VDEV GUID (which would be consistent across runs, but need another level of lookup to resolve to a specific disk).
The Lustre-patched filefrag also passes a flag FIEMAP_FLAG_DEVICE_ORDER=0x40000000 to indicate that the lower layers should return the extents in device order (i.e. all of OST0000 first, then OST0001, or wherever the stripes are located) so that they are not interleaved across devices at every stripe_size boundary.
$ filefrag -v /myth/tmp/2stripe
Filesystem type is: bd00bd0
File size of /myth/tmp/2stripe is 20971520 (20480 blocks of 1024 bytes)
ext: device_logical: physical_offset: length: dev: flags:
0: 0.. 10239: 3844867072..3844877311: 10240: 0002: net
1: 0.. 4095: 2938242048..2938246143: 4096: 0000: net
2: 4096.. 8191: 2938363904..2938367999: 4096: 0000: net
3: 8192.. 10239: 2938380288..2938382335: 2048: 0000: last,net
For handling ditto blocks, the FIEMAP_FLAG_DEVICE_ORDER option would also segregate ditto block copies between multiple VDEVs, and they would return the same logical file offset for each of the ditto copies of that block. For display by filefrag this is fine, and for most tools that use FIEMAP they only care whether there is data at a given offset or not so reporting the same logical offset two or three times should be fine.
@adilger that all makes good sense and I generally agree.
We return the OST index (__u32) in a reserved field:
It would be great to get this reserved in the kernel. When looking at the existing interfaces I was surprised btrfs hasn't already done this. My feeling here is the kdev_t would be the most useful thing to return. One potential gotcha I see with this approach is we'll need to return some reserved value when the device is faulted or missing.
The Lustre-patched filefrag also passes a flag FIEMAP_FLAG_DEVICE_ORDER=0x40000000
That's handy! Although if we're reporting the kdev_t as the device then what exactly is device order? Numerically based on the kdev_t values won't match up with the zpool status ordering. But your right it still would prevent interleaving which is probably sufficient.
or it could e.g. return the low 32 bits of the VDEV GUID
The full vdev guid could additionally be returned in fe_reserved64[0] if there's a legitimate use for it. I don't think we want to be reporting truncated GUIDs which are very likely, but not guaranteed to be, unique.
ditto blocks
What are you thoughts on adding a new FIEMAP_EXTENT_DATA_DUPLICATE flag which would be set for all extents which have multiple copies.
Mirrors and RAIDZ
Handling mirrors is relatively straight forward since we can return the extent multiple times, one for each device. Although there are some complications with compression, more on that in a minute.
As for RAIDZ and DRAID in order to return correct physical information we need to add an extent for every vdev spanned by the stripe. That's potentially a large number of small extents even assuming that we ignore the parity information. Will that cause any problems for filefrag?
Compression
The fiemap_extent structure only includes a single value for length, fe_length, which implies the logical and physical lengths are the same. With compression this simply isn't true, is there an existing way to handle this I'm not seeing?
Splitting blocks.
When ZFS splits a compressed or encrypted block when writing it, either because its RAIDZ or because a gang block is needed, there exists no meaningful logical to physical offset mapping for the extent. Given the current interface I'm not sure what those extents should look like. One option would be to add the extent multiple times with each entry referencing a different physical offset which includes a partial portion of the block.
An alternative to all of this would be to report the extent from the perspective of the top-level RAIDZ vdev. That would push this issue up in to the caller but my feeling is that wouldn't be particularly useful since it would be difficult at best for them to calculate the physical offsets.
@behlendorf see http://linux-fsdevel.vger.kernel.narkive.com/S8u3GLaY/patch-0-6-v5-fiemap-introduce-data-compressed-and-phys-length-flags for details on how compressed extents should be handled. The reserved64[0] field would become the physical length of the extent, and the existing length field would be the logical length. There should be a new extent flag for compressed blocks, but it makes sense to always fill in the physical length even if the blocks are not compressed.
Ideally, that patch series could be refreshed and submitted upstream, to ensure that the flags/fields are fixed for the future.
For all the ones who just want to display file fragmentation... it's not complete, and I'm not 100% sure it even works, but the script I wrote for #7110 might nevertheless be a good start.
Brian, any news on this? Would it be possible to post a WIP/RFC patch so that it can be reviewed and (maybe) someone else can work on it?
Nothing new to add. I wish there was a patch worth opening a PR for. But the prototype version I had was very basic having been thrown together for a hackathon.
@adilger I've opened #7545 with a proper FIEMAP implementation. If you have any time to review or test it would be appreciated!
Most helpful comment
@adilger I've opened #7545 with a proper FIEMAP implementation. If you have any time to review or test it would be appreciated!