Since the introduction of the Compressed ARC feature (6950 ARC should cache compressed data), it has been possible to disable the feature using the tunable: compressed_arc_enabled=0
It is unclear how many users operate systems with the compressed ARC feature disabled, however it is clear that it gets a lot less testing than the default case. Over time, the assumption that the ARC is compressed, and dealing with the corner cases when it is not, has increased the complexity of the code base. It is often standing in the way of additional new features.
A number of developers have expressed a desire to retire the ability to disable the Compressed ARC.
We likely need to follow the not-yet-established OpenZFS Deprecation Policy, to give users warning that this feature is going away, and to give those with use cases for disabling the compressed ARC to make those use cases known to us.
I've been mulling this over for a couple days, I think your "pathological" reasons are quite compelling. I honestly can't think of a reason for turning it off, unless you're not using compression for the entire pool since we're going to be spending time decompressing blocks at some point along the chain no matter what.
I also just realized that we never put compressed_arc_enable into the man pages.
I think compressed_arc_enabled=0 has valid use cases (and performance impacts), thus should not be removed.
ARC compression trades CPU for RAM and as every read will eventually lead to the data being decompressed it can very well reduce performance (the TPS metric) in some scenarios, caused by the overhead for repeated decompression of one and the same data block - which might well offset any benefit gained from being able to fit more blocks into the available RAM.
The 1st issue (illumos and FreeBSD) could be avoided by directly compressing the data with the algorythm it was originally read from the pool when evicting it to L2ARC, which is what ZoL is iirc doing in the 2nd issue (leading to more data fitting into L2ARC).
The 3rd issue is IMHO none as the data has to be decompressed anyway: it was requested for a read, else it wouldn't be fetched from L2ARC in the first place.
The 4st issue states in the linked comment:
The performance overhead of this will be relatively low [...]
Regarding offload cards (and improved software implementations) coming to a compressed representation than differs from the corresponding/old software implementation:
This breaks at least dedup and nop-write (in the sense of not finding a match in case an accelerator is added/removed - not in terms of data loss), and should get a clear warning in the documentation.
Bottom line
The listed issues could all (possivly except the encryption case, havn't wrapped my head around that yet) addressed by tracking the original on-disk block checksum through the ARC and giving the L2ARC header a on-L2-disk checksum (so reads can be verified against the raw data coming from the cache drive).
This should decouple ARC compression from on-disk (allowing to compress even blocks stored with compression=off inside ARC, and the other way around). That might increase the size of the (L2)ARC headers a bit, but should remove any problems related to de-/recompression while avoiding the potential performance penalty from repeated decompression of MRU/MFU blocks (as ARC could decompress these once and drop the compressed version, so it won't need to hold onto both).
This would make it completely irrelevant if a clean block in ARC has been decompressed, (re-)compressed (either for a trip through L2ARC or just to save some space when the block dosn't get accessed that often anymore) or has been sitting there as on-disk (compressed or not) from the very beginning.
And possibly enable us to add knobs to tune ARC compression behaviour _per dataset_, later.
I think
compressed_arc_enabled=0has valid use cases (and performance impacts), thus should not be removed.ARC compression trades CPU for RAM and as every read will eventually lead to the data being decompressed it can very well reduce performance (the TPS metric) in some scenarios, caused by the overhead for repeated decompression of one and the same data block - which might well offset any benefit gained from being able to fit more blocks into the available RAM.
This is not exactly true. There is a dbuf cache that avoids decompressing the same block repeatedly if it is accessed frequently enough. Depending on the compression algorithm, the cost to decompress can be quite low.
However, yes, as I mentioned in the original post, if you have a very large working set, it could still be useful, as expanding the dbuf cache to compensate results in excessive double caching.
The 1st issue (illumos and FreeBSD) could be avoided by directly compressing the data with the algorythm it was originally read from the pool when evicting it to L2ARC, which is what ZoL is iirc doing in the 2nd issue (leading to more data fitting into L2ARC).
The only reason ZoL is different is that the way the L2ARC works was changed for the ZFS Native Crypto work, which has not been ported to FreeBSD and illumos yet.
The L2ARC is supposed to boost performance, having to use up CPU to compress blocks before writing them to the L2ARC works against that, and is much more expensive than decompression. Neither solution is idea.
The same algorithm may not be available. As the point about QAT (Intel Quick Assist crypto/compression accelerator) shows.
The 3rd issue is IMHO none as the data has to be decompressed anyway: it was requested for a read, else it wouldn't be fetched from L2ARC in the first place.
An argument could be made that the L2ARC should be a cache of the in-memory representation, not the on-disk representation. In the case where the compressed_arc is enabled, they are the same. When it is disabled, doing extra work in one or both directions reduces the usefulness of the cache.
The 4st issue states in the linked comment:
The performance overhead of this will be relatively low [...]
Regarding offload cards (and improved software implementations) coming to a compressed representation than differs from the corresponding/old software implementation:
This breaks at least dedup and nop-write (in the sense of not finding a match in case an accelerator is added/removed - not in terms of data loss), and should get a clear warning in the documentation.Bottom line
The listed issues could all (possivly except the encryption case, havn't wrapped my head around that yet) addressed by tracking the original on-disk block checksum through the ARC and giving the L2ARC header a on-L2-disk checksum (so reads can be verified against the raw data coming from the cache drive).
Expanding the size of every ARC header by 32 bytes would be a huge cost. Storing the checksum with the data on the L2ARC risks the checksum being corrupted along with the data, although maybe that is less of a concern since the probability of a checksum being corrupted in a way that it matches the data are low.
The thought of removing this tunable came out of a discussion of how to extend the L2 header in the ARC to contain the uncompressed data checksum, since the blockpointer checksum is of the compressed version. There was a strong desire not to increase the size of the L2 header, since there may be a very large number of them in memory if the L2ARC device is large.
This should decouple ARC compression from on-disk (allowing to compress even blocks stored with compression=off inside ARC, and the other way around). That might increase the size of the (L2)ARC headers a bit, but should remove any problems related to de-/recompression while avoiding the potential performance penalty from repeated decompression of MRU/MFU blocks (as ARC could decompress these once and drop the compressed version, so it won't need to hold onto both).
This doesn't really make sense to me. I definitely would not want the ARC trying to compress blocks read from disk that were not compressed on disk, because of the latency, and because the most likely reason a block on disk is not compressed, is because compression failed to yield sufficient gains.
This would make it completely irrelevant if a clean block in ARC has been decompressed, (re-)compressed (either for a trip through L2ARC or just to save some space when the block dosn't get accessed that often anymore) or has been sitting there as on-disk (compressed or not) from the very beginning.
One of the things that makes the compressed ARC better than most other memory compression features out there, is that it does not spend time and memory trying to compress data at the worst possible time, when there is demand for memory.
And possibly enable us to add knobs to tune ARC compression behaviour _per dataset_, later.
That would be a big change.
The L2ARC is supposed to boost performance, having to use up CPU to compress blocks before writing them to the L2ARC works against that, and is much more expensive than decompression. Neither solution is idea.
L2ARC is intended to boost read performance, yes.
An argument could be made that the L2ARC should be a cache of the in-memory representation, not the on-disk representation. In the case where the compressed_arc is enabled, they are the same. When it is disabled, doing extra work in one or both directions reduces the usefulness of the cache.
I see the usefulness of L2ARC in being backed by a device with way lower latency (more read IOPS) than the average pool drive, so fetching data from it has lower latency than from the main pool - decompression cost should play no big role in comparison (and it likely also applies when fetching the block from the pool). Plus being able to fetch from L2 dosn't impact the IOPS budget of the pool.
Expanding the size of every ARC header by 32 bytes would be a huge cost. Storing the checksum with the data on the L2ARC risks the checksum being corrupted along with the data, although maybe that is less of a concern since the probability of a checksum being corrupted in a way that it matches the data are low.
The thought of removing this tunable came out of a discussion of how to extend the L2 header in the ARC to contain the uncompressed data checksum, since the blockpointer checksum is of the compressed version. There was a strong desire not to increase the size of the L2 header, since there may be a very large number of them in memory if the L2ARC device is large.
I can follow the argument about header size, please disregard my comment in that direction as it wasn't well thought out.
Regarding self-checksumming: it's OK for other stuff (among others: uberblocks) so it should be fine to store the checksum with the data - possibly enriched with some more information like original DVA and birth (so one can't accidently read a stale block with an intact self checksum, instead of the one we actually want, but I have no idea if such extra checks would actually be needed).
This doesn't really make sense to me. I definitely would not want the ARC trying to compress blocks read from disk that were not compressed on disk, because of the latency, and because the most likely reason a block on disk is not compressed, is because compression failed to yield sufficient gains.
That depends, I guess. My line of though was the following:
When a block is written to disk (with compression!=off) it is according to https://github.com/zfsonlinux/zfs/blob/cc99f275a28c43fe450a66a7544f73c4935f7361/module/zfs/zio.c#L1589
of the smallest-ashift device, and zero the tail.
only stored compressed if the results needs less physical blocks than the uncompressed version and according to https://github.com/zfsonlinux/zfs/blob/c3bd3fb4ac49705819666055ff1206a9fa3d1b9e/module/zfs/zio_compress.c#L123 only when the compression yields at least 12.5%.
So looking at an ashift=12 pool and one 8k block of a zvol that could compress to, say, 4.1k - this block would be stored uncompressed on-disk and thus need 8k space when read back into ARC.
_Now_, if ARC could compress that buffer when it isn't in use frequently enough it could (should I have gotten it correctly) shrink the uncompressed 8K buffer to 4.5k (9 blocks of SPA_MINBLOCKSIZE, the granularity ARC seems to cut it's data buffer) - it wouldn't need to abide the rules for on-disk compression as that buffer would never written back to disk (thanks to CoW, unless I miss something about how resilver works). A 43% reduction in ARC space use (granted: I constructed a quite optimal case to make the point) could be an interesting saving, don't do you think?
Same goes for evicting into L2, there even more as L2 is byte addressed and wrap-around write (not free space mapped and block addressed like the rest of the pool) so the worker thread that evicts to L2 could well batch several writes together (removing real alignment need for all but the first block of the write) so the compressed 4.5k ARC buffer would only need the actually used 4.1k (plus a little for self-checksum as discussed above) to go onto the L2 drive - gaining us another ~9% (compared to storing the compressed buffer, or ~48% compared to evicting the uncompressed block to L2).
Currently we have a background thread that feeds L2ARC, couldn't we have something like this compress ARC buffers fallen out of grace (instead of directly evicting them, as the L2 feeder does, or dropping them)? And should we be able to do that, couldn't ARC not decompress frequently used buffers to avoid double caching using dbufs (in case I got it correctly)?
One of the things that makes the compressed ARC better than most other memory compression features out there, is that it does not spend time and memory trying to compress data at the worst possible time, when there is demand for memory.
The tricky part could be to get good behaviour in case of memory pressure, with that I agree.
But this all is, for the moment, a rough idea I just got from digging through the source.
I havn't wrapped my head around all of it yet, please correct me if I took a wrong turn somewhere.
But should I be right: Even with the CPU overhead of background compressing ARC buffers, the space saving it might give could make it well worth to spend a currently free bit in arc_flags_t on decoupling ARC compression from on-disk.
That would be a big change.
Possibly. But maybe worth it.
ISTM like you're advocating removing an ARC feature because it interferes with L2ARC. However, relatively few people use L2ARC, while everyone uses ARC.
As for recompressing, AIUI, the blocks are uncompressed when used, but the compressed blocks remain, so there is no recompression.
The theoretical arguments both ways are interesting. However, practical experience can help us weigh the importance of the pros and cons.
@GregorKopka and @richardelling, are you using compressed_arc_enabled=0? If so could you elaborate on what makes this compelling for your use case?
@richardelling
relatively few people use L2ARC, while everyone uses ARC.
But (we think) almost nobody sets compressed_arc_enabled=0. Certainly fewer folks than are using L2ARC.
As for recompressing, AIUI, the blocks are uncompressed when used, but the compressed blocks remain, so there is no recompression.
With compressed_arc_enabled=0, the blocks are stored in memory uncompressed, so they are not uncompressed when used. The compressed version is not available in memory, which is why it needs to be recompressed when writing it to the L2ARC.
I agree that almost nobody sets compressed_arc_enabled=0 Based on many studies of
tunables over the years, few people actually tune any of them. When they do tune, it is because
of what we used to call "/etc/system viruses" or "I read it on the internet"
To frame the discussion, is the following table correct?
| compress_arc_enabled | ARC contents | ARC efficiency | L2ARC impact |
|---|---|---|---|
| 0 | only uncompressed block in ARC| reducing ARC efficiency | blocks must be recompressed to send to L2ARC, reducing L2ARC efficiency |
| 1 (default) | both compressed and uncompressed blocks in ARC | possibly reducing number of blocks eligible to be in the ARC and therefore reducing ARC efficiency | easy to pass compressed block to L2ARC, improving L2ARC efficiency |
possibly reducing number of blocks eligible to be in the ARC and therefore reducing ARC efficiency
Having compressed_arc turned on does not change any of the eligibility criteria. It just means that compressed blocks take less space than if they were stored in the ARC uncompressed, so you can fit more data in the ARC, increasing its efficiency. There may however be a small performance impact when you are decompressing the block repeatedly when reading it from the ARC, rather than just once as it is loaded into the ARC.
@richardelling I'm not sure exactly what you mean by "ARC efficiency". All blocks are eligible to be in the ARC, regardless of compress_arc_enabled, though less may fit in the ARC with compress_arc_enabled=0. Here's an updated table that matches my understanding, with "efficiency" a proxy for "ARC hit rate". Also note that the impacts only apply to data that's compressed on disk. Data that's stored uncompressed on disk is handled the same regardless of compress_arc_enabled (no compression / decompression).
| compress_arc_enabled | ARC contents | ARC efficiency | L2ARC impact |
|---|---|---|---|
| 0 | uncompressed in ARC | reduced ARC efficiency (ARC stores less blocks because they are bigger in memory) | blocks must be recompressed to send to L2ARC, increasing L2ARC CPU usage |
| 1 (default) | matches what's on disk (compressed or uncompressed) | good ARC efficiency (ARC stores maximum number of blocks) | easy to pass compressed block to L2ARC, negligible L2ARC CPU usage |
yeah, I was afraid the "efficiency" word would cause confusion. What I mean is the ARC is a constrained resource containing a limited number of bytes. With compressed ARC, each compressed block consumes its lsize + (compressed) psize, for some period of time. Clearly, for higher compression ratios, the efficiency is better. But my argument is a rathole... don't go there now.
For the expected common case where compression ratios are high, the new framing is better.
With compressed ARC, each compressed block consumes its lsize + (compressed) psize, for some period of time
I think you're talking about the need to store the uncompressed version in memory while it's being accessed. compress_arc_enabled=0 has no impact on this. Even if the data is stored uncompressed in the ARC, an additional in-memory copy is made while it is being accessed, due to ABD. Additionally, this memory may continue to be used after the access completes, due to the dbuf cache. We think of this space as being owned by the dbuf cache, not the ARC, because the dbuf layer controls how much memory is used by it, and the eviction policy.
Currently running no productions systems with compressed_arc_enabled=0, though I had experimented with it and vaguely remember to have seen a _slightly_ better performance when booting multiple diskless clients (backed by cloned zvols exported over iSCSI) in parallel - basically the 'repeated uncompress' case. For reasons long forgotten by now it hasn't been kept disabled.
I still _suspect_ though (given I understood the code correctly enough that my view of on-disk data being read (and stored) in ARC in on-disk 2^ashift sized blocks (of the vdevs they come from) is somewhat correct) that (re-)compressing the data (after being delivered to the DMU, for which it needs to be decompressed anyway) using SPA_MINBLOCKSIZE as block granularity would lead to a more effective ARC, as it could store more data compared to storing the verbatim on-disk representation as it comes from the drives. Especially for data from vdevs with higher ashift (12 or 13) and/or lower record-/volblocksize.
I still _suspect_ though (given I understood the code correctly enough that my view of on-disk data being read (and stored) in ARC in on-disk 2^ashift sized blocks (of the vdevs they come from) is somewhat correct) that (re-)compressing the data (after being delivered to the DMU, for which it needs to be decompressed anyway) using SPA_MINBLOCKSIZE as block granularity would lead to a more effective ARC, as it could store more data compared to storing the verbatim on-disk representation as it comes from the drives. Especially for data from vdevs with higher ashift (12 or 13) and/or lower record-/volblocksize.
In the case where we need to re-compress before writing to the L2ARC, it must be compressed in exactly the same way as the on-disk version, or the checksum will not match. The L2ARC used to have its own separate checksum, since it was usually compressed while the in-ARC version was not, but this was removed to make the L2ARC use a lot less ram per block that is cached there.
L2 data could self-checksum on-disk, or is the checksum needed _after_ the data had been read back into RAM?
Data is compressed on disk
It is read from disk, and the checksum is compared
It is then stored in the ARC (compressed, or uncompressed, based on the setting)
If it nears the tail end of the cache, then it is written to the L2ARC (if ARC is uncompressed, it is recompressed with the same settings, so the checksum will match later)
When it is read back from the L2ARC, the checksum is compared again, to the version in the ARC header, which is from the block pointer (the checksum of the original on-disk version)
The L2ARC does not store its own checksum (it used to, but this was a waste of memory).
I know that it currently works that way, dosn't answer my question if reading back from L2 _needs_ to compare to the checksum in the L1 header _or_ if the data being read couldn't be verified through a L2 on-disk header (self-checksum, DVA, TXG) that would only 'waste' space on the L2 drives. That header could well be discarded (from RAM) after the read is verified and the retrieved payload is decompressed (which needs to be done anyway, else the L2 read wouldn't have happened in the first place).
This was while thinking about the 'counter arguments' part: an ARC behaviour where the hot data is (and stays) uncompressed to avoid constant decompression, while the cooling data gets compressed (while otherwise unused CPU time being available) to squeeze as much into the available ARC space as possible - could be way more effective than the current approach. Should that be a derail... sorry.
It appears that utilizing L2ARC with zfs_compressed_arc_enabled = 0results in a kernel panic - #8454
For what is worth, seeing how a not often used (and so not well tested) codepath (ie: compressed ARC off) caused a kernel panic, I agree with the premise of this issue: compressed ARC should be mandatory. This would simplify code management, significantly reducing the possibility of messing up when doing other changes.
This was discussed at the Feb 26 OpenZFS meeting (link below), and the input was positive. So I think we should move forward with this proposal. @allanjude would you like to open a PR?
https://www.youtube.com/watch?v=EXstK9ckcZQ
I'll get started on it this week.
Compressed ARC have huge (x4) performance penalty in some cases: https://blog.lexa.ru/2019/05/10/zfs_vfszfscompressed_arc_enabled0.html
Please, don't remove control of this!
On Fri, May 10, 2019 at 06:57:40AM -0700, ptx0 wrote:
better to open an issue and resolve your performance than it is to
make them leave compressed arc tunable in and not get zstd
compression. disabling compressed arc should not be a requirement
for performance.
you are kidding? or you are realy ready to resolve this issuse by
donate power hardware?
@slw Compressed ARC should not have a huge performance impact compared to uncompressed ARC. It sounds like you have a workload where that is not the case. We would like to investigate and fix that. Could you open a separate issue describing the problem you're having with compressed ARC?
On Fri, May 10, 2019 at 12:43:58PM -0700, Matthew Ahrens wrote:
@slw Compressed ARC should not have a huge performance impact
compared to uncompressed ARC. It sounds like you have a workload
where that is not the case. We would like to investigate and fix
that. Could you open a separate issue describing the problem you're
having with compressed ARC?
This is not may setup, this is setup of Alex Tutubalin. Can you
contact directly to [email protected]? English is ok.
@ahrens I don't have a case with major performance difference, but I can see a reproducible ~10% difference on my Intel i5-5200u with FIO's buffer_compress_percentage=50 ,
I didn't think that it may be a huge difference, but after my brief tests I'm against mandatory compression.
Reproducer:
# zfs get compression,recordsize,primarycache rpool/home/gmelikov/fio
NAME PROPERTY VALUE SOURCE
rpool/home/gmelikov/fio compression lz4 inherited from rpool
rpool/home/gmelikov/fio recordsize 128K default
rpool/home/gmelikov/fio primarycache all default
# echo 1 > /sys/module/zfs/parameters/zfs_compressed_arc_enabled
$ fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=read --bs=128k --direct=0 --size=512M --numjobs=2 --runtime=48 --group_reporting -time_based --buffer_compress_percentage=50
$ rm ./*.0
# echo 0 > /sys/module/zfs/parameters/zfs_compressed_arc_enabled
$ fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=read --bs=128k --direct=0 --size=512M --numjobs=2 --runtime=48 --group_reporting -time_based --buffer_compress_percentage=50
randwrite: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K, ioengine=libaio, iodepth=1
...
fio-2.16
Starting 2 processes
Jobs: 2 (f=2): [R(2)] [100.0% done] [3712MB/0KB/0KB /s] [29.7K/0/0 iops] [eta 00m:00s]
randwrite: (groupid=0, jobs=2): err= 0: pid=20250: Fri May 10 23:43:22 2019
read : io=177351MB, bw=3694.8MB/s, iops=29557, runt= 48001msec
slat (usec): min=44, max=5843, avg=65.92, stdev=14.33
clat (usec): min=0, max=1988, avg= 1.10, stdev= 1.84
lat (usec): min=45, max=5872, avg=67.03, stdev=14.57
clat percentiles (usec):
| 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
| 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1],
| 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 1], 95.00th=[ 2],
| 99.00th=[ 2], 99.50th=[ 3], 99.90th=[ 5], 99.95th=[ 6],
| 99.99th=[ 20]
lat (usec) : 2=90.41%, 4=9.14%, 10=0.41%, 20=0.03%, 50=0.01%
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%
lat (msec) : 2=0.01%
cpu : usr=2.42%, sys=97.39%, ctx=1097, majf=0, minf=79
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=1418807/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: io=177351MB, aggrb=3694.8MB/s, minb=3694.8MB/s, maxb=3694.8MB/s, mint=48001msec, maxt=48001msec
randwrite: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K, ioengine=libaio, iodepth=1
...
fio-2.16
Starting 2 processes
Jobs: 2 (f=2): [R(2)] [100.0% done] [4242MB/0KB/0KB /s] [33.1K/0/0 iops] [eta 00m:00s]
randwrite: (groupid=0, jobs=2): err= 0: pid=15436: Fri May 10 23:39:59 2019
read : io=204526MB, bw=4260.9MB/s, iops=34086, runt= 48001msec
slat (usec): min=36, max=10262, avg=56.93, stdev=21.23
clat (usec): min=0, max=2163, avg= 1.11, stdev= 1.83
lat (usec): min=37, max=10266, avg=58.04, stdev=21.38
clat percentiles (usec):
| 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
| 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1],
| 70.00th=[ 1], 80.00th=[ 1], 90.00th=[ 2], 95.00th=[ 2],
| 99.00th=[ 3], 99.50th=[ 4], 99.90th=[ 5], 99.95th=[ 6],
| 99.99th=[ 19]
lat (usec) : 2=88.87%, 4=10.61%, 10=0.48%, 20=0.03%, 50=0.01%
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%
lat (msec) : 4=0.01%
cpu : usr=2.70%, sys=97.01%, ctx=1233, majf=0, minf=82
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=1636208/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: io=204526MB, aggrb=4260.9MB/s, minb=4260.9MB/s, maxb=4260.9MB/s, mint=48001msec, maxt=48001msec
Enabled (!) compression could be devastating for performance on some systems. Here are example (sorry, it is Russian blog):
https://blog.lexa.ru/2019/05/10/zfs_vfszfscompressed_arc_enabled0.html
Bottom line of this post: on FreeBSD with ARC compression enabled (default) some operations (backup checksum verifying) go with throughput 600-800Mbit/sec. and with compression disabled (and on old versions of FreeBSD) it is 2-2.5 Gbit/s.
If you want to make compression always-on, you should fix such pathological cases first.
In my own experience (on FreeBSD again) compression on server with media files (effectively uncompresseable) make ARC efficiency significallyu lower in terms of hit rate, as difference between Wired memory and ARC becomes larger and ARC becomes effectively SMALLER for same amount of physical RAM.
Compressed ARC should have no impact on uncompressible data. It only impacts reads of data that is already compressed on disk. Are you seeing evidence to the contrary?
On Sat, May 11, 2019 at 07:36:40AM -0700, Richard Elling wrote:
Compressed ARC should have no impact on uncompressible data. It
only impacts reads of data that is already compressed on disk. Are
you seeing evidence to the contrary?
Metadata still compressable.
High demand of metadata + low end CPU can cause high performance
impact.
On Sat, May 11, 2019 at 07:36:40AM -0700, Richard Elling wrote: Compressed ARC should have no impact on uncompressible data. It only impacts reads of data that is already compressed on disk. Are you seeing evidence to the contrary?
Metadata still compressable. High demand of metadata + low end CPU can cause high performance impact.
People understand that "compressed ARC" never actually compresses anything, right?
It just defers decompressing data that is already compressed on disk, until each time it is actually read from the ARC, so it can store the compressed version in the ARC and maintain a higher cache hit ratio.
On Sat, May 11, 2019 at 08:28:42AM -0700, Allan Jude wrote:
On Sat, May 11, 2019 at 07:36:40AM -0700, Richard Elling wrote: Compressed ARC should have no impact on uncompressible data. It only impacts reads of data that is already compressed on disk. Are you seeing evidence to the contrary?
Metadata still compressable. High demand of metadata + low end CPU can cause high performance impact.People understand that "compressed ARC" never actually compresses anything, right?
It just defers decompressing data that is already compressed on disk, until each time it is actually read from the ARC, so it can store the compressed version in the ARC and maintain a higher cache hit ratio.
Cache efficiency depends not only on cache hit rate, but on
distribution of demanded elements. As result addtional elemnts in ARC
cache can have very low impact on cache hit ratio, but permanent
decompression of compressed metadata in cache can cause high
performance penalty.
In this issuse's comments we see two performance impact results. This
is the fact.
slw,
I'm unable to read Russian but I see that the linked page references use of
an Atom CPU. When you say "low end CPU", is this what you're referring to?
On Sat, May 11, 2019 at 11:43 AM slw notifications@github.com wrote:
On Sat, May 11, 2019 at 08:28:42AM -0700, Allan Jude wrote:
>
On Sat, May 11, 2019 at 07:36:40AM -0700, Richard Elling wrote:
Compressed ARC should have no impact on uncompressible data. It only
impacts reads of data that is already compressed on disk. Are you seeing
evidence to the contrary?
Metadata still compressable. High demand of metadata + low end CPU can
cause high performance impact.People understand that "compressed ARC" never actually compresses
anything, right?It just defers decompressing data that is already compressed on disk,
until each time it is actually read from the ARC, so it can store the
compressed version in the ARC and maintain a higher cache hit ratio.Cache efficiency depends not only on cache hit rate, but on
distribution of demanded elements. As result addtional elemnts in ARC
cache can have very low impact on cache hit ratio, but permanent
decompression of compressed metadata in cache can cause high
performance penalty.In this issuse's comments we see two performance impact results. This
is the fact.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/zfsonlinux/zfs/issues/7896#issuecomment-491521729,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACI6MLNATH6VFWWQYHVPDZLPU3SR3ANCNFSM4FU2AABA
.
On Sat, May 11, 2019 at 08:51:48AM -0700, jwittlincohen wrote:
slw,
I'm unable to read Russian but I see that the linked page references use of
an Atom CPU. When you say "low end CPU", is this what you're referring to?
Yes (and i5-5200u also not power CPU -- only 3486 passmark).
My [adapted] translation of this post (sorry for bad english, also try google
translate):
===
I am have FreeBSD-12 based file server.
I am see backup verification speed dropped down from 2-2.5 Gbit/s in FreeBSD-11.1
(Acronis Backup limited) to 600-800Mbit/sec.
top show mostly all ARC compresed but compressed ratio very poor
[from comment].
vfs.zfs.compressed_arc_enabled=0 help to resolve this.
[I am use Intel(R) Atom(TM) CPU C3758 @ 2.20GHz (2200.08-MHz K8-class
@allanjude decompression speed will always affect arc pipeline, it's not free, or what is your point? Unfortunately, there will be cases when someone needs for ram speed, not ram capacity. Don't get me wrong, compressed arc is a great feature, but in a world of NVMEs and high iops you may need to disable compression for better speed.
So the question is: ARC capacity vs ARC speed. IMHO we should remain a mechanism to decide.
@kpande thanks for clarification, so:
The uncompressed data should be short-lived allowing the ARC to cache a much larger amount of data. The DMU would also maintain a smaller cache of uncompressed blocks to minimize the impact of decompressing frequently accessed blocks.
So with this, we have a "smaller cache of uncompressed blocks". But it don't change my point significantly.
The argument in favour of removing the ability to disable compressed ARC, is about reducing code complexity, and working around assumptions (recompressing a record will result in the same checksum) that may not hold true.
@allanjude decompression speed will always affect arc pipeline, it's not free, or what is your point? Unfortunately, there will be cases when someone needs for ram speed, not ram capacity. Don't get me wrong, compressed arc is a great feature, but in a world of NVMEs and high iops you may need to disable compression for better speed.
So the question is: ARC capacity vs ARC speed. IMHO we should remain a mechanism to decide.
While I agree in principle, the reality seems to be than the uncompressed code path receives much less on-field testing. This leave the door open to very ugly bugs as the one I discovered some time ago: https://github.com/zfsonlinux/zfs/issues/8454
So, for what it worth, I agree in simplifying the ARC code path removing the uncompressed case.
While I agree in principle, the reality seems to be than the uncompressed code path receives much less
on-field testing.
To be honest, both paths are broken now at very heart of ARC algorithm (adaptation), and nobody cares (yes, it is FreeBSD again, but AFAIK this code is shared by all ZFS implementations):
https://reviews.freebsd.org/D19094
Yes, it is separate issue, but it is linked with compressed ARC at root, as this bug was introduced with compressed ARC.
On Sat, May 11, 2019 at 02:15:13PM -0700, shodanshok wrote:
@allanjude decompression speed will always affect arc pipeline, it's not free, or what is your point? Unfortunately, there will be cases when someone needs for ram speed, not ram capacity. Don't get me wrong, compressed arc is a great feature, but in a world of NVMEs and high iops you may need to disable compression for better speed.
So the question is: ARC capacity vs ARC speed. IMHO we should remain a mechanism to decide.
While I agree in principle, the reality seems to be than the uncompressed code path receives much less on-field testing. This leave the door open to very ugly bugs as the one I discovered some time ago: https://github.com/zfsonlinux/zfs/issues/8454
So, for what it worth, I agree in simplifying the ARC code path removing the uncompressed case.
I am many hosts w/ L2ARC and vfs.zfs.compressed_arc_enabled=0, nothing
to crash. I am run FreeBSD.
This issuse like as linux-specific. And like as linux-specific issuse
caused regression on all system. Good catch, new upstream will be very
toxic.
PS: previos case -- abd introduction, caused about 30% performance
drop (https://lists.freebsd.org/pipermail/freebsd-fs/2018-August/026612.html).
I have an interest in collecting data regarding the overhead metric for compressed ARC. It is not clear to me that it behave well for all use cases. So, rather than guessing, I've created a grafana dashboard for observing compressed ARC data. If testers would be so kind as to share the results for interesting workloads, it would be much appreciated.
https://github.com/richardelling/grafana-dashboards/blob/master/linux/Compressed-ARC.json
To use this dashboard
While I agree in principle, the reality seems to be than the uncompressed code path receives much less
on-field testing.To be honest, both paths are broken now at very heart of ARC algorithm (adaptation), and nobody cares (yes, it is FreeBSD again, but AFAIK this code is shared by all ZFS implementations):
https://reviews.freebsd.org/D19094
Yes, it is separate issue, but it is linked with compressed ARC at root, as this bug was introduced with compressed ARC.
Well, the linked issue is very different from the one I reported (which caused immediate kernel panic).
As I stated above, I have nothing against uncompressed ARC by itself. But if that code path is left "collecting dust" (due to low use on the field), maybe it is better to remove it entirely.
due to low use on the field
I'm curious what you're basing that on? I mean, is there some automated metrics reporting in zfs I don't know about because I fail to see how the limited number of people who are aware of this change, represents the install base of zfs let alone the hardware. Regarding the crash is there not a test case for running without it?
due to low use on the field
I'm curious what you're basing that on? I mean, is there some automated metrics reporting in zfs I don't know about because I fail to see how the limited number of people who are aware of this change, represents the install base of zfs let alone the hardware. Regarding the crash is there not a test case for running without it?
From the dev talks and the mailing list questions I had the impression the vast majority of ZoL users has compressed ARC enabled. Moreover, the bug I described in https://github.com/zfsonlinux/zfs/issues/8454 is so quickly reproducible I wonder anyone not using compressed ARC should face it pretty soon. But I can be wrong, of course.
I'm unable to read Russian but I see that the linked page references use of an Atom CPU. When you say "low end CPU", is this what you're referring to?
OK, this is my blog post, my home server, etc, etc, etc.
CPU: Intel(R) Atom(TM) CPU C3758
Yes, this is server Atom CPU (Denverton) with 8 cores, ECC RAM and 10Gbit NIC. These CPUs are targeted to NAS and so, so I expect that lot of middle-sized NAS systems/etc will use something similar.
Motherboard: Supermicro A2SDi-H-TF: https://www.supermicro.com/products/motherboard/atom/A2SDi-H-TF.cfm
RAM: 48GB
Disk setup:
7x HGST HUS726060ALE614 in RAIDZ2 (this is 'Data' array) + L2ARC (512G NVME drive)
2x HGST HUS726060ALE614 in ZFS mirror - this is 'backup datataset'. No L2ARC, but secondary cache for this dataset is set to metadata .
ZFS block record size: 1M for both datasets (and all filesystems), compession is off
OS: FreeBSD 12.0-STABLE r344512 GENERIC, updated to FreeBSD 12.0 on Feb 25, 2019 (according to kernel build date).
Problem description:
top utility shows: lot of gigabytes in ARC are compressed with compression ratio 1.1:1 (I do not remember the exact ratio of compressed to full ARCsize)
My actions: vfs.zfs.compressed_arc_enabled=0
(because almost all of my files are not compressible: this is media data (compressed images), movies, compressed backup files)
And, of course, reboot.
I've rebooted several times after that (to see ARC usage/arc fill on my usual data access patterns), so right now:
uptime is ~2days
backup verification speed: as expected
Keep watching: it may be not compressed ARC problem, but, for example, some different problem which appears after weeks or months (in my typical load).
Concerning disabling uncompressed ARC: there are some not-very-fast CPUs on NAS/small servers/etc, so it is better to have option to low CPU load.
OK, I can suffer backup verification slow down for a few more days.
Commented out:
and rebooted.
Will monitor backup/verification speed for a several days starting from tomorrow. Stay tuned.
Followup: could not wait for backup verification, so tried to re-read some files set for several times:
fileset size: 34GB, my ARC limit is 42GB.
after two re-reads I see:
According to this, most of ARC is compressed, although I do not use compression on ZFS volumes. Looks like data was compressed on the fly??
Data read speed (measured via tar cf - folderlist | mbuffer -s 64k -m 64m -o /dev/null) is disgusting:
summary: 33.7 GiByte in 1min 14.8sec - average of 461 MiB/s
Usually (without ARC compression) I see about 1GB/sec if data resides in L2ARC (it is slow /for NVME/ patriot hellfire M2/NVME drive with max read speed about 1.2GB/sec)
I expected much higher speed for from-memory reads.
Will continue my tests tomorrow.
* top shows this ARC breakdown: ARC: 38G Total, 2982M MFU, 35G MRU, 976K Anon, 12M Header, 44M Other 37G Compressed, 37G Uncompressed, 1.01:1 RatioAccording to this, most of ARC is compressed, although I do not use compression on ZFS volumes. Looks like data was compressed on the fly??
I think you are misreading this, which is possibly my fault for the way I named the fields in top on FreeBSD. It means that basically NONE of your ARC contents are compressed.
Of the 38GB of ARC contents, 37GB of it is uncompressed. So I would expect very very little impact from the compressed arc feature being turned on or off. Blocks are only stored compressed if the compression provides at least a 12.5% savings.
A highly compressed ARC looks like this:
ARC: 20G Total, 2846M MFU, 16G MRU, 64K Anon, 509M Header, 60M Other
19G Compressed, 45G Uncompressed, 2.45:1 Ratio
That is to say, the ARC is 20GB total. compressed, the MFU+MRU take 19GB, and if they were not compressed they would take 45GB of ram, resulting in a 2.45:1 compression ratio.
top doesn't actually tell you what percentage of the blocks are compressed etc, just the physical size (compressed) and logical size (how much ram it would take if the blocks were not compressed)
Data read speed (measured via tar cf - folderlist | mbuffer -s 64k -m 64m -o /dev/null) is disgusting:
summary: 33.7 GiByte in 1min 14.8sec - average of 461 MiB/s
Part of the issue here is that things are being written to the pipe in 4kb blocks, rather than larger.
Your results here may also vary wildly after a reboot, having nothing to do with compressed or uncompressed ARC.
Reading a 3.2GB uncompressable file, 100% from ARC with different mbuffer block sizes. I repeated the 4k test at the end to show that there was no change in the cache effects.
> mbuffer -s 4k -m 64m -i bigfile.xz -o /dev/null```
summary: 3256 MiByte in 7.1sec - average of 462 MiB/s
> mbuffer -s 8k -m 64m -i bigfile.xz -o /dev/null
summary: 3256 MiByte in 4.2sec - average of 784 MiB/s
md5-ce286300775fee8bbdd4afc328706f0e
> mbuffer -s 16k -m 64m -i bigfile.xz -o /dev/null
summary: 3256 MiByte in 2.7sec - average of 1200 MiB/s
md5-ce286300775fee8bbdd4afc328706f0e
> mbuffer -s 64k -m 64m -i bigfile.xz -o /dev/null
summary: 3256 MiByte in 1.6sec - average of 2000 MiB/s
md5-ce286300775fee8bbdd4afc328706f0e
> mbuffer -s 128k -m 64m -i bigfile.xz -o /dev/null
summary: 3256 MiByte in 1.4sec - average of 2286 MiB/s
md5-ce286300775fee8bbdd4afc328706f0e
> mbuffer -s 1m -m 64m -i bigfile.xz -o /dev/null
summary: 3256 MiByte in 1.3sec - average of 2558 MiB/s
md5-ce286300775fee8bbdd4afc328706f0e
> mbuffer -s 1m -m 64m -i bigfile.xz -o /dev/null
summary: 3256 MiByte in 1.3sec - average of 2548 MiB/s
md5-ce286300775fee8bbdd4afc328706f0e
> mbuffer -s 1m -m 64m -i bigfile.xz -o /dev/null
summary: 3256 MiByte in 1.3sec - average of 2571 MiB/s
md5-ce286300775fee8bbdd4afc328706f0e
> mbuffer -s 4k -m 64m -i bigfile.xz -o /dev/null
summary: 3256 MiByte in 7.0sec - average of 464 MiB/s
Usually (without ARC compression) I see about 1GB/sec if data resides in L2ARC (it is slow /for NVME/ patriot hellfire M2/NVME drive with max read speed about 1.2GB/sec)
I expected much higher speed for from-memory reads.
Will continue my tests tomorrow.
There seem to be a lot of variables in your testing. I'd recommend for the benchmark, just disabling L2ARC all together to avoid it impacting the results.
I'd also recommend running top -aS and/or top -aHS to see what kernel threads are using CPU time, as I doubt the decompression threads could be the cause in this case.
@alextutubalin Thanks for reporting your problem and sharing some details with us here.
ZFS block record size: 1M for both datasets (and all filesystems), compession is off
Interesting. Since you are not compressing ZFS user data, compressed_arc_enabled should only be able to impact metadata (which is always compressed). The overall performance impact should be negligible. Even more so for sequential data access (like from tar), where there's no possibility of reading an indirect block more than once.
top shows:
37G Compressed, 37G Uncompressed, 1.01:1 Ratio
According to this, most of ARC is compressed, although I do not use compression on ZFS volumes. > Looks like data was compressed on the fly??
No, "compressed ARC" does not compress anything ("on the fly" or otherwise). As @allanjude mentioned, it keeps the data in the ARC in the same format as on disk - compressed or not. top is telling you that the size after compression (if compressed) is almost the same as the size before compression. Which makes sense since your user data is not compressed (size before compression == size after (nonexistent) compression.
Same test, same machine:
> sysctl vfs.zfs.compressed_arc_enabled
vfs.zfs.compressed_arc_enabled: 0
> mbuffer -s 4k -m 64m -i bigfile.xz -o /dev/null
summary: 3256 MiByte in 7.3sec - average of 448 MiB/s
> mbuffer -s 8k -m 64m -i bigfile.xz -o /dev/null
summary: 3256 MiByte in 4.2sec - average of 775 MiB/s
> mbuffer -s 16k -m 64m -i bigfile.xz -o /dev/null
summary: 3256 MiByte in 2.7sec - average of 1187 MiB/s
> mbuffer -s 64k -m 64m -i bigfile.xz -o /dev/null
summary: 3256 MiByte in 1.6sec - average of 2015 MiB/s
> mbuffer -s 128k -m 64m -i bigfile.xz -o /dev/null
summary: 3256 MiByte in 1.4sec - average of 2272 MiB/s
> mbuffer -s 1m -m 64m -i bigfile.xz -o /dev/null
summary: 3256 MiByte in 1.3sec - average of 2545 MiB/s
> mbuffer -s 4k -m 64m -i bigfile.xz -o /dev/null
summary: 3256 MiByte in 7.0sec - average of 465 MiB/s
So compressed_arc_enabled being on or off doesn't seem to make a difference at all for uncompressable data, as is expected since it doesn't do anything different if the data is not compressed on disk.
Folks,
1) when testing L2ARC performance (with compressed arc disabled), more or less same test, but with larger files selection: 200Gb instead of 30Gb) I've just seen speeds above 1GB/sec (limited with my L2ARC speed).
I'll upgrade to the latest FreeBSD 12-stable and redo tests with and without arc compressed later today (it is 7:30am here in Moscow).
Now I'm waiting for my backups till 9-10am or so.
2) I've used mbuffer's 64k block size, not 4k. Yes, it looks to limit IO speed too, here quick test made with 1m block size (compressed arc is still enabled):
tar cf - 2019-0* | mbuffer -s 1m -m 64m -o /dev/null
summary: 33.7 GiByte in 1min 05.9sec - average of 523 MiB/s
(repeated several times to ensure files are in ARC again).
Slightly better than 460Mb/sec, but far from perfect.
Compared with mbuffer with 64Gb all-zeroes file (made with dd seek=..)
mbuffer -i file64g.img -s 1m -m 64m -o /dev/null
summary: 64.0 GiByte in 30.0sec - average of 2183 MiB/s
Not too much also, but 4x higher than for real on-disk files, looks like mbuffer should not limit it.
I'll accurately re-do tests later today. I'm interested in my real media library read speed, so I'll do tests with these files.
Upgraded FreeBSD to latest 12-stable: FreeBSD 12.0-STABLE r347562 GENERIC
Today backup verification looks good (expected 2+Gbit/sec for old /not-in-arc/ files and 3.5Gbit/sec for just created/placed in ARC).
Will keep watching.
I wanted to speak up on behalf of the users who have been using the compressed_arc_enabled=0 tunable to return reasonable performance but haven't yet discovered this thread. Compressed ARC completely trashes performance, and if anything, my vote would be to eliminate compressed ARC and make uncompressed mandatory.
Our usage case is a ZFS backed mail store, all SSD zpool with compression=gzip-1 on a server with lots of memory.
I present this extremely simple testcase, which represents a user searching an IMAP folder (mbox) for a specific term. ZFS v0.7.13-1 on Linux v4.19.34
ztest1:~# echo 3 > /proc/sys/vm/drop_caches
ztest1:~# cat /sys/module/zfs/parameters/zfs_compressed_arc_enabled
0 # Confirm off
# First test with compressed_arc=0
ztest1:~# time grep -c "Lorem Ipsum" /zfs/mail/inboxes.test/mbox_test_user_2gb
6
real 0m4.426s
user 0m2.682s
sys 0m1.577s
# Repeat test
ztest1:~# time grep -c "Lorem Ipsum" /zfs/mail/inboxes.test/mbox_test_user_2gb
6
real 0m3.936s
user 0m2.411s
sys 0m1.488s
# Enable compressed_arc and flush caches
ztest1:~# echo -en 1 > /sys/module/zfs/parameters/zfs_compressed_arc_enabled
ztest1:~# echo 3 > /proc/sys/vm/drop_caches
ztest1:~# cat /sys/module/zfs/parameters/zfs_compressed_arc_enabled
1 # Confirm on
# First test with compressed_arc=1
ztest1:~# time grep -c "Lorem Ipsum" /zfs/mail/inboxes.test/mbox_test_user_2gb
6
real 0m11.601s
user 0m2.555s
sys 0m8.772s
ztest1:~# time grep -c "Lorem Ipsum" /zfs/mail/inboxes.test/mbox_test_user_2gb
6
real 0m10.531s
user 0m2.312s
sys 0m8.047s
# Disable compressed_arc, flush caches, and test a second time as a sanity check
ztest1:~# echo -en 0 > /sys/module/zfs/parameters/zfs_compressed_arc_enabled
ztest1:~# echo 3 > /proc/sys/vm/drop_caches
ztest1:~# cat /sys/module/zfs/parameters/zfs_compressed_arc_enabled
0 # Confirm off again
# Second test with compressed_arc=0
ztest1:~# time grep -c "Lorem Ipsum" /zfs/mail/inboxes.test/mbox_test_user_2gb
6
real 0m4.557s
user 0m2.608s
sys 0m1.626s
# Repeat second test with compressed_arc=0
ztest1:~# time grep -c "Lorem Ipsum" /zfs/mail/inboxes.test/mbox_test_user_2gb
6
real 0m3.958s
user 0m2.487s
sys 0m1.443s
ztest1:~#
@idvsolutions Could you please re-run your "grep test", but instead of grepping, use "md5sum" to make sure the file is being read sequentially and in its entirety. And all that I'm terribly interested in is the initial "empty cache" numbers, which I'm surprised are so different in your grep test with compressed ARC disabled versus enabled. The reason I'm surprised is that, theoretically, both tests must be reading and, therefore, decompressing the entire file. If they still differ a lot, the next step would be to run the tests under "perf record" and see where that extra time is being spent.
@ptx0 Yes, we are following zstd with much interest also. My understanding is that forcing compressed ARC is only 1 possible solution to the issue, though. I am not advocating to make compressed_arc_enabled=0 mandatory, but I am saying people ARE using this functionality, it IS impacting real-world workloads, and it would be a Big Deal to lose it.
I tried to pick a simple example that would be easy for others to replicate, but we see the same issue across the board with compressed data whether it's mail store, syslog data, CDRs, ie virtually any data that benefits from compression and you'd want to read/search repeatedly.
@dweeezil I believe the "empty cache" numbers are so poor because the compressed ARC path has a single-threaded bottleneck at some point. Monitoring total CPU usage while testing shows all CPUs in used on the "empty cache" run with compressed_arc_enabled=0 but only 1 CPU used with compressed_arc_enabled=1. grep was chosen because it represents an example real-world workload, where the numbers for md5sum are comperable, md5sum actually has higher CPU overhead and somewhat masks the issue. Doing a straight "cat" though demonstrates an even bigger difference between the "cached" numbers.
# echo -en 1 > /sys/module/zfs/parameters/zfs_compressed_arc_enabled
# echo 3 > /proc/sys/vm/drop_caches
# time cat /zfs/mail/inboxes/mbox_test_user_2gb > /dev/null
real 0m8.839s
user 0m0.010s
sys 0m8.494s
# time cat /zfs/mail/inboxes/mbox_test_user_2gb > /dev/null
real 0m7.993s
user 0m0.000s
sys 0m7.809s
# echo -en 0 > /sys/module/zfs/parameters/zfs_compressed_arc_enabled
# echo 3 > /proc/sys/vm/drop_caches
# time cat /zfs/mail/inboxes/mbox_test_user_2gb > /dev/null
real 0m1.889s
user 0m0.000s
sys 0m0.547s
# time cat /zfs/mail/inboxes/mbox_test_user_2gb > /dev/null
real 0m0.420s
user 0m0.000s
sys 0m0.417s
And here are the md5 numbers just for fun:
# echo -en 0 > /sys/module/zfs/parameters/zfs_compressed_arc_enabled
# echo 3 > /proc/sys/vm/drop_caches
# time md5sum /zfs/mail/inboxes/mbox_test_user_2gb
c9a58a7d7ffb32f22f30b2663314e185 /zfs/mail/inboxes/mbox_test_user_2gb
real 0m4.717s
user 0m3.768s
sys 0m0.657s
# echo -en 1 > /sys/module/zfs/parameters/zfs_compressed_arc_enabled
# echo 3 > /proc/sys/vm/drop_caches
# time md5sum /zfs/mail/inboxes/mbox_test_user_2gb
c9a58a7d7ffb32f22f30b2663314e185 /zfs/mail/inboxes/mbox_test_user_2gb
real 0m13.194s
user 0m3.577s
sys 0m9.309s
Just to be clear, this is not only a zfsonlinux issue. We see the same issue with FreeBSD and utilize zfs.compressed_arc_enabled=0 on all of our FreeNAS/FreeBSD 11+ systems (FreeBSD 10.3 did not have this problem)
dilos (illumos based)
root@dev2:~# dd if=/dev/urandom of=/var/tmp/2gb.bin bs=1M count=2000 status=progress
2024800256 bytes (2.0 GB, 1.9 GiB) copied, 24 s, 84.4 MB/s
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB, 2.0 GiB) copied, 24.8379 s, 84.4 MB/s
root@dev2:~# mdb -ke 'zfs_compressed_arc_enabled::print'
0x1
root@dev2:~# time cat /var/tmp/2gb.bin > /dev/null
real 0m0.507s
user 0m0.011s
sys 0m0.495s
root@dev2:~# mdb -kwe 'zfs_compressed_arc_enabled/W 0t0'
zfs_compressed_arc_enabled: 0x1 = 0x0
root@dev2:~# time cat /var/tmp/2gb.bin > /dev/null
real 0m0.506s
user 0m0.011s
sys 0m0.495s
root@dev2:~# mdb -kwe 'zfs_compressed_arc_enabled/W 0t1'
zfs_compressed_arc_enabled: 0 = 0x1
root@dev2:~# time cat /var/tmp/2gb.bin > /dev/null
real 0m0.507s
user 0m0.011s
sys 0m0.495s
no differences with changes in arc on real hw 1U server with 1CPU xeon and 32GB RAM
On Tue, May 21, 2019 at 04:16:04PM -0700, Igor K wrote:
dilos (illumos based)
root@dev2:~# dd if=/dev/urandom of=/var/tmp/2gb.bin bs=1M count=2000 status=progress 2024800256 bytes (2.0 GB, 1.9 GiB) copied, 24 s, 84.4 MB/s 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB, 2.0 GiB) copied, 24.8379 s, 84.4 MB/s root@dev2:~# mdb -ke 'zfs_compressed_arc_enabled::print' 0x1 root@dev2:~# time cat /var/tmp/2gb.bin > /dev/null real 0m0.507s user 0m0.011s sys 0m0.495s root@dev2:~# mdb -kwe 'zfs_compressed_arc_enabled/W 0t0' zfs_compressed_arc_enabled: 0x1 = 0x0 root@dev2:~# time cat /var/tmp/2gb.bin > /dev/null real 0m0.506s user 0m0.011s sys 0m0.495s root@dev2:~# mdb -kwe 'zfs_compressed_arc_enabled/W 0t1' zfs_compressed_arc_enabled: 0 = 0x1 root@dev2:~# time cat /var/tmp/2gb.bin > /dev/null real 0m0.507s user 0m0.011s sys 0m0.495sno differences with changes in arc on real hw 1U server with 1CPU xeon and 32GB RAM
realy no changes in arc: all cat run used first cached result, w/ zfs_compressed_arc_enabled=1
I have been able to reproduce the problem as described by @idvsolutions.
The issue occurs when dealing with blocks that are actually compressed so whomever is doing tests with random data (@slw) is not going to see an issue because the blocks won't actually be compressed and there will be no difference in behavior with zfs_compressed_arc_enabled set to 0 or 1.
After a quick observation of my test system while running the test, the issue does, indeed, seem to be related to parallelism: without compressed ARC, the decompression happens in numerous zio threads, whereas, with compressed ARC, the decompression occurs in a single thread.
Here's my test script (the "linux.2gb" file is the one that compresses and the "random.2gb" file is the one that does not compress):
/usr/share/zfs/zfs.sh -u
modprobe zfs zfs_compressed_arc_enabled=$1
zpool import -d /tank.img tank
echo linux.2gb
time md5sum /tank/fs/linux.2gb
echo random.2gb
time md5sum /tank/fs/random.2gb
The file full of compressed blocks (linux.2gb) reads about 3 times faster [EDIT: actually, about 2.25 time faster] when compressed ARC is disabled.
I'll investigate further but I suspect this is, in fact, a somewhat unintended side-effect of the design of compressed ARC and is also likely a result of these tests all being single-threaded. I suspect the situation would be much better when, say, using fio and multiple threads to test sequential reading.
I added one more step to my testing script: I added another copy of the "linux.2gb" file which was written with compression=lz4 and it reads substantially faster with compressed ARC enabled, however, not quite as fast as when it's disabled.
Further testing shows that the decompression parallelism seen in the zfs_compressed_arc_enabled=0 case is almost completely due to prefetching. When prefetching is disabled, the sequential read performance is almost identical whether or not compressed ARC is enabled.
This type of benchmark appears to be a somewhat pathological case: Sequential reading of a file consisting completely of compressed blocks is faster without compressed ARC because the decompression happens during the zios launched by prefetching and gains performance due to parallelism.
Folks,
I know almost nothing about ZFS internals, so my (rhetorical) question may be completely wrong.
Do I understand correctly that:
1) If compressed_arc_enabled=1, than ARC stores compressed blocks (if these blocks was compressed on disk), right?
2) So, repeated read of same file(block set) will result in repeated decompression of same data, right?
If so, especially if slow compression method is used (e.g. gzip), repeated read of ARC-cached file definitely will be slower than if uncompressed data is cached in ARC. This may be acceptable, may be not, depending of specific case (slow/fast CPU, compression ratio, ARC size).
And, yes, if ARC size is enough for uncompressed active dataset, it is better to have uncompressed-ARC as an option (even if single-thread decompression will be fixed in future).
@ptx0, dbuf is another cache on top of ARC, right? So, instead of single block copy (uncompressed), using of dbuf will result in compressed block in ARC plus uncompressed block in dbuf? looks like waste of RAM.
Sequential reading of a file consisting completely of compressed blocks is faster without compressed ARC because the decompression happens during the zios launched by prefetching and gains performance due to parallelism.
@dweeezil Thanks for your investigation, that analysis makes sense to me. I wonder if there's any performance difference with compressed ARC on/off if the workload is multi-threaded sequential reads of compressed data? E.g. NCPU's files, and one thread reading each file.
Any thoughts on how we could close that performance gap for single-threaded sequential reads? Maybe prefetches could be decompressed asynchronously into some cache (the dbuf cache, or a separate cache)? Maybe we could evict the decompressed copy after arc_min_prefetch_ms (default 1 second)?
@idvsolutions gzip-1 is not as good as zstd which you would gain access to by allowing compressed arc to be mandatory.
As the problem how I understand it is solely with checksum mismatch to the original pool on-disk checksum in case the compressor used for recompression isn't bit-identical:
Wouldn't it be a good enough solution to change L2 on-disk blocks to be _payload, original block DVA, used compression algorithm, self-checksum_ ? That should be enough to allow to sanity check the self-checksum to guarantee data integrity, sanity check the L2 on-disk header DVA to match L1ARC header so it's sure the right data is read back, getting the correct decompressor for the L2 compressed representation (regardless of how the original block is compressed on-disk in the pool).
The extra tail of the L2 on-disk data block could be released with the compressed representation when directly decompressing the buffer, in case of compressed ARC the additional L2 on-disk header tail could (after updating arc_buf_hdr->arc_flags->ARC_FLAG_COMPRESS_x to the compression mode that was used for the L2 trip) either be ignored in case it dosn't cross a SPA_MINBLOCKSIZE granularity border or (in case it does) the last SPA block of the buffer be released (as it would only contain (parts) of the then useless L2 on-disk header).
Pros:
Cons:
I also did some test, changing the compression alg and the number of threads. I run the benchmark on a system with Intel i7-7700HQ CPU @ 2.80/3.80GHz (base/turbo) and a fast NVMe device (the OEM version of a Samsung 960 EVO 500 GB).
Each test was repeated two times: the first with cold/empty ARC, the second with an hot ARC.
Summary of results (aggregate performance of the "hot ARC" run, sequential read with fio):
Some considerations:
Below you can fine the annotated benchmark runs...
# lz4 dataset
[root@gdanti-lenovo g.danti]# zfs get all tank/test | grep "inherited\|local\|recordsize"
tank/test recordsize 128K default
tank/test compression lz4 inherited from tank
tank/test xattr sa inherited from tank
tank/test relatime on inherited from tank
# create a compressible, 1 GB sized file
[root@gdanti-lenovo g.danti]# fio --name=test --filename=/tank/test/test.img --size=1G --rw=read --buffer_pattern=0xDEADBEEF --buffer_compress_percentage=75
[root@gdanti-lenovo g.danti]# du -hs /tank/test/test.img
285M /tank/test/test.img
# compressed arc enabled
[root@gdanti-lenovo g.danti]# cat /sys/module/zfs/parameters/zfs_compressed_arc_enabled
1
[root@gdanti-lenovo g.danti]# zpool export tank; zpool import tank
[root@gdanti-lenovo g.danti]# fio --name=test --filename=/tank/test/test.img --size=1G --rw=read
READ: bw=1082MiB/s (1135MB/s), 1082MiB/s-1082MiB/s (1135MB/s-1135MB/s), io=1024MiB (1074MB), run=946-946msec
READ: bw=1298MiB/s (1361MB/s), 1298MiB/s-1298MiB/s (1361MB/s-1361MB/s), io=1024MiB (1074MB), run=789-789msec
[root@gdanti-lenovo g.danti]# zpool export tank; zpool import tank
[root@gdanti-lenovo g.danti]# fio --name=test --filename=/tank/test/test.img --size=1G --rw=read --numjobs=4
READ: bw=3101MiB/s (3251MB/s), 775MiB/s-830MiB/s (813MB/s-870MB/s), io=4096MiB (4295MB), run=1234-1321msec
READ: bw=3070MiB/s (3220MB/s), 768MiB/s-920MiB/s (805MB/s-965MB/s), io=4096MiB (4295MB), run=1113-1334msec
# compressed arc disabled
[root@gdanti-lenovo g.danti]# cat /sys/module/zfs/parameters/zfs_compressed_arc_enabled
0
[root@gdanti-lenovo g.danti]# zpool export tank; zpool import tank
[root@gdanti-lenovo g.danti]# fio --name=test --filename=/tank/test/test.img --size=1G --rw=read
READ: bw=1036MiB/s (1087MB/s), 1036MiB/s-1036MiB/s (1087MB/s-1087MB/s), io=1024MiB (1074MB), run=988-988msec
READ: bw=1354MiB/s (1420MB/s), 1354MiB/s-1354MiB/s (1420MB/s-1420MB/s), io=1024MiB (1074MB), run=756-756msec
[root@gdanti-lenovo g.danti]# zpool export tank; zpool import tank
[root@gdanti-lenovo g.danti]# fio --name=test --filename=/tank/test/test.img --size=1G --rw=read --numjobs=4
READ: bw=3048MiB/s (3196MB/s), 762MiB/s-766MiB/s (799MB/s-803MB/s), io=4096MiB (4295MB), run=1337-1344msec
READ: bw=3489MiB/s (3658MB/s), 872MiB/s-977MiB/s (915MB/s-1025MB/s), io=4096MiB (4295MB), run=1048-1174msec
---
# gzip dataset
[root@gdanti-lenovo g.danti]# zfs get all tank/test | grep "inherited\|local"
tank/test compression gzip local
tank/test xattr sa inherited from tank
tank/test relatime on inherited from tank
# create a compressible, 1 GB sized file
[root@gdanti-lenovo g.danti]# rm -f /tank/test/test.img
[root@gdanti-lenovo g.danti]# fio --name=test --filename=/tank/test/test.img --size=1G --rw=read --buffer_pattern=0xDEADBEEF --buffer_compress_percentage=75
[root@gdanti-lenovo g.danti]# du -hs /tank/test/test.img
289M /tank/test/test.img
# compressed arc enabled
[root@gdanti-lenovo g.danti]# cat /sys/module/zfs/parameters/zfs_compressed_arc_enabled
1
[root@gdanti-lenovo g.danti]# zpool export tank; zpool import tank
[root@gdanti-lenovo g.danti]# fio --name=test --filename=/tank/test/test.img --size=1G --rw=read
READ: bw=374MiB/s (392MB/s), 374MiB/s-374MiB/s (392MB/s-392MB/s), io=1024MiB (1074MB), run=2738-2738msec
READ: bw=438MiB/s (459MB/s), 438MiB/s-438MiB/s (459MB/s-459MB/s), io=1024MiB (1074MB), run=2340-2340msec
[root@gdanti-lenovo g.danti]# zpool export tank; zpool import tank
[root@gdanti-lenovo g.danti]# fio --name=test --filename=/tank/test/test.img --size=1G --rw=read --numjobs=4
READ: bw=1243MiB/s (1304MB/s), 311MiB/s-311MiB/s (326MB/s-326MB/s), io=4096MiB (4295MB), run=3294-3294msec
READ: bw=1252MiB/s (1313MB/s), 313MiB/s-313MiB/s (328MB/s-328MB/s), io=4096MiB (4295MB), run=3271-3271msec
# compressed arc disabled
[root@gdanti-lenovo g.danti]# cat /sys/module/zfs/parameters/zfs_compressed_arc_enabled
0
[root@gdanti-lenovo g.danti]# zpool export tank; zpool import tank
[root@gdanti-lenovo g.danti]# fio --name=test --filename=/tank/test/test.img --size=1G --rw=read
READ: bw=1037MiB/s (1088MB/s), 1037MiB/s-1037MiB/s (1088MB/s-1088MB/s), io=1024MiB (1074MB), run=987-987msec
READ: bw=1424MiB/s (1493MB/s), 1424MiB/s-1424MiB/s (1493MB/s-1493MB/s), io=1024MiB (1074MB), run=719-719msec
[root@gdanti-lenovo g.danti]# zpool export tank; zpool import tank
[root@gdanti-lenovo g.danti]# fio --name=test --filename=/tank/test/test.img --size=1G --rw=read --numjobs=4
READ: bw=2960MiB/s (3103MB/s), 740MiB/s-740MiB/s (776MB/s-776MB/s), io=4096MiB (4295MB), run=1383-1384msec
READ: bw=3185MiB/s (3340MB/s), 796MiB/s-972MiB/s (835MB/s-1020MB/s), io=4096MiB (4295MB), run=1053-1286msec
@idvsolutions Could you please re-run your "grep test", but instead of grepping, use "md5sum" to make sure the file is being read sequentially and in its entirety. And all that I'm terribly interested in is the initial "empty cache" numbers, which I'm surprised are so different in your grep test with compressed ARC disabled versus enabled. The reason I'm surprised is that, theoretically, both tests must be reading and, therefore, decompressing the entire file. If they still differ a lot, the next step would be to run the tests under "perf record" and see where that extra time is being spent.
I think the big difference here is the read size. Reading a file 4kb at a time will be much slower than using dd to read it 1mb at a time.
I also ran some multi-threaded sequential read tests and they more-or-less confirm @shodanshok's conclusions above which I think can be summarized as "without compressed ARC, workloads which benefit from prefetch gain performance from parallel decompression in the zio layer". And, of course, this benefit is greatly magnified for the more expensive decompression algorithms. In my case, I ran some tests with "xdd" on a series of 256MiB files, totalling 2GiB and somewhat as expected, the performance difference between compressed ARC on/off was even _greater_ than the single-threaded case because each of the threads has its own prefetch stream and the system on which I ran the test has a lot of CPU cores.
@ahrens It seems that if prefetched blocks could be asynchronously decompressed into the dbuf cache when they're prefetched, it would likely mitigate most, if not all of the difference between compressed ARC on/off.
Sequential reading of a file consisting completely of compressed blocks is faster without compressed ARC because the decompression happens during the zios launched by prefetching and gains performance due to parallelism.
@dweeezil Thanks for your investigation, that analysis makes sense to me. I wonder if there's any performance difference with compressed ARC on/off if the workload is multi-threaded sequential reads of compressed data? E.g. NCPU's files, and one thread reading each file.
Any thoughts on how we could close that performance gap for single-threaded sequential reads? Maybe prefetches could be decompressed asynchronously into some cache (the dbuf cache, or a separate cache)? Maybe we could evict the decompressed copy after
arc_min_prefetch_ms(default 1 second)?
@ahrens We had this same discussion last week at BSDCan, that a single threaded read from the compressed ARC is limited by the latency of compression, because the readonly only requests one block at a time.
I did tests with multiple threads, and they could each read the file at the same speed up to the point I started to run out of CPU or memory bandwidth, so threading seems to be the answer.
I agree the solution is likely some type of 'prefetch (predecompress) to the dbuf cache', but I am not sure how complex that will be, and how we avoid wasting too much CPU time decompressing blocks that may never get used.
My benchmark results:
16GB file compressed with LZ4 at 1.48x
Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz (10 cores, 20 threads)
| | CARC on | CARC off |
| -------- | ---------- | ---------- |
| ABD on | 2.2GB/s | 3.8GB/s |
| ABD off | 2.9GB/s | 8.6GB/s |
I also ran some multi-threaded sequential read tests and they more-or-less confirm @shodanshok's conclusions above which I think can be summarized as "without compressed ARC, workloads which benefit from prefetch gain performance from parallel decompression in the zio layer". And, of course, this benefit is greatly magnified for the more expensive decompression algorithms. In my case, I ran some tests with "xdd" on a series of 256MiB files, totalling 2GiB and somewhat as expected, the performance difference between compressed ARC on/off was even _greater_ than the single-threaded case because each of the threads has its own prefetch stream and the system on which I ran the test has a lot of CPU cores.
@ahrens It seems that if prefetched blocks could be asynchronously decompressed into the dbuf cache when they're prefetched, it would likely mitigate most, if not all of the difference between compressed ARC on/off.
Not sure this will work. The dbuf cache is generally quite short lived. If you are reading off disk, you are likely already bottlenecked by the backing storage. It is the cache of a cache hit where the decompress read-ahead is likely to make the bigger impact. Although There would still likely be some gain from what you propose. I just think we need to find a way to generalize it such that it works for a re-read.
I propose we create a new issue to track 'compressed ARC performance', and move that discussion there, and likely abandon the idea of removing the compressed_arc_enabled tunable.
@allanjude What do you mean with "ABD off"? Are you referring to ABD scatter/gather feature?
@allanjude What do you mean with "ABD off"? Are you referring to ABD scatter/gather feature?
Yes, abd_scatter_enabled=0
On FreeBSD ABD is not required, as the kernel can allocate very large memory regions as required. However, the ABD feature does help reduce memory fragmentation, so it is turned on by default. However, having to do extra copys to switch between linear and scatter-gather buffers seems to have a relatively large impact on throughput.
There is a concurrent investigation in FreeBSD into how some of this can be avoided.
My benchmark results:
16GB file compressed with LZ4 at 1.48x
Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz (10 cores, 20 threads)
CARC on CARC off
ABD on 2.2GB/s 3.8GB/s
ABD off 2.9GB/s 8.6GB/s
* With ABD off, CARC on, and the dbuf cache enlarged to 1.25GB, a 1GB file completely cached in the dbuf cache managed 8.2GB/s
I would like to put forward that this one specific benchmark should not necessarily block this change. As this result shows ABD was the far bigger perf hit, but nobody proposed switching that off, let alone removing it.
For this single thread sequential prefetched read test maybe the more relevant comparison would be an unprefetched benchmark. Not sure if @allanjude tested that back then, but if prefetching into compressed ARC beats that then that could still be considered a win, on top of the significantly simpler to maintain code and correctness. (#10342, #8454)
Most helpful comment
I propose we create a new issue to track 'compressed ARC performance', and move that discussion there, and likely abandon the idea of removing the compressed_arc_enabled tunable.