@AttilaFueloep has indicated that the original question's premise, a belief that slow performance on non-AVX processors was due to a non-accelerated GHASH, was in error. @AttilaFueloep indicates the more likely cause of low performance is the FPU handling cost which may be resolved by processing larger chunks on non-AVX processors, as is already done on processors with the AVX instructions, see his comments below.
This thread is now a request to implement such chunking.
* Above edited 9.10.2020*
On my machine, which has AES-NI but not AVX extensions, accelerated GCM encryption is still dog slow. Intel's docs seem to indicate that SSE GHASH acceleration is possible and may provide a similar performance benefit as AVX accelerated GHASH (see https://github.com/openzfs/zfs/pull/9749) and has code already available, find gcm_sse.asm at page 14:
Any possibility of implementing? Thank you.
The link [6] in the PDF seems to be dead, so I couldn't even find out whether the provided implementation is a 'proof of concept/educational, do not use for actual security' or supposedly an audited, secure one.
The last GCM improvements were taken from OpenSSL AFAIK, so if they carry an optimized routine for this hardware, chances of it finding its way into OpenZFS are substancially higher.
I appreciate your reply. FYI, I think this may be what we would be looking for: https://boringssl.googlesource.com/boringssl/+/refs/heads/master/crypto/fipsmodule/modes/asm/ghash-x86.pl
Indicates a more than 10x improvement. Thank you.
@electricboogie May I inquire why you didn't use the "question" or "feature request" forms? This is clearly a question or a feature request (which one depending on personal opinion, but thats why we offer both options).
(Asking because of #10833 and feedback on #10779 )
I was ignorant of those forms. Do you have a suggestion about how best to handle now that we are where we are, so that my feature request is well received? Thank you. Appreciate your feedback.
OpenSSL seems to also carry that, at least judging by the header comment.
https://github.com/openssl/openssl/blob/master/crypto/modes/asm/ghash-x86_64.pl
Maybe @AttilaFueloep can comment on or even port this 馃檭
I bet a number of Intel (Pentium|Celeron) J owners would be thankful.
@electricboogie No problem.
@behlendorf Please label this "Type: Feature" :)
(its also a feature worth looking into)
FYI, might be Google/BoringSSL only, but see also: https://boringssl.googlesource.com/boringssl/+/refs/heads/master/crypto/fipsmodule/modes/asm/ghash-ssse3-x86_64.pl
Unfortunately it's more than GHASH, you'd need an SSE equivalent of aesni_gcm_[en|de]crypt() . This routine is a complete aes-gcm implementation in assembler (AES-NI-CTR+GHASH stitch, as the comment says) and requires AVX. I skimmed over the OpenSSL sources but to no avail.
Without that, processing larger chunks of encryption data would roughly double performance by reducing the FPU state handling overhead and wouldn't be that hard to implement. But I'm afraid I don't have the capacity to make this happen anytime soon. If someone wants to tackle this, I'm happy to help.
As a side note, I'm wondering what kind of CPU would have AES-NI but no AVX? Does it support MOVBE and PCLMULQDQ?
Thank you for your response.
I have an Intel J4205. CPUID indicates it has the MOVBE and PCLMULQDQ instructions. Pleased to provide full cpuid output if needed.
I shouldn't pretend I know that the non-accelerated GHASH is the root of the problem, however, fio and OpenSSL benchmarks seem to indicate there is some performance to be gained somewhere:
$ dmesg | grep gcm
[ 7.708031] SSE version of gcm_enc/dec engaged.
$ openssl speed -evp aes-128-gcm
...
OpenSSL 1.1.1f 31 Mar 2020
...
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-gcm 202107.48k 538017.41k 880800.68k 1039108.78k 1085216.09k 1089465.00k
$ fio --direct=1 --name=read --ioengine=libaio --rw=read --bs=128k --size=512m --numjobs=8 --iodepth=1 --group_reporting
read: (g=0): rw=read, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=1
...
fio-3.16
...
READ: bw=124MiB/s (131MB/s), 124MiB/s-124MiB/s (131MB/s-131MB/s), io=4096MiB (4295MB), run=32903-32903msec
`$ hdparm -t /dev/sda
/dev/sda:
Timing buffered disk reads: 1368 MB in 3.00 seconds = 455.71 MB/sec`
Excuse me. My processor does have the PCLMULDQ but not the VPCLMULQDQ instruction.
$ cpuid | grep -i pcl
PCLMULDQ instruction = true
VPCLMULQDQ instruction = false
PCLMULDQ instruction = true
VPCLMULQDQ instruction = false
PCLMULDQ instruction = true
VPCLMULQDQ instruction = false
PCLMULDQ instruction = true
VPCLMULQDQ instruction = false
@AttilaFueloep Excuse my (profound) ignorance, my read of your explanation as to why a AES-NI-CTR+GHASH is necessary is because the ZFS modules have to do as much as possible when they hold the FPU (because using the kernel SIMD interfaces is out of the question). Wouldn't it be possible just to ifdef in an SSE accelerated GHASH into the current stitch borrowed from the OpenSSL asm module or from a closely related fork? I'm sure that all sounds easier than it is.
And see: https://github.com/openssl/openssl/blob/master/crypto/modes/asm/ghash-x86_64.pl
Understand your point that not a lot of enterprise workloads are running on CPUs like this. However, I'm willing to gamble that a huge number of home NAS systems have the same or similar CPUs.
Appreciate your guidance.
Crypto in ZFS is using the ICP kernel module which is a port of Illumos crypto and CDDL licenced. Therefore it can't use the GPL-only Linux kernel SIMD interfaces. The AES-GCM implementation is written in C and operates on single GCM blocks. Encryption and GHASH calculation are accelerated by using the AES-NI and PCLMULQDQ SIMD instructions. Since AES-GCM requires one AES and one GHASH calculation per block, this results in two FPU saves and two FPU restores done per 16 byte block. This, of course, is a massive overhead, slowing down the operation. Please see the PR Description in #9749.
The easiest way to improve performance in the non-AVX case would be to reduce this overhead by implementing the chunking described in the PR for the non-AVX cases too. This would roughly double the performance (tested while developing #9749). If you compare e.g. gcm_mode_encrypt_contiguous_blocks() against gcm_mode_encrypt_contiguous_blocks_avx() you'll see the chunking implemented at line 1182. I could do this but unfortunately right now I have a number of higher priority tasks in my queue.
Beyond that there is some optimization margin in the GHASH implementation too but I'd guess it wouldn't be that massive. To improve performance further one would need an SSE assembler version of aesni-gcm-x86_64.S but since the combination of AES-NI and SSE is quite uncommon I'd doubt that such code exists. If someone knows of such code I'd appreciate any pointer.
Wouldn't it be possible just to ifdef in an SSE accelerated GHASH into the current stitch
If you look at the source you'll realize that it heavily utilizes VEX instructions (the ones starting with the letter v) and therefore requires AVX to run anyhow. So unfortunately this is not possible.
I agree that having fast encryption for a broader range of architectures would be a good thing to have, but as it currently stands this would require implementation of one of the above options.
My processor does have the PCLMULDQ but not the VPCLMULQDQ instruction.
Yes, VEX instructions require AVX. Could you post here the output of cat /sys/module/icp/parameters/icp_aes_impl and cat /sys/module/icp/parameters/icp_gcm_impl please?
$ dmesg | grep gcm [ 7.708031] SSE version of gcm_enc/dec engaged.
That message is from Linux kernel crypto we can't use due to licence issues (GPL vs. CDDL).
Appreciate your response.
cat /sys/module/icp/parameters/icp_aes_impl
cycle fastest generic x86_64 [aesni]
cat /sys/module/icp/parameters/icp_gcm_impl
cycle [fastest] generic pclmulqdq
So you are using hardware acceleration for both, AES and GHASH and are paying the high FPU handling price. IIRC saving the FPU state on Goldmont is twice as expensive as on Ivy Bridge, therefore I'd expect "chunked encryption" to at least double your throughput.
@AttilaFueloep I really do appreciate this free education. Understand now that the bottleneck is the high FPU handling cost. And understand, as well, if you don't have the bandwidth to implement right now, but would you mind if I reframe this question and your comments as an open feature request, according to @Ornias1993 's form?
Thank you.
You can tag and ask behlendorf to just re-label this as "Type: Feature" ;)
Thank you @Ornias1993. @behlendorf Could you re-label this question as "Type: Feature"? I would suggest a new title, given what @AttilaFueloep has told us re: the likely root of this performance issue. Perhaps "ICP: Implement Larger Encrypt/Decrypt Chunks for Non-AVX Processors"? Of course, I would defer to @AttilaFueloep or you others who have been so helpful.
@electricboogie Feel free to rename the title and intro-text yourself! :)
Thanks for your great ideas though :)
FYI, perf top output on 0.8.3 also indicates that @AttilaFueloep is correct that FPU save and restore overhead during a fio run is the core issue. Pleased to provide additional info if needed. Thanks!
Samples: 327K of event 'cycles', 4000 Hz, Event count (approx.): 114513049547 lost: 0/0 drop: 0/65143
Overhead Shared Object Symbol
15.12% [kernel] [k] kfpu_restore_xsave.constprop.0
15.08% [kernel] [k] kfpu_restore_xsave.constprop.0
12.15% [kernel] [k] kfpu_save_xsave.constprop.0
12.05% [kernel] [k] kfpu_save_xsave.constprop.0
5.27% [kernel] [k] aes_encrypt_intel
4.58% [kernel] [k] aes_xor_block
4.15% [kernel] [k] gcm_mul_pclmulqdq
2.81% [kernel] [k] aes_encrypt_block
2.38% [kernel] [k] aes_aesni_encrypt
This weekend I found some time to look more into this. Given the fact that even the newest Intel Atoms (Tremont) do not support AVX but AES-NI and SSE, the home NAS use case is a valid point to consider.
Looking around, I think I've found some suitable code to use and I'll try to come up with something once time permits. This may take a while though, and I can't give any ETA. Once done, it should perform comparable to the AVX implementation. As soon as I've something to test, I'll let you know.
Thanks for bringing this up.
I was curious about this as well so I did some testing on a system with an Intel Celeron N3060 I've been playing around with. I definitely think there is more performance that can be squeezed out of the "lower-end" non-AVX-equipped CPUs. As @AttilaFueloep mentioned, even current Intel low-power CPUs lack AVX, so it seems reasonable to invest a bit of time to at least determine if something can be done to improve performance and/or how hard it would be to implement.
I'd be happy to do any additional testing and/or test any patches if needed!
Relevant Test System Specifications:
CPU: Intel Celeron N3060
Memory: 8GB DDR3L
Kernel: 4.19.0-11-amd64 (most recent kernel in Debian Buster at time of testing)
OpenZFS Version: 0.8.4 (latest in buster-backports at time of testing)
ZFS Encryption: aes-128-gcm
ZFS Parameters: sync=standard, compression=off, recordsize=128K (performance was similar with sync=disabled)
$ cat /sys/module/icp/parameters/icp_aes_impl
cycle [fastest] generic x86_64 aesni
$ cat /sys/module/icp/parameters/icp_gcm_impl
cycle [fastest] generic pclmulqdq
$ openssl speed -evp aes-128-gcm
Doing aes-128-gcm for 3s on 16 size blocks: 21128961 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 64 size blocks: 10614883 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 256 size blocks: 3805791 aes-128-gcm's in 2.99s
Doing aes-128-gcm for 3s on 1024 size blocks: 1077630 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 8192 size blocks: 138814 aes-128-gcm's in 2.99s
Doing aes-128-gcm for 3s on 16384 size blocks: 69491 aes-128-gcm's in 2.98s
OpenSSL 1.1.1d 10 Sep 2019
built on: Mon Apr 20 20:23:01 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-8Ocme2/openssl-1.1.1d=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-gcm 112687.79k 226450.84k 325846.99k 367831.04k 380322.50k 382060.59
$ dd if=/dev/zero of=./zero.000 bs=1M count=16K
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 435.618 s, 39.4 MB/s
I also compiled the Kernel with the FPU begin/end functions re-exported and tested again. I used the current, stock Debian kernel source; the only change was the FPU function exports, otherwise identical to the previous test. As expected, performance was better, though I was surprised by how much (about 80%!).
$ dd if=/dev/zero of=/testpool/temp/zero.000 bs=1M count=16K
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 236.754 s, 72.6 MB/s
As mentioned above, I'm working on a prototype which I expect to perform at least as well as OpenSSL (380 MB/s). I'll let you know once there is something to test. As an added benefit, once the SSE stuff is working, it's straight forward to add support for avx2, avx512 and avx512-vaes as well.
I also compiled the Kernel with the FPU begin/end functions re-exported and tested again. I used the current, stock Debian kernel source; the only change was the FPU function exports, otherwise identical to the previous test. As expected, performance was better, though I was surprised by how much (about 80%!).
Yes, that resembles the 100% I observed. If you can use the kernel FPU functions the FPU state is only saved on context switches which has essentially the same effect as processing larger chunks while disabling preemption.
@AttilaFueloep How about ensuring the default gcm_avx_chunk_size is at least SPA_OLD_MAXBLOCKSIZE (after rounding)? Would that allow all data blocks with default recordsize, metadata blocks (+indirect blocks? Are they authenticated?), ZIL blocks be encripted in one go?
I don't think that's useful for two reasons.
First, while using the FPU we disable interrupts and preemption to make sure the FPU regs won't get clobbered in between. To avoid starving the system, we should minimize the time we stay in this state, so the smaller the chunk size, the better.
And second, there are diminishing returns with increasing chunk size. Let's do a rough estimate: two calls per 16 bytes results in an 100% overhead, so one call per 32 KiB produces 0.025% overhead, which is negligible already. Retrospectively I think that the chosen value of 32k is already quite large and a value of 16 KiB or 8 KiB would've been a better choice. I'm planning to refine the default value after running some benchmarks.
Most helpful comment
This weekend I found some time to look more into this. Given the fact that even the newest Intel Atoms (Tremont) do not support AVX but AES-NI and SSE, the home NAS use case is a valid point to consider.
Looking around, I think I've found some suitable code to use and I'll try to come up with something once time permits. This may take a while though, and I can't give any ETA. Once done, it should perform comparable to the AVX implementation. As soon as I've something to test, I'll let you know.
Thanks for bringing this up.