Currently, there's no assembly implementation for SHA3 hashing for ARM platforms (specifically ARMv7). On ARMv7+ there are vector assembly instructions (known as NEON) available which greatly speed up the speed of SHA3 hashing. There is an upstream reference implementation (here: https://github.com/KeccakTeam/KeccakCodePackage/blob/master/lib/low/KeccakP-1600-times2/OptimizedAsmARM/KeccakP-1600-inplace-pl2-armv7a-neon-le-gcc.s) available that implements SHA3 hashing using these vector instructions and so I have ported this to Go.
Unfortunately, there is no support in the Go assembler/dissassembler for ARMv7 vector instructions, and so I wrote a small tool (available: https://github.com/anonymouse64/asm2go) which translates native assembly code for ARM into Go's plan9 based assembly unsupported opcode syntax in order to integrate the upstream implementation in Go.
I see approximately 3-4 time speedup in SHA3 hashing on a reference Raspberry Pi 3 Model B Revision 1.2 board:
benchmark old ns/op new ns/op delta
BenchmarkPermutationFunction-4 19033 6054 -68.19%
BenchmarkSha3_512_MTU-4 388484 137001 -64.73%
BenchmarkSha3_384_MTU-4 279054 100177 -64.10%
BenchmarkSha3_256_MTU-4 224595 81443 -63.74%
BenchmarkSha3_224_MTU-4 210459 79196 -62.37%
BenchmarkShake128_MTU-4 181445 66482 -63.36%
BenchmarkShake256_MTU-4 199495 71572 -64.12%
BenchmarkShake256_16x-4 2704755 1094933 -59.52%
BenchmarkShake256_1MiB-4 151870003 53953383 -64.47%
BenchmarkSha3_512_1MiB-4 283838790 97048578 -65.81%
benchmark old MB/s new MB/s speedup
BenchmarkPermutationFunction-4 10.51 33.03 3.14x
BenchmarkSha3_512_MTU-4 3.48 9.85 2.83x
BenchmarkSha3_384_MTU-4 4.84 13.48 2.79x
BenchmarkSha3_256_MTU-4 6.01 16.58 2.76x
BenchmarkSha3_224_MTU-4 6.41 17.05 2.66x
BenchmarkShake128_MTU-4 7.44 20.31 2.73x
BenchmarkShake256_MTU-4 6.77 18.86 2.79x
BenchmarkShake256_16x-4 6.06 14.96 2.47x
BenchmarkShake256_1MiB-4 6.90 19.43 2.82x
I opened a CL providing this implementation here: https://go-review.googlesource.com/c/crypto/+/119255 however I have not received any feedback on the CL, so here I am opening this issue to hopefully get more visibility on this.
Ref: #28126
As I wasn't aware of the mentioned Assembly Policy mentioned in the issue, I will provide responses to some of the points mentioned there that should be discussed when considering assembly contributions:
This function is on the "fast path" when performing SHA3 hashing of large files, etc. I don't have specific benchmarks of implementing other parts of the hashing algorithm in assembly to compare to, but it should be clear from both the speedup I got and the fact that the amd64 implementation only implements this same function in assembly that the function I'm providing is the "minimal" amount to implement in assembly.
I would argue this is okay because the generated assembly is generated from a relatively small amount of hand-written native assembly. The reason why I didn't write the native assembly in plan9 assembly is because Go's assembler doesn't support the ARM vector instructions needed for this implementation (and using vector instructions is why the implementation is much faster than the native Go implementation).
Additionally, this small hand-written assembly comes from the upstream reference implementation for SHA3, and is in fact the same place where the original amd64 assembly implementation comes from.
In the future, this assembly code could be simplified with either support for vector instructions in plan9 assembly code, or if the go compiler gained the ability to emit vector instructions when compiling specifically for ARMv7+ processors.
SHA3 is likely not going to become a legacy primitive anytime soon, considering it was specifically chosen in a public, open manner to replace older hashing mechanisms
Specifically for our use case at Canonical Ltd., our package management system snapd uses SHA3 hashing for almost all (possibly all not 100% sure) hashing of files.
One particular place this lack of performance on ARM hits us is during the first boot of a newly flashed Ubuntu Core ARM device, where the system when installing snaps (and other tasks) takes up a good portion of the time performing SHA3 hashes. An example difference for a customer device I can't unfortunately release more details on, performs the first boot in a factory environment about 1-2 minutes faster (approx 10% faster) when using this assembly implementation, and this time/percentage increases with the number of snaps the customer wants to configure their device with initially.
/cc @FiloSottile
Hi folks, any update on this? Is it now too late for this to be considered for go 1.12?
@anonymouse64 hi there. It is too late for this to make it into 1.12 since we are deep into the freeze period where only bugs are being fixed. More info is here: https://golang.org/wiki/Go-Release-Cycle
Okay, so can I at least get a decision about whether an assembly implementation of SHA3 for ARM is acceptable or not (disregarding what specific go release it goes into)?
My CL has been open for almost 6 months and this issue has been open for almost 2 months and I have not yet heard any feedback on whether it is acceptable or not. I would like to at least know if it is acceptable, as I was told that I should attempt to upstream this implementation rather than try to maintain our own fork of the SHA3 crypto library. If it's not acceptable, I would need to prepare for making the case for us to maintain a fork which would be highly unfortunate so it would be good for me to know.
I'm sorry you've been waiting so long for a response. Hopefully this finds itself at the top of @FiloSottile's notifications and can be looked at more closely soon.
Change https://golang.org/cl/119255 mentions this issue: sha3: implement assembly implementation on ARM
Hi @FiloSottile any comment on this?
This provides significant improvement and should be considered
Any update on this, even an ACK/NACK on the generic approach would be great at this point, it's getting close to two years with no response on whether the approach is okay :-/
Most helpful comment
Okay, so can I at least get a decision about whether an assembly implementation of SHA3 for ARM is acceptable or not (disregarding what specific go release it goes into)?
My CL has been open for almost 6 months and this issue has been open for almost 2 months and I have not yet heard any feedback on whether it is acceptable or not. I would like to at least know if it is acceptable, as I was told that I should attempt to upstream this implementation rather than try to maintain our own fork of the SHA3 crypto library. If it's not acceptable, I would need to prepare for making the case for us to maintain a fork which would be highly unfortunate so it would be good for me to know.