The discussion at https://github.com/golang/go/issues/31586 includes several techniques for optimizing xor of byte slices, including using encoding/binary (with native endianness) and loop unrolling. It'd be nice to apply them to safeXORBytes, which is used on many non-amd64 platforms. An optimized safeXORBytes might even be fast enough that we could delete fastXORBytes.
Somewhat related: #30553
cc @nhooyr
(Aside: I've wanted encoding/binary.NativeEndian to exist on multiple occasions. Here's another one.)
Will give this a shot on the weekend.
(Aside: I've wanted encoding/binary.NativeEndian to exist on multiple occasions. Here's another one.)
I agree this would be very useful. Some discussion in the past: https://groups.google.com/forum/#!topic/golang-nuts/3GEzwKfRRQw
Will give this a shot on the weekend.
Awesome. No rush, in any case—the freeze just started, so nothing will go in for three months.
Thanks for the pointer to previous discussion of native endianness. I’ll file a proposal to discuss.
If you implement a generic, fast xor function, it could probably stand to be used in x/crypto/sha3 — the current implementation there is somewhat dubious due to alignment assumptions.
(See CL 203837.)
CC @FiloSottile
@bcmills good point, thanks. If we put it in a place that both x/crypto and crypto/cipher can reach it, that'd presumably also resolve #30553.
Proposal at #35398.
Added in https://github.com/nhooyr/go/commit/24ce00e112af127052f7c46e5ec50cadb1cab97f
Results against xorBytesSSE2 on amd64:
$ /Users/nhooyr/src/golang/go/bin/go test -run=_ -bench=XORBytes
goos: darwin
goarch: amd64
pkg: crypto/cipher
BenchmarkXORBytes/8Bytes-8 176461338 6.70 ns/op 1194.11 MB/s
BenchmarkXORBytes/128Bytes-8 100000000 11.6 ns/op 11062.06 MB/s
BenchmarkXORBytes/2048Bytes-8 12895827 95.2 ns/op 21517.42 MB/s
BenchmarkXORBytes/32768Bytes-8 627291 1596 ns/op 20527.27 MB/s
BenchmarkFastSafeXORBytes/8Bytes-8 147817434 7.92 ns/op 1009.53 MB/s
BenchmarkFastSafeXORBytes/128Bytes-8 66896612 17.2 ns/op 7438.88 MB/s
BenchmarkFastSafeXORBytes/2048Bytes-8 6217075 185 ns/op 11069.10 MB/s
BenchmarkFastSafeXORBytes/32768Bytes-8 397446 3007 ns/op 10897.15 MB/s
PASS
ok crypto/cipher 12.014s
So safe fast is still about 2x as slow. Still 10x faster in my testing than basic simple xor though.
https://github.com/nhooyr/go/commit/68261d7dcf45c3e97f74d7d851301c945bf0eda7
Added safe xor to the benchmark:
$ /Users/nhooyr/src/golang/go/bin/go test -run=_ -bench=XORBytes
goos: darwin
goarch: amd64
pkg: crypto/cipher
BenchmarkXORBytes/default/8B-8 180542707 6.79 ns/op 1178.83 MB/s
BenchmarkXORBytes/safe/8B-8 135818607 8.73 ns/op 916.11 MB/s
BenchmarkXORBytes/fastSafe/8B-8 139375618 8.62 ns/op 927.86 MB/s
BenchmarkXORBytes/default#01/128B-8 100000000 10.9 ns/op 11787.06 MB/s
BenchmarkXORBytes/safe#01/128B-8 13410099 83.8 ns/op 1527.65 MB/s
BenchmarkXORBytes/fastSafe#01/128B-8 60092373 19.7 ns/op 6490.28 MB/s
BenchmarkXORBytes/default#02/2048B-8 13841852 83.5 ns/op 24532.14 MB/s
BenchmarkXORBytes/safe#02/2048B-8 974487 1180 ns/op 1735.13 MB/s
BenchmarkXORBytes/fastSafe#02/2048B-8 5222248 220 ns/op 9314.25 MB/s
BenchmarkXORBytes/default#03/32768B-8 855936 1448 ns/op 22634.24 MB/s
BenchmarkXORBytes/safe#03/32768B-8 61302 19043 ns/op 1720.75 MB/s
BenchmarkXORBytes/fastSafe#03/32768B-8 332857 3440 ns/op 9524.70 MB/s
PASS
ok crypto/cipher 17.433s
Sometimes nearly 3x as slow as the current default.
cc @FiloSottile for removing assembly in favor of pure Go
Filippo, is 2x-3x in this case within range of switching to pure Go?
There may also be further optimizations available (either in the pure Go or in the compiler); I haven't checked yet, and may not for a little while.
cc @CAFxX who I believe likes optimizing stuff like this
Most helpful comment
(Aside: I've wanted encoding/binary.NativeEndian to exist on multiple occasions. Here's another one.)