In the WebSocket protocol, clients have to generate a random masking key for every frame sent to the server and XOR mask the frame payload with the key. The server will unmask it on retrieval.
The mask/unmask function looks like this https://github.com/gorilla/websocket/blob/0ec3d1bd7fe50c503d6df98ee649d81f4857c564/mask_safe.go#L9-L15
This is masking byte by byte but can be made significantly faster if the masking was done word by word. Unfortunately, the Go compiler does not do this automatically so you have to use unsafe and write this yourself: https://github.com/gorilla/websocket/blob/0ec3d1bd7fe50c503d6df98ee649d81f4857c564/mask.go#L13-L54
Benchmarks at https://github.com/gorilla/websocket/issues/31
It would be nice if the Go compiler did this automatically.
The two links you posted are the same (copy-paste mistake?). The 2nd one in particular does not use unsafe.
My bad, fixed.
The second link was supposed to be https://github.com/gorilla/websocket/blob/0ec3d1bd7fe50c503d6df98ee649d81f4857c564/mask.go#L13-L54
Thank you for filing this issue @nhooyr and @ALTree for the triaging!
Kindly paging @josharian @randall77 @martisch.
Please see #28113 #30553 and #28465
If you want fast xor, there are 4 different implementations in crypto/cipher.
I don't know whether the compiler can do so much optimization.
The compiler already recognizes sequences of byte loads and coalesces them into a single multi-byte loads. See e.g. the amd64 rules starting at https://github.com/golang/go/blob/e6ae4e86ad59c45f302a8828e77e6c234307fce4/src/cmd/compile/internal/ssa/gen/AMD64.rules#L1531. We could in theory write similar rules that convert a sequence of byte-wise load/xor/store operations to an XORQmodify.
In fact, I believe there is a way of writing this that would allow the compiler to do this optimization now. This code:
import "encoding/binary"
func f(key [4]byte, b []byte) {
k := binary.LittleEndian.Uint32(key[:])
v := binary.LittleEndian.Uint32(b)
binary.LittleEndian.PutUint32(b, v^k)
}
Generates (at its core, surrounded by other stuff):
MOVQ "".b+40(SP), AX
XORL (AX), DX
MOVL DX, (AX)
That seems pretty good.
What do you think, @nhooyr ?
(And if this is seriously performance-critical, you might want to manually inline the LittleEndian rountines and also duplicate key into a uint64 so that you can do 8 bytes at a time.)
That seems very reasonable @josharian
Will give this is a shot and let you know how the performance compares.
So benchmarked and here are the results on my machine (Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz):
$ go test -bench=BenchmarkMaskBytes
goos: darwin
goarch: amd64
pkg: github.com/gorilla/websocket
BenchmarkMaskBytes/size-2/align-4/byte-8 500000000 3.74 ns/op 535.10 MB/s
BenchmarkMaskBytes/size-2/align-4/gorilla-8 300000000 3.93 ns/op 508.44 MB/s
BenchmarkMaskBytes/size-2/align-4/gobwas-8 300000000 4.67 ns/op 428.21 MB/s
BenchmarkMaskBytes/size-2/align-4/nhooyr-8 300000000 4.73 ns/op 422.43 MB/s
BenchmarkMaskBytes/size-2/align-4/crypto/cipher-8 100000000 16.9 ns/op 118.23 MB/s
BenchmarkMaskBytes/size-4/align-4/byte-8 300000000 5.58 ns/op 716.61 MB/s
BenchmarkMaskBytes/size-4/align-4/gorilla-8 300000000 5.56 ns/op 719.80 MB/s
BenchmarkMaskBytes/size-4/align-4/gobwas-8 200000000 7.27 ns/op 550.31 MB/s
BenchmarkMaskBytes/size-4/align-4/nhooyr-8 200000000 7.04 ns/op 567.92 MB/s
BenchmarkMaskBytes/size-4/align-4/crypto/cipher-8 100000000 19.6 ns/op 204.31 MB/s
BenchmarkMaskBytes/size-16/align-4/byte-8 100000000 13.7 ns/op 1170.31 MB/s
BenchmarkMaskBytes/size-16/align-4/gorilla-8 100000000 20.2 ns/op 793.15 MB/s
BenchmarkMaskBytes/size-16/align-4/gobwas-8 200000000 6.59 ns/op 2426.43 MB/s
BenchmarkMaskBytes/size-16/align-4/nhooyr-8 100000000 13.6 ns/op 1174.09 MB/s
BenchmarkMaskBytes/size-16/align-4/crypto/cipher-8 100000000 16.2 ns/op 985.94 MB/s
BenchmarkMaskBytes/size-32/align-4/byte-8 100000000 23.9 ns/op 1336.39 MB/s
BenchmarkMaskBytes/size-32/align-4/gorilla-8 100000000 20.6 ns/op 1551.76 MB/s
BenchmarkMaskBytes/size-32/align-4/gobwas-8 200000000 8.13 ns/op 3934.47 MB/s
BenchmarkMaskBytes/size-32/align-4/nhooyr-8 100000000 15.6 ns/op 2054.30 MB/s
BenchmarkMaskBytes/size-32/align-4/crypto/cipher-8 100000000 24.7 ns/op 1298.15 MB/s
BenchmarkMaskBytes/size-512/align-4/byte-8 5000000 345 ns/op 1480.68 MB/s
BenchmarkMaskBytes/size-512/align-4/gorilla-8 30000000 53.8 ns/op 9525.00 MB/s
BenchmarkMaskBytes/size-512/align-4/gobwas-8 50000000 39.9 ns/op 12840.94 MB/s
BenchmarkMaskBytes/size-512/align-4/nhooyr-8 20000000 83.6 ns/op 6126.90 MB/s
BenchmarkMaskBytes/size-512/align-4/crypto/cipher-8 10000000 241 ns/op 2122.82 MB/s
BenchmarkMaskBytes/size-4096/align-4/byte-8 500000 2674 ns/op 1531.63 MB/s
BenchmarkMaskBytes/size-4096/align-4/gorilla-8 5000000 285 ns/op 14322.91 MB/s
BenchmarkMaskBytes/size-4096/align-4/gobwas-8 5000000 282 ns/op 14478.90 MB/s
BenchmarkMaskBytes/size-4096/align-4/nhooyr-8 3000000 539 ns/op 7596.43 MB/s
BenchmarkMaskBytes/size-4096/align-4/crypto/cipher-8 1000000 1818 ns/op 2252.51 MB/s
PASS
ok github.com/gorilla/websocket 58.645s
Source code at https://github.com/nhooyr/websocket/blob/40b4f235f2e2730ad1d8b81852e7610a8251d080/mask_test.go#L136-L174
Interestingly, the crypto/cipher assembly XOR was the slowest but I might be using it wrong.
Your solution is 2x slower than gorilla and gobwas but fast enough for me as I do not want to use unsafe.
Thanks @josharian.
I'll leave this issue open in case you think there is still improvement that can be made.
How I implemented what you described:
// xor applies the WebSocket masking algorithm to p
// with the given key where the first 3 bits of pos
// are the starting position in the key.
// See https://tools.ietf.org/html/rfc6455#section-5.3
//
// The returned value is the position of the next byte
// to be used for masking in the key. This is so that
// unmasking can be performed without the entire frame.
func nhooyr(key [4]byte, keyPos int, b []byte) int {
// If the payload is greater than 16 bytes, then it's worth
// masking 8 bytes at a time.
// Optimization from https://github.com/golang/go/issues/31586#issuecomment-485530859
if len(b) > 16 {
// We first create a key that is 8 bytes long
// and is aligned on the keyPos correctly.
var alignedKey [8]byte
for i := range alignedKey {
alignedKey[i] = key[(i+keyPos)&3]
}
k := binary.LittleEndian.Uint64(alignedKey[:])
// Then we xor until b is less than 8 bytes.
for len(b) >= 8 {
v := binary.LittleEndian.Uint64(b)
binary.LittleEndian.PutUint64(b, v^k)
b = b[8:]
}
}
// xor remaining bytes.
for i := range b {
b[i] ^= key[keyPos&3]
keyPos++
}
return keyPos & 3
}
@nhooyr this is a case in which a little unrolling helps. Add something along these lines:
for len(b) >= 32 {
v := binary.LittleEndian.Uint64(b)
binary.LittleEndian.PutUint64(b, v^k)
v = binary.LittleEndian.Uint64(b[8:])
binary.LittleEndian.PutUint64(b[8:], v^k)
v = binary.LittleEndian.Uint64(b[16:])
binary.LittleEndian.PutUint64(b[16:], v^k)
v = binary.LittleEndian.Uint64(b[24:])
binary.LittleEndian.PutUint64(b[24:], v^k)
b = b[32:]
}
I see 30%+ improvements from that. You may want to experiment a bit to find the sweet spot.
What's going on here is that reslicing (b = b[8:]) is expensive because the compiler needs to take care to avoid creating a pointer past the end of b. Unrolling dilutes that cost. Some of the techniques discussed in #27857 might help reduce that cost as well, but those don't appear likely to land before Go 1.14 at the earliest.
And with that, I think I'll go ahead and close this issue. Thanks very much for filing it--issues like these are helpful in finding opportunities to improve the compiler and runtime.
Sweet, will use that.
Thank you for the tips.
@josharian out of curiosity but do you have any idea why the crypto/cipher xor was so slow even though its written in assembly?
Unrolling the loop to 128 makes it about as fast as gobwas/ws and gorilla/websocket for smaller sizes and faster for larger sizes 馃帀
$ go test -bench=XOR -run=^\$
goos: darwin
goarch: amd64
pkg: nhooyr.io/websocket
BenchmarkXOR/2/basic-8 300000000 4.07 ns/op 491.49 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/2/fast-8 300000000 4.73 ns/op 423.15 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/16/basic-8 100000000 12.6 ns/op 1273.17 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/16/fast-8 100000000 13.4 ns/op 1197.26 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/32/basic-8 100000000 23.1 ns/op 1388.07 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/32/fast-8 100000000 23.8 ns/op 1343.96 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/512/basic-8 5000000 349 ns/op 1466.36 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/512/fast-8 30000000 44.4 ns/op 11531.75 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/4096/basic-8 500000 2582 ns/op 1585.85 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/4096/fast-8 5000000 274 ns/op 14920.34 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/16384/basic-8 200000 10239 ns/op 1600.04 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/16384/fast-8 1000000 1073 ns/op 15267.46 MB/s 0 B/op 0 allocs/op
PASS
ok nhooyr.io/websocket 20.611s
crypto/cipher is fast for
largeBuffer := make([]byte, size)
xor(largeBuffer, ...)
But what you did is
largeBuffer := make([]byte, size)
for i := 0; i < 16; i += 16 {
xor(largeBuffer[i:i+16], ...)
}
This is slow.
If you want that fast SIMD implementation, I think the assembly code needs to be changed for the WebSocket mask algorithm.
@crvv makes sense got it.
@josharian final code looks like https://github.com/nhooyr/websocket/blob/cfdbe6819dba18b345df68fa06ed1e499874cad8/xor.go#L28-L115
So just tiered loops going getting smaller in unrolled size, this doesn't seem to have any negative affect on performance at low or high byte sizes. However I noticed something interesting. if I change the for loops after the 128 unrolled loop into into if statements which they logically are as the loops after cannot run for more than a single iteration, performance drops 30% at higher byte sizes which is confusing.
$ benchcmp /tmp/if.txt /tmp/for.txt
benchmark old ns/op new ns/op delta
BenchmarkXOR/2/basic-8 4.24 3.99 -5.90%
BenchmarkXOR/2/fast-8 4.49 4.89 +8.91%
BenchmarkXOR/16/basic-8 12.6 12.6 +0.00%
BenchmarkXOR/16/fast-8 12.6 12.6 +0.00%
BenchmarkXOR/32/basic-8 22.6 22.7 +0.44%
BenchmarkXOR/32/fast-8 13.4 13.2 -1.49%
BenchmarkXOR/512/basic-8 340 327 -3.82%
BenchmarkXOR/512/fast-8 62.9 44.2 -29.73%
BenchmarkXOR/4096/basic-8 2576 2588 +0.47%
BenchmarkXOR/4096/fast-8 409 280 -31.54%
BenchmarkXOR/16384/basic-8 10700 10291 -3.82%
BenchmarkXOR/16384/fast-8 1722 1089 -36.76%
benchmark old MB/s new MB/s speedup
BenchmarkXOR/2/basic-8 471.71 501.50 1.06x
BenchmarkXOR/2/fast-8 444.95 409.10 0.92x
BenchmarkXOR/16/basic-8 1274.44 1269.48 1.00x
BenchmarkXOR/16/fast-8 1265.40 1272.86 1.01x
BenchmarkXOR/32/basic-8 1417.10 1410.06 1.00x
BenchmarkXOR/32/fast-8 2388.94 2425.89 1.02x
BenchmarkXOR/512/basic-8 1502.64 1565.02 1.04x
BenchmarkXOR/512/fast-8 8134.34 11575.80 1.42x
BenchmarkXOR/4096/basic-8 1589.47 1582.41 1.00x
BenchmarkXOR/4096/fast-8 10009.68 14615.04 1.46x
BenchmarkXOR/16384/basic-8 1531.21 1591.97 1.04x
BenchmarkXOR/16384/fast-8 9513.69 15044.05 1.58x
if I change the for loops after the 128 unrolled loop into into if statements which they logically are as the loops after cannot run for more than a single iteration, performance drops 30% at higher byte sizes which is confusing.
Interesting. Branch misprediction, perhaps? Loop head unrolling? It might be instructive to look at the generated code and at what perf counters tell you. If you learn more, please share, in case it is something we can teach the compiler to fix automatically.
as fast as gobwas/ws and gorilla/websocket
Hooray!
One important caveat: This is all tuned for amd64, which is little-endian and supports unaligned loads and stores. It'll work correctly on all architectures (because it is pure Go), but it might not be fast on them.
You could probably make it fast on a per-architecture basis (using build flags), if that matters to you. You might even be able to write a useful standalone package that lets you efficiently (per-architecture) take a byte slice and process in turn some leading bytes, a bunch of uint64s, and then some trailing bytes.
Just to note the aligning key loop can be reduced to a bits.RotateLeft64
@renthraysk To clarify, that would involve changing the key's type to uint64 from [4]byte right? i.e there is no way to just replace the loop with bits.RotateLeft64 directly.
$ go test -bench=BenchmarkXOR -run=^\$
goos: darwin
goarch: amd64
pkg: nhooyr.io/websocket
BenchmarkXOR/2/basic-8 298493414 3.95 ns/op 506.25 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/2/fast-8 273740404 4.43 ns/op 451.88 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/2/fast2-8 256999929 4.63 ns/op 432.01 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/16/basic-8 96471366 12.5 ns/op 1280.55 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/16/fast-8 96853706 12.0 ns/op 1330.76 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/16/fast2-8 222916975 5.42 ns/op 2951.17 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/32/basic-8 53923512 21.8 ns/op 1469.46 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/32/fast-8 95436625 12.2 ns/op 2621.93 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/32/fast2-8 189894780 6.17 ns/op 5187.79 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/512/basic-8 3593703 369 ns/op 1388.04 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/512/fast-8 32837448 36.0 ns/op 14223.76 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/512/fast2-8 40974423 27.6 ns/op 18531.62 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/4096/basic-8 474385 2628 ns/op 1558.64 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/4096/fast-8 5921575 212 ns/op 19308.44 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/4096/fast2-8 6145724 191 ns/op 21422.74 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/16384/basic-8 118580 9822 ns/op 1668.03 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/16384/fast-8 1565948 767 ns/op 21355.19 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/16384/fast2-8 1560382 760 ns/op 21569.24 MB/s 0 B/op 0 allocs/op
PASS
ok nhooyr.io/websocket 26.730s
fast2 is with bits.RotateLeft64, pretty significant speedup, especially from 32 till 512, where the speed pretty much doubled.
馃帄
Yeah, might have misremembered, and using binary.X.Uint32() to read 4 bytes in, perform the alignment rotation with RotateLeft32() and then expanded to 64 bits x<<32|x.
Also the last remain byte loop can use RotateLeft() instead of reading from the key array.
https://github.com/nhooyr/websocket/commit/646e967341551f499e7fc5a2286d02c9c0c238b7
https://github.com/nhooyr/websocket/blob/646e967341551f499e7fc5a2286d02c9c0c238b7/frame.go#L325-L443
$ go test -bench=BenchmarkXOR -run=^\$
goos: darwin
goarch: amd64
pkg: nhooyr.io/websocket
BenchmarkXOR/2/basic-8 344487918 3.43 ns/op 582.99 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/2/fast-8 312039567 3.91 ns/op 511.00 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/3/basic-8 285057960 4.31 ns/op 696.14 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/3/fast-8 295787498 4.04 ns/op 742.36 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/4/basic-8 244778098 4.75 ns/op 841.51 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/4/fast-8 361466797 3.31 ns/op 1209.12 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/8/basic-8 166173404 7.09 ns/op 1129.10 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/8/fast-8 262743075 4.55 ns/op 1757.55 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/16/basic-8 100000000 11.8 ns/op 1361.28 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/16/fast-8 260987524 4.57 ns/op 3500.76 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/32/basic-8 56211588 21.1 ns/op 1515.55 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/32/fast-8 216847578 5.46 ns/op 5859.79 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/128/basic-8 13744023 83.5 ns/op 1532.62 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/128/fast-8 128491503 9.30 ns/op 13764.28 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/512/basic-8 3784251 320 ns/op 1601.17 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/512/fast-8 41965264 26.9 ns/op 19009.42 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/4096/basic-8 458898 2460 ns/op 1665.01 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/4096/fast-8 5955132 198 ns/op 20699.90 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/16384/basic-8 117724 9961 ns/op 1644.87 MB/s 0 B/op 0 allocs/op
BenchmarkXOR/16384/fast-8 1532142 760 ns/op 21545.42 MB/s 0 B/op 0 allocs/op
PASS
ok nhooyr.io/websocket 31.076s
Faster than basic XOR at even 3 bytes :)
Thanks @renthraysk
Another optimisation. Current code seems to be producing a bounds check per PutUint64(). Instead of passing an open ended slice b[x:] to PutUint64() using b[x:x+8] will eliminate them.
Eg
https://gist.github.com/renthraysk/5617b6aea3852c80abfa662d8ca2ff37
Thanks. Will give it a shot at some point. https://github.com/nhooyr/websocket/pull/171#issuecomment-550582028
@renthraysk I wonder if the bound checks could be eliminated if I reverse the order in which the uint64s are xored in each loop. Thus after the first bounds check, the rest of the PutUint64's in each loop would not need a bounds check.
Did not have any effect. Going to go with what you suggested instead.
About a 20% speedup with values >= 128 bytes with your suggestion implemented. 馃殌 馃敟
See https://github.com/nhooyr/websocket/pull/171/commits/15d0a18fa2042fb0bb8735374c7efe041e90c014
Thanks again.
Yeah, reordering wouldn't have made a difference. It was the _ = b[7] line in PutUint64() that was cause the bounds checks. The compiler failed to prove it's not needed with PutUint64(b[8:], x) however it does eliminate the first check in PutUint64(b, x).
Looks like deficiency in the bounds checking elimination.
cc @zdjones
Most helpful comment
@nhooyr this is a case in which a little unrolling helps. Add something along these lines:
I see 30%+ improvements from that. You may want to experiment a bit to find the sweet spot.
What's going on here is that reslicing (
b = b[8:]) is expensive because the compiler needs to take care to avoid creating a pointer past the end ofb. Unrolling dilutes that cost. Some of the techniques discussed in #27857 might help reduce that cost as well, but those don't appear likely to land before Go 1.14 at the earliest.And with that, I think I'll go ahead and close this issue. Thanks very much for filing it--issues like these are helpful in finding opportunities to improve the compiler and runtime.