Go: cmd/compile: optimize XOR masking code

Created on 20 Apr 2019  路  30Comments  路  Source: golang/go

In the WebSocket protocol, clients have to generate a random masking key for every frame sent to the server and XOR mask the frame payload with the key. The server will unmask it on retrieval.

The mask/unmask function looks like this https://github.com/gorilla/websocket/blob/0ec3d1bd7fe50c503d6df98ee649d81f4857c564/mask_safe.go#L9-L15

This is masking byte by byte but can be made significantly faster if the masking was done word by word. Unfortunately, the Go compiler does not do this automatically so you have to use unsafe and write this yourself: https://github.com/gorilla/websocket/blob/0ec3d1bd7fe50c503d6df98ee649d81f4857c564/mask.go#L13-L54

Benchmarks at https://github.com/gorilla/websocket/issues/31

It would be nice if the Go compiler did this automatically.

FrozenDueToAge NeedsInvestigation Performance

Most helpful comment

@nhooyr this is a case in which a little unrolling helps. Add something along these lines:

        for len(b) >= 32 {
            v := binary.LittleEndian.Uint64(b)
            binary.LittleEndian.PutUint64(b, v^k)
            v = binary.LittleEndian.Uint64(b[8:])
            binary.LittleEndian.PutUint64(b[8:], v^k)
            v = binary.LittleEndian.Uint64(b[16:])
            binary.LittleEndian.PutUint64(b[16:], v^k)
            v = binary.LittleEndian.Uint64(b[24:])
            binary.LittleEndian.PutUint64(b[24:], v^k)
            b = b[32:]
        }

I see 30%+ improvements from that. You may want to experiment a bit to find the sweet spot.

What's going on here is that reslicing (b = b[8:]) is expensive because the compiler needs to take care to avoid creating a pointer past the end of b. Unrolling dilutes that cost. Some of the techniques discussed in #27857 might help reduce that cost as well, but those don't appear likely to land before Go 1.14 at the earliest.

And with that, I think I'll go ahead and close this issue. Thanks very much for filing it--issues like these are helpful in finding opportunities to improve the compiler and runtime.

All 30 comments

The two links you posted are the same (copy-paste mistake?). The 2nd one in particular does not use unsafe.

Thank you for filing this issue @nhooyr and @ALTree for the triaging!

Kindly paging @josharian @randall77 @martisch.

Please see #28113 #30553 and #28465

And https://github.com/golang/go/blob/68d4b1265ec7915dccfccf6c0e32f9ab2d9c3a86/src/crypto/cipher/xor_generic.go#L45

If you want fast xor, there are 4 different implementations in crypto/cipher.
I don't know whether the compiler can do so much optimization.

The compiler already recognizes sequences of byte loads and coalesces them into a single multi-byte loads. See e.g. the amd64 rules starting at https://github.com/golang/go/blob/e6ae4e86ad59c45f302a8828e77e6c234307fce4/src/cmd/compile/internal/ssa/gen/AMD64.rules#L1531. We could in theory write similar rules that convert a sequence of byte-wise load/xor/store operations to an XORQmodify.

In fact, I believe there is a way of writing this that would allow the compiler to do this optimization now. This code:

import "encoding/binary"

func f(key [4]byte, b []byte) {
    k := binary.LittleEndian.Uint32(key[:])
    v := binary.LittleEndian.Uint32(b)
    binary.LittleEndian.PutUint32(b, v^k)
}

Generates (at its core, surrounded by other stuff):

MOVQ    "".b+40(SP), AX
XORL    (AX), DX
MOVL    DX, (AX)

That seems pretty good.

What do you think, @nhooyr ?

(And if this is seriously performance-critical, you might want to manually inline the LittleEndian rountines and also duplicate key into a uint64 so that you can do 8 bytes at a time.)

That seems very reasonable @josharian

Will give this is a shot and let you know how the performance compares.

So benchmarked and here are the results on my machine (Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz):

$ go test -bench=BenchmarkMaskBytes
goos: darwin
goarch: amd64
pkg: github.com/gorilla/websocket
BenchmarkMaskBytes/size-2/align-4/byte-8            500000000            3.74 ns/op  535.10 MB/s
BenchmarkMaskBytes/size-2/align-4/gorilla-8         300000000            3.93 ns/op  508.44 MB/s
BenchmarkMaskBytes/size-2/align-4/gobwas-8          300000000            4.67 ns/op  428.21 MB/s
BenchmarkMaskBytes/size-2/align-4/nhooyr-8          300000000            4.73 ns/op  422.43 MB/s
BenchmarkMaskBytes/size-2/align-4/crypto/cipher-8           100000000           16.9 ns/op   118.23 MB/s
BenchmarkMaskBytes/size-4/align-4/byte-8                    300000000            5.58 ns/op  716.61 MB/s
BenchmarkMaskBytes/size-4/align-4/gorilla-8                 300000000            5.56 ns/op  719.80 MB/s
BenchmarkMaskBytes/size-4/align-4/gobwas-8                  200000000            7.27 ns/op  550.31 MB/s
BenchmarkMaskBytes/size-4/align-4/nhooyr-8                  200000000            7.04 ns/op  567.92 MB/s
BenchmarkMaskBytes/size-4/align-4/crypto/cipher-8           100000000           19.6 ns/op   204.31 MB/s
BenchmarkMaskBytes/size-16/align-4/byte-8                   100000000           13.7 ns/op  1170.31 MB/s
BenchmarkMaskBytes/size-16/align-4/gorilla-8                100000000           20.2 ns/op   793.15 MB/s
BenchmarkMaskBytes/size-16/align-4/gobwas-8                 200000000            6.59 ns/op 2426.43 MB/s
BenchmarkMaskBytes/size-16/align-4/nhooyr-8                 100000000           13.6 ns/op  1174.09 MB/s
BenchmarkMaskBytes/size-16/align-4/crypto/cipher-8          100000000           16.2 ns/op   985.94 MB/s
BenchmarkMaskBytes/size-32/align-4/byte-8                   100000000           23.9 ns/op  1336.39 MB/s
BenchmarkMaskBytes/size-32/align-4/gorilla-8                100000000           20.6 ns/op  1551.76 MB/s
BenchmarkMaskBytes/size-32/align-4/gobwas-8                 200000000            8.13 ns/op 3934.47 MB/s
BenchmarkMaskBytes/size-32/align-4/nhooyr-8                 100000000           15.6 ns/op  2054.30 MB/s
BenchmarkMaskBytes/size-32/align-4/crypto/cipher-8          100000000           24.7 ns/op  1298.15 MB/s
BenchmarkMaskBytes/size-512/align-4/byte-8                   5000000           345 ns/op    1480.68 MB/s
BenchmarkMaskBytes/size-512/align-4/gorilla-8               30000000            53.8 ns/op  9525.00 MB/s
BenchmarkMaskBytes/size-512/align-4/gobwas-8                50000000            39.9 ns/op  12840.94 MB/s
BenchmarkMaskBytes/size-512/align-4/nhooyr-8                20000000            83.6 ns/op  6126.90 MB/s
BenchmarkMaskBytes/size-512/align-4/crypto/cipher-8         10000000           241 ns/op    2122.82 MB/s
BenchmarkMaskBytes/size-4096/align-4/byte-8                   500000          2674 ns/op    1531.63 MB/s
BenchmarkMaskBytes/size-4096/align-4/gorilla-8               5000000           285 ns/op    14322.91 MB/s
BenchmarkMaskBytes/size-4096/align-4/gobwas-8                5000000           282 ns/op    14478.90 MB/s
BenchmarkMaskBytes/size-4096/align-4/nhooyr-8                3000000           539 ns/op    7596.43 MB/s
BenchmarkMaskBytes/size-4096/align-4/crypto/cipher-8         1000000          1818 ns/op    2252.51 MB/s
PASS
ok      github.com/gorilla/websocket    58.645s

Source code at https://github.com/nhooyr/websocket/blob/40b4f235f2e2730ad1d8b81852e7610a8251d080/mask_test.go#L136-L174

Interestingly, the crypto/cipher assembly XOR was the slowest but I might be using it wrong.

Your solution is 2x slower than gorilla and gobwas but fast enough for me as I do not want to use unsafe.

Thanks @josharian.

I'll leave this issue open in case you think there is still improvement that can be made.

How I implemented what you described:

// xor applies the WebSocket masking algorithm to p
// with the given key where the first 3 bits of pos
// are the starting position in the key.
// See https://tools.ietf.org/html/rfc6455#section-5.3
//
// The returned value is the position of the next byte
// to be used for masking in the key. This is so that
// unmasking can be performed without the entire frame.
func nhooyr(key [4]byte, keyPos int, b []byte) int {
    // If the payload is greater than 16 bytes, then it's worth
    // masking 8 bytes at a time.
    // Optimization from https://github.com/golang/go/issues/31586#issuecomment-485530859
    if len(b) > 16 {
        // We first create a key that is 8 bytes long
        // and is aligned on the keyPos correctly.
        var alignedKey [8]byte
        for i := range alignedKey {
            alignedKey[i] = key[(i+keyPos)&3]
        }
        k := binary.LittleEndian.Uint64(alignedKey[:])

        // Then we xor until b is less than 8 bytes.
        for len(b) >= 8 {
            v := binary.LittleEndian.Uint64(b)
            binary.LittleEndian.PutUint64(b, v^k)
            b = b[8:]
        }
    }

    // xor remaining bytes.
    for i := range b {
        b[i] ^= key[keyPos&3]
        keyPos++
    }
    return keyPos & 3
}

@nhooyr this is a case in which a little unrolling helps. Add something along these lines:

        for len(b) >= 32 {
            v := binary.LittleEndian.Uint64(b)
            binary.LittleEndian.PutUint64(b, v^k)
            v = binary.LittleEndian.Uint64(b[8:])
            binary.LittleEndian.PutUint64(b[8:], v^k)
            v = binary.LittleEndian.Uint64(b[16:])
            binary.LittleEndian.PutUint64(b[16:], v^k)
            v = binary.LittleEndian.Uint64(b[24:])
            binary.LittleEndian.PutUint64(b[24:], v^k)
            b = b[32:]
        }

I see 30%+ improvements from that. You may want to experiment a bit to find the sweet spot.

What's going on here is that reslicing (b = b[8:]) is expensive because the compiler needs to take care to avoid creating a pointer past the end of b. Unrolling dilutes that cost. Some of the techniques discussed in #27857 might help reduce that cost as well, but those don't appear likely to land before Go 1.14 at the earliest.

And with that, I think I'll go ahead and close this issue. Thanks very much for filing it--issues like these are helpful in finding opportunities to improve the compiler and runtime.

Sweet, will use that.

Thank you for the tips.

@josharian out of curiosity but do you have any idea why the crypto/cipher xor was so slow even though its written in assembly?

Unrolling the loop to 128 makes it about as fast as gobwas/ws and gorilla/websocket for smaller sizes and faster for larger sizes 馃帀

$ go test -bench=XOR -run=^\$
goos: darwin
goarch: amd64
pkg: nhooyr.io/websocket
BenchmarkXOR/2/basic-8          300000000            4.07 ns/op  491.49 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/2/fast-8           300000000            4.73 ns/op  423.15 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/16/basic-8         100000000           12.6 ns/op  1273.17 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/16/fast-8          100000000           13.4 ns/op  1197.26 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/32/basic-8         100000000           23.1 ns/op  1388.07 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/32/fast-8          100000000           23.8 ns/op  1343.96 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/512/basic-8         5000000           349 ns/op    1466.36 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/512/fast-8         30000000            44.4 ns/op  11531.75 MB/s          0 B/op          0 allocs/op
BenchmarkXOR/4096/basic-8         500000          2582 ns/op    1585.85 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/4096/fast-8         5000000           274 ns/op    14920.34 MB/s          0 B/op          0 allocs/op
BenchmarkXOR/16384/basic-8        200000         10239 ns/op    1600.04 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/16384/fast-8        1000000          1073 ns/op    15267.46 MB/s          0 B/op          0 allocs/op
PASS
ok      nhooyr.io/websocket 20.611s

crypto/cipher is fast for

largeBuffer := make([]byte, size)
xor(largeBuffer, ...)

But what you did is

largeBuffer := make([]byte, size)
for i := 0; i < 16; i += 16 {
    xor(largeBuffer[i:i+16], ...)
}

This is slow.

If you want that fast SIMD implementation, I think the assembly code needs to be changed for the WebSocket mask algorithm.

@crvv makes sense got it.

@josharian final code looks like https://github.com/nhooyr/websocket/blob/cfdbe6819dba18b345df68fa06ed1e499874cad8/xor.go#L28-L115

So just tiered loops going getting smaller in unrolled size, this doesn't seem to have any negative affect on performance at low or high byte sizes. However I noticed something interesting. if I change the for loops after the 128 unrolled loop into into if statements which they logically are as the loops after cannot run for more than a single iteration, performance drops 30% at higher byte sizes which is confusing.

$ benchcmp /tmp/if.txt /tmp/for.txt 
benchmark                      old ns/op     new ns/op     delta
BenchmarkXOR/2/basic-8         4.24          3.99          -5.90%
BenchmarkXOR/2/fast-8          4.49          4.89          +8.91%
BenchmarkXOR/16/basic-8        12.6          12.6          +0.00%
BenchmarkXOR/16/fast-8         12.6          12.6          +0.00%
BenchmarkXOR/32/basic-8        22.6          22.7          +0.44%
BenchmarkXOR/32/fast-8         13.4          13.2          -1.49%
BenchmarkXOR/512/basic-8       340           327           -3.82%
BenchmarkXOR/512/fast-8        62.9          44.2          -29.73%
BenchmarkXOR/4096/basic-8      2576          2588          +0.47%
BenchmarkXOR/4096/fast-8       409           280           -31.54%
BenchmarkXOR/16384/basic-8     10700         10291         -3.82%
BenchmarkXOR/16384/fast-8      1722          1089          -36.76%

benchmark                      old MB/s     new MB/s     speedup
BenchmarkXOR/2/basic-8         471.71       501.50       1.06x
BenchmarkXOR/2/fast-8          444.95       409.10       0.92x
BenchmarkXOR/16/basic-8        1274.44      1269.48      1.00x
BenchmarkXOR/16/fast-8         1265.40      1272.86      1.01x
BenchmarkXOR/32/basic-8        1417.10      1410.06      1.00x
BenchmarkXOR/32/fast-8         2388.94      2425.89      1.02x
BenchmarkXOR/512/basic-8       1502.64      1565.02      1.04x
BenchmarkXOR/512/fast-8        8134.34      11575.80     1.42x
BenchmarkXOR/4096/basic-8      1589.47      1582.41      1.00x
BenchmarkXOR/4096/fast-8       10009.68     14615.04     1.46x
BenchmarkXOR/16384/basic-8     1531.21      1591.97      1.04x
BenchmarkXOR/16384/fast-8      9513.69      15044.05     1.58x

if I change the for loops after the 128 unrolled loop into into if statements which they logically are as the loops after cannot run for more than a single iteration, performance drops 30% at higher byte sizes which is confusing.

Interesting. Branch misprediction, perhaps? Loop head unrolling? It might be instructive to look at the generated code and at what perf counters tell you. If you learn more, please share, in case it is something we can teach the compiler to fix automatically.

as fast as gobwas/ws and gorilla/websocket

Hooray!

One important caveat: This is all tuned for amd64, which is little-endian and supports unaligned loads and stores. It'll work correctly on all architectures (because it is pure Go), but it might not be fast on them.

You could probably make it fast on a per-architecture basis (using build flags), if that matters to you. You might even be able to write a useful standalone package that lets you efficiently (per-architecture) take a byte slice and process in turn some leading bytes, a bunch of uint64s, and then some trailing bytes.

Just to note the aligning key loop can be reduced to a bits.RotateLeft64

@renthraysk To clarify, that would involve changing the key's type to uint64 from [4]byte right? i.e there is no way to just replace the loop with bits.RotateLeft64 directly.

$ go test -bench=BenchmarkXOR -run=^\$
goos: darwin
goarch: amd64
pkg: nhooyr.io/websocket
BenchmarkXOR/2/basic-8          298493414            3.95 ns/op  506.25 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/2/fast-8           273740404            4.43 ns/op  451.88 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/2/fast2-8          256999929            4.63 ns/op  432.01 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/16/basic-8         96471366            12.5 ns/op  1280.55 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/16/fast-8          96853706            12.0 ns/op  1330.76 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/16/fast2-8         222916975            5.42 ns/op 2951.17 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/32/basic-8         53923512            21.8 ns/op  1469.46 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/32/fast-8          95436625            12.2 ns/op  2621.93 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/32/fast2-8         189894780            6.17 ns/op 5187.79 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/512/basic-8         3593703           369 ns/op    1388.04 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/512/fast-8         32837448            36.0 ns/op  14223.76 MB/s          0 B/op          0 allocs/op
BenchmarkXOR/512/fast2-8        40974423            27.6 ns/op  18531.62 MB/s          0 B/op          0 allocs/op
BenchmarkXOR/4096/basic-8         474385          2628 ns/op    1558.64 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/4096/fast-8         5921575           212 ns/op    19308.44 MB/s          0 B/op          0 allocs/op
BenchmarkXOR/4096/fast2-8        6145724           191 ns/op    21422.74 MB/s          0 B/op          0 allocs/op
BenchmarkXOR/16384/basic-8        118580          9822 ns/op    1668.03 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/16384/fast-8        1565948           767 ns/op    21355.19 MB/s          0 B/op          0 allocs/op
BenchmarkXOR/16384/fast2-8       1560382           760 ns/op    21569.24 MB/s          0 B/op          0 allocs/op
PASS
ok      nhooyr.io/websocket 26.730s

fast2 is with bits.RotateLeft64, pretty significant speedup, especially from 32 till 512, where the speed pretty much doubled.

馃帄

Yeah, might have misremembered, and using binary.X.Uint32() to read 4 bytes in, perform the alignment rotation with RotateLeft32() and then expanded to 64 bits x<<32|x.

Also the last remain byte loop can use RotateLeft() instead of reading from the key array.

https://github.com/nhooyr/websocket/commit/646e967341551f499e7fc5a2286d02c9c0c238b7

https://github.com/nhooyr/websocket/blob/646e967341551f499e7fc5a2286d02c9c0c238b7/frame.go#L325-L443

$ go test -bench=BenchmarkXOR -run=^\$
goos: darwin
goarch: amd64
pkg: nhooyr.io/websocket
BenchmarkXOR/2/basic-8          344487918            3.43 ns/op  582.99 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/2/fast-8           312039567            3.91 ns/op  511.00 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/3/basic-8          285057960            4.31 ns/op  696.14 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/3/fast-8           295787498            4.04 ns/op  742.36 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/4/basic-8          244778098            4.75 ns/op  841.51 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/4/fast-8           361466797            3.31 ns/op 1209.12 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/8/basic-8          166173404            7.09 ns/op 1129.10 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/8/fast-8           262743075            4.55 ns/op 1757.55 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/16/basic-8         100000000           11.8 ns/op  1361.28 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/16/fast-8          260987524            4.57 ns/op 3500.76 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/32/basic-8         56211588            21.1 ns/op  1515.55 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/32/fast-8          216847578            5.46 ns/op 5859.79 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/128/basic-8        13744023            83.5 ns/op  1532.62 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/128/fast-8         128491503            9.30 ns/op 13764.28 MB/s          0 B/op          0 allocs/op
BenchmarkXOR/512/basic-8         3784251           320 ns/op    1601.17 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/512/fast-8         41965264            26.9 ns/op  19009.42 MB/s          0 B/op          0 allocs/op
BenchmarkXOR/4096/basic-8         458898          2460 ns/op    1665.01 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/4096/fast-8         5955132           198 ns/op    20699.90 MB/s          0 B/op          0 allocs/op
BenchmarkXOR/16384/basic-8        117724          9961 ns/op    1644.87 MB/s           0 B/op          0 allocs/op
BenchmarkXOR/16384/fast-8        1532142           760 ns/op    21545.42 MB/s          0 B/op          0 allocs/op
PASS
ok      nhooyr.io/websocket 31.076s

Faster than basic XOR at even 3 bytes :)

Thanks @renthraysk

Another optimisation. Current code seems to be producing a bounds check per PutUint64(). Instead of passing an open ended slice b[x:] to PutUint64() using b[x:x+8] will eliminate them.
Eg
https://gist.github.com/renthraysk/5617b6aea3852c80abfa662d8ca2ff37

Thanks. Will give it a shot at some point. https://github.com/nhooyr/websocket/pull/171#issuecomment-550582028

@renthraysk I wonder if the bound checks could be eliminated if I reverse the order in which the uint64s are xored in each loop. Thus after the first bounds check, the rest of the PutUint64's in each loop would not need a bounds check.

Did not have any effect. Going to go with what you suggested instead.

About a 20% speedup with values >= 128 bytes with your suggestion implemented. 馃殌 馃敟

See https://github.com/nhooyr/websocket/pull/171/commits/15d0a18fa2042fb0bb8735374c7efe041e90c014

Thanks again.

Yeah, reordering wouldn't have made a difference. It was the _ = b[7] line in PutUint64() that was cause the bounds checks. The compiler failed to prove it's not needed with PutUint64(b[8:], x) however it does eliminate the first check in PutUint64(b, x).

Looks like deficiency in the bounds checking elimination.

cc @zdjones

Was this page helpful?
0 / 5 - 0 ratings

Related issues

natefinch picture natefinch  路  3Comments

stub42 picture stub42  路  3Comments

michaelsafyan picture michaelsafyan  路  3Comments

longzhizhi picture longzhizhi  路  3Comments

bbodenmiller picture bbodenmiller  路  3Comments