Go: crypto/tls: slow server-side handshake performance for RSA certificates without client session cache

Created on 20 Apr 2017  Â·  8Comments  Â·  Source: golang/go

Both go 1.8 and go tip provides too slow server-side handshake performance for RSA certificates if the client doesn't use TLS session cache:

$ go get -u github.com/valyala/fasthttp/fasthttputil
$ GOMAXPROCS=1 go test github.com/valyala/fasthttp/fasthttputil -bench=TLSHandshake
goos: linux
goarch: amd64
pkg: github.com/valyala/fasthttp/fasthttputil
BenchmarkPlainHandshake                                       300000          3953 ns/op
BenchmarkTLSHandshakeWithClientSessionCache                    20000         81960 ns/op
BenchmarkTLSHandshakeWithoutClientSessionCache                   500       3493016 ns/op
BenchmarkTLSHandshakeWithCurvesWithClientSessionCache          20000         80307 ns/op
BenchmarkTLSHandshakeWithCurvesWithoutClientSessionCache         500       3518508 ns/op
PASS
ok      github.com/valyala/fasthttp/fasthttputil    11.683s

The results show that a single amd64 core may perform only 300 handshakes per second from new clients without session tickets. This is very discouraging performance comparing to openssl as described on https://istlsfastyet.com/ :

$ openssl version
OpenSSL 1.0.2g  1 Mar 2016
$ openssl speed ecdh
...
                              op      op/s
 256 bit ecdh (nistp256)   0.0001s  12797.0
 384 bit ecdh (nistp384)   0.0007s   1416.8
 521 bit ecdh (nistp521)   0.0005s   1968.0

Note that openssl performs 12797 256-bit ecdh operations per second on a single CPU core. This is 40x higher than the results from the comparable BenchmarkTLSHandshakeWithCurvesWithoutClientSessionCache above. Below are cpu profiles for this benchmark:

Mixed client and server profile:

(pprof) top20
Showing nodes accounting for 154.20ms, 87.56% of 176.10ms total
Dropped 200 nodes (cum <= 0.88ms)
Showing top 20 nodes out of 103
      flat  flat%   sum%        cum   cum%
   82.50ms 46.85% 46.85%    82.50ms 46.85%  math/big.addMulVVW /home/aliaksandr/work/go-tip/src/math/big/arith_amd64.s
   19.50ms 11.07% 57.92%   113.20ms 64.28%  math/big.nat.montgomery /home/aliaksandr/work/go-tip/src/math/big/nat.go
   13.70ms  7.78% 65.70%    13.70ms  7.78%  runtime.memmove /home/aliaksandr/work/go-tip/src/runtime/memmove_amd64.s
    7.40ms  4.20% 69.90%    21.10ms 11.98%  math/big.nat.divLarge /home/aliaksandr/work/go-tip/src/math/big/nat.go
       5ms  2.84% 72.74%        5ms  2.84%  math/big.mulAddVWW /home/aliaksandr/work/go-tip/src/math/big/arith_amd64.s
    3.40ms  1.93% 74.67%     3.40ms  1.93%  crypto/sha256.block /home/aliaksandr/work/go-tip/src/crypto/sha256/sha256block_amd64.s
    2.70ms  1.53% 76.21%     2.70ms  1.53%  math/big.subVV /home/aliaksandr/work/go-tip/src/math/big/arith_amd64.s
    2.70ms  1.53% 77.74%     2.70ms  1.53%  p256MulInternal /home/aliaksandr/work/go-tip/src/crypto/elliptic/p256_asm_amd64.s
    2.60ms  1.48% 79.22%     2.60ms  1.48%  math/big.addVV /home/aliaksandr/work/go-tip/src/math/big/arith_amd64.s
    2.60ms  1.48% 80.69%     2.60ms  1.48%  p256SqrInternal /home/aliaksandr/work/go-tip/src/crypto/elliptic/p256_asm_amd64.s
    2.10ms  1.19% 81.89%     2.10ms  1.19%  math/big.shlVU /home/aliaksandr/work/go-tip/src/math/big/arith_amd64.s
    1.80ms  1.02% 82.91%     1.80ms  1.02%  syscall.Syscall /home/aliaksandr/work/go-tip/src/syscall/asm_linux_amd64.s
    1.30ms  0.74% 83.65%     1.30ms  0.74%  crypto/elliptic.p256Sqr /home/aliaksandr/work/go-tip/src/crypto/elliptic/p256_asm_amd64.s
    1.10ms  0.62% 84.27%     2.30ms  1.31%  sync.(*Pool).Put /home/aliaksandr/work/go-tip/src/sync/pool.go
       1ms  0.57% 84.84%     4.20ms  2.39%  math/big.nat.add /home/aliaksandr/work/go-tip/src/math/big/nat.go
       1ms  0.57% 85.41%     3.70ms  2.10%  math/big.nat.mulAddWW /home/aliaksandr/work/go-tip/src/math/big/nat.go
       1ms  0.57% 85.97%        1ms  0.57%  sync.(*Mutex).Lock /home/aliaksandr/work/go-tip/src/sync/mutex.go
       1ms  0.57% 86.54%        1ms  0.57%  sync.(*Mutex).Unlock /home/aliaksandr/work/go-tip/src/sync/mutex.go
    0.90ms  0.51% 87.05%    24.90ms 14.14%  math/big.(*Int).GCD /home/aliaksandr/work/go-tip/src/math/big/int.go
    0.90ms  0.51% 87.56%     5.40ms  3.07%  math/big.(*Int).Mul /home/aliaksandr/work/go-tip/src/math/big/int.go

Server profile:

(pprof) top20 Server
Showing nodes accounting for 149.20ms, 84.72% of 176.10ms total
Dropped 110 nodes (cum <= 0.88ms)
Showing top 20 nodes out of 86
      flat  flat%   sum%        cum   cum%
   82.50ms 46.85% 46.85%    82.50ms 46.85%  math/big.addMulVVW /home/aliaksandr/work/go-tip/src/math/big/arith_amd64.s
   19.50ms 11.07% 57.92%   113.20ms 64.28%  math/big.nat.montgomery /home/aliaksandr/work/go-tip/src/math/big/nat.go
   13.40ms  7.61% 65.53%    13.40ms  7.61%  runtime.memmove /home/aliaksandr/work/go-tip/src/runtime/memmove_amd64.s
    7.40ms  4.20% 69.73%    21.10ms 11.98%  math/big.nat.divLarge /home/aliaksandr/work/go-tip/src/math/big/nat.go
       5ms  2.84% 72.57%        5ms  2.84%  math/big.mulAddVWW /home/aliaksandr/work/go-tip/src/math/big/arith_amd64.s
    2.70ms  1.53% 74.11%     2.70ms  1.53%  math/big.subVV /home/aliaksandr/work/go-tip/src/math/big/arith_amd64.s
    2.60ms  1.48% 75.58%     2.60ms  1.48%  math/big.addVV /home/aliaksandr/work/go-tip/src/math/big/arith_amd64.s
    2.30ms  1.31% 76.89%     2.30ms  1.31%  crypto/sha256.block /home/aliaksandr/work/go-tip/src/crypto/sha256/sha256block_amd64.s
    2.10ms  1.19% 78.08%     2.10ms  1.19%  math/big.shlVU /home/aliaksandr/work/go-tip/src/math/big/arith_amd64.s
    1.40ms   0.8% 78.88%     1.40ms   0.8%  p256MulInternal /home/aliaksandr/work/go-tip/src/crypto/elliptic/p256_asm_amd64.s
    1.30ms  0.74% 79.61%     1.30ms  0.74%  syscall.Syscall /home/aliaksandr/work/go-tip/src/syscall/asm_linux_amd64.s
    1.20ms  0.68% 80.30%     1.20ms  0.68%  p256SqrInternal /home/aliaksandr/work/go-tip/src/crypto/elliptic/p256_asm_amd64.s
    1.10ms  0.62% 80.92%     2.30ms  1.31%  sync.(*Pool).Put /home/aliaksandr/work/go-tip/src/sync/pool.go
       1ms  0.57% 81.49%     4.20ms  2.39%  math/big.nat.add /home/aliaksandr/work/go-tip/src/math/big/nat.go
       1ms  0.57% 82.06%     3.70ms  2.10%  math/big.nat.mulAddWW /home/aliaksandr/work/go-tip/src/math/big/nat.go
       1ms  0.57% 82.62%        1ms  0.57%  sync.(*Mutex).Lock /home/aliaksandr/work/go-tip/src/sync/mutex.go
       1ms  0.57% 83.19%        1ms  0.57%  sync.(*Mutex).Unlock /home/aliaksandr/work/go-tip/src/sync/mutex.go
    0.90ms  0.51% 83.70%    24.90ms 14.14%  math/big.(*Int).GCD /home/aliaksandr/work/go-tip/src/math/big/int.go
    0.90ms  0.51% 84.21%     5.40ms  3.07%  math/big.(*Int).Mul /home/aliaksandr/work/go-tip/src/math/big/int.go
    0.90ms  0.51% 84.72%   114.30ms 64.91%  math/big.nat.expNNMontgomery /home/aliaksandr/work/go-tip/src/math/big/nat.go

Client profile:

(pprof) top20 Client
Showing nodes accounting for 14.10ms, 8.01% of 176.10ms total
Showing top 20 nodes out of 202
      flat  flat%   sum%        cum   cum%
    2.60ms  1.48%  1.48%     2.60ms  1.48%  p256SqrInternal /home/aliaksandr/work/go-tip/src/crypto/elliptic/p256_asm_amd64.s
    2.10ms  1.19%  2.67%     2.10ms  1.19%  p256MulInternal /home/aliaksandr/work/go-tip/src/crypto/elliptic/p256_asm_amd64.s
    1.60ms  0.91%  3.58%     2.50ms  1.42%  math/big.nat.divLarge /home/aliaksandr/work/go-tip/src/math/big/nat.go
    1.10ms  0.62%  4.20%     1.10ms  0.62%  crypto/sha256.block /home/aliaksandr/work/go-tip/src/crypto/sha256/sha256block_amd64.s
    0.90ms  0.51%  4.71%     0.90ms  0.51%  crypto/elliptic.p256Sqr /home/aliaksandr/work/go-tip/src/crypto/elliptic/p256_asm_amd64.s
    0.80ms  0.45%  5.17%     4.50ms  2.56%  crypto/elliptic.p256PointDoubleAsm /home/aliaksandr/work/go-tip/src/crypto/elliptic/p256_asm_amd64.s
    0.80ms  0.45%  5.62%     0.80ms  0.45%  math/big.addMulVVW /home/aliaksandr/work/go-tip/src/math/big/arith_amd64.s
    0.70ms   0.4%  6.02%     0.70ms   0.4%  syscall.Syscall /home/aliaksandr/work/go-tip/src/syscall/asm_linux_amd64.s
    0.50ms  0.28%  6.30%     1.30ms  0.74%  math/big.basicMul /home/aliaksandr/work/go-tip/src/math/big/nat.go
    0.40ms  0.23%  6.53%     0.40ms  0.23%  math/big.subVV /home/aliaksandr/work/go-tip/src/math/big/arith_amd64.s
    0.30ms  0.17%  6.70%     0.30ms  0.17%  crypto/elliptic.p256Select /home/aliaksandr/work/go-tip/src/crypto/elliptic/p256_asm_amd64.s
    0.30ms  0.17%  6.87%     0.30ms  0.17%  crypto/hmac.New /home/aliaksandr/work/go-tip/src/crypto/hmac/hmac.go
    0.30ms  0.17%  7.04%     0.30ms  0.17%  math/big.mulAddVWW /home/aliaksandr/work/go-tip/src/math/big/arith_amd64.s
    0.30ms  0.17%  7.21%     0.30ms  0.17%  p256SubInternal /home/aliaksandr/work/go-tip/src/crypto/elliptic/p256_asm_amd64.s
    0.30ms  0.17%  7.38%     0.30ms  0.17%  runtime.mallocgc /home/aliaksandr/work/go-tip/src/runtime/mbitmap.go
    0.30ms  0.17%  7.55%     0.30ms  0.17%  runtime.memmove /home/aliaksandr/work/go-tip/src/runtime/memmove_amd64.s
    0.20ms  0.11%  7.67%     0.60ms  0.34%  crypto/elliptic.p256PointAddAffineAsm /home/aliaksandr/work/go-tip/src/crypto/elliptic/p256_asm_amd64.s
    0.20ms  0.11%  7.78%     1.70ms  0.97%  encoding/asn1.parseField /home/aliaksandr/work/go-tip/src/encoding/asn1/asn1.go
    0.20ms  0.11%  7.89%     0.20ms  0.11%  math/big.nat.setBytes /home/aliaksandr/work/go-tip/src/math/big/nat.go
    0.20ms  0.11%  8.01%     0.20ms  0.11%  runtime.heapBitsSetType /home/aliaksandr/work/go-tip/src/runtime/mbitmap.go

As you can see, the client side takes 1/10 part of CPU time comparing to the server side.

@agl , @vkrasnov

Performance help wanted

Most helpful comment

@vkrasnov , thanks - this raised the performance from 300 handshakes per second to 2000 handshakes per second on a single CPU core:

$ GOMAXPROCS=1 go test -run=111 -bench=Handshake -cpuprofile=cpu.pprof -benchtime=1s
goos: linux
goarch: amd64
pkg: github.com/valyala/fasthttp/fasthttputil
BenchmarkPlainHandshake                                       300000          3968 ns/op
BenchmarkTLSHandshakeWithClientSessionCache                    20000         89539 ns/op
BenchmarkTLSHandshakeWithoutClientSessionCache                  3000        554257 ns/op
BenchmarkTLSHandshakeWithCurvesWithClientSessionCache          20000         90366 ns/op
BenchmarkTLSHandshakeWithCurvesWithoutClientSessionCache        3000        570461 ns/op
PASS
ok      github.com/valyala/fasthttp/fasthttputil    10.383s

Are there plans to improve handshake performance for RSA certificates?

All 8 comments

@valyala , you should use ECDSA instead RSA if you can. RSA is not very optimized in go.

@vkrasnov , then probably ECDSA must go before RSA at initDefaultCipherSuites?

@valyala , you need an ECDSA certificate, you can try it and see if it helps:

go run `go env GOROOT`/src/crypto/tls/generate_cert.go --host=localhost --ecdsa-curve=P256

@vkrasnov , thanks - this raised the performance from 300 handshakes per second to 2000 handshakes per second on a single CPU core:

$ GOMAXPROCS=1 go test -run=111 -bench=Handshake -cpuprofile=cpu.pprof -benchtime=1s
goos: linux
goarch: amd64
pkg: github.com/valyala/fasthttp/fasthttputil
BenchmarkPlainHandshake                                       300000          3968 ns/op
BenchmarkTLSHandshakeWithClientSessionCache                    20000         89539 ns/op
BenchmarkTLSHandshakeWithoutClientSessionCache                  3000        554257 ns/op
BenchmarkTLSHandshakeWithCurvesWithClientSessionCache          20000         90366 ns/op
BenchmarkTLSHandshakeWithCurvesWithoutClientSessionCache        3000        570461 ns/op
PASS
ok      github.com/valyala/fasthttp/fasthttputil    10.383s

Are there plans to improve handshake performance for RSA certificates?

Are there plans to improve handshake performance for RSA certificates?

I personally have no such immediate plans. RSA is past its prime, and its usage is constantly dropping.

Added some benchmarks https://golang.org/cl/44730/

Change https://golang.org/cl/74851 mentions this issue: math/big: speed-up addMulVVW on amd64

@vkrasnov @bradfitz Does it mean the RSA is not recommended and generally RSA related code won't be optimized in the future?

According to https://www.ssl.com/article/comparing-ecdsa-vs-rsa/ ECDSA is significantly more vulnerable to Shor’s algorithm (quantum computing attack) than the RSA and I'm more concern about that than the benefits of ECDSA in the moment.

Was this page helpful?
0 / 5 - 0 ratings