We are still trying to track down the problem but in two of our services we have observedspanner.healthChecker consuming much more memory than expected with repeated calls to context.WithDeadline.
(pprof) top10
2308001 of 2340227 total (98.62%)
Dropped 758 nodes (cum <= 11701)
Showing top 10 nodes out of 84 (cum >= 70167)
flat flat% sum% cum cum%
1714972 73.28% 73.28% 1714972 73.28% context.WithDeadline
436926 18.67% 91.95% 436926 18.67% runtime.deferproc.func1
32768 1.40% 93.35% 65536 2.80% encoding/asn1.parseField
32768 1.40% 94.75% 32768 1.40% github.com/lyft/spannerproxy/vendor/github.com/golang/protobuf/proto.(*Buffer).DecodeStringBytes
32768 1.40% 96.15% 32768 1.40% reflect.(*structType).Field
32768 1.40% 97.55% 32768 1.40% strconv.formatBits
16384 0.7% 98.25% 16384 0.7% syscall.anyToSockaddr
8192 0.35% 98.60% 40960 1.75% github.com/lyft/spannerproxy/vendor/github.com/golang/protobuf/proto.(*Buffer).dec_slice_struct
455 0.019% 98.62% 65991 2.82% crypto/x509.parseCertificate
0 0% 98.62% 70167 3.00% crypto/tls.(*Conn).Handshake
I am digging through the health checker code but any insight here would be helpful. Thanks!
Full heap profile here:
I believe https://code-review.googlesource.com/#/c/gocloud/+/17250 has the fix; it's hard for me to test.
I'm going to reopen until you confirm the fix.
By the way, excellent debugging info, thanks for that.
Thanks for the quick turnaround! That seems to have fixed it.
(pprof) top20
26479 of 26495 total (99.94%)
Dropped 321 nodes (cum <= 132)
Showing top 20 nodes out of 56 (cum >= 5461)
flat flat% sum% cum cum%
16384 61.84% 61.84% 16384 61.84% github.com/lyft/dbproxy/vendor/google.golang.org/grpc.(*clientStream).SendMsg
5461 20.61% 82.45% 5461 20.61% context.WithDeadline
2048 7.73% 90.18% 2793 10.54% github.com/lyft/dbproxy/vendor/google.golang.org/grpc/transport.(*http2Server).operateHeaders
1260 4.76% 94.93% 1260 4.76% runtime.malg
745 2.81% 97.75% 745 2.81% github.com/lyft/dbproxy/vendor/google.golang.org/grpc/transport.(*decodeState).processHeaderField
410 1.55% 99.29% 410 1.55% encoding/pem.Decode
171 0.65% 99.94% 171 0.65% crypto/tls.(*Conn).readHandshake
0 0% 99.94% 581 2.19% crypto/tls.(*Conn).Handshake
0 0% 99.94% 581 2.19% crypto/tls.(*Conn).clientHandshake
0 0% 99.94% 581 2.19% crypto/tls.(*clientHandshakeState).doFullHandshake
0 0% 99.94% 410 1.55% crypto/x509.(*CertPool).AppendCertsFromPEM
0 0% 99.94% 410 1.55% crypto/x509.(*Certificate).Verify
0 0% 99.94% 410 1.55% crypto/x509.initSystemRoots
0 0% 99.94% 410 1.55% crypto/x509.loadSystemRoots
0 0% 99.94% 410 1.55% crypto/x509.systemRootsPool
0 0% 99.94% 16384 61.84% github.com/lyft/dbproxy/services/proxy.(*Service).Get
0 0% 99.94% 16384 61.84% github.com/lyft/dbproxy/services/proxy/grpc/handlers.Get.func1
0 0% 99.94% 16384 61.84% github.com/lyft/dbproxy/vendor/cloud.google.com/go/spanner.(*RowIterator).Do
0 0% 99.94% 16384 61.84% github.com/lyft/dbproxy/vendor/cloud.google.com/go/spanner.(*RowIterator).Next
0 0% 99.94% 5461 20.61% github.com/lyft/dbproxy/vendor/cloud.google.com/go/spanner.(*healthChecker).worker
Woot.
Most helpful comment
I'm going to reopen until you confirm the fix.
By the way, excellent debugging info, thanks for that.