Flux: Goroutine leak in fluxd?

Created on 14 Dec 2018  路  7Comments  路  Source: fluxcd/flux

We've observed fluxd's memory usage slowly increasing until it reaches the memory limit (arbitrarily set to 300Mi by us) and is OOMKilled by Kubernetes:

image

With @ncabatoff's guidance I used the profiler to dump the goroutines and found several thousand. Here's a gist with the output of lsof and netstat plus the goroutine list. (10.244.3.20 is the IP address assigned to the flux-memcached pod)

We also have a lot of logs like the following:

ts=2018-12-14T15:11:47.467587321Z caller=warming.go:192 component=warmer canonical_name=mcr.microsoft.com/k8s/metrics/adapter auth={map[]} err="requesting tags: json: cannot unmarshal array into Go value of type struct { Tags []string \"json:\\\"tags\\\"\" }"
ts=2018-12-14T15:11:51.219396002Z caller=warming.go:192 component=warmer canonical_name=mcr.microsoft.com/k8s/aad-pod-identity/mic auth={map[]} err="requesting tags: json: cannot unmarshal array into Go value of type struct { Tags []string \"json:\\\"tags\\\"\" }"
ts=2018-12-14T15:11:54.583460254Z caller=warming.go:192 component=warmer canonical_name=iqsandbox.azurecr.io/gameday/quackserver auth={map[]} err="requesting tags: Get https://iqsandbox.azurecr.io/v2/gameday/quackserver/tags/list: unauthorized: authentication required"
ts=2018-12-14T15:11:55.710904979Z caller=warming.go:192 component=warmer canonical_name=mcr.microsoft.com/k8s/aad-pod-identity/nmi auth={map[]} err="requesting tags: json: cannot unmarshal array into Go value of type struct { Tags []string \"json:\\\"tags\\\"\" }"

iqsandbox.azurecr.io is a container registry which doesn't have a pull secret present in the namespace Flux is running in. Anecdotally, this issue seems to have started (or gotten worse) around the time we started running containers using images from mcr.microsoft.com, so maybe something in that code path is the culprit?

I'll be happy to provide any additional diagnostic info you need. Thanks again to @ncabatoff for walking me through this so far (I don't have any experience with the golang toolchain and his help was invaluable).

Most helpful comment

We've been running master-2441121d for over a week now and the memory profile looks much healthier:

image

I'm going to call this one a win. Thanks again, @2opremio & @ncabatoff!

All 7 comments

One extra detail: the reason we think the memory increase is due to the goroutines is because @brantb also produced a pprof heap SVG which shows that the heap accounts for only ~80/262Mi memory used.

Whoops, I forgot to include that. Here's the heap graph.

@brantb I am not 100% that #1672 will solve the problem but it will surely help

Interestingly, the amount of authentication-error logs from the warmer roughly matches the amount of descriptors leaked:

# grep component=warmer "flux-log.txt"  | grep unauthorized | wc -l
    3791
# cat /proc/$FLUXPID/net/sockstat
sockets: used 3808
TCP: inuse 16 orphan 0 tw 27 alloc 3046 mem 2750
UDP: inuse 0 mem 0
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

We've been running master-2441121d for over a week now and the memory profile looks much healthier:

image

I'm going to call this one a win. Thanks again, @2opremio & @ncabatoff!

@brantb nice UI - what is it?

@davidkarlsen That's Azure Monitor, for their managed Kubernetes service (AKS). 馃槃

Was this page helpful?
0 / 5 - 0 ratings