Cloud-on-k8s: Potential memory leak in the operator

Created on 11 Oct 2019  ·  17Comments  ·  Source: elastic/cloud-on-k8s

After running a recent 1.0.0-beta1-b10 for two days with metrics reporting turned on and fetching those via metricbeat into Elasticsearch it seems that memory consumption has roughly doubled even though the number of clusters managed has been more or less constant (I have created and destroyed a few test clusters in that time)

Worth noting regarding the severity of this issue that the process's memory after two days of uptime is still at 25Mi which half of what we have as requests and far below the limit of 150Mi.

Operator manifest used:
https://github.com/pebrc/cloud-on-k8s/blob/dd77e135df28a80c10d603fd272009928eee0618/config/samples/beats/all-in-one.yaml#L3015-L3054

Go mstats struct with comments: https://golang.org/src/runtime/mstats.go

Visualisation: Heap Alloc bytes averaged in 10m interval
Screenshot 2019-10-11 at 20 28 25

Visualisation: Heap Objects averaged in 10m intervals
Screenshot 2019-10-11 at 18 31 10

Goroutines: averaged in 10m intervals are constant 👍
Screenshot 2019-10-11 at 18 41 42

Number of Reconciliations per controller: max in 10m intervals (note this is a counter so increase over time is expected)
Screenshot 2019-10-11 at 20 36 55

>bug

Most helpful comment

I'm not sure either is necessary since this works for now, and when we land a fix upstream (or someone else does) we can revisit this. But there were two other options we discussed out of band I thought might be worth noting here:

1) the default resync period is already 10h, so all items in the informer cache get flushed to the event handler. We could even just not requeue at all (at least assuming longer lived certificates) and instead rely on requeuing. This doesn't feel great, but is an option. Ref code and issue

2) I was curious how cert-manager handles this since it seemed like they would have very similar problems. If I'm reading it right, cert-manager wraps the workqueue and forgets items if they're already present before re-adding them (which should clear out the old timer?). Which is a fancier version of our old approach and might be worth considering. Reference:
https://github.com/jetstack/cert-manager/blob/master/pkg/scheduler/scheduler.go#L65
https://github.com/jetstack/cert-manager/blob/master/pkg/controller/certificates/controller.go#L122

All 17 comments

I restarted the operator with the latest bc12 and pprof profiling turned on. Memory still seems to be increasing over time.

Screenshot 2019-10-13 at 10 26 42

I took a heap profile right after restarting the operator and then again multiple times afterwards.
Comparing the first and the latest profile with pprof -normalize -base=$HOME/pprof/pprof.elastic-operator.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz ~/pprof/pprof.elastic-operator.alloc_objects.alloc_space.inuse_objects.inuse_space.007.pb.gz

the one thing that sticks out is
Screenshot 2019-10-13 at 10 19 55

I did the same test and I also saw a steady increase of the RSS memory _(I have not been able to configure metricbeat to collect the go metrics)_:
image
A memory dump gave me the same suspect:
image

I might be wrong but there is something suspicious in the source code of the go client.

Before I dive into the theory some facts about the operator and the "work queue":

  • The “delaying queue” is never empty in ECK because we schedule reconciliations to ensure that the CA are rotated. It means that there is always at least 1 element in the waiting queue of each of the controllers (apm,kibana,elasticsearch) set to at least 365 days. The requeue is set during every reconciliation loop, first time to 365 days, later according to this function:

https://github.com/elastic/cloud-on-k8s/blob/fb2e693949ec774be79e551637fd842e8c2af9d8/pkg/controller/common/certificates/expiration.go#L27-L37

  • An other important thing is that the implementation of the “delaying queue” is using 2 timers :
  • A ticker to wake up every 10 seconds:

https://github.com/kubernetes/client-go/blob/172b42569ccacc5c33a885060fc0b7eab863d81e/util/workqueue/delaying_queue.go#L169-L172

  1. time.After(d Duration) _(used indirectly through After(..))_ to schedule a delayed event at the right time.
    It is important to note the following comment in the godoc: “The underlying Timer is not recovered by the garbage collector until the timer fires.”

Knowing that here is the part that seems suspicious to me:

// waitingLoop runs until the workqueue is shutdown and keeps a check on the list of items to be added.
func (q *delayingType) waitingLoop() {
/* .... */
    for {
/* .... */
        if waitingForQueue.Len() > 0 {
            entry := waitingForQueue.Peek().(*waitFor)
            nextReadyAt = q.clock.After(entry.readyAt.Sub(now)) // <=== Here
        }
        select {
        case <-q.stopCh:
            return

        case <-q.heartbeat.C():
            // continue the loop, which will add ready items

        case <-nextReadyAt:
            // continue the loop, which will add ready items
/* ... */

A loop is scheduled at least every 10 seconds because of the heartbeat (case <-q.heartbeat.C()), it also means that a new nextReadyAt _(and an underlying timer)_ is created at least every 10 seconds.

Unless I'm missing something it means that at least every 10 seconds we are creating timers that will start to expire _(and be removed from memory)_ in 365 days...

Excellent analysis @barkbay! I think this is spot on. Just one addition we are also scheduling long running requeues in the license controller to re-issue new licenses after the expiry of the current ones.

https://github.com/elastic/cloud-on-k8s/blob/25caaf104217d387f47e4d588b0550476c7c4957/pkg/controller/license/license_controller.go#L74-L89

I am thinking we can work around this issue by managing long running re-queues in a dedicated scheduler structure (managed by ECK not go-client's workqueue) that correctly handles a single timer per key (typically namespaced name + the controller that initiated the timer) stopping existing timers before creating new ones. We could then inject an event into the correct workqueue just in time once a timer fires.

Unassigning myself as I am out the next couple of days and we are probably going with a simpler solution of not scheduling long running re-queues.

Let's implement an easy workaround we discussed: schedule a requeue in 10 hours instead of 365 days. This should mitigate the memory leak problem (timers still accumulate but are garbage-collected after 10 hours).

Let's not forget the license controller where something similar happens (license_expiry - safetyMargin) which can be years in the future

I'm not sure either is necessary since this works for now, and when we land a fix upstream (or someone else does) we can revisit this. But there were two other options we discussed out of band I thought might be worth noting here:

1) the default resync period is already 10h, so all items in the informer cache get flushed to the event handler. We could even just not requeue at all (at least assuming longer lived certificates) and instead rely on requeuing. This doesn't feel great, but is an option. Ref code and issue

2) I was curious how cert-manager handles this since it seemed like they would have very similar problems. If I'm reading it right, cert-manager wraps the workqueue and forgets items if they're already present before re-adding them (which should clear out the old timer?). Which is a fancier version of our old approach and might be worth considering. Reference:
https://github.com/jetstack/cert-manager/blob/master/pkg/scheduler/scheduler.go#L65
https://github.com/jetstack/cert-manager/blob/master/pkg/controller/certificates/controller.go#L122

Let's keep this open until we correctly observe that the memory leak is gone (or at least significantly contained).

I'm keeping the operator running since a few hours and it looks like it still leaking memory:

image

Memory usage seem to have stabilised over the last 24 hours:

image

Not sure yet the memory leak is fixed though. Also I realized I should probably generate some reconciliations event (eg. patch ES annotations with some randomness every X seconds), since the memory leak we observed was directly related to reconciliations happening.

I setup a k8s job that patches the Elasticsearch resource annotation every second, to trigger reconciliations:

apiVersion: batch/v1
kind: Job
metadata:
  name: generate-reconciliations
  namespace: elastic-system
spec:
  template:
    metadata:
      name: generate-reconciliations
    spec:
      serviceAccount: elastic-operator
      containers:
        - name: generate-reconciliations
          image: bitnami/kubectl:latest
          command:
          - bash
          args:
          - -c
          - while true; do kubectl patch elasticsearch monitor -n beats -p "{\"metadata\":{\"annotations\":{\"date\":\"`date +'%s'`\"}}}" --type merge; sleep 1; done
      restartPolicy: Never

And restarted the operator to use a recent nightly build that includes the memory leak fix.

I expect to see memory growing for the first 10 hours, then being approximately constant for the remaining timeframe.

Finally observed what I wanted to! 🎉

image

This is when triggering around 1 reconciliation per second with the above job.
The memory usage of the container slowly grows for the first 10 hours, since we're creating one new timer per second. Then after the first 10 hours, old timers start expiring, along with new ones being created. Leading to memory usage remaining stable (we always have approximately the same number of timers in memory since we keep doing 1 reconciliation per second).

In this scenario we seem to be in the 20-30MB range additional stable memory usage.
It seems to properly contain the leak, until we depend on k8s v1.17 that includes https://github.com/kubernetes/kubernetes/pull/83945.

Hey I'm curious what tools you used to collect the performance data and view the graphs? I haven't done much performance monitoring in K8s but am wanting to learn.

Hey @ljdelight, we are running Metricbeats on Kubernetes, which pushes metrics into the configured Elasticsearch cluster, that we visualize with Kibana.

What's your process when you "collect a heap profile"? Like, do you have profiling code in the binary that writes the heap data on some interval, or do you pprof the binary (which needs specific compile flags?). I'm thinking of approaches and I imagine it's tricky in K8s if there's no pprof in the container.

Thank you so much for the replies @sebgl and @anyasabo, i really appreciate it

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sebgl picture sebgl  ·  5Comments

SebastianCaceresUltra picture SebastianCaceresUltra  ·  3Comments

pebrc picture pebrc  ·  3Comments

barkbay picture barkbay  ·  4Comments

spencergilbert picture spencergilbert  ·  3Comments