Cloud-on-k8s: Scale testing ECK

Created on 7 Feb 2019 · 6Comments · Source: elastic/cloud-on-k8s

Things we're looking to answer:

How many clusters can the operator+k8s handle?
What are the main bottlenecks, and how can we work past them?
Speed/performance/availability/stability of PVs
Network performance

>test v1.0.0

Source

nkvoll

Most helpful comment

Kubernetes version: 1.15.4-gke.18
ECK commit: b559ec387bc30c1be3011b00722ef98cc5a683b7

As to be expected, the operator is bound mainly by external factors such as resource limits, provider quotas and limitations of Kubernetes itself. I managed to spawn 218 Elasticsearch clusters (1 master + 5 data nodes) and 218 Kibana instances before the time to attach persistent volumes became a bottleneck and clusters were taking too long to reach green status. At this point, the number of objects in the cluster were as follows:

436 Statefulsets
218 Deployments
1526 Pods
872 Services
3924 Secrets
436 Configmaps

Resource usage of the operator was as follows:

It should be noted that the RAM usage only increases when new resources are created. When no new create operations are being performed, memory usage remains fairly constant even when the underlying resources are being modified regularly. Some spikes can be observed when a large number of pods disappear (e.g. when a node dies), but they return to normal quickly.

CPU and network usage remain fairly flat and does not seem to grow in proportion to the number of resources being managed. RAM usage is proportional to the number of resources being managed (268 MiB for 436 resources = ~0.61 MiB per resource). This proportion did not seem to vary by the number of Elasticsearch nodes created per resource and can be considered reasonably stable.

Number of requests to the API server remained fairly low at around 100 RPS and the 99th percentile latency below 150ms. The API server was using 1.32 GiB of RAM and CPU usage was under 0.29.

Based on these observations, I think it is fair to conclude that the capacity of the operator is mostly bound by the resources allocated to it and the limitations of the underlying Kubernetes installation. Since the reconciliation logic is relatively simple and does not require locks nor complex state management, it should be able to handle a very large number of resources effortlessly if all other external factors are controlled. One potential bottleneck that could manifest itself at a sufficiently large scale is the access to the Elasticsearch API of each managed cluster. However, it is unlikely that this would be an immediate concern in the short term.

charith-elastic on 29 Nov 2019

❤4 👍1

All 6 comments

While ideally the tests described here are fully automated and repeatable we can run a first iteration of scale testing for the 1.0 release which does not attempt to automated everything right away if it helps reducing the effort involved in this.

pebrc on 18 Oct 2019

For big clusters you typically start running into timeouts on the API server for operations like trying to list all pods. (We've had this in our implementation ;)

otrosien on 14 Nov 2019

As this is not a typical load testing use case, I was not aware of any pre-existing tools and techniques to test a Kubernetes operator. I had to do some experiments to understand the nature of the problem and associated issues.

Unsurprisingly, Kubernetes itself is the main factor in any attempt to understand the behaviour of the operator under load. As the operator delegates most of its work to built-in Kubernetes controllers like the Statefulset controller, it is effectively bound by the capacity of those systems. Another limiting factor is the quotas imposed by the cloud provider which prevents creation of resources such as persistent volumes beyond a certain limit.

During experimental testing, resource usage of the operator appeared to be roughly proportional to the number of resources being managed regardless of cluster size (no significant difference between single node clusters vs. 30-node clusters). The main bottleneck seemed to be the number of secrets created, which seem to trigger a volume mounting error in the Kubelet: MountVolume.SetUp failed for volume "elastic-internal-http-certificates" : couldn't propagate object cache: timed out waiting for the condition (https://github.com/kubernetes/kubernetes/issues/70044). No significant load on the API server was observed apart from an increase in Goroutine count. Response times for API calls remained fairly constant throughout.

Given these initial findings, it appears that the behaviour of the operator depends mainly on external factors rather than intrinsic ones. The question, then, is to determine what we want to achieve from the scale testing effort. Do we want to establish a baseline such as "the operator requires X mb of RAM for each managed resource" or make a sweeping statement like "the operator can manage N Elasticsearch clusters". The former is measurable and fairly environment-agnostic while the latter is subjective and depends on many external factors (Kubernetes versions, available resources and their saturation, cluster topology, overhead from service meshes and network overlays, other operators and applications running in the cluster etc.)

Any ideas or suggestions are welcome.

charith-elastic on 27 Nov 2019

Kubernetes version: 1.15.4-gke.18
ECK commit: b559ec387bc30c1be3011b00722ef98cc5a683b7

436 Statefulsets
218 Deployments
1526 Pods
872 Services
3924 Secrets
436 Configmaps

Resource usage of the operator was as follows:

Number of requests to the API server remained fairly low at around 100 RPS and the 99th percentile latency below 150ms. The API server was using 1.32 GiB of RAM and CPU usage was under 0.29.

charith-elastic on 29 Nov 2019

❤4 👍1

I left the operator running over the weekend, managing 50 Elasticsearch clusters and 50 Kibana instances. Chaoskube was configured to randomly kill a pod every 10 minutes. Additionally, as the pods were scheduled on to a node pool consisting of GCP preemptible nodes, Kubernetes nodes were automatically cycled every 24 hours as well.

All 100 resources were in green state after the weekend. There was a sudden jump in used memory from 83 MiB to 112 MiB -- presumably due to multiple nodes getting recycled -- but for over 48 hours the total used memory only increased by 2 MiB.

Average heap usage over this period was 51 MiB.

From looking at the heap profile, it appears that most of the heap allocations can be attributed to the TLS connection establishment by the operator's Elasticsearch client and the Kubernetes API client. For large deployments, we may want to reconsider this approach and investigate the feasibility of long-lived connections and techniques like TLS session resumption to reduce the overhead of communication.

charith-elastic on 2 Dec 2019

❤4

Managed to run 514 Elasticsearch clusters with the operator before the time to detect the health of clusters became too long. From a quick look at the observer code, it appears that for each cluster, 3 goroutines are spawned every 10 seconds to retrieve the cluster info, cluster health and licence. Since these are network calls, at a sufficiently large scale the overhead seems to compound and become more obvious. Unfortunately, the client is not instrumented so there are no metrics to illustrate this point. Profile data as well the following trace summary is available for anyone interested in digging deeper.

Goroutines:
net.(*Resolver).goLookupIPCNAMEOrder.func3.1 N=1544
net/http.(*persistConn).addTLS.func2 N=259
net/http.(*persistConn).readLoop N=775
net/http.(*Transport).dialConnFor N=259
net/http.(*persistConn).writeLoop N=774
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/observer.RetrieveState.func3 N=259
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/observer.RetrieveState.func1 N=257
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/observer.RetrieveState.func2 N=258
internal/singleflight.(*Group).doCall N=258
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/observer.(*Observer).runPeriodically N=259
runtime.timerproc N=2
k8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive N=2
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/observer.(*Manager).notifyListeners.func1 N=258
net.(*netFD).connect.func2 N=258
golang.org/x/net/http2.(*ClientConn).readLoop N=1
runtime/trace.Start.func1 N=1
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1 N=2
k8s.io/client-go/tools/cache.(*sharedIndexInformer).Run N=2
k8s.io/client-go/util/workqueue.(*Type).updateUnfinishedWorkLoop N=8
github.com/google/gops/agent.listen N=1
k8s.io/client-go/util/workqueue.(*delayingType).waitingLoop N=8
k8s.io/klog.(*loggingT).flushDaemon N=1
N=1965

In order to support very large deployments, we may want to reconsider the observation strategy and investigate options such as healthcheck sidecars that can reduce the workload of the operator.

Apart from the noticeable delay in detecting the health of clusters, everything else appeared to be normal during the test. The operator was managing the 514 clusters for over 12 hours and the memory usage during this period was 438 MiB with about 240 MiB allocated on the heap. 3291 goroutines were active. API Server was using 1.52 GiB of memory and had 9517 goroutines.

514 clusters amounts to: