Currently, we:
With no default limits, the Elasticsearch pod can be evicted from a k8s host if the host is out of memory. For stateful workloads such as Elasticsearch, this can be pretty bad.
We discussed 2 approaches to solve this:
Just enforce 2GB memory (requests and limits) for the ES pod. We already do this for requests, so that's just also adding it for limits.
We would remove our default memory request, and ask the user to provide one. Probably changing our quickstart example from:
cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 7.2.0
nodes:
- nodeCount: 1
config:
node.master: true
node.data: true
node.ingest: true
EOF
to:
cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 7.2.0
nodes:
- nodeCount: 1
config:
node.master: true
node.data: true
node.ingest: true
podTemplate:
spec:
containers:
- name: elasticsearch
env:
- name: ES_JAVA_OPTS
value: -Xms1g -Xmx1g
resources:
requests:
memory: 2Gi
limits:
memory: 2Gi
EOF
I would be in favor of option 1.
Option 2 makes it more explicit to the user that these should be overridden (since users probably don't read the doc), but feels much more verbose for a quickstart.
Related to #236, #815, #819, #1097 and #1029.
I don't think if we expand the quickstart example, we would also need to remove the default requests (or limits). Personally I think I'm okay with both.
As far as I know, the only performance downside of adding a default limit is that the pod will not be able to use >2Gi for the cache. But that's a relatively small downside and I am okay with it.
Another downside: other resources that use pod templates (e.g. deployments, ssets) just create pods and don't set defaults, so you just end up with regular pod defaults. So this is a little bit of a departure from "normal" behavior for types that create pods. That said, we know a little bit more about our use case here than those more generic types, so I think it is justified.
But just to spell out the upside of a default limit:
Having the request == limits means it will run as a Guaranteed QoS and should be in the last group of pods to be evicted in the event of resource pressure.
For expanding the quickstart example, I think starting people closer to a "Production" configuration outweighs the extra verbosity. If people want to change it in the future, they don't need to come back to our documentation to look up how to set resource requests, they can just kubectl edit elasticsearch quickstart and be done. But having a default limit also gets a lot of the same benefit, so I don't feel too strongly. Curious what other people's thoughts are.
The main concern I have with adding some default behind the scene is that it could break the user experience when there are some Limit Ranges applied in the cluster _(must confess here that maybe I'm more sensitive to that because I have used them to manage some projects in my previous shop)_
tbh I think that applying an empty limits as workaround is a little bit odd:
yaml
podTemplate:
spec:
containers:
- name: elasticsearch
resources:
limits: {}
I don't think that expanding the quickstart example is really that bad and I agree with @anyasabo that it's closer to a "Production" configuration, it is a "quick start" and a "good start" 馃槃
What should be improved imho is that the user should not have to manage the memory settings in 2 different places, Xmx could be calculated automatically, see https://github.com/elastic/cloud-on-k8s/pull/1029#pullrequestreview-247393459
It would save 3 lines :)
So I would vote for 2 but in the mean time check if the Docker image could be improved to automatically adjust the JVM settings depending on the control groups applied to the container.
~We decided to go with option 2 for now.~ This decision did not stand.
I'm wondering if it could be worth adding an optional extra field to the ES CRD. See memory below:
cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 7.2.0
nodes:
- name: master-nodes
memory: 2Gi
nodeCount: 3
- name: data-nodes
memory: 32Gi
nodeCount: 3
EOF
Behaviour:
memory are set, one gets the priority over the other (priority to memory is probably simpler?).Benefits:
Concerns:
Note this does not prevent to move forward with Option 2 above as a first step.
Hey,
I'm kind of late to the game here, but wanted to voice my opinion on this.
While I understand we need a good solution here, I don't think I like option 2. that we've decided on, because:
Maybe we should we have ECK production checklist similar to the one ES has? It's also possible that I'm over sensitive on the config looks and that it doesn't matter/is obvious for someone that is already familiar with ES :)
I really like the suggestion @sebgl made. It's clean, incorporates the logic behind recommended ratios, while still allows to be more precise/verbose using the pod template.
Trying summarise the discussion and the arguments:
{} to turn off defaults in case of limit ranger use, because that is already the case with our current implementation and documented as such]2 a. remove all defaulting and encourage users to set resource requests and limits and JVM options themselves
2 b. introduce a new attribute "memory" (in addition to removing the defaults)
spec:
version: 7.2.0
nodes:
- name: data-nodes
memory: 2Gi
nodeCount: 3
memory and podTemplate resources if one wants to specify CPU. Or one has to not use memory and rely on resources for both CPU and memory and set JVM options separately as welltl,dr
I would be in favor of option 1.
+1 for now (and let's maybe discuss our approach to defaults)
I feel like that these defaults are mainly targeting the 'getting started experience' and will never be able to provide a production grade configuration.
Thoughts:
1) A default of 1gb heap (JVM) matches Elasticsearch's other packaging (rpm, etc)
2) A default memory limit of 2gb matches our recommendation to leave half the memory for fs cache
3) 1gb JVM heap is unsuitable for most production use cases, so users will most likely need to change this anyway.
If the failure scenarios for these defaults on k8s (memory limit 2gb) are obvious kinds of failures, such as a pod fails to schedule, etc, then I think it makes sense.
Having no memory limit by default seems dangerous and asking for a PoC to start failing unpredictably due to pod evictions, etc, especially with scenarios like #1716.
Each time I've tried to abstract things in Kubernetes, it feels like doing it wrong. Kubernetes yaml is, typically, painfully verbose, even when using templating tools like Helm. I've experimented with Jsonnet to help abstract things ("memory 2g" would set memory request+limit to 2g and JVM Heap to half that), and that helps quite a bit, I feel there's little we can do automatically in a Kubernetes resource definition.
When I look at the helm chart, there's a vast array of configurations such that abstraction isn't even really possible with all the templating values choices it supports.
While I'd love to set "cpu X memory Y" and compute the rest, I don't know if that really aligns with how users expect Kubernetes operators to behave (I have no experience there). From helm charts, I find abstractions like this are uncommon, which is frustrating for me as a user since I kind of revel in such abstractions and want to build them when they don't exist.
Focusing on abstractions, here's what I had experimented with in jsonnet:
local epsilon = import "epsilon.libjsonnet";
local name = "demo";
local resources = {
data: { cpu: 2, heap: "4g", memory: "6Gi", disk: "100Gi" },
master: { cpu: 2, heap: "4g", memory: "6Gi", disk: "30Gi" }
};
epsilon.values("demo", resources)
This would create a cluster with the default replica count (statefulset, replica count 1). This also included an abstraction for disk (which created a volumeClaimTemplate as appropriate).
As much as I want/need abstractions (to minimize mistakes), it's possible a Kubernetes resource is the wrong level to attempt such abstractions given there are so many different Kubernetes environments with weird, wild, and unforeseen configurations under which abstractions will fail.
My current leaning is, therefore, to set defaults (1gb JVM heap, 2GB memory request+limit, some appropriate CPU if needed, etc).
I feel like that these defaults are mainly targeting the 'getting started experience' and will never be able to provide a production grade configuration.
That maybe helps understanding that the priority of fixing this is maybe not as high as one might think. Having said that, if we can give a better experience in the 'quickstart' case where we don't have to explain why nodes get evicted we should try.
Doesn't not setting any defaults, but adding requests/limits in the quickstart accomplish the same goal but with less downside (e.g. having to read how to set it in the docs rather than starting off with an example you can just edit right there, us having to maintain defaulting code, interaction with limitranges)?
Having no memory limit by default seems dangerous and asking for a PoC to start failing unpredictably due to pod evictions, etc, especially with scenarios like #1716.
That's true, but it's also the case for any native resources you spin up too.
That said, we have knowledge of what our application requires (which appsv1/Deployments dont really) so I can see the argument for adding the defaults. Overall I think I agree with setting the 2Gi request/limit as default, but just wanted to bring up the (somewhat persuasive imo) counter.
We had some more discussions offline and the decision is
podTemplate.spec.containers[].resources themselves or by setting it to {}https://github.com/elastic/cloud-on-k8s/pull/1810 applies the default 2Gi memory limit for Elasticsearch.
However, since we don't apply any CPU requirements, the pod still ends up in QoS Burstable. Which means it will be evicted before Pods with QoS Guaranteed.
Picking a default CPU value (eg. 1000m) feels a bit more complicated, since it will effectively slow down the ES Pod compared to not setting anything.
Most helpful comment
Thoughts:
1) A default of 1gb heap (JVM) matches Elasticsearch's other packaging (rpm, etc)
2) A default memory limit of 2gb matches our recommendation to leave half the memory for fs cache
3) 1gb JVM heap is unsuitable for most production use cases, so users will most likely need to change this anyway.
If the failure scenarios for these defaults on k8s (memory limit 2gb) are obvious kinds of failures, such as a pod fails to schedule, etc, then I think it makes sense.
Having no memory limit by default seems dangerous and asking for a PoC to start failing unpredictably due to pod evictions, etc, especially with scenarios like #1716.
Each time I've tried to abstract things in Kubernetes, it feels like doing it wrong. Kubernetes yaml is, typically, painfully verbose, even when using templating tools like Helm. I've experimented with Jsonnet to help abstract things ("memory 2g" would set memory request+limit to 2g and JVM Heap to half that), and that helps quite a bit, I feel there's little we can do automatically in a Kubernetes resource definition.
When I look at the helm chart, there's a vast array of configurations such that abstraction isn't even really possible with all the templating values choices it supports.
While I'd love to set "cpu X memory Y" and compute the rest, I don't know if that really aligns with how users expect Kubernetes operators to behave (I have no experience there). From helm charts, I find abstractions like this are uncommon, which is frustrating for me as a user since I kind of revel in such abstractions and want to build them when they don't exist.
Focusing on abstractions, here's what I had experimented with in jsonnet:
This would create a cluster with the default replica count (statefulset, replica count 1). This also included an abstraction for disk (which created a volumeClaimTemplate as appropriate).
As much as I want/need abstractions (to minimize mistakes), it's possible a Kubernetes resource is the wrong level to attempt such abstractions given there are so many different Kubernetes environments with weird, wild, and unforeseen configurations under which abstractions will fail.
My current leaning is, therefore, to set defaults (1gb JVM heap, 2GB memory request+limit, some appropriate CPU if needed, etc).