We think we are giving this attribute too much of prominence given that it is a hack to enable a smooth getting started experience.
Part of https://github.com/elastic/cloud-on-k8s/issues/1141
kind: Elasticsearch
metadata:
name: elasticsearch-sample
spec:
version: 7.2.0
unsafe: # tweaks, hacks,
setVmMaxMapCount: true
nodeSets:
- name: default
count: 3
Just using PodSecurityContext does not work because of the sysctl we want to use is typically not whitelisted thereby defeating the purpose of this exercise.
apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
name: elasticsearch-sample
spec:
version: 7.3.0
nodeSets:
- name: default
podTemplate:
spec:
securityContext:
sysctls:
- name: vm.max_map_count
value: "262144"
count: 1
Looks like on OpenShift you can use annotations as well, so there is some precedence
apiVersion: v1
kind: Pod
metadata:
name: sysctl-example
annotations:
security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1
security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3
spec:
This has the downside of creating an implicit API
Promoting tweaking the host in an init container feels a little rough (but can totally be done, IF you accept the security risks associated with it), so I'm not sure we should be strongly advocating it. This is an ES feature (both memory mapped files -- a lot of them, and bootstrap checks -- yay!), which can be disabled via config (https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-store.html#allow-mmap):
You can restrict the use of the mmapfs and the related hybridfs store type via the setting node.store.allow_mmap. This is a boolean setting indicating whether or not memory-mapping is allowed. The default is to allow it. This setting is useful, for example, if you are in an environment where you can not control the ability to create a lot of memory maps so you need disable the ability to use memory-mapping.
I think we can make the argument that we can instruct our users to do either 1) set the sysctl setting in one of several ways, or 2) disable mmap because if you are in an environment where you can not control the ability to create a lot of memory maps. The description seem to fit.
Thus, I propose the following:
Remove setVmMaxMapCount from the spec. It's promoting bad behavior from containers.
Provide something ala this as a quickstart:
cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 7.2.0
nodes:
- name: default
nodeCount: 1
config:
node.master: true
node.data: true
node.ingest: true
## Hosts running ES should set vm.max_map_count according to
## https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html
##
## If you're not able to configure this, node.store.allow_mmap can be set to false to
## disable memory mapped files to avoid this requirement, at the cost of performance
## (see https://www.elastic.co/guide/en/elasticsearch/reference/current/_maximum_map_count_check.html)
#node.store.allow_mmap: false
EOF
We should probably discuss within the team whether we'd like to have this uncommented by default (for the quickstart only), and if we'd like to further document this (e.g add a section to "troubleshooting" that explains it further and allows for perhaps easier discovery when running into the issue outside of the quickstart).
A follow-up question should go to the ES team whether they are fine with this as the quickstart contents, or whether there are better ways we can/should deal with this. (e.g potentially allowing for a lower number of mem-mapped files for small clusters in the bootstrap-check?)
Promoting tweaking the host in an init container feels a little rough
Agreed. However, this has been in my experience the only way to make host-level changes on GKE. To phrase this another way, 100% of GKE users[1] are likely to need this sysctl vm.max_map_count setting.
In past explorations, I've found folks requesting things like a "DaemonJob" which would act like a DaemonSet (schedule on all nodes) but otherwise act like a Job (run once). I don't think anyone's got a solution for a 'daemon job' style thing.
An initContainer with elevated privileges is the shortest path to declare "In order for this pod to be successful, the host needs to be configured a certain way." There's been recent work to move sysctl stuff (as mentioned above in the description) with a PodSecurityPolicy, and maybe we can do something along those lines and then set this in the Elasticsearch pod's securityContext?
[1] I don't know how common it is for someone to run GKE but also do config management on their nodes (like with puppet or ansible). On Infosec, we run GKE's default OS image (Google Container-optimized OS) and it has a number of properties, like the filesystem changes revert on reboot and upgrade, which make it hard or annoying to do this.
I think our default behavior should be the same as our best practices. In this case, do change the vm.max_map_count setting unless we can't find any way to have it by default. Otherwise, I feel that our user experience especially involving support would always open the door on a sour note - a user experiencing a performance issue, and we may blame the lack of correct vm.max_map_count.
Checking the helm chart, it has an optional initContainer to set the sysctl settings.
The helm chart docs are interestingly worded to imply that you are _expected_ to set vm.max_map_count somehow, whether by the initContainer or by some other means.
sysctlInitContainer.enabled | Allows you to disable the sysctlInitContainer if you are setting vm.max_map_count with another method
The way I read this implies that it is not optional (though we know one can choose niofs instead of mmapfs, but our docs warn this has bugs on windows
Some prior discussions for this optional sysctl initContainer on the helm chart, if that helps:
disable mmap because if you are in an environment where you can not control the ability to create a lot of memory maps
I feel like this is waaaaaay below the Kubernetes abstraction layer. A user is thinking about deploying Elasticsearch, and now we're forcing them to think about weird words like "mmap" and "sysctl". I don't think we should hide these details, but I wonder if we could suggest something on the same abstraction level as Kubernetes? Something like:
In order to achieve best filesystem performance, you should set
vm.max_map_countsysctl. You can have ECK do this for you if you install the appropriate PodSecurityPolicy{example pod security policy}
If you can't install this PodSecurityPolicy, you will need to set an appropriate sysctl
vm.max_map_counton all kubelet nodes yourself, or you will need to set [some Elasticsearch CRD setting] to false which causes Elasticsearch to avoid mmap, resulting in slower filesystem performance.
It's possible the operator could detect these situations on its own (Does an appropriate PodSecurityPolicy exist? etc)
It'd be ideal if users didn't have to think about this ;)
However, this has been in my experience the only way to make host-level changes on GKE.
Does something provider specific not work, or am I missing your point?
For example:
gcloud compute ssh $USER:$INSTANCE --project=$GCP_PROJECT --zone=$ZONE --command="sudo sysctl -w vm.max_map_count=262144"
This works, manually. But we have dozens of GKE nodes coming and going week
to week. We would have to build a system for detecting new GKE nodes and
deploying this sysctl change quickly before we could deploy elasticsearch.
It adds another step, which adds complexity and room for error.
On Wed, Sep 18, 2019 at 1:25 AM Peter Brachwitz notifications@github.com
wrote:
However, this has been in my experience the only way to make host-level
changes on GKE.Does something provider specific not work, or am I missing your point?
For example:
gcloud compute ssh $USER:$INSTANCE --project=$GCP_PROJECT --zone=$ZONE
--command="sudo sysctl -w vm.max_map_count=262144"—
You are receiving this because you commented.Reply to this email directly, view it on GitHub
https://github.com/elastic/cloud-on-k8s/issues/1712?email_source=notifications&email_token=AABAF2VU46XLLCW5KQJIFPDQKHQX7A5CNFSM4IVV52JKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD67H6MY#issuecomment-532578099,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABAF2Q3PMZKMX3X6DWWEO3QKHQX7ANCNFSM4IVV52JA
.
@jordansissel you can configure the GKE nodes bootstrap and add the sysctl setting there, so it happens e.g before the kubelet even starts. This makes the most sense to me: configuring host level settings on the host, at bootstrap.
This is a host-level sysctl (not namespaced), so one container setting it will affect all others on the host, so setting it in an init-container and hoping nothing overwrites it with something incompatible doesn't feel like an obvious/perfect solution. Assuming possible multi-tenancy/workloads here.
Nothing would prevent you from including such an init container on your own though if you accept the tradeoff (a few lines of YAML) is sufficient:
apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 7.2.0
nodes:
- name: default
nodeCount: 1
podTemplate:
spec:
initContainers:
- name: sysctl
securityContext:
privileged: true
command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']
In the end, this is going to have to be a trade-off between ease of use, security, isolation and we need to have this discussion in order to pick the default behavior that we believe makes the most sense. There's different kinds of users that will have different expectations here, and it looks like all options currently on the table seems very odd to at least one of the groups?
Also to note, when a GKE node reboots, it resets all changes. So we would
have to make sure this is detected and sysctl run before elasticsearch
would come back online.
The best way to do this on GKE is with k8s resources (daemonset or
initContainer)
On Wed, Sep 18, 2019 at 7:12 AM Jordan Sissel notifications@github.com
wrote:
This works, manually. But we have dozens of GKE nodes coming and going week
to week. We would have to build a system for detecting new GKE nodes and
deploying this sysctl change quickly before we could deploy elasticsearch.
It adds another step, which adds complexity and room for error.On Wed, Sep 18, 2019 at 1:25 AM Peter Brachwitz notifications@github.com
wrote:However, this has been in my experience the only way to make host-level
changes on GKE.Does something provider specific not work, or am I missing your point?
For example:
gcloud compute ssh $USER:$INSTANCE --project=$GCP_PROJECT --zone=$ZONE
--command="sudo sysctl -w vm.max_map_count=262144"—
You are receiving this because you commented.Reply to this email directly, view it on GitHub
<
https://github.com/elastic/cloud-on-k8s/issues/1712?email_source=notifications&email_token=AABAF2VU46XLLCW5KQJIFPDQKHQX7A5CNFSM4IVV52JKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD67H6MY#issuecomment-532578099
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AABAF2Q3PMZKMX3X6DWWEO3QKHQX7ANCNFSM4IVV52JA.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/elastic/cloud-on-k8s/issues/1712?email_source=notifications&email_token=AABAF2VJC53Q2ESLBBAYXMTQKIZN5A5CNFSM4IVV52JKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7AGQ4Y#issuecomment-532703347,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABAF2V5KPH6FX5QO4K65DDQKIZN5ANCNFSM4IVV52JA
.
you can configure the GKE nodes bootstrap and add the sysctl setting there
How? I'm unable to find any boot/configuration settings for Google Container OS without building my own image.
this is going to have to be a trade-off between ease of use, security, isolation
Agreed. If we can still use a custom initContainer or securityContext (assuming a PSP will work for this), that should help us continue upgrading ECK without tons of development effort (ie; we don't have to build custom GKE node images, etc).
Is there still a "quick start" getting started experience without this sysctl hack? if it's removed, Elasticsearch won't start, unless we disable allow_mmap by default, which is not suitable for production and will be blamed quickly in support scenarios[1].
[1] when seeking advice on our own ECK clusters, the first reaction is _always_ "Stop using Google Persistent Disk and use local disks" even before the situation is evaluated to form a hypothesis. If we disable mmap by default, I believe we'll open the same scenario for ECK users, an unevaluated "Use hybridfs instead of niofs" presented when any performance concerns are raised. This will provide a bad experience for users who may not have chosen this outcome.
Everyone seems to have to set sysctl their own special way: initContainer, securityContext w/ an PodSecurityPolicy, by managing the OS the kubelet runs on, etc. This feels weird. Can we bring this problem upstream to Kubernetes?
Can we revisit with the Elasticsearch team to allow Elasticsearch to run with the default vm.max_map_count? The default on GKE is 65536, and our minimum "required" is 4x that. The default on Ubuntu 18.04 seems to be 65530. Is 262144 really some magic number and 65536 is wrong?
How? I'm unable to find any boot/configuration settings for Google Container OS without building my own image.
This was my understanding as well and what led me to be okay with keeping the init container as an option. It appears startup scripts are still blacklisted:
https://cloud.google.com/kubernetes-engine/docs/reference/rest/v1/NodeConfig
And while people running their own k8s will likely be building their own images, I really doubt that will be the case for most people on managed k8s platforms.
Can we revisit with the Elasticsearch team to allow Elasticsearch to run with the default vm.max_map_count
Found a recent discussion where this was rejected: https://github.com/elastic/elasticsearch/issues/45815
Just in case I can't make meeting tomorrow, here is my overall thinking on this:
I do think it is useful to have this flag, given that it is a pain to configure this in a hosted environment like GKE (and potentially also AKS, EKS).
No strong opinion on if it defaults to true or false. I think I lean towards false because I don't think I want to encourage privileged pods that change the host by default. I can see the other argument that it is helpful to default to true from a quickstart perspective though, but I don't think that outweighs encouraging an overall not great practice of allowing all privileged pods.
I am in favor of moving it to a lower level. I don't particularly care what key we put it under. experimental, tweaks, and unsafe all work for me, and probably prefer them in roughly that order.
I don't know enough about the trade offs to know if we can encourage niofs by default (and so avoid the mmap bootstrap check). Even if we do, I don't think that will change my mind on still wanting it to be an option but behind a key like experimental etc, because there may still be relevant use cases for mmapfs that would otherwise be a pain to use.
I'm thinking that for a few reasons, maybe we should default node.store.allow_mmap to false. Here's my reasoning:
1) Allowing mmap requires vm.max_map_count set correctly. This requires one of these solutions:
vm.max_map_count (due to policy or whatever, preventing this setting)On the quickstart note, I feel the best way we can be sure everyone is successful is to set allow_mmap to false. We can also write a "going to production" document. We have such a thing for Kibana, and Google has one for GKE.
Thoughts? I'll ping #elasticsearch and see if they have recommendations here.
I could definitely be persuaded. The docs I saw didn't make it clear what the trade off of running only niofs was, just that one could do it. I'm assuming there's a reason for the hybridfs setting. If there's no real downside, the portability benefits are compelling.
That said I don't think it changes my opinion on the setVmMaxMapCount feature -- we can still maintain it but put it behind an experimental flag of some sort.
On the quickstart note, I feel the best way we can be sure everyone is successful is to set allow_mmap to false. We can also write a "going to production" document. We have such a thing for Kibana, and Google has one for GKE.
What about keeping the Elasticsearch default (node.store.allow_mmap: true) and just disable it in the quickstart example (as suggested in https://github.com/elastic/cloud-on-k8s/issues/1712#issuecomment-531232298)
Reading this issue it doesn't seem like any of us feel comfortable making the recommendation to do that just yet though, and would need to discuss with the ES team. It seems like that would affect a bunch of people's opinions on the setVmMaxMapCount flag in ECK , so I'm in favor of tabling this until we get an opinion from them.
so I'm in favor of tabling this until we get an opinion from them.
I think Njals suggestion to turn off allow_mmap for the quickstart resulted from a conversation with @jasontedor and I think the desire to keep the defaults the same across all platforms was also expressed.
That's correct, @nkvoll and I discussed this a few weeks ago, and disabling node.store.allow_mmap is the path that is least likely to give users a bad experience. Yes, there will be a performance difference, in a lot of use cases it won't be noticeable, especially for a quickstart experience. Keeping the defaults the same is also a goal, we don't really want users to have a different experience depending on how they deploy Elasticsearch, so I'm supportive of only making this for the quickstart.
Ah I must have misread then. Given that, I think I'm good with your suggestion here:
What about keeping the Elasticsearch default (node.store.allow_mmap: true) and just disable it in the quickstart example (as suggested in #1712 (comment))
Though I think it may still be useful to keep the functionality in ECK but buried under experimental so people who are okay with it can still use it.
@jasontedor just to make sure I understand, your opinion is that for the quickstart we should disable allow_mmap, but we still should have documentation on how to enable it and strongly recommend that it be enabled in production? Is that correct?
After some more discussions offline we decided to
setVmMaxMapCount boolean from the CRD. There is still some disagreement as to what the best approach is for the documentation part.
The two suggestions are at the moment:
allow_mmap
My vote: the example has allow_mmap:false with a link to a more detailed guide on the trade offs and how to set max_map_count in a privileged init container.
My vote:
Option 1 looks simpler, especially if we document it as in @pebrc screenshot, and provide a dedicated documentation page.
Option 2 looks way more verbose and advanced, but should definitely be documented in that dedicated doc page.
(edit: @anyasabo says it all in a single sentence 😄 )
Most helpful comment
My vote: the example has
allow_mmap:falsewith a link to a more detailed guide on the trade offs and how to setmax_map_countin a privileged init container.