Cloud-on-k8s: Move setVMMaxMapCount from top-level in spec

Created on 11 Sep 2019 · 22Comments · Source: elastic/cloud-on-k8s

We think we are giving this attribute too much of prominence given that it is a hack to enable a smooth getting started experience.

Part of https://github.com/elastic/cloud-on-k8s/issues/1141

Move the existing attribute one level down

kind: Elasticsearch
metadata:
  name: elasticsearch-sample
spec:
  version: 7.2.0
  unsafe: # tweaks, hacks, 
    setVmMaxMapCount: true 
  nodeSets:
  - name: default
    count: 3

Why not just securityContext?

Just using PodSecurityContext does not work because of the sysctl we want to use is typically not whitelisted thereby defeating the purpose of this exercise.

apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
spec:
  version: 7.3.0  
  nodeSets:
  - name: default
    podTemplate:
      spec:
        securityContext:
          sysctls:
          - name: vm.max_map_count
            value: "262144"
   count: 1

Annotations

Looks like on OpenShift you can use annotations as well, so there is some precedence

apiVersion: v1
kind: Pod
metadata:
  name: sysctl-example
  annotations:
    security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1
    security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3
spec:

This has the downside of creating an implicit API

discuss v1.0.0-beta1

Source

pebrc

Most helpful comment

My vote: the example has allow_mmap:false with a link to a more detailed guide on the trade offs and how to set max_map_count in a privileged init container.

anyasabo on 1 Oct 2019

👍3

All 22 comments

Promoting tweaking the host in an init container feels a little rough (but can totally be done, IF you accept the security risks associated with it), so I'm not sure we should be strongly advocating it. This is an ES feature (both memory mapped files -- a lot of them, and bootstrap checks -- yay!), which can be disabled via config (https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-store.html#allow-mmap):

You can restrict the use of the mmapfs and the related hybridfs store type via the setting node.store.allow_mmap. This is a boolean setting indicating whether or not memory-mapping is allowed. The default is to allow it. This setting is useful, for example, if you are in an environment where you can not control the ability to create a lot of memory maps so you need disable the ability to use memory-mapping.

I think we can make the argument that we can instruct our users to do either 1) set the sysctl setting in one of several ways, or 2) disable mmap because if you are in an environment where you can not control the ability to create a lot of memory maps. The description seem to fit.

Thus, I propose the following:

Remove setVmMaxMapCount from the spec. It's promoting bad behavior from containers.
Provide something ala this as a quickstart:

cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.2.0
  nodes:
  - name: default
    nodeCount: 1
    config:
      node.master: true
      node.data: true
      node.ingest: true
      ## Hosts running ES should set vm.max_map_count according to
      ## https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html
      ##
      ## If you're not able to configure this, node.store.allow_mmap can be set to false to
      ## disable memory mapped files to avoid this requirement, at the cost of performance
      ## (see https://www.elastic.co/guide/en/elasticsearch/reference/current/_maximum_map_count_check.html)
      #node.store.allow_mmap: false
EOF

We should probably discuss within the team whether we'd like to have this uncommented by default (for the quickstart only), and if we'd like to further document this (e.g add a section to "troubleshooting" that explains it further and allows for perhaps easier discovery when running into the issue outside of the quickstart).

A follow-up question should go to the ES team whether they are fine with this as the quickstart contents, or whether there are better ways we can/should deal with this. (e.g potentially allowing for a lower number of mem-mapped files for small clusters in the bootstrap-check?)

nkvoll on 13 Sep 2019

👍1

Promoting tweaking the host in an init container feels a little rough

Agreed. However, this has been in my experience the only way to make host-level changes on GKE. To phrase this another way, 100% of GKE users[1] are likely to need this sysctl vm.max_map_count setting.

In past explorations, I've found folks requesting things like a "DaemonJob" which would act like a DaemonSet (schedule on all nodes) but otherwise act like a Job (run once). I don't think anyone's got a solution for a 'daemon job' style thing.

An initContainer with elevated privileges is the shortest path to declare "In order for this pod to be successful, the host needs to be configured a certain way." There's been recent work to move sysctl stuff (as mentioned above in the description) with a PodSecurityPolicy, and maybe we can do something along those lines and then set this in the Elasticsearch pod's securityContext?

[1] I don't know how common it is for someone to run GKE but also do config management on their nodes (like with puppet or ansible). On Infosec, we run GKE's default OS image (Google Container-optimized OS) and it has a number of properties, like the filesystem changes revert on reboot and upgrade, which make it hard or annoying to do this.

I think our default behavior should be the same as our best practices. In this case, do change the vm.max_map_count setting unless we can't find any way to have it by default. Otherwise, I feel that our user experience especially involving support would always open the door on a sour note - a user experiencing a performance issue, and we may blame the lack of correct vm.max_map_count.

jordansissel on 18 Sep 2019

Checking the helm chart, it has an optional initContainer to set the sysctl settings.

The helm chart docs are interestingly worded to imply that you are _expected_ to set vm.max_map_count somehow, whether by the initContainer or by some other means.

sysctlInitContainer.enabled | Allows you to disable the sysctlInitContainer if you are setting vm.max_map_count with another method

The way I read this implies that it is not optional (though we know one can choose niofs instead of mmapfs, but our docs warn this has bugs on windows

Some prior discussions for this optional sysctl initContainer on the helm chart, if that helps:

jordansissel on 18 Sep 2019

👀1

disable mmap because if you are in an environment where you can not control the ability to create a lot of memory maps

I feel like this is waaaaaay below the Kubernetes abstraction layer. A user is thinking about deploying Elasticsearch, and now we're forcing them to think about weird words like "mmap" and "sysctl". I don't think we should hide these details, but I wonder if we could suggest something on the same abstraction level as Kubernetes? Something like:

In order to achieve best filesystem performance, you should set vm.max_map_count sysctl. You can have ECK do this for you if you install the appropriate PodSecurityPolicy

{example pod security policy}

If you can't install this PodSecurityPolicy, you will need to set an appropriate sysctl vm.max_map_count on all kubelet nodes yourself, or you will need to set [some Elasticsearch CRD setting] to false which causes Elasticsearch to avoid mmap, resulting in slower filesystem performance.

It's possible the operator could detect these situations on its own (Does an appropriate PodSecurityPolicy exist? etc)

It'd be ideal if users didn't have to think about this ;)

jordansissel on 18 Sep 2019

However, this has been in my experience the only way to make host-level changes on GKE.

Does something provider specific not work, or am I missing your point?
For example:
gcloud compute ssh $USER:$INSTANCE --project=$GCP_PROJECT --zone=$ZONE --command="sudo sysctl -w vm.max_map_count=262144"

pebrc on 18 Sep 2019

This works, manually. But we have dozens of GKE nodes coming and going week
to week. We would have to build a system for detecting new GKE nodes and
deploying this sysctl change quickly before we could deploy elasticsearch.
It adds another step, which adds complexity and room for error.

On Wed, Sep 18, 2019 at 1:25 AM Peter Brachwitz notifications@github.com
wrote:

However, this has been in my experience the only way to make host-level
changes on GKE.

Does something provider specific not work, or am I missing your point?
For example:
gcloud compute ssh $USER:$INSTANCE --project=$GCP_PROJECT --zone=$ZONE
--command="sudo sysctl -w vm.max_map_count=262144"

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
https://github.com/elastic/cloud-on-k8s/issues/1712?email_source=notifications&email_token=AABAF2VU46XLLCW5KQJIFPDQKHQX7A5CNFSM4IVV52JKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD67H6MY#issuecomment-532578099,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABAF2Q3PMZKMX3X6DWWEO3QKHQX7ANCNFSM4IVV52JA
.

jordansissel on 18 Sep 2019

👍1

@jordansissel you can configure the GKE nodes bootstrap and add the sysctl setting there, so it happens e.g before the kubelet even starts. This makes the most sense to me: configuring host level settings on the host, at bootstrap.

This is a host-level sysctl (not namespaced), so one container setting it will affect all others on the host, so setting it in an init-container and hoping nothing overwrites it with something incompatible doesn't feel like an obvious/perfect solution. Assuming possible multi-tenancy/workloads here.

Nothing would prevent you from including such an init container on your own though if you accept the tradeoff (a few lines of YAML) is sufficient:

apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.2.0
  nodes:
  - name: default
    nodeCount: 1
    podTemplate:
      spec:
        initContainers:
        - name: sysctl
          securityContext:
            privileged: true
          command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']

In the end, this is going to have to be a trade-off between ease of use, security, isolation and we need to have this discussion in order to pick the default behavior that we believe makes the most sense. There's different kinds of users that will have different expectations here, and it looks like all options currently on the table seems very odd to at least one of the groups?

nkvoll on 18 Sep 2019

👍3

Also to note, when a GKE node reboots, it resets all changes. So we would
have to make sure this is detected and sysctl run before elasticsearch
would come back online.

The best way to do this on GKE is with k8s resources (daemonset or
initContainer)

On Wed, Sep 18, 2019 at 7:12 AM Jordan Sissel notifications@github.com
wrote:

This works, manually. But we have dozens of GKE nodes coming and going week
to week. We would have to build a system for detecting new GKE nodes and
deploying this sysctl change quickly before we could deploy elasticsearch.
It adds another step, which adds complexity and room for error.

On Wed, Sep 18, 2019 at 1:25 AM Peter Brachwitz notifications@github.com
wrote:

However, this has been in my experience the only way to make host-level
changes on GKE.

Does something provider specific not work, or am I missing your point?
For example:
gcloud compute ssh $USER:$INSTANCE --project=$GCP_PROJECT --zone=$ZONE
--command="sudo sysctl -w vm.max_map_count=262144"

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
<
https://github.com/elastic/cloud-on-k8s/issues/1712?email_source=notifications&email_token=AABAF2VU46XLLCW5KQJIFPDQKHQX7A5CNFSM4IVV52JKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD67H6MY#issuecomment-532578099
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AABAF2Q3PMZKMX3X6DWWEO3QKHQX7ANCNFSM4IVV52JA

.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/elastic/cloud-on-k8s/issues/1712?email_source=notifications&email_token=AABAF2VJC53Q2ESLBBAYXMTQKIZN5A5CNFSM4IVV52JKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7AGQ4Y#issuecomment-532703347,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABAF2V5KPH6FX5QO4K65DDQKIZN5ANCNFSM4IVV52JA
.

jordansissel on 18 Sep 2019

you can configure the GKE nodes bootstrap and add the sysctl setting there

How? I'm unable to find any boot/configuration settings for Google Container OS without building my own image.

this is going to have to be a trade-off between ease of use, security, isolation

Agreed. If we can still use a custom initContainer or securityContext (assuming a PSP will work for this), that should help us continue upgrading ECK without tons of development effort (ie; we don't have to build custom GKE node images, etc).

Is there still a "quick start" getting started experience without this sysctl hack? if it's removed, Elasticsearch won't start, unless we disable allow_mmap by default, which is not suitable for production and will be blamed quickly in support scenarios[1].

[1] when seeking advice on our own ECK clusters, the first reaction is _always_ "Stop using Google Persistent Disk and use local disks" even before the situation is evaluated to form a hypothesis. If we disable mmap by default, I believe we'll open the same scenario for ECK users, an unevaluated "Use hybridfs instead of niofs" presented when any performance concerns are raised. This will provide a bad experience for users who may not have chosen this outcome.

Everyone seems to have to set sysctl their own special way: initContainer, securityContext w/ an PodSecurityPolicy, by managing the OS the kubelet runs on, etc. This feels weird. Can we bring this problem upstream to Kubernetes?

Can we revisit with the Elasticsearch team to allow Elasticsearch to run with the default vm.max_map_count? The default on GKE is 65536, and our minimum "required" is 4x that. The default on Ubuntu 18.04 seems to be 65530. Is 262144 really some magic number and 65536 is wrong?

jordansissel on 18 Sep 2019

How? I'm unable to find any boot/configuration settings for Google Container OS without building my own image.

This was my understanding as well and what led me to be okay with keeping the init container as an option. It appears startup scripts are still blacklisted:

https://cloud.google.com/kubernetes-engine/docs/reference/rest/v1/NodeConfig

And while people running their own k8s will likely be building their own images, I really doubt that will be the case for most people on managed k8s platforms.

anyasabo on 18 Sep 2019

Can we revisit with the Elasticsearch team to allow Elasticsearch to run with the default vm.max_map_count

Found a recent discussion where this was rejected: https://github.com/elastic/elasticsearch/issues/45815

jordansissel on 18 Sep 2019

Just in case I can't make meeting tomorrow, here is my overall thinking on this:

I do think it is useful to have this flag, given that it is a pain to configure this in a hosted environment like GKE (and potentially also AKS, EKS).
No strong opinion on if it defaults to true or false. I think I lean towards false because I don't think I want to encourage privileged pods that change the host by default. I can see the other argument that it is helpful to default to true from a quickstart perspective though, but I don't think that outweighs encouraging an overall not great practice of allowing all privileged pods.
I am in favor of moving it to a lower level. I don't particularly care what key we put it under. experimental, tweaks, and unsafe all work for me, and probably prefer them in roughly that order.
I don't know enough about the trade offs to know if we can encourage niofs by default (and so avoid the mmap bootstrap check). Even if we do, I don't think that will change my mind on still wanting it to be an option but behind a key like experimental etc, because there may still be relevant use cases for mmapfs that would otherwise be a pain to use.

anyasabo on 23 Sep 2019

I'm thinking that for a few reasons, maybe we should default node.store.allow_mmap to false. Here's my reasoning:

1) Allowing mmap requires vm.max_map_count set correctly. This requires one of these solutions:

k8s node administrator activity (outside ECK)
alternately, allow pods to set sysctls
alternately, use a privileged initContainer to set sysctl
2) There are too many ways for folks to set this sysctl, making it very difficult to have a "quickstart" under this scenario
3) Some users may not have any opportunity to set vm.max_map_count (due to policy or whatever, preventing this setting)

On the quickstart note, I feel the best way we can be sure everyone is successful is to set allow_mmap to false. We can also write a "going to production" document. We have such a thing for Kibana, and Google has one for GKE.

Thoughts? I'll ping #elasticsearch and see if they have recommendations here.

jordansissel on 23 Sep 2019

I could definitely be persuaded. The docs I saw didn't make it clear what the trade off of running only niofs was, just that one could do it. I'm assuming there's a reason for the hybridfs setting. If there's no real downside, the portability benefits are compelling.

That said I don't think it changes my opinion on the setVmMaxMapCount feature -- we can still maintain it but put it behind an experimental flag of some sort.

anyasabo on 23 Sep 2019

On the quickstart note, I feel the best way we can be sure everyone is successful is to set allow_mmap to false. We can also write a "going to production" document. We have such a thing for Kibana, and Google has one for GKE.

What about keeping the Elasticsearch default (node.store.allow_mmap: true) and just disable it in the quickstart example (as suggested in https://github.com/elastic/cloud-on-k8s/issues/1712#issuecomment-531232298)

same defaults everywhere (ECK, outside of ECK)
quickstart will "just work"
we can remove the setting from the CRD
we can explain the reason for disabling in a footnote of the quickstart guide (not great I know)
we add a section about production configuration outlining the various ways to configure the kernel for Elasticsearch (privileged initContainer, pod security context, daemonset, k8s admin)

pebrc on 23 Sep 2019

Reading this issue it doesn't seem like any of us feel comfortable making the recommendation to do that just yet though, and would need to discuss with the ES team. It seems like that would affect a bunch of people's opinions on the setVmMaxMapCount flag in ECK , so I'm in favor of tabling this until we get an opinion from them.

anyasabo on 23 Sep 2019

so I'm in favor of tabling this until we get an opinion from them.

I think Njals suggestion to turn off allow_mmap for the quickstart resulted from a conversation with @jasontedor and I think the desire to keep the defaults the same across all platforms was also expressed.

pebrc on 23 Sep 2019

That's correct, @nkvoll and I discussed this a few weeks ago, and disabling node.store.allow_mmap is the path that is least likely to give users a bad experience. Yes, there will be a performance difference, in a lot of use cases it won't be noticeable, especially for a quickstart experience. Keeping the defaults the same is also a goal, we don't really want users to have a different experience depending on how they deploy Elasticsearch, so I'm supportive of only making this for the quickstart.

jasontedor on 23 Sep 2019

Ah I must have misread then. Given that, I think I'm good with your suggestion here:

What about keeping the Elasticsearch default (node.store.allow_mmap: true) and just disable it in the quickstart example (as suggested in #1712 (comment))

Though I think it may still be useful to keep the functionality in ECK but buried under experimental so people who are okay with it can still use it.

@jasontedor just to make sure I understand, your opinion is that for the quickstart we should disable allow_mmap, but we still should have documentation on how to enable it and strongly recommend that it be enabled in production? Is that correct?

anyasabo on 23 Sep 2019

After some more discussions offline we decided to

remove the setVmMaxMapCount boolean from the CRD.
adjust the documentation to ensure our quickstart still gives a good copy/pasteable experience.
document the various ways to set sysctl settings on k8s

There is still some disagreement as to what the best approach is for the documentation part.

The two suggestions are at the moment:

just use allow_mmap

Screenshot 2019-09-24 at 22 06 06

use an explicit init container
```yaml
cat <
apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 7.2.0
nodes:
- name: default
  
  nodeCount: 1
  
  config:
  
  node.master: true
  
  node.data: true
  
  node.ingest: true
  
  ## Hosts running ES should set vm.max_map_count according to
  
  ## https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html
  
  ##
  
  ## If you're not able to configure this, node.store.allow_mmap can be set to false to
  
  ## disable memory mapped files to avoid this requirement, at the cost of performance
  
  ## (see https://www.elastic.co/guide/en/elasticsearch/reference/current/_maximum_map_count_check.html)
  
  #node.store.allow_mmap: false
  
  podTemplate:
  
  spec:
  
  initContainers:
  - name: sysctl
    
    securityContext:
    
    privileged: true
    
    command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']
    
    EOF

pebrc on 24 Sep 2019

My vote: the example has allow_mmap:false with a link to a more detailed guide on the trade offs and how to set max_map_count in a privileged init container.

anyasabo on 1 Oct 2019

👍3

My vote:
Option 1 looks simpler, especially if we document it as in @pebrc screenshot, and provide a dedicated documentation page.
Option 2 looks way more verbose and advanced, but should definitely be documented in that dedicated doc page.

(edit: @anyasabo says it all in a single sentence 😄 )