Cloud-on-k8s: Elasticsearch 8.0.0-SNAPSHOT fails at startup due to volume permissions

Created on 31 Mar 2020 · 17Comments · Source: elastic/cloud-on-k8s

Exception raised at startup:

["org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [/usr/share/elasticsearch/d
ata]; maybe these locations are not writable or multiple nodes were started on the same data path?",

I think that's because the Docker image runs with user elasticsearch by default, whereas it was using user root before (even though the elasticsearch process itself runs as user elasticsearch):

⟩ docker run -ti docker.elastic.co/elasticsearch/elasticsearch:7.6.0 id
uid=0(root) gid=0(root) groups=0(root)
⟩ docker run -ti docker.elastic.co/elasticsearch/elasticsearch:8.0.0-SNAPSHOT id
uid=1000(elasticsearch) gid=0(root) groups=0(root)

>bug

Source

sebgl

All 17 comments

We run an init container to change the owner of the data volume to elasticsearch, but only if the init container runs with the root user:

        # chown the data and logs volume to the elasticsearch user
    # only done when running as root, other cases should be handled
    # with a proper security context
    chown_start=$(date +%s)
    if [[ $EUID -eq 0 ]]; then
        {{range .ChownToElasticsearch}}
            echo "chowning {{.}} to elasticsearch:elasticsearch"
            chown -v elasticsearch:elasticsearch {{.}}
        {{end}}
    fi

In 8.0.0-SNAPSHOT the init container runs with the elasticsearch user, hence does not have permission to chown the volume.
If I comment the if condition above, the init container fails with:

chowning /usr/share/elasticsearch/data to elasticsearch:elasticsearch
chown: changing ownership of '/usr/share/elasticsearch/data': Operation not permitted
failed to change ownership of '/usr/share/elasticsearch/data' from root:root to elasticsearch:elasticsearch

sebgl on 31 Mar 2020

pebrc on 31 Mar 2020

I think the way we currently deal with volumes permissions is not great: we run an init container to chown the mounted volumes (with write access to user root) so they belong to the elasticsearch user instead.

I think this would be better dealt with securityContext.fsGroup in the pod spec.
This modified podTemplate allows files to be written in the mounted volume by a user with group ID 1000, and works fine with Elasticsearch 8.0.0-SNAPSHOT:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
spec:
  version: 8.0.0-SNAPSHOT
  nodeSets:
  - name: default
    count: 3
    podTemplate:
      spec:
        securityContext:
          fsGroup: 1000

I think the right thing to do is replace our custom init container chown mechanism by a default fsGroup that works out of the box.

However I'm not exactly sure of the implications on Openshift. This documentation gives more details.

sebgl on 31 Mar 2020

The difference in Elasticsearch behaviour comes from https://github.com/elastic/elasticsearch/pull/50277, where tini was added to the image as process manager. It does not default to using the root user, which is a good thing IMO.

I've done some tests regarding fsGroup on Openshift 3.11 (using minishift).

First, there is no problem running 8.0..0-SNAPSHOT on Openshift. Openshift changes the default user the container runs with to an arbitrary one for security reasons (in my example: UID 1000140000). It also ensures this arbitrary user is member of the root group (but is not the root user), which gives it write access to our mounted volume.

Whereas in "regular" Kubernetes, the elasticsearch user in the container cannot write to the mounted volumes owned by root. To fix this, we can set fsGroup: 1000 in the pod spec, which allows the user with UID 1000 to write in mounted volumes.

Setting fsGroup: 1000 on Openshift leads to the Pod not being created at all:

create Pod elasticsearch-sample-es-default-0 in StatefulSet elasticsearch-sample-es-default failed error: pods "elasticsearch-sample-es-default-0" is forbidden: unable to validate against any security context constraint: [fsGroup: Invalid val
ue: []int64{1000}: 1000 is not an allowed group]

One solution to this, detailed in Openshift docs is to not use the default restricted SCC, but to create a custom one where group 1000 is allowed (or part of a range that is allowed).

tl;dr:

we can use fsGroup on most k8s distributions
we cannot use fsGroup on Openshift unless using a custom SCC

Let's see if we can find a common solution here. In any case, changing permissions in the init container does not feel like the right thing to do.

sebgl on 1 Apr 2020

we can use fsGroup on most k8s distributions

I think it will work as long as the cluster is not secured. If there is a PSP that restrict the range for the fsGroup _(which is I would expect on production clusters)_ chances are it will fail the same way I guess.

barkbay on 1 Apr 2020

You're right @barkbay this goes beyond the scope of Openshift vs. not Openshift.

The question is more: is there a PSP (or SCC on Openshift) or not?

IIUC this doc correctly, setting an fsGroup automatically makes all processes running in the container (which may or not be using a runAsUser range) part of the fsGroup supplementary group. So as long as the default PSP specifies an fsGroup (can be a range), the assigned arbitrary user will be able to write in the mounted volumes. So we should just do nothing in this situation.

When no PSP/SCC is enforced, we probably need to set fsGroup: 1000.

Should we default to one or the other, or try to auto-detect what's best? Users can still override the securityContext in the podTemplate, but picking a default seems hard :(

sebgl on 1 Apr 2020

Assuming we want to rely on securityContext.fsGroup, we can write a dedicated documentation page that explains:

what is the default applied by ECK
how to remove that default (if there's one) if you don't want any fsGroup set, because you already rely on a PSP/SCC
how to override the podTemplate to set your own fsGroup

Regarding ECK defaults, we have several options.

1. Don't set a default `securityContext.fsGroup`

If we don't set a default value, it is likely that:

Vanilla K8s users with no PSP set will run into troubles
Openshift users running with the default SCC will not have any problem
Vanilla K8s users or Openshift users with a custom PSP/SCC may run into troubles if that PSP does not set the fsGroup

The first point seems quite representative of a quickstart experience, so we probably have to adapt the quickstart to explain what fsGroup is and why it matters.

cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.6.2
  nodeSets:
  - name: default
    count: 1
    config:
      node.master: true
      node.data: true
      node.ingest: true
      node.store.allow_mmap: false
    # allow the elasticsearch user in the container to write on mounted volumes
    # remove this if you already have an SCC (Openshift) or PSP (Kubernetes) set
    podTemplate:
      spec:
        securityContext:
          fsGroup: 1000
EOF

It's important to notice this example does not "just work": it will probably work on a basic Kubernetes setup, but not on a basic Openshift setup. Users have to understand they need to remove or comment some lines in the yaml.

I think if we go this path we also have to adapt other examples in the rest of the documentation, and also adapt the recipes we have in the Github repository.

2. Set a default `securityContext.fsGroup: 1000`

If we set a default value, it is likely that:

Vanilla K8s users with no PSP set will not have any problem
Openshift users running with the default SCC will run into troubles
Vanilla K8s users or Openshift users with a custom PSP/SCC may run into troubles if that PSP conflicts with the default securityContext

We probably need to adapt the quickstart example to mention the securityContext, especially for Openshift users.

We can either mention it explicitly in the quickstart (and other examples):

cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.6.2
  nodeSets:
  - name: default
    count: 1
    config:
      node.master: true
      node.data: true
      node.ingest: true
      node.store.allow_mmap: false
    # # if a PSP (Kubernetes) or SCC (Openshift) is set, ensure ECK does not set any custom
    # # securityContext, by uncommenting this line. This is likely to be the case on Openshift.
    # podTemplate:
    #  spec:
    #    securityContext: {}

EOF

Or add a note about it:

cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.6.2
  nodeSets:
  - name: default
    count: 1
    config:
      node.master: true
      node.data: true
      node.ingest: true
      node.store.allow_mmap: false
EOF

NOTE: ECK sets a default security context to allow the `elasticsearch` user to access mounted volumes. On Openshift, this conflicts with the default `restricted` SCC. It may also conflict with any custom PSP or SCC configured on your Kubernetes cluster. If that is the case you can disable the default pod security context. See [this page](link) for more details.

We also need to highlight this on the documentation page dedicated to OpenShift.

3. Attempt to set the best default depending on the environment

We could attempt the following:

If we detect ECK is deployed on Openshift, don't set a default pod securityContext.
Otherwise, set a default securityContext.fsGroup: 1000.

Detecting Openshift can probably be done in various ways (an example), but there does not seem to be a documented robust way of doing it.
This have to be done in agreement with the existing RBAC permissions. Requiring additional RBAC permissions for this call feels wrong.

It is also an option to list the existing PSP/SCC in the cluster, and don't set the securityContext if the returned list is not empty, but that would require an explicit RBAC permission that may differ in K8s vs. Openshift.

It is also possible to pass an explicit flag to the operator so it nevers sets a default securityContext, which would have to be documented especially for Openshift.

In any case, we probably still need to add the following to the quickstart:

NOTE: ECK sets a default pod `securityContext` to allow the `elasticsearch` user to access mounted volumes, unless deployed on Openshift. This may conflict with existing PSP and SCC. See [this page](link) for more information on how to disable it.

sebgl on 3 Apr 2020

My inclination would be for option 3. It makes things tricky since there is an implicit/non-robust decision being made by the operator, but it gives the easiest quickstart experience with less overhead in the documentation.
I think it is worth trying to find the best way to do a best-effort Openshift detection on operator startup, with limited RBAC permission.

If that ends up being too complicated, my second choice would be option 2. In short: favor a quickstart experience on an unsecured k8s cluster, and try to redirect other users (including Openshift users) to a dedicated doc page about disabling the securityContext.

sebgl on 3 Apr 2020

A few things we discussed today with @nkvoll @pebrc @anyasabo. No decision reached yet:

Elasticsearch moving to not use root as a default user feels like a good thing. We don't want to revert that decision.
Auto-detecting if we're running on Openshift (option 3 above) feels wrong, and will be hard to do reliabily.
Option 1 above (don't set a default fsGroup, specify one in the quickstart) makes the quickstart quite verbose. Also, if we remove the current init container chown process, it will break existing clusters on ECK upgrade.
Option 2 above (default to fsGroup: 1000, can be overridden to empty in the podTemplate) is tempting since Openshift users already have to go through a specific documentation page. In the quickstart we would just add a note with a link to a documentation page dedicated to that setting. Existing Openshift clusters and/or secured k8s clusters will likely be broken on ECK upgrade.
We could make this a setting at the operator level. Either as a boolean flag (--set-default-elasticsearch-fsgroup), either as a more complete parameter ({securityContext: fsGroup: 1000}). The last one probably requires a dedicated configuration file. Also we may want to extend this setting to other resources (Kibana, APMServer, etc.) for consistency.
If we end up picking option 1 or 2, existing clusters will be rolled out with the new settings. The rolling upgrade will fail on the first Pod if the securityContext conflicts with the default SCC/PSP. To avoid this situation, we may want to synchronise that change with Elasticsearch 8.0.0? It feels complicated to understand the ECK version default/ES version mapping though.

sebgl on 7 Apr 2020

I think I'm leaning towards the following:

Add a boolean flag to the operator arguments: --set-default-fsgroup=true. Why boolean and not an int (eg. 1000)? Because that value may differ for each resource managed by ECK, so it would then need to be a per-Kind flag. Or a more complex yaml configuration.
Set this flag to true by default in ECK manifests. Add a dedicated doc page explaining this flag, and how its behaviour can also be overridden in the podTemplate, which then takes priority. Specify in the Openshift docs that this flag should most likely be overridden to false.
The above implicitly means that we optimize our defaults for non-secured k8s setups, and not for Openshift and k8s clusters with PSPs. We can eventually decide to provide different manifests for Openshift.
Attempt a best-effort Pod creation dry-run to better surface the error if an incompatible PSP/SCC exists in operator logs and events attached to the resource.

I'm not sure whether this should apply to all stack versions, or only apply to 8.0+ so we don't break compatibility with existing pre-8.0 clusters.

sebgl on 27 Apr 2020

I think I agree Seb. I'm less sure on only implementing it for 8.0+. It's nice because it is "simple" from a user experience -- if an existing user with PSPs upgrades to the version of ECK that includes the automatic fsGroup setting, it only fails for new 8.x clusters, or fails on the first pod during a rolling upgrade to 8.x. So the impact is minimal and should be relatively easy to notice by users.

I think it's less nice because of the complexity involved. We're arguably not doing the "right thing" now in <8.0 by using the init container instead of the native feature that does what we want it to do. There's even more differences between 6.x/7.x and 8.x that aren't really related to actual Elasticsearch and are more related to how Elasticsearch is packaged. Being consistent wherever we can is nice just to minimize mental load both for us and our users (who have to keep a mental map of what behavior we default with across different ES versions).

Downsides of defaulting fsGroups for all versions:

existing users with PSPs/SCCs who blindly upgrade to a new ECK version without flipping the toggle will have their existing clusters broken as the pods cannot start
- this is mitigated by the rolling upgrade process -- only one pod will go unavailable (by default). That said, users who did not read the upgrade notes may not notice that one pod is unavailable, since from their perspective they did not change anything in Elasticsearch
- We can also mitigate this in K8s 1.13+ with the dry run as mentioned. Even if they have the fsGroup toggle enabled still, if we can detect that a pod cannot be created because of the fsGroup (I'm not sure what kind of feedback it gives you), then we can proceed as if the user did switch the fsGroup toggle off. Users on k8s <1.13 would not have this mitigation and would experience the single pod going down.

Overall I think I'm okay with making this change for all ES versions, but could still be persuaded otherwise.

anyasabo on 27 Apr 2020

👍 on using the flag @sebgl proposed.
Arguments for using the fsGroup mechanism only as of ES 8.0 + imo:

less potential for disruption on existing clusters
a major version upgrade is usually something users plan to a certain extent and chances are higher that they realise/read about the necessity to have ECK configured correctly before moving forward. A minor version upgrade or an ECK upgrade might be taken more lightly and lead to surprises if suddenly all clusters have are stuck in a rolling upgrade

Side-note: I am still a bit worried about the number of flags we add to the binary (17 atm) and still think we should consider a configuration file. But maybe not for the ones that are feature toggles like this one but for the configuration values like cert validity and such.

pebrc on 29 Apr 2020

If we use a flag I think we will have to make a choice regarding the operator hub:

Either :

the flag is not set and the upgrade from the operator hub will break the experience for the non-(openshift|psp) users
we do the opposite and it breaks the experience for Openshift users

barkbay on 30 Jun 2020

A few things we discussed with the team:

There seems to be no way around making life slightly harder for _some_ users (Openshift users, or vanilla k8s users, or k8s PSP users).
Upgrading ECK should not break existing clusters. Binding the fsGroup change to stack version 8.0.0 seems to be a reasonable way to alleviate this concern.
If feels more natural to change the value of an operator-level setting in a configMap rather than changing the operator binary args in the StatefulSet spec.
We could add a note in the quickstart docs, right after installing ECK, about how the setting needs to be changed for some users (Openshift and k8s with PSP enabled). So far we have optimized ECK for a smooth quickstart experience on vanilla k8s clusters with no particular security enforcement.
We could decide to _force_ users to make an explicit choice for that setting (set-default-fsgroup=true|false). If not set, we would error-out in the reconciliation of Elasticsearch 8.0.0. The downside of this approach is that people need to care and understand this fsGroup thing when installing ECK for the first time (or upgrading). It moves us away from a very simple quickstart experience, which would be a big loss.
We can try to surface any error coming from the fsGroup setting not being set to the right value. If an SCC/PSP enforces fsGroup to not be 1000, the Pod creation will fail, and an error will be reported in the StatefulSets events. Unfortunately that's not easy to discover, even from within the operator. Pod creation dry-run can help grab a better error message, but the feature is not available on all k8s environments. We could also try to detect, at ECK startup, if we're running on Openshift. If that's the case, and set-default-fsgroup=true, we can output an explicit warning.

sebgl on 1 Jul 2020

What we agreed on with the team (basically summarizes the discussion above):

operator-level --set-default-fsgroup=true|false flag (defaults: true, also in operator hub) - https://github.com/elastic/cloud-on-k8s/pull/3342
only set the Elasticsearch fsGroup starting 8.0.0
marked as breaking change, since it breaks expectations for Openshift users starting 8.0.0
dedicated documentation page to explain how to disable this for Openshift & non-PSP users, linked from the quickstart
as part of ECK 1.3 so we release way before 8.0.0 (new Openshift users will likely notice in the quickstart while deploying a new 7.x)
emit warnings if we suspect the setting is wrongly set
best-effort autodetect openshift at ECK startup
double-check pod creation dry-run fails if available
eventually a setting available in a configMap (not only a flag)

sebgl on 28 Jul 2020

👍1

[x] operator-level --set-default-fsgroup=true|false flag (defaults: true, also in operator hub) - #3342

[x] only set the Elasticsearch fsGroup starting 8.0.0

[x] marked as breaking change, since it breaks expectations for Openshift users starting 8.0.0

[ ] dedicated documentation page to explain how to disable this for Openshift & non-PSP users, linked from the quickstart
as part of ECK 1.3 so we release way before 8.0.0 (new Openshift users will likely notice in the quickstart while deploying a new 7.x)

[ ] emit warnings if we suspect the setting is wrongly set

[ ] best-effort autodetect openshift at ECK startup

[ ] double-check pod creation dry-run fails if available

[x] [eventually](https://github.com/elastic/cloud-on-k8s/issues/3401) a setting available in a configMap (not only a flag)

@sebgl just trying to figure out where we stand with this issue. I think I ticked all the right boxes.

pebrc on 30 Nov 2020

👍1

@pebrc yes! Unassigning myself here since not really working on this at the moment.

sebgl on 30 Nov 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Reconciliation is blocked when a Pod can't be created

barkbay · 5Comments

Issues with users other than default "elastic"

spencergilbert · 3Comments

TestUpdateKibanaSecureSettings is flaky

sebgl · 5Comments

Kibana does not support rolling upgrades

pebrc · 3Comments

Status subresource updates fail when the crd version changes

sebgl · 3Comments