Exception raised at startup:
["org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [/usr/share/elasticsearch/d
ata]; maybe these locations are not writable or multiple nodes were started on the same data path?",
I think that's because the Docker image runs with user elasticsearch by default, whereas it was using user root before (even though the elasticsearch process itself runs as user elasticsearch):
⟩ docker run -ti docker.elastic.co/elasticsearch/elasticsearch:7.6.0 id
uid=0(root) gid=0(root) groups=0(root)
⟩ docker run -ti docker.elastic.co/elasticsearch/elasticsearch:8.0.0-SNAPSHOT id
uid=1000(elasticsearch) gid=0(root) groups=0(root)
We run an init container to change the owner of the data volume to elasticsearch, but only if the init container runs with the root user:
# chown the data and logs volume to the elasticsearch user
# only done when running as root, other cases should be handled
# with a proper security context
chown_start=$(date +%s)
if [[ $EUID -eq 0 ]]; then
{{range .ChownToElasticsearch}}
echo "chowning {{.}} to elasticsearch:elasticsearch"
chown -v elasticsearch:elasticsearch {{.}}
{{end}}
fi
In 8.0.0-SNAPSHOT the init container runs with the elasticsearch user, hence does not have permission to chown the volume.
If I comment the if condition above, the init container fails with:
chowning /usr/share/elasticsearch/data to elasticsearch:elasticsearch
chown: changing ownership of '/usr/share/elasticsearch/data': Operation not permitted
failed to change ownership of '/usr/share/elasticsearch/data' from root:root to elasticsearch:elasticsearch
I think the way we currently deal with volumes permissions is not great: we run an init container to chown the mounted volumes (with write access to user root) so they belong to the elasticsearch user instead.
I think this would be better dealt with securityContext.fsGroup in the pod spec.
This modified podTemplate allows files to be written in the mounted volume by a user with group ID 1000, and works fine with Elasticsearch 8.0.0-SNAPSHOT:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: elasticsearch-sample
spec:
version: 8.0.0-SNAPSHOT
nodeSets:
- name: default
count: 3
podTemplate:
spec:
securityContext:
fsGroup: 1000
I think the right thing to do is replace our custom init container chown mechanism by a default fsGroup that works out of the box.
However I'm not exactly sure of the implications on Openshift. This documentation gives more details.
The difference in Elasticsearch behaviour comes from https://github.com/elastic/elasticsearch/pull/50277, where tini was added to the image as process manager. It does not default to using the root user, which is a good thing IMO.
I've done some tests regarding fsGroup on Openshift 3.11 (using minishift).
First, there is no problem running 8.0..0-SNAPSHOT on Openshift. Openshift changes the default user the container runs with to an arbitrary one for security reasons (in my example: UID 1000140000). It also ensures this arbitrary user is member of the root group (but is not the root user), which gives it write access to our mounted volume.
Whereas in "regular" Kubernetes, the elasticsearch user in the container cannot write to the mounted volumes owned by root. To fix this, we can set fsGroup: 1000 in the pod spec, which allows the user with UID 1000 to write in mounted volumes.
Setting fsGroup: 1000 on Openshift leads to the Pod not being created at all:
create Pod elasticsearch-sample-es-default-0 in StatefulSet elasticsearch-sample-es-default failed error: pods "elasticsearch-sample-es-default-0" is forbidden: unable to validate against any security context constraint: [fsGroup: Invalid val
ue: []int64{1000}: 1000 is not an allowed group]
One solution to this, detailed in Openshift docs is to not use the default restricted SCC, but to create a custom one where group 1000 is allowed (or part of a range that is allowed).
tl;dr:
fsGroup on most k8s distributionsfsGroup on Openshift unless using a custom SCCLet's see if we can find a common solution here. In any case, changing permissions in the init container does not feel like the right thing to do.
Related k8s issue: https://github.com/kubernetes/kubernetes/issues/2630.
we can use fsGroup on most k8s distributions
I think it will work as long as the cluster is not secured. If there is a PSP that restrict the range for the fsGroup _(which is I would expect on production clusters)_ chances are it will fail the same way I guess.
You're right @barkbay this goes beyond the scope of Openshift vs. not Openshift.
The question is more: is there a PSP (or SCC on Openshift) or not?
IIUC this doc correctly, setting an fsGroup automatically makes all processes running in the container (which may or not be using a runAsUser range) part of the fsGroup supplementary group. So as long as the default PSP specifies an fsGroup (can be a range), the assigned arbitrary user will be able to write in the mounted volumes. So we should just do nothing in this situation.
When no PSP/SCC is enforced, we probably need to set fsGroup: 1000.
Should we default to one or the other, or try to auto-detect what's best? Users can still override the securityContext in the podTemplate, but picking a default seems hard :(
Assuming we want to rely on securityContext.fsGroup, we can write a dedicated documentation page that explains:
Regarding ECK defaults, we have several options.
securityContext.fsGroupIf we don't set a default value, it is likely that:
fsGroupThe first point seems quite representative of a quickstart experience, so we probably have to adapt the quickstart to explain what fsGroup is and why it matters.
cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 7.6.2
nodeSets:
- name: default
count: 1
config:
node.master: true
node.data: true
node.ingest: true
node.store.allow_mmap: false
# allow the elasticsearch user in the container to write on mounted volumes
# remove this if you already have an SCC (Openshift) or PSP (Kubernetes) set
podTemplate:
spec:
securityContext:
fsGroup: 1000
EOF
It's important to notice this example does not "just work": it will probably work on a basic Kubernetes setup, but not on a basic Openshift setup. Users have to understand they need to remove or comment some lines in the yaml.
I think if we go this path we also have to adapt other examples in the rest of the documentation, and also adapt the recipes we have in the Github repository.
securityContext.fsGroup: 1000If we set a default value, it is likely that:
securityContextWe probably need to adapt the quickstart example to mention the securityContext, especially for Openshift users.
We can either mention it explicitly in the quickstart (and other examples):
cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 7.6.2
nodeSets:
- name: default
count: 1
config:
node.master: true
node.data: true
node.ingest: true
node.store.allow_mmap: false
# # if a PSP (Kubernetes) or SCC (Openshift) is set, ensure ECK does not set any custom
# # securityContext, by uncommenting this line. This is likely to be the case on Openshift.
# podTemplate:
# spec:
# securityContext: {}
EOF
Or add a note about it:
cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 7.6.2
nodeSets:
- name: default
count: 1
config:
node.master: true
node.data: true
node.ingest: true
node.store.allow_mmap: false
EOF
NOTE: ECK sets a default security context to allow the `elasticsearch` user to access mounted volumes. On Openshift, this conflicts with the default `restricted` SCC. It may also conflict with any custom PSP or SCC configured on your Kubernetes cluster. If that is the case you can disable the default pod security context. See [this page](link) for more details.
We also need to highlight this on the documentation page dedicated to OpenShift.
We could attempt the following:
securityContext.securityContext.fsGroup: 1000.Detecting Openshift can probably be done in various ways (an example), but there does not seem to be a documented robust way of doing it.
This have to be done in agreement with the existing RBAC permissions. Requiring additional RBAC permissions for this call feels wrong.
It is also an option to list the existing PSP/SCC in the cluster, and don't set the securityContext if the returned list is not empty, but that would require an explicit RBAC permission that may differ in K8s vs. Openshift.
It is also possible to pass an explicit flag to the operator so it nevers sets a default securityContext, which would have to be documented especially for Openshift.
In any case, we probably still need to add the following to the quickstart:
NOTE: ECK sets a default pod `securityContext` to allow the `elasticsearch` user to access mounted volumes, unless deployed on Openshift. This may conflict with existing PSP and SCC. See [this page](link) for more information on how to disable it.
My inclination would be for option 3. It makes things tricky since there is an implicit/non-robust decision being made by the operator, but it gives the easiest quickstart experience with less overhead in the documentation.
I think it is worth trying to find the best way to do a best-effort Openshift detection on operator startup, with limited RBAC permission.
If that ends up being too complicated, my second choice would be option 2. In short: favor a quickstart experience on an unsecured k8s cluster, and try to redirect other users (including Openshift users) to a dedicated doc page about disabling the securityContext.
A few things we discussed today with @nkvoll @pebrc @anyasabo. No decision reached yet:
root as a default user feels like a good thing. We don't want to revert that decision.fsGroup: 1000, can be overridden to empty in the podTemplate) is tempting since Openshift users already have to go through a specific documentation page. In the quickstart we would just add a note with a link to a documentation page dedicated to that setting. Existing Openshift clusters and/or secured k8s clusters will likely be broken on ECK upgrade.--set-default-elasticsearch-fsgroup), either as a more complete parameter ({securityContext: fsGroup: 1000}). The last one probably requires a dedicated configuration file. Also we may want to extend this setting to other resources (Kibana, APMServer, etc.) for consistency.I think I'm leaning towards the following:
--set-default-fsgroup=true. Why boolean and not an int (eg. 1000)? Because that value may differ for each resource managed by ECK, so it would then need to be a per-Kind flag. Or a more complex yaml configuration.true by default in ECK manifests. Add a dedicated doc page explaining this flag, and how its behaviour can also be overridden in the podTemplate, which then takes priority. Specify in the Openshift docs that this flag should most likely be overridden to false.I'm not sure whether this should apply to all stack versions, or only apply to 8.0+ so we don't break compatibility with existing pre-8.0 clusters.
I think I agree Seb. I'm less sure on only implementing it for 8.0+. It's nice because it is "simple" from a user experience -- if an existing user with PSPs upgrades to the version of ECK that includes the automatic fsGroup setting, it only fails for new 8.x clusters, or fails on the first pod during a rolling upgrade to 8.x. So the impact is minimal and should be relatively easy to notice by users.
I think it's less nice because of the complexity involved. We're arguably not doing the "right thing" now in <8.0 by using the init container instead of the native feature that does what we want it to do. There's even more differences between 6.x/7.x and 8.x that aren't really related to actual Elasticsearch and are more related to how Elasticsearch is packaged. Being consistent wherever we can is nice just to minimize mental load both for us and our users (who have to keep a mental map of what behavior we default with across different ES versions).
Downsides of defaulting fsGroups for all versions:
Overall I think I'm okay with making this change for all ES versions, but could still be persuaded otherwise.
👍 on using the flag @sebgl proposed.
Arguments for using the fsGroup mechanism only as of ES 8.0 + imo:
Side-note: I am still a bit worried about the number of flags we add to the binary (17 atm) and still think we should consider a configuration file. But maybe not for the ones that are feature toggles like this one but for the configuration values like cert validity and such.
If we use a flag I think we will have to make a choice regarding the operator hub:
Either :
A few things we discussed with the team:
fsGroup change to stack version 8.0.0 seems to be a reasonable way to alleviate this concern.set-default-fsgroup=true|false). If not set, we would error-out in the reconciliation of Elasticsearch 8.0.0. The downside of this approach is that people need to care and understand this fsGroup thing when installing ECK for the first time (or upgrading). It moves us away from a very simple quickstart experience, which would be a big loss.fsGroup setting not being set to the right value. If an SCC/PSP enforces fsGroup to not be 1000, the Pod creation will fail, and an error will be reported in the StatefulSets events. Unfortunately that's not easy to discover, even from within the operator. Pod creation dry-run can help grab a better error message, but the feature is not available on all k8s environments. We could also try to detect, at ECK startup, if we're running on Openshift. If that's the case, and set-default-fsgroup=true, we can output an explicit warning.What we agreed on with the team (basically summarizes the discussion above):
--set-default-fsgroup=true|false flag (defaults: true, also in operator hub) - https://github.com/elastic/cloud-on-k8s/pull/3342
- [x] operator-level
--set-default-fsgroup=true|falseflag (defaults: true, also in operator hub) - #3342- [x] only set the Elasticsearch fsGroup starting 8.0.0
- [x] marked as breaking change, since it breaks expectations for Openshift users starting 8.0.0
- [ ] dedicated documentation page to explain how to disable this for Openshift & non-PSP users, linked from the quickstart
as part of ECK 1.3 so we release way before 8.0.0 (new Openshift users will likely notice in the quickstart while deploying a new 7.x)
- [ ] emit warnings if we suspect the setting is wrongly set
- [ ] best-effort autodetect openshift at ECK startup
- [ ] double-check pod creation dry-run fails if available
- [x] [eventually](https://github.com/elastic/cloud-on-k8s/issues/3401) a setting available in a configMap (not only a flag)
@sebgl just trying to figure out where we stand with this issue. I think I ticked all the right boxes.
@pebrc yes! Unassigning myself here since not really working on this at the moment.