kubernetes 🚀 - Volumes are created in container with root ownership and strict permissions

hostDir should get the same permissions as the existing host entry, though I am not sure we ensure a host direct exists before using hostDir

Part of the problem here is that different containers can run as
different users in the same pod - which user do we create the volume
with? what we really need is a way to tell docker to add supplemental
group IDs when launching a container, so we can assign all containers
in a pod to a common group.

I filed https://github.com/docker/docker/issues/9360

thockin on 26 Nov 2014

Would it be reasonable to add user and/or permissions option to volumeMounts or emptyDir to explicitly force it?

carlossg on 26 Nov 2014

👍68

I don't think that we want that in the API long-term, so I'd rather apply a
hidden heuristic like "chown to the USER of the first container that mounts
the volume" or even "ensure that all VolumeMounts for an emptyDir Volume
have the same USER, else error". Do you think such heuristics would hold?

On Wed, Nov 26, 2014 at 9:07 AM, Carlos Sanchez [email protected]
wrote:

Would it be reasonable to add user and/or permissions option to
volumeMounts or emptyDir to explicitly force it?

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-64678153
.

thockin on 26 Nov 2014

👎25

That sounds good to me

carlossg on 26 Nov 2014

This is a good starter project

thockin on 1 Dec 2014

Background
Inside a docker container, the primary process is launched as root by default . And, currently, docker containers can not be run without root privileges (Once docker supports the user namespace, a process inside a container can run as root, and the container root user could actually be mapped to a normal, non-privileged, user outside the container). However, even today, inside a docker container a process can be run under a non-privileged user: the Docker image can create new users and then force docker to launch the entry point process as that user instead of root (as long as that user exists within the container image).

When an external volume is mounted it’s permissions are set to ROOT (UID 0), therefore unless the process inside the container is launched as root, it won’t have permission to access the mounted directory.

Proposed Workarounds on the Kuberentes side

While creating pod, if it requires an EmptyDir volume, before starting containers, retrieve the USER from each container image (introspect JSON for each container image), if any of the containers are launching their main process as non-root, fail pod creation.
While creating pod, if it requires an EmptyDir volume, before creating shared volume, Chown it to the USER of the first container that mounts the volume.
Problems with this approach:
1. With Kubernetes a pod can contain multiple containers that share a volume, but each container could potentially run their processes with different users inside, meaning even if the owner of a volume was changed, unless the owner was changed to a group that all containers were aware of (and all relevant users were part of), the problem would still exist.
  - Not that big a deal because we could handle it with docker/docker#9360
2. Another interesting dimension to the problem is that running CHOWN on a shared volume from outside the containers could fail if the host machine does not have the same user as inside the container (container images can create a new user, that the host is unaware of, and have the entry process run as that user, but since that user does not exist on the host, CHOWN to that user from the host will fail).
  
  One work around for this is to share the etc/passwd file between the host and the container, but that is very limiting.
  
  Another potential workaround would be for the host to some how reach inside the container during initialization (before the shared volume is mounted), read the USER that the main process will start with and use the image “/etc/passwd” file to map the USER to UID, and CHOWN the shared volume on the host to that UID (CHOWN on the host would only fails if it doesn’t find a user _string_ because it uses /etc/passwd to find the mapping, but it always succeeds with UIDs because it just sets the uint value directly without any lookup).

Both approaches feel to me like they are breaking a layer of abstraction by having Kubernetes reach into the container to figure out what user the main process would start as, and doing something outside the container with that information. I feel like the right approach would be for the containers themselves to CHOWN any "mounted volumes" during setup (after creating and setting user).

Thoughts?

saad-ali on 13 Dec 2014

👍4

@thockin, after talking to some folks, I think @carlossg's approach of explicitly specifying the user in the API would be the cleanest work around. I don't thing we can apply "hidden heuristics" without doing icky violation of abstractions (like reaching in to a container to figure out what username to use and then mounting the container's /etc/passwd file to figure out the associated UID).

Proposal to modify the API:

Extend the API for EmptyDir, GitRepo, and GCEPersistentDisk volumes to optionally specify a unsigned integer UID.
- If the UID is specified, the host will change the owner of the directory to that UID and set the permissions to 750 (User: rwx, Group: r-x, World: ---) when the volume directory is created.
- If the UID is not specified, the host will not change the owner, but set the permissions to 757 (User: rwx, Group: r-x, World: rxw), i.e. world writable, when the volume directory is created.
- HostDir volumes would be left untouched, since those directories are not created by Kubernetes.
- Require UID instead of username string so there are no problems if the user does exist on the host machine (issue 2.ii above).

Thoughts?

CC: @bgrant0607, @dchen1107, @lavalamp

saad-ali on 19 Dec 2014

👍10

I think adding UID to volumes is a hack and redundant. I'd rather we do
the right thing and get Docker to support supplemental group IDs.

https://github.com/docker/docker/issues/9360

On Thu, Dec 18, 2014 at 3:13 PM, Saad Ali [email protected] wrote:

@thockin https://github.com/thockin, after talking to some folks, I
think @carlossg https://github.com/carlossg's approach of explicitly
specifying the user in the API would be the cleanest work around. I don't
thing we can apply "hidden heuristics" without doing icky violation of
abstractions (like reaching in to a container to figure out what username
to use and then mounting the container's /etc/passwd file to figure out
the associated UID).

Proposal to modify the API:

Extend the API for EmptyDir, GitRepo, and GCEPersistentDisk volumes
to optionally specify a unsigned integer UID.

If the UID is specified, the host will change the owner of the

directory to that UID and set the permissions to 750 (User: rwx,

Group: r-x, World: ---) when the volume directory is created.

If the UID is not specified, the host will not change the owner,

but set the permissions to 757 (User: rwx, Group: r-x, World: rxw),

i.e. world writable, when the volume directory is created.

HostDir volumes would be left untouched, since those directories

are not created by Kubernetes.

Require UID instead of username string so there are no problems

if the user does exist on the host machine (issue 2.ii above).

Thoughts?

CC: @bgrant0607 https://github.com/bgrant0607, @dchen1107
https://github.com/dchen1107, @lavalamp https://github.com/lavalamp

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-67573932
.

thockin on 22 Dec 2014

👎4

@saad-ali I think HostDir should not be left untouched. Let's consider this: Hadoop on restart, restores the blocks from the directory it stores data in. If we use emptyDir, the container which restarted will get another directory and the previous data will be lost. And Hadoop requires the permissions and ownership of directory to be set to the user starting Hadoop (hdfs). If HostDir is not allowed to change permissions as per user, then similar use cases to this cannot be achieved. Please, comment.

LuqmanSahaf on 13 Jan 2015

Define restart? Do you mean the container crashed and came back, or do you
mean the machine rebooted and a new pod was scheduled and expects to be
able to reclaim the disk space used by the previous pod? Or something else?

On Mon, Jan 12, 2015 at 11:56 PM, Luqman [email protected] wrote:

@saad-ali https://github.com/saad-ali I think HostDir should not be
left untouched. Let's consider this: Hadoop on restart, restores the blocks
from the directory it stores data in. If we use emptyDir, the container
which restarted will get another directory and the previous data will be
lost. And Hadoop requires the permissions and ownership of directory to be
set to the user starting Hadoop (hdfs). If HostDir is not allowed to change
permissions per user, then similar use cases to this cannot be achieved.
Please, comment.

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-69707793
.

thockin on 14 Jan 2015

@thockin Restart could be anything. It could be after pod failure or container failure. Or the container could be restarted, after changing some configurations (Hadoop needs to be restarted after changes in configs). Does that answer?

LuqmanSahaf on 14 Jan 2015

This document mentions that when a pod is unbound, the emptyDir is deleted. In use case of Hadoop, the data might be essential and might be required when another pod of Hadoop comes back (or the container restarts). So, HostDir must be used to persist data even the pod is unbound. But Hadoop requires permissions to be set for the user for the data directory. Hope this explains.

LuqmanSahaf on 14 Jan 2015

With docker/libcontainer/pull/322, docker containers now allow specifying AdditionalGroups (supplementary group GIDs). So an updated proposal to handle shared volumes amongst different containers in a pod:

When creating EmptyDir, GitRepo, or GCEPersistentDisk volumes for a new pod, Kubelet will:
1. Create a new linux group for the pod on the host machine
  - Group is created with the next available Group ID number (GID)
2. Change the group of the new directory (on the host machine) to the newly created group.
3. Set the permissions of the new directory (on the host machine) to 770 (User: rwx, Group: rwx, World: ---).
4. For each docker container, pass in the GID of the new group as AdditionalGroups via docker container configs.
  - Still requires docker to support passing AdditionalGroups through to libcontainer (https://github.com/docker/docker/issues/9360).
  - May require updating fsouza/go-dockerclient to support AdditionalGroups
When creating HostDir volumes for a new pod, Kubelet will:
- Leave the volume untouched, since those directories are not created by Kubernetes.
- @LuqmanSahaf: this is up for debate, but my thinking is that since Kubernetes does not create the HostDir, and since it may contain existing data, Kubernetes should not get in to the business of modifying it. We should leave it up to the creator and maintainer of the HostDir to modify it's ownership or permissions to allow containers to access it.

saad-ali on 15 Jan 2015

👍1

There's an important distinction between a container restarting and a pod
being removed. When a container restarts, the data in a normal emptyDir
volume is safe. when a pod is removed, it should be GONE. Leaving
Hostdata and expecting it to be there at some later point in time is
awkward at best.

All of this is more complicated as soon as user namespaces land.

On Wed, Jan 14, 2015 at 12:04 AM, Luqman [email protected] wrote:

This document
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/volumes.md#emptydir,
mentions that when a pod unbound, the emptyDir is deleted. In use case of
Hadoop, the data might be essential and might be required when another pod
of Hadoop comes back (or the container restarts). So, HostDir must be used
to persist data even the pod is unbound. But Hadoop requires permissions to
be set for the user for the data directory. Hope this explains.

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-69881834
.

thockin on 15 Jan 2015

If I'm understanding the current proposal correctly, I think this is going to create surprising behavior for a certain class of applications.

Many older applications which bind to low (privileged) ports start first as root, then immediately drop privileges to some other user. In such a scenario, the container must be configured to start the application as root, and so the original user (root) would have access to the volume. Once the application calls setuid(2)/seteuid(2) though, it won't have access anymore. Now the only way to get access to that directory is to modify the container to chown the volume before starting the application itself. This is the situation I'm currently in.

Due to this, I'd like to voice another opinion in favor of extending the API to allow explicitly specifying UID and GID as I don't think the current proposal covers all possible (reasonable) use cases.

eatnumber1 on 21 May 2015

👍7

At a minimum the emptydir should use the UID/GID/Labels of the security context (if specified).

smarterclayton on 3 Jun 2015

I think adding UID to volumes is a hack and redundant. I'd rather we do the right thing and get Docker to support supplemental group IDs.

+1, but also:

At a minimum the emptydir should use the UID/GID/Labels of the security context (if specified).

Now that we have security context API in place, I think we should make emptyDir work with the security context of the containers in a pod. One little wrinkle to iron out about this is that volumes are pod-scoped while security contexts are container scoped. I think you will have to look at which containers have the volume mounted and what their security contexts are. If there's a single container that mounts an emptyDir, it's easy -- use the security context of that container. If there are multiple, it gets dicey:

Do all containers mounting the emptyDir need to have the same UID?
Do the containers have to have the exact same SELinux options?
Can we have use cases where there are two different SELinux contexts that both need to use the volume? Can we synthesize the levels for a volume from the labels of different SELinux contexts of different containers that have the same user, role, and type, but different levels?

I think I will probably start prototyping this concentrating on the simple case where there's a single container. I will use the security context of the first container that mounts the volume in the pod spec at first and we can change the strategy for determining the context to use as discussion goes.

@smarterclayton @thockin @erictune @pweil-

pmorie on 4 Jun 2015

I don't really like the heuristics here. I acknowledge that I suggested a
heuristic but that was half a year ago :)

Other than Docker support not being done yet, why can't we do something
like:

All volumes get allocated a unique GID (maybe machine-unique or maybe
master-allocated?)
All volumes are mounted g+srwx
All containers in a pod get all volumes in that pod's GIDs in their
supplemental groups

Net result should be that all containers in the pod can access all volumes
in the pod without restriction, regardless of what UID each container is
using.

I don't know anything about SELinux labels, might be a problem. I assert
that all volumes in a pod should be available to all containers

Next we have to define "all volumes". emptyDir is pretty obvious. What
happens to hostPath mounts that did not exist and were created for this
use? Seems reasonable. What about hostPath mounts that existed before
this use - no way can we change those. What about things like PDs? Do we
run a recursive chown/chgrp/chmod Blech.

@mrunalp for docker support status on supplemental groups.

thockin on 4 Jun 2015

Waiting for https://github.com/docker/libcontainer/pull/603 to be merged.

mrunalp on 4 Jun 2015

I assert that all volumes in a pod should be available to all containers

Spent a lot of time thinking about this last night, and I agree with you on this point.

pmorie on 4 Jun 2015

I don't know if that means we need a pod-level security context or
something, though :)

On Thu, Jun 4, 2015 at 8:40 AM, Paul Morie [email protected] wrote:

I assert that all volumes in a pod should be available to all containers

Spent a lot of time thinking about this last night, and I agree with you
on this point.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108939442
.

thockin on 4 Jun 2015

Although - I may collocate two containers and not want them to share contents (db and web logs) but share a work for.

On Jun 4, 2015, at 12:22 PM, Tim Hockin [email protected] wrote:

I don't know if that means we need a pod-level security context or
something, though :)

On Thu, Jun 4, 2015 at 8:40 AM, Paul Morie [email protected] wrote:

I assert that all volumes in a pod should be available to all containers

Spent a lot of time thinking about this last night, and I agree with you
on this point.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108939442
.

—
Reply to this email directly or view it on GitHub.

smarterclayton on 4 Jun 2015

I don't find the case of two containers in a pod needing different access
control to a volume to be very compelling. The alternative is to spec a
full security context for volumes and then force the complexity back onto
API users.

On Thu, Jun 4, 2015 at 9:30 AM, Clayton Coleman [email protected]
wrote:

Although - I may collocate two containers and not want them to share
contents (db and web logs) but share a work for.

On Jun 4, 2015, at 12:22 PM, Tim Hockin [email protected]
wrote:

I don't know if that means we need a pod-level security context or
something, though :)

On Thu, Jun 4, 2015 at 8:40 AM, Paul Morie [email protected]
wrote:

I assert that all volumes in a pod should be available to all
containers

Spent a lot of time thinking about this last night, and I agree with
you
on this point.

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108939442>

.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108957878
.

thockin on 4 Jun 2015

The argument here seems to be that you don't need intra pod security more complex than "don't mount the same volume into different contexts". I'm ok with a single security context for the pod - just pointing out that if you want to have complex, secure pods, you may want to use user isolation between the containers to secure disk contents.

----- Original Message -----

I don't find the case of two containers in a pod needing different access
control to a volume to be very compelling. The alternative is to spec a
full security context for volumes and then force the complexity back onto
API users.

On Thu, Jun 4, 2015 at 9:30 AM, Clayton Coleman [email protected]
wrote:

Although - I may collocate two containers and not want them to share
contents (db and web logs) but share a work for.

On Jun 4, 2015, at 12:22 PM, Tim Hockin [email protected]
wrote:

I don't know if that means we need a pod-level security context or
something, though :)

On Thu, Jun 4, 2015 at 8:40 AM, Paul Morie [email protected]
wrote:

I assert that all volumes in a pod should be available to all
containers

Spent a lot of time thinking about this last night, and I agree with
you
on this point.

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108939442>

.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108957878
.

Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108959938

smarterclayton on 4 Jun 2015

Hrm.

pmorie on 4 Jun 2015

Although the pod security context is unlikely to work for most real containers once user namespaces land - the UID a container runs as (in user namespaces) is really tied to the container, not the pod. So either that has to be a default security context at the pod level (overridable) or it's instead the volume security context.

----- Original Message -----

The argument here seems to be that you don't need intra pod security more
complex than "don't mount the same volume into different contexts". I'm ok
with a single security context for the pod - just pointing out that if you
want to have complex, secure pods, you may want to use user isolation
between the containers to secure disk contents.

----- Original Message -----

I don't find the case of two containers in a pod needing different access
control to a volume to be very compelling. The alternative is to spec a
full security context for volumes and then force the complexity back onto
API users.

On Thu, Jun 4, 2015 at 9:30 AM, Clayton Coleman [email protected]
wrote:

Although - I may collocate two containers and not want them to share
contents (db and web logs) but share a work for.

On Jun 4, 2015, at 12:22 PM, Tim Hockin [email protected]
wrote:

I don't know if that means we need a pod-level security context or
something, though :)

On Thu, Jun 4, 2015 at 8:40 AM, Paul Morie [email protected]
wrote:

I assert that all volumes in a pod should be available to all
containers

Spent a lot of time thinking about this last night, and I agree with
you
on this point.

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108939442>

.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108957878
.

Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108959938

smarterclayton on 4 Jun 2015

We should share user namespace across the pod.

On Thu, Jun 4, 2015 at 10:32 AM, Clayton Coleman [email protected]
wrote:

Although the pod security context is unlikely to work for most real
containers once user namespaces land - the UID a container runs as (in user
namespaces) is really tied to the container, not the pod. So either that
has to be a default security context at the pod level (overridable) or it's
instead the volume security context.

----- Original Message -----

The argument here seems to be that you don't need intra pod security more
complex than "don't mount the same volume into different contexts". I'm
ok
with a single security context for the pod - just pointing out that if
you
want to have complex, secure pods, you may want to use user isolation
between the containers to secure disk contents.

----- Original Message -----

I don't find the case of two containers in a pod needing different
access
control to a volume to be very compelling. The alternative is to spec a
full security context for volumes and then force the complexity back
onto
API users.

On Thu, Jun 4, 2015 at 9:30 AM, Clayton Coleman <
[email protected]>
wrote:

Although - I may collocate two containers and not want them to share
contents (db and web logs) but share a work for.

On Jun 4, 2015, at 12:22 PM, Tim Hockin [email protected]
wrote:

I don't know if that means we need a pod-level security context or
something, though :)

On Thu, Jun 4, 2015 at 8:40 AM, Paul Morie <
[email protected]>
wrote:

I assert that all volumes in a pod should be available to all
containers

Spent a lot of time thinking about this last night, and I agree
with
you
on this point.

—
Reply to this email directly or view it on GitHub
<

https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108939442

.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108957878

.

Reply to this email directly or view it on GitHub:

https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108959938

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108983917
.

thockin on 4 Jun 2015

The user namespace of the two containers should probably be in the same range. But the UID of container A and B are not required to be ==, and in many cases you don't want them to be trivially ==, because you may want to "read" the volume but not write it.

----- Original Message -----

We should share user namespace across the pod.

On Thu, Jun 4, 2015 at 10:32 AM, Clayton Coleman [email protected]
wrote:

Although the pod security context is unlikely to work for most real
containers once user namespaces land - the UID a container runs as (in user
namespaces) is really tied to the container, not the pod. So either that
has to be a default security context at the pod level (overridable) or it's
instead the volume security context.

----- Original Message -----

The argument here seems to be that you don't need intra pod security more
complex than "don't mount the same volume into different contexts". I'm
ok
with a single security context for the pod - just pointing out that if
you
want to have complex, secure pods, you may want to use user isolation
between the containers to secure disk contents.

----- Original Message -----

I don't find the case of two containers in a pod needing different
access
control to a volume to be very compelling. The alternative is to spec a
full security context for volumes and then force the complexity back
onto
API users.

On Thu, Jun 4, 2015 at 9:30 AM, Clayton Coleman <
[email protected]>
wrote:

Although - I may collocate two containers and not want them to share
contents (db and web logs) but share a work for.

On Jun 4, 2015, at 12:22 PM, Tim Hockin [email protected]
wrote:

I don't know if that means we need a pod-level security context or
something, though :)

On Thu, Jun 4, 2015 at 8:40 AM, Paul Morie <
[email protected]>
wrote:

I assert that all volumes in a pod should be available to all
containers

Spent a lot of time thinking about this last night, and I agree
with
you
on this point.

—
Reply to this email directly or view it on GitHub
<

https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108939442

.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108957878

.

Reply to this email directly or view it on GitHub:

https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108959938

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108983917
.

Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108987165

smarterclayton on 4 Jun 2015

@smarterclayton

The user namespace of the two containers should probably be in the same range. But the UID of container A and B are not required to be ==, and in many cases you don't want them to be trivially ==, because you may want to "read" the volume but not write it.

Do you think we could infer this by whether the readOnly flag is set on the VolumeMount?

pmorie on 4 Jun 2015

Some thoughts...

When mounting a volume for a container it could inherit the SC of the container if it has nothing set.
For a complex case we could spec out the SC for the volume to support ranges of SELinux labels as mentioned before and in this case it would not inherit the SC of the volume
For a predefined volume SCs the container's SC would need to be allocated in a manner consistent with the desired security (ie. volume has range s0:c1.c10 and container has s0:c1,c2, GID that has only read, etc) to facilitate custom, complex approaches with fine grained access

pweil- on 4 Jun 2015

----- Original Message -----

Some thoughts...

When mounting a volume for a container it could inherit the SC of the
container if it has nothing set.

Hrm - that means a pod with one container would behave differently from a pod with two containers?

For a complex case we could spec out the SC for the volume to support
ranges of SELinux labels as mentioned before and in this case it would not
inherit the SC of the volume

For a predefined volume SCs the container's SC would need to be allocated
in a manner consistent with the desired security (ie. volume has range
s0:c1.c10 and container has s0:c1,c2, GID that has only read, etc) to
facilitate custom, complex approaches with fine grained access

Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108991796

smarterclayton on 4 Jun 2015

Hrm - that means a pod with one container would behave differently from a pod with two containers?

And this is where some of the complexity comes in from being flexible. It is a valid use case to a single pod to have different security contexts for every container in the pod. And likewise, in OpenShift, if the continers make SC requests each container may validate against different SCCs. In that case, it seems like a predefined SC on the volume should be used with container SCs that comply. Inheriting is simply an ease of use feature.

Another idea to throw against the wall - being able to go from a pre-defined SC on a volume (inherited or not) to a RunInRange policy and validating that all container SCs that request the volume will have some sort of access.

pweil- on 4 Jun 2015

Security aside -- can we agree that a PR that relaxes the mode is a Good Thing? I would like it if non-root uids could use volumes and junk.

@thockin @smarterclayton

pmorie on 5 Jun 2015

I don't know what "relaxes the mode" means - can you be more concrete?

On Fri, Jun 5, 2015 at 2:41 PM, Paul Morie [email protected] wrote:

Security aside -- can we agree that a PR that relaxes the mode is a Good
Thing? I would like it if non-root uids could use volumes and junk.

@thockin https://github.com/thockin @smarterclayton
https://github.com/smarterclayton

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109447608
.

thockin on 5 Jun 2015

My bad. I meant, make emptyDir 0777 instead of 0700.

pmorie on 6 Jun 2015

At a minimum for us (Paul) we could use the sc of the first container that mounts the volume until we get a broader solution. Users can then control ordering and we'll at least be unbroken.

On Jun 5, 2015, at 6:01 PM, Paul Morie [email protected] wrote:

My bad. I meant, make emptyDir 0777 instead of 0700.

—
Reply to this email directly or view it on GitHub.

smarterclayton on 6 Jun 2015

@smarterclayton that's what I was going to next. We still need to fix the
non-root UID case when SELinux isn't in play -- which I think we need the
0777 mode to do until we have supplemental groups in docker.

On Fri, Jun 5, 2015 at 6:13 PM, Clayton Coleman [email protected]
wrote:

At a minimum for us (Paul) we could use the sc of the first container that
mounts the volume until we get a broader solution. Users can then control
ordering and we'll at least be unbroken.

On Jun 5, 2015, at 6:01 PM, Paul Morie [email protected]
wrote:

My bad. I meant, make emptyDir 0777 instead of 0700.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109457218
.

pmorie on 6 Jun 2015

I do agree that a separate security context for the pod volumes (individual or group) is probably necessary.

On Jun 5, 2015, at 6:39 PM, Paul Morie [email protected] wrote:

@smarterclayton that's what I was going to next. We still need to fix the
non-root UID case when SELinux isn't in play -- which I think we need the
0777 mode to do until we have supplemental groups in docker.

On Fri, Jun 5, 2015 at 6:13 PM, Clayton Coleman [email protected]
wrote:

At a minimum for us (Paul) we could use the sc of the first container that
mounts the volume until we get a broader solution. Users can then control
ordering and we'll at least be unbroken.

On Jun 5, 2015, at 6:01 PM, Paul Morie [email protected]
wrote:

My bad. I meant, make emptyDir 0777 instead of 0700.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109457218
.

—
Reply to this email directly or view it on GitHub.

smarterclayton on 6 Jun 2015

I do agree that a separate security context for the pod volumes (individual or group) is probably necessary.

I think so too after having my brain thoroughly pretzelized while thinking through all the permutations of what you would have to handle without one. :curly_loop: :curly_loop: :curly_loop:

pmorie on 6 Jun 2015

For comparison, we are intentionally NOT this flexible for things like
RestartPolicy - defining sane semantics for this sort of edge case is just
not worth the effort. Is it really worth the effort for security context ?
On Jun 4, 2015 11:27 AM, "Paul Weil" [email protected] wrote:

Hrm - that means a pod with one container would behave differently from a
pod with two containers?

And this is where some of the complexity comes in from being flexible. It
is a valid use case to a single pod to have different security contexts for
every container in the pod. And likewise, in OpenShift, if the continers
make SC requests each container may validate against different SCCs. In
that case, it seems like a predefined SC on the volume should be used with
container SCs that comply. Inheriting is simply an ease of use feature.

Another idea to throw against the wall - being able to go from a
pre-defined SC on a volume (inherited or not) to a RunInRange policy and
validating that all container SCs that request the volume will have some
sort of access.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108999035
.

thockin on 7 Jun 2015

Can't we make it 0770 and set the group ID for it and every container in
the pod? It's coarse but better than 0777 and closer to where we should go
(IMO supplemental GIDs). As far as I see, Security Context does not yet
allow a container to set GID...
On Jun 5, 2015 3:01 PM, "Paul Morie" [email protected] wrote:

My bad. I meant, make emptyDir 0777 instead of 0700.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109452834
.

thockin on 7 Jun 2015

I could live with this iff it comes with a giant TODO and docs in the right
places.
On Jun 5, 2015 3:13 PM, "Clayton Coleman" [email protected] wrote:

At a minimum for us (Paul) we could use the sc of the first container that
mounts the volume until we get a broader solution. Users can then control
ordering and we'll at least be unbroken.

On Jun 5, 2015, at 6:01 PM, Paul Morie [email protected]
wrote:

My bad. I meant, make emptyDir 0777 instead of 0700.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109457218
.

thockin on 7 Jun 2015

I really really want to lean on simple assumptions intra-pod. Adding SC on
volumes is not simpler.
On Jun 5, 2015 3:44 PM, "Paul Morie" [email protected] wrote:

I do agree that a separate security context for the pod volumes
(individual or group) is probably necessary.

I think so too after having my brain thoroughly pretzelized while thinking
through all the permutations of what you would have to handle without one. [image:
:curly_loop:] [image: :curly_loop:] [image: :curly_loop:]

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109468896
.

thockin on 7 Jun 2015

Volumes plural, not volumes singular. I'm pretty sure different uids per container is absolutely valid, so we can't guess a uid based on containers alone and be predictable. So we either need a pod default sc or a subset of sc applied to all volumes. If the uid on the directory is wrong for that sc that has other security implications. And labels _have_ to match or you get nothing.

So the two options seem to be:

Rules based on first container using the volume (first in the pod containers list)
Explicit pod level setting

Pod level default sc kind of makes sense, while the first option also makes sense but is order dependent and somewhat implicit.

On Jun 7, 2015, at 1:55 PM, Tim Hockin [email protected] wrote:

I really really want to lean on simple assumptions intra-pod. Adding SC on
volumes is not simpler.
On Jun 5, 2015 3:44 PM, "Paul Morie" [email protected] wrote:

I do agree that a separate security context for the pod volumes
(individual or group) is probably necessary.

I think so too after having my brain thoroughly pretzelized while thinking
through all the permutations of what you would have to handle without one. [image:
:curly_loop:] [image: :curly_loop:] [image: :curly_loop:]

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109468896
.

—
Reply to this email directly or view it on GitHub.

smarterclayton on 7 Jun 2015

I agree group is generally useful. It wouldn't work for labels though.

On Jun 7, 2015, at 1:50 PM, Tim Hockin [email protected] wrote:

Can't we make it 0770 and set the group ID for it and every container in
the pod? It's coarse but better than 0777 and closer to where we should go
(IMO supplemental GIDs). As far as I see, Security Context does not yet
allow a container to set GID...
On Jun 5, 2015 3:01 PM, "Paul Morie" [email protected] wrote:

My bad. I meant, make emptyDir 0777 instead of 0700.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109452834
.

—
Reply to this email directly or view it on GitHub.

smarterclayton on 7 Jun 2015

Yeah, I don't know selinux at all (it has only ever given me problems that
I don't know how to solve - same as here).
On Jun 7, 2015 11:05 AM, "Clayton Coleman" [email protected] wrote:

I agree group is generally useful. It wouldn't work for labels though.

On Jun 7, 2015, at 1:50 PM, Tim Hockin [email protected]
wrote:

Can't we make it 0770 and set the group ID for it and every container in
the pod? It's coarse but better than 0777 and closer to where we should
go
(IMO supplemental GIDs). As far as I see, Security Context does not yet
allow a container to set GID...
On Jun 5, 2015 3:01 PM, "Paul Morie" [email protected] wrote:

My bad. I meant, make emptyDir 0777 instead of 0700.

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109452834>

.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109784566
.

ghost on 7 Jun 2015

Treat it like 700 - if labels are different you can't see it, if they are you can.

In this context we're probably going to stay simple and say every volume has a single label, and every container has a single label, and we want everything in the pod to have the same label in 99% of cases. Eventually we may want a container to have a different label (which means it can access anything outside of its label, period).

On Jun 7, 2015, at 2:21 PM, thockin-cc [email protected] wrote:

Yeah, I don't know selinux at all (it has only ever given me problems that
I don't know how to solve - same as here).
On Jun 7, 2015 11:05 AM, "Clayton Coleman" [email protected] wrote:

I agree group is generally useful. It wouldn't work for labels though.

On Jun 7, 2015, at 1:50 PM, Tim Hockin [email protected]
wrote:

Can't we make it 0770 and set the group ID for it and every container in
the pod? It's coarse but better than 0777 and closer to where we should
go
(IMO supplemental GIDs). As far as I see, Security Context does not yet
allow a container to set GID...
On Jun 5, 2015 3:01 PM, "Paul Morie" [email protected] wrote:

My bad. I meant, make emptyDir 0777 instead of 0700.

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109452834>

.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109784566
.

—
Reply to this email directly or view it on GitHub.

smarterclayton on 7 Jun 2015

Hrm
On Sun, Jun 7, 2015 at 2:25 PM Clayton Coleman [email protected]
wrote:

Treat it like 700 - if labels are different you can't see it, if they are
you can.

In this context we're probably going to stay simple and say every volume
has a single label, and every container has a single label, and we want
everything in the pod to have the same label in 99% of cases. Eventually we
may want a container to have a different label (which means it can access
anything outside of its label, period).

On Jun 7, 2015, at 2:21 PM, thockin-cc [email protected]
wrote:

Yeah, I don't know selinux at all (it has only ever given me problems
that
I don't know how to solve - same as here).
On Jun 7, 2015 11:05 AM, "Clayton Coleman" [email protected]
wrote:

I agree group is generally useful. It wouldn't work for labels though.

On Jun 7, 2015, at 1:50 PM, Tim Hockin [email protected]
wrote:

Can't we make it 0770 and set the group ID for it and every
container in
the pod? It's coarse but better than 0777 and closer to where we
should
go
(IMO supplemental GIDs). As far as I see, Security Context does not
yet
allow a container to set GID...
On Jun 5, 2015 3:01 PM, "Paul Morie" [email protected]
wrote:

My bad. I meant, make emptyDir 0777 instead of 0700.

—
Reply to this email directly or view it on GitHub
<

https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109452834>

.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109784566>

.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109785941
.

pmorie on 7 Jun 2015

It doesn't look like docker exposed control of the group at all yet. Am i
missing something, @thockin? Would the kubelet have to setgid on the
container process? Sounds racy to me.
On Sun, Jun 7, 2015 at 3:22 PM Paul Morie [email protected] wrote:

Hrm
On Sun, Jun 7, 2015 at 2:25 PM Clayton Coleman [email protected]
wrote:

Treat it like 700 - if labels are different you can't see it, if they are
you can.

In this context we're probably going to stay simple and say every volume
has a single label, and every container has a single label, and we want
everything in the pod to have the same label in 99% of cases. Eventually we
may want a container to have a different label (which means it can access
anything outside of its label, period).

On Jun 7, 2015, at 2:21 PM, thockin-cc [email protected]
wrote:

Yeah, I don't know selinux at all (it has only ever given me problems
that
I don't know how to solve - same as here).
On Jun 7, 2015 11:05 AM, "Clayton Coleman" [email protected]
wrote:

I agree group is generally useful. It wouldn't work for labels
though.

On Jun 7, 2015, at 1:50 PM, Tim Hockin [email protected]
wrote:

Can't we make it 0770 and set the group ID for it and every
container in
the pod? It's coarse but better than 0777 and closer to where we
should
go
(IMO supplemental GIDs). As far as I see, Security Context does not
yet
allow a container to set GID...
On Jun 5, 2015 3:01 PM, "Paul Morie" [email protected]
wrote:

My bad. I meant, make emptyDir 0777 instead of 0700.

—
Reply to this email directly or view it on GitHub
<

https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109452834>

.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109784566>

.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109785941
.

pmorie on 7 Jun 2015

Multiple GH fail, my bad!

pmorie on 7 Jun 2015

Hmm, I thought docker allowed setting GID. Damn.
On Jun 7, 2015 1:23 PM, "Paul Morie" [email protected] wrote:

It doesn't look like docker exposed control of the group at all yet. Am i
missing something, @thockin? Would the kubelet have to setgid on the
container process? Sounds racy to me.
On Sun, Jun 7, 2015 at 3:22 PM Paul Morie [email protected] wrote:

Hrm
On Sun, Jun 7, 2015 at 2:25 PM Clayton Coleman <[email protected]

wrote:

Treat it like 700 - if labels are different you can't see it, if they
are
you can.

In this context we're probably going to stay simple and say every volume
has a single label, and every container has a single label, and we want
everything in the pod to have the same label in 99% of cases.
Eventually we
may want a container to have a different label (which means it can
access
anything outside of its label, period).

On Jun 7, 2015, at 2:21 PM, thockin-cc [email protected]
wrote:

Yeah, I don't know selinux at all (it has only ever given me problems
that
I don't know how to solve - same as here).
On Jun 7, 2015 11:05 AM, "Clayton Coleman" [email protected]
wrote:

I agree group is generally useful. It wouldn't work for labels
though.

On Jun 7, 2015, at 1:50 PM, Tim Hockin [email protected]
wrote:

Can't we make it 0770 and set the group ID for it and every
container in
the pod? It's coarse but better than 0777 and closer to where we
should
go
(IMO supplemental GIDs). As far as I see, Security Context does
not
yet
allow a container to set GID...
On Jun 5, 2015 3:01 PM, "Paul Morie" [email protected]
wrote:

My bad. I meant, make emptyDir 0777 instead of 0700.

—
Reply to this email directly or view it on GitHub
<

https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109452834

.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
<

https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109784566

.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109785941

.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109797284
.

thockin on 8 Jun 2015

@thockin :'-/

On Sun, Jun 7, 2015 at 8:54 PM, Tim Hockin [email protected] wrote:

Hmm, I thought docker allowed setting GID. Damn.

On Jun 7, 2015 1:23 PM, "Paul Morie" [email protected] wrote:

It doesn't look like docker exposed control of the group at all yet. Am i
missing something, @thockin? Would the kubelet have to setgid on the
container process? Sounds racy to me.
On Sun, Jun 7, 2015 at 3:22 PM Paul Morie [email protected] wrote:

Hrm
On Sun, Jun 7, 2015 at 2:25 PM Clayton Coleman <
[email protected]

wrote:

Treat it like 700 - if labels are different you can't see it, if they
are
you can.

In this context we're probably going to stay simple and say every
volume
has a single label, and every container has a single label, and we
want
everything in the pod to have the same label in 99% of cases.
Eventually we
may want a container to have a different label (which means it can
access
anything outside of its label, period).

On Jun 7, 2015, at 2:21 PM, thockin-cc [email protected]
wrote:

Yeah, I don't know selinux at all (it has only ever given me
problems
that
I don't know how to solve - same as here).
On Jun 7, 2015 11:05 AM, "Clayton Coleman" <
[email protected]>
wrote:

I agree group is generally useful. It wouldn't work for labels
though.

On Jun 7, 2015, at 1:50 PM, Tim Hockin <
[email protected]>
wrote:

Can't we make it 0770 and set the group ID for it and every
container in
the pod? It's coarse but better than 0777 and closer to where we
should
go
(IMO supplemental GIDs). As far as I see, Security Context does
not
yet
allow a container to set GID...
On Jun 5, 2015 3:01 PM, "Paul Morie" [email protected]
wrote:

My bad. I meant, make emptyDir 0777 instead of 0700.

—
Reply to this email directly or view it on GitHub
<

https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109452834

.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
<

https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109784566

.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
<

https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109785941

.

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109797284

.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109819177
.

pmorie on 8 Jun 2015

@thockin @pmorie The syntax for setting gid is --user "uid:gid". For e.g.

docker run -it --rm --user "1:777" busybox sh

mrunalp on 8 Jun 2015

@thockin @pmorie The syntax for setting gid is --user "uid:gid". For e.g.

@pmorie @smarterclayton - we probably want to patch this in to SCs and SCCs then.

pweil- on 8 Jun 2015

Yes we do.

----- Original Message -----

@thockin @pmorie The syntax for setting gid is --user "uid:gid". For e.g.

@pmorie @smarterclayton - we probably want to patch this in to SCs and SCCs
then.

Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110031779

smarterclayton on 8 Jun 2015

@pweil-

On Mon, Jun 8, 2015 at 11:25 AM, Clayton Coleman [email protected]
wrote:

Yes we do.

----- Original Message -----

@thockin @pmorie The syntax for setting gid is --user "uid:gid". For
e.g.

@pmorie @smarterclayton - we probably want to patch this in to SCs and
SCCs
then.

Reply to this email directly or view it on GitHub:

https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110031779

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110039834
.

pmorie on 8 Jun 2015

Docker's syntax is atrocious, please don't copy it.
On Jun 8, 2015 8:45 AM, "Paul Morie" [email protected] wrote:

@pweil-

On Mon, Jun 8, 2015 at 11:25 AM, Clayton Coleman <[email protected]

wrote:

Yes we do.

----- Original Message -----

@thockin @pmorie The syntax for setting gid is --user "uid:gid". For
e.g.

@pmorie @smarterclayton - we probably want to patch this in to SCs and
SCCs
then.

Reply to this email directly or view it on GitHub:

https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110031779

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110039834

.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110049481
.

ghost on 8 Jun 2015

@thockin agree, we need a ditinct gid field in security context

On Mon, Jun 8, 2015 at 12:29 PM, thockin-cc [email protected]
wrote:

Docker's syntax is atrocious, please don't copy it.

On Jun 8, 2015 8:45 AM, "Paul Morie" [email protected] wrote:

@pweil-

On Mon, Jun 8, 2015 at 11:25 AM, Clayton Coleman <
[email protected]

wrote:

Yes we do.

----- Original Message -----

@thockin @pmorie The syntax for setting gid is --user "uid:gid".
For
e.g.

@pmorie @smarterclayton - we probably want to patch this in to SCs
and
SCCs
then.

Reply to this email directly or view it on GitHub:

https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110031779

—
Reply to this email directly or view it on GitHub
<

https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110039834

.

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110049481

.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110065270
.

pmorie on 8 Jun 2015

Docker's syntax is atrocious, please don't copy it.

Definitely not, we'd add an RunAsGroup int64 field.

pweil- on 8 Jun 2015

Depending on requirements, you might want to store user and group as strings. They are looked up by default in the passwd and group files. If they aren't found and are numeric then they are converted to uid/gid.

mrunalp on 8 Jun 2015

So I know it's a wild unreasonable idea, but from a purely correctness technical and layering PoV it seems to me like the right solution would be to just forbid/ignore the USER line in any docker file and make people set the uid of a container somewhere in the kube pod/container declaration. (although that still obviously suffers from the insanity of having to look at /etc/passwd inside the container to interpret the uid of something outside the container if we make this a string)

More seriously though, there doesn't seem to be ANY good solution until we get to user namespaces. Then we can generate a random uid/gid on the outside, chown the emptyDir directory to the uid/gid we picked and just let the container on the inside run as whatever uid/gid it wants. Nothing until we get to that point is anything but an ugly hack.

A relatedish point, just FYI, today (plain) docker is picking a random selinux context for every container. Docker recently accepted a new option -v /source:/dest:rw,z. The z portion is new. It means to do the equivalent of chown -R except it sets the selinux label on all files in the volume to the randomly generated label. They do not have an option to actually chown -R and set uid/gid but I know some people want that as well...

Personally, I think putting a security context on the volume and foisting the complexity on the user is the only 'clean' option unless we just ignore it entirely until user namespaces are a reality. One big reason I think this instead of jumping on @thockin supplemental gid solution is because there is no selinux analog. selinux is used similarly to uids. You get one. And either it exactly matches or it doesn't. It is not like groups. You can't have more than 1 label. Even with user namespaces, this part is still going to be a PITA...

eparis on 9 Jun 2015

On Jun 8, 2015, at 6:45 PM, Eric Paris [email protected] wrote:

So I know it's a wild unreasonable idea, but from a purely correctness technical and layering PoV it seems to me like the right solution would be to just forbid/ignore the USER line in any docker file and make people set the uid of a container somewhere in the kube pod/container declaration. (although that still obviously suffers from the insanity of having to look at /etc/passwd inside the container to interpret the uid of something outside the container if we make this a string)

I agree - we plan to reject images with non numeric usernames in some modes. You can't even trust /etc/passwd anyway, so you're just compensating for lazy image authors.
More seriously though, there doesn't seem to be ANY good solution until we get to user namespaces. Then we can generate a random uid/gid on the outside, chown the emptyDir directory to the uid/gid we picked and just let the container on the inside run as whatever uid/gid it wants. Nothing until we get to that point is anything but an ugly hack.

People need to write images to a known uid or work on any uid. But all the prep work to get to user namespaces can be done now. And user namespaces doesn't solve the ownership problem because containers can have multiple users, so you still don't know which uid to map to unless the image tells you with a numeric uid.
A relatedish point, just FYI, today (plain) docker is picking a random selinux context for every container. Docker recently accepted a new option -v /source:/dest:rw,z. The z portion is new. It means to do the equivalent of chown -R except it sets the selinux label on all files in the volume to the randomly generated label. They do not have an option to actually chown -R and set uid/gid but I know some people want that as well...

Personally, I think putting a security context on the volume and foisting the complexity on the user is the only 'clean' option unless we just ignore it entirely until user namespaces are a reality. One big reason I think this instead of jumping on @thockin supplemental gid solution is because there is no selinux analog. selinux is used similarly to uids. You get one. And either it exactly matches or it doesn't. It is not like groups. You can't have more than 1 label. Even with user namespaces, this part is still going to be a PITA...

—
Reply to this email directly or view it on GitHub.

smarterclayton on 9 Jun 2015

This topic is relevant to @jmccormick2001's interests

pmorie on 10 Jul 2015

@thockin @eparis Is there merit in peeling off a separate issue to discuss this problem for NFS and other !emptyDir volumes?

pmorie on 10 Jul 2015

yes

On Thu, Jul 9, 2015 at 8:34 PM, Paul Morie [email protected] wrote:

@thockin https://github.com/thockin @eparis https://github.com/eparis
Is there merit in peeling off a separate issue to discuss this problem for
NFS and other !emptyDir volumes?

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-120213150
.

thockin on 10 Jul 2015

What is the Kubernetes team recommended workaround for this for the time being? The options I see are as follows:

Run as root
Run as root initially and have an entrypoint wrapper that chowns specified directories before dropping permissions (?? would this even work?)

Neither of them seem especially palatable.

I see that a fix was made to chmod emptyDirs to 777 but this isn't in 1.0.x (which is what Google Container Engine, my preferred deployment target, is using.) So it looks like I have to dig in and use one of the above solutions for now.

joshk0 on 2 Nov 2015

kubernetes v1.1 will be out soon with better support for this case!

@pmorie

On Mon, Nov 2, 2015 at 9:03 AM, Joshua Kwan [email protected]
wrote:

What is the Kubernetes team recommended workaround for this for the time
being? The options I see are as follows:

Run as root

Run as root initially and have an entrypoint wrapper that chowns
specified directories before dropping permissions (?? would this even work?)

Neither of them seem especially palatable.

I see that a fix was made to chmod emptyDirs to 777 but this isn't in
1.0.x (which is what Google Container Engine, my preferred deployment
target, is using.) So it looks like I have to dig in and use one of the
above solutions for now.

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-153083650
.

thockin on 2 Nov 2015

@joshk0 we perform a chown from the containers' entry point (a bash script) and fallback to running our main process using a normal user using runuser or gosu.

Not ideal but it's the only way to do it on v1.0.

antoineco on 2 Nov 2015

@joshk0 @antoineco We've just introduced API changes that will allow you to specify a supplemental group that owns emptyDir and its derivatives and some block device volumes: #15352

pmorie on 3 Nov 2015

Paul,

Do we have a roadmap doc for the evolution of this? We've talked about
FSGid being auto-allocated at admission, then maybe being coupled to PVs
eventually. It would be nice to be able to see that plan all laid out.

On Tue, Nov 3, 2015 at 7:01 AM, Paul Morie [email protected] wrote:

@joshk0 https://github.com/joshk0 @antoineco
https://github.com/antoineco We've just introduced API changes that
will allow you to specify a supplemental group that owns emptyDir and its
derivatives and some block device volumes: #15352
https://github.com/kubernetes/kubernetes/pull/15352

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-153381097
.

thockin on 3 Nov 2015

@thockin I would love to write one up once I am freed up from finishing stuff for 3.1... I'll make an issue for it.

pmorie on 3 Nov 2015

@pmorie looks good in terms of usability! Thanks a lot for working on this.
Will it also be possible to set a more fine-grained mode on the mountpoint? Right row RW volumes are mounted with 0777 (1.1-beta.1). I have a use case where this causes minor issues:

❯ kubectl exec mypod -c logrotate -- ls -ld /var/log/containers/rails/
drwxrwxrwx    2 9999     root          4096 Nov  6 14:35 /var/log/containers/rails/

❯ kubectl exec mypod -c logrotate -- logrotate /etc/logrotate.conf  
error: skipping "/var/log/containers/rails/production.log" because parent directory has insecure permissions (It's world writable or writable by group which is not "root") Set "su" directive in config file to tell logrotate which user/group should be used for rotation.

antoineco on 7 Nov 2015

@antoineco Are you using an emptyDir for that volume? I think I will create an issue to change the behavior so that emptyDirs are chmod a-rwx if FSGroup is specified.

pmorie on 7 Nov 2015

Yes, it's an emptyDir volume.

antoineco on 7 Nov 2015

@antoineco Okay, I am packing for a trip next week today, but I will make an issue for this and tag you into it. Do the semantics I mentioned work for you?

pmorie on 7 Nov 2015

(Did I hear "KubeCon“? 😄)
What you suggested would work, but you probably meant o-rwx, which would set the mode and owner as follows: 0770 root : FSGroup. Or did I get everything wrong?

antoineco on 7 Nov 2015

@antoineco You did :)

I think it articulated things the wrong way -- what I meant was that emptyDir should be 0770 root:fsgroup g+s.

pmorie on 7 Nov 2015

@antoineco You suggested as a workaround for 1.0.X to perform a chown from the container's entrypoint. Do you have an example of that?

I can't seem to be able to see the mounted volume when I execute such command as the entrypoint. Is the volume mounted at a later stage?

marcolenzo on 12 Nov 2015

@marcolenzo Nothing fancy, I use a tiny bash script as my entrypoint: ENTRYPOINT ["/run.sh"]
This script sets the correct permissions on the shared volume(s) and then starts my service. Example:

#!/bin/sh

# reset permissions on log volumes
vol=/var/log/nginx
if [ -d "$vol" -a "$(stat -c '%U' "$vol" 2>/dev/null)" = "root" ]; then
    chown app "$vol"
    chmod o-rwx "$vol"
fi

# startup
echo "+-- starting nginx..."
exec nginx "$@"

antoineco on 12 Nov 2015

I noticed that the PR for docker/docker#9360 was merged: docker/docker#10717

Does this mean the group ID solution to this problem can now be implemented? I have had several cases where I had to do what @antoineco has had to do, i.e. create little mini startup scripts that keep me from being able to use many 3rd party Docker images as-is.

charles-crain on 7 Dec 2015

@charles-crain There are two ways you can work with supplemental groups via the API now:

There's a pod.spec.securityContext.supplementalGroups list which lets you directly specify supplemental groups
There's a pod.spec.securityContext.fsGroup field that is a special supplemental group -- if you specify this, block devices and system-generated volumes will be owned by this group and each container will run with this group in their list of supplemental groups

Does that help? Let me know if you need more information.

pmorie on 9 Dec 2015

@pmorie From which version are pod.spec.securityContext.supplementalGroups and pod.spec.securityContext.fsGroup supported? Does it also apply to non-existing hostPaths that are created when the POD starts for the first time?

andrejvanderzee on 17 Dec 2015

@pmorie Looking through the commit logs it looks like fsGroup isn't supported until 1.2.0-alpha3 yes?

charles-crain on 4 Jan 2016

It seems to me this issue is only caused by kubernetes relying on bind mount. When using docker volumes (i.e created using a volume driver, docker volume command, or implicit VOLUME from Dockerfile) the volume is initialized with content from the docker image, and can be chwoned to match image requirements. Can't kubernetes use volume API to do the same ?

ndeloof on 18 Mar 2016

@thockin , what's the final decision of this issue? I meet the same when using jenkins & glusterfs: the jenkins is using UID:1000 and glusterfs is mounted by UID:0; the jenkins can not access the FS :(.

[root@cdemo01 jenkins]# kc logs jenkins-3tozq
touch: cannot touch ‘/var/jenkins_home/copy_reference_file.log’: Permission denied
Can not write to /var/jenkins_home/copy_reference_file.log. Wrong volume permissions?

k82cn on 22 May 2016

@andrejvanderzee @charles-crain my sincere apologies, I somehow missed the tags on this thread in the torrent of github alerts. FSGroup is supported as of 1.2.0.

Does it also apply to non-existing hostPaths that are created when the POD starts for the first time?

Host paths do not support FSGroup or SELinux relabeling, because those could provide an escalation path for a pod to take over a host. However, that said, host paths are not created if they do not exist. Could you be thinking of empty dir? Empty dir volumes do support both FSGroup and SELinux relabel.

@k82 The glusterfs plugin does not support FSGroup at this time.

I am positive we have an issue for controlling the UID of a volume, but I will need to find it.

pmorie on 22 May 2016

IIUC Empty Dir Volume are created for Pod but also destroyed when Pod is deleted, so they couldn't be used for persistant data (typically, jenkins_home considering @k82 scenario).

ndeloof on 22 May 2016

Currently, I'm using 0777 as workaround for demo environment; but a better solution is necessary for production :).

k82cn on 22 May 2016

So my reading this issue is largely mitigated, though some plugins, specifically those that are shared like glusterfs, do not support it well. Paul is there a writeup / walkthru of how to use FSGroup?

thockin on 3 Jun 2016

There is doc in the form of the proposal, and some doc for security context, but this could probably use better examples.

pmorie on 3 Jun 2016

Here’s an example for how to use the fsgroup directive:

For the Docker container defined in https://github.com/robustirc/robustirc/blob/master/Dockerfile (which specifies RUN echo 'nobody:x:99:99:nobody:/:/bin/sh' >> /etc/passwd and USER nobody), the following modification to my kubernetes replicationcontroller config was necessary:

--- a/robustirc-node-1.rc.yaml  2016-07-11 22:04:31.795710444 +0200
+++ b/robustirc-node-1.rc.yaml  2016-07-11 22:04:37.815678489 +0200
@@ -14,6 +14,10 @@
     spec:
       restartPolicy: Always
       dnsPolicy: ClusterFirst
+      securityContext:
+        # Specify fsGroup so that the persistent volume is writable for the
+        # non-privileged uid/gid 99, which is used in robustirc’s Dockerfile.
+        fsGroup: 99
       containers:
       - name: robustirc
         image: robustirc/robustirc

stapelberg on 11 Jul 2016

👍3 🎉1

What is the current best workaround for 1.4?

joan38 on 8 Oct 2016

👍1

@stapelberg I've tried exactly like you explained with k8s 1.4.6 on a hostPath volume but I still see that root owns the mounted directory with 755 permissions. Any additional pointers on how to debug this?

Also, does it depend on the existence of the given user/group on the actual node/host where the container is scheduled?

buchireddy on 6 Dec 2016

Ahh.. Not sure if I'm looking at the correct code but seems like setUp method doesn't do anything for hostPath volume driver. See https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/host_path/host_path.go#L198

@thockin @saad-ali Can you please confirm? Seems like you guys are maintaining hostPath volume driver.

buchireddy on 6 Dec 2016

Since fsGroup isn't available for Gluster yet, is there a recommended workaround other than running the container as root?

rushtehrani on 10 Dec 2016

FsGroup is not supported by hostpath, yiu do not want kubelet making such permission changes on the host.

pmorie on 11 Dec 2016

@pmorie Why not give the kubelet permission? Otherwise, the hand that is forced is to give the container full access as root to do anything on the host? After all, it would have to be specified by fsGroup in the manifest.

haf on 3 Jan 2017

I agree with @haf regarding his "hand forcing" comment.

As an example, the kube-aws tool allows users to configure an auto-attached NFS drive to all of the workers during cluster provisioning. So, by default, every single worker has an AWS EFS drive mounted to /efs when it comes online (uses cloud-init under the hood to accomplish this). Admins can then create shared storage PVs by pointing the hostPath option to that NFS mount.

This type of hybrid setup is very simple to set up, works well, and I would think a lot of people would like to take advantage of this because of those reasons. Isn't that a good reason for hostPath to support the FsGroup option?

sonnysideup on 17 Jan 2017

@joan38 @rushtehrani I use init-containers to chown the volume:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: jenkins
  name: jenkins
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: jenkins
      annotations:
        pod.alpha.kubernetes.io/init-containers: '[
        {
            "name": "jenkins-init",
            "image": "busybox",
            "imagePullPolicy": "IfNotPresent",
            "command": ["sh", "-c", "chown -R 1234:1234 /jenkins"],
            "volumeMounts": [
                {
                  "name": "jenkins-home",
                  "mountPath": "/jenkins"
                }
            ]
        }
    ]'
    spec:
      containers:
      - name: jenkins
        image: 'amarruedo/jenkins:v2.32.1'
             ..........
     volumes:
      - name: jenkins-home
        vsphereVolume:
          volumePath: "[FAS03_IDVMS] volumes/jenkins"
          fsType: ext4

I define in my containers the UID and GID to a known one so I can chown in the init-container. I've used this aproach with hostPath volumes as well with no problems. When FSGroup gets available in vsphere volumes I'll use that.

Hope this helps.

amarruedo on 25 Jan 2017

👍21 ❤1 😄1

As with https://github.com/kubernetes/kubernetes/pull/39438#issuecomment-275459427, maybe we can only set FSGroup if the hostpath is being created?

thockin on 27 Jan 2017

After reading this full thread I am not sure what the resolution was. I am having a similar problem where I am on a mac using virtualbox and trying to start a mysql pod with a volume mount into /data (i have also tried /Users but it has the same behaviour). When minikube creates the directories they are created with root ownership and restrictive write permissions. Mysql is not able to write to them and so my pods crash. If I minikube ssh and chmod -R 777 /data then the pods start and work correctly. What should I be doing different so I can start this pod without having to modify the permission of the data directory?

justechn on 9 May 2017

@justechn you may want to check https://kubernetes.io/docs/concepts/policy/security-context/ and https://kubernetes.io/docs/api-reference/v1.6/#podsecuritycontext-v1-core and set the value of fsGroup to whatever gid your mysql deamon is running with.

fsGroup
Volumes which support ownership management are modified to be owned and writable by the GID specified in fsGroup.

antoineco on 9 May 2017

@antoineco thanks for the tip. I must be doing something wrong because it is not working for me. I logged into the pod and ran id mysql and got back uid=999(mysql) gid=999(mysql) groups=999(mysql). So I added fsGroup with 999 to my spec and restarted, but nothing changed. the directories are still owned by root.

spec: {
          containers: [
            {
              name: 'percona',
              image: 'percona:5.6',
              imagePullPolicy: 'Always',
              env: [
                {
                  name: 'MYSQL_ROOT_PASSWORD',
                  value: secrets['system-mysql-root-password'],
                },
                {
                  name: 'MYSQL_OPS_USER',
                  value: variables['system-mysql-ops-user'],
                },
                {
                  name: 'MYSQL_OPS_PASSWORD',
                  value: secrets['system-mysql-ops-password'],
                },
                {
                  name: 'MYSQL_APP_USER',
                  value: variables['system-mysql-app-user'],
                },
                {
                  name: 'MYSQL_APP_PASSWORD',
                  value: secrets['system-mysql-app-password'],
                },
              ],
              ports: [
                {
                  containerPort: 3306,
                  protocol: 'TCP',
                },
              ],
              volumeMounts: [
                { name: 'data', mountPath: '/var/lib/mysql' },
                { name: 'conf', mountPath: '/etc/mysql/conf.d' },
              ],
            },
          ],
          securityContext: {
            fsGroup: 999,
          },
          volumes: [
            { name: 'data', hostPath: { path: '/data/db/db' } },
            { name: 'conf', hostPath: { path: '/data/db/db-conf' } },
          ],
        },

drwxr-xr-x 2 root root 4096 May  9 18:14 db
drwxr-xr-x 2 root root 4096 May  9 18:14 db-conf
drwxr-xr-x 2 root root 4096 May  9 18:14 shared
drwxr-xr-x 2 root root 4096 May  9 18:14 shared-conf
drwxr-xr-x 2 root root 4096 May  9 18:14 shard-1
drwxr-xr-x 2 root root 4096 May  9 18:14 shard-1-conf

justechn on 9 May 2017

👍2

You may have to adjust the Pod Security Policy as well:
https://kubernetes.io/docs/concepts/policy/pod-security-policy/

antoineco on 9 May 2017

What's the progress on this?
How can we access PV if using non-root users?

hongchaodeng on 30 Jun 2017

👍20

ZK does not run on minikube because of this: https://github.com/kubernetes/charts/issues/976

francisu on 30 Jun 2017

FYI - For those coming to this and using the workaround by @amarruedo it will need to be updated to the new v1.6+ syntax for initContainers. and looks like the following:

      initContainers:
      - name: volume-mount-hack
        image: busybox
        command: ["sh", "-c", "chown -R 1000:100 /usr/share/elasticsearch/data"]
        volumeMounts:
        - name: data
          mountPath: /usr/share/elasticsearch/data

Regardless does the trick, but will be nice when this is natively supported.

djschny on 8 Aug 2017

👍15

Thank you @antoineco for the suggestions that brought many hours of searching to an end. It seems that this issue is focused on host directories, but there weren't any hints for "Using Persistent Volumes as non-root user" within the Persistent Volumes Using section.

I was successful with a simple addition to the Pod.spec using v1.7 and a Dynamically Provisioned AWS Persistent Volume for a StatefulSet. I did not require a Pod Security Policy:

      # Allow non-root user to access PersistentVolume
      securityContext:
        fsGroup: 1000

Could we please have more documentation than the brief mention in the API Reference? It's ambiguous to know that some volume types will work without knowing which ones.

fsGroup: A special supplemental group that applies to all containers in a pod. Some volume types allow the Kubelet to change the ownership of that volume to be owned by the pod: 1. The owning GID will be the FSGroup 2. The setgid bit is set (new files created in the volume will be owned by FSGroup) 3. The permission bits are OR'd with rw-rw---- If unset, the Kubelet will not modify the ownership and permissions of any volume.

@antoineco the security context concept doc you mention gives a 404, btw.
It seems that access control on the PV would achieve the same goal of non-root access. Is it possible to set the PV annotation with Dynamic Provisioning?

hobti01 on 17 Aug 2017

👍20

I tried to mount an azurefile persistent volume for Jenkins and had the permission denied problem, because the jenkins_home directory was only rwx for root, when Jenkins works with jenkins user.
I could workaround using securityContext/runAsUser: 0 but it would be better to inherit the existing dir rights or to provide a chmod property.

Guillaume-Mayer on 14 Nov 2017

👍7

cc @kubernetes/sig-storage-feature-requests @kubernetes/sig-node-feature-requests

bgrant0607 on 14 Nov 2017

@Guillaume-Mayer - wouldn't a postStart lifecycle hook to chmod the files work (alternatively an init container if execution needs to be completed before the entrypoint)?

so0k on 26 Dec 2017

@so0k That won't work if the runAsUser and allowPrivilegeEscalation prevent the root user from being used.

patrickf55places on 27 Dec 2017

I have a very similar use case where I would need to change the owner of a file that is mounted (from a secret in my case).

I have a mongodb-cluster in k8s which uses a special cluster.key file for cluster authorization. That file is stored in a secret; we have a client where running images as root is forbidden. Our pod has a securityContext set with a runAsUser: 1000 directive. Mongodb itself forbids the case that the file is accessible by anyone else but the owner itself. It will reject startup if the file is readable by group or other.

Since the owner is root, and I cannot run a chown as non-root on that file, neither changing permissions works, nor (since there is no k8s support) changing the owner of the file does.

I am currently working around this by injecting as an environment variable in a busybox init container which in turn mounts an emptyDir and writes there. The secret is then not mounted as a file anymore. It's quite ugly, and if there is a chance to get rid of it' I'd be in.

lenalebt on 20 Feb 2018

👍4

The fact that so many of the Docs advise and caution the user against running containers as root, and that this issue is now 3 years old astounds me. This should at least be explained in much greater detail in the Docs.

jordanwilson230 on 21 Feb 2018

👍46 😕5

@saad-ali

On Wed, Feb 21, 2018 at 9:34 AM, Jordan Wilson notifications@github.com
wrote:

The fact that so many of the Docs advise and caution the user against
running containers as root, and that this issue is now 3 years old astounds
me. This should at least be explained in much greater detail in the Docs.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-367406210,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALNCDzdco3jAGv5kwRfMg497RuZeiWWbks5tXFOJgaJpZM4DBDWs
.

ghost on 24 Feb 2018

😕2

Hi!
I ended up with the below initContainers config for giving the node-red-docker container, which runs as a non-privileged user, access to a externally created disk. After trying a lot of things, it seemed the "runAsUser" 0 (root) did the trick.

Cheers
-jo

  initContainers:
    - name: volume-mount-hack
      image: nodered/node-red-docker:slim
      command:
        - sh
        - -c
        - 'chmod -R a+rwx /data'
      volumeMounts:
        - name: picturl-persistent-storage
          mountPath: /data
      securityContext:
        runAsUser: 0

tverilytt on 23 Mar 2018

👍10 🎉8 ❤4

Many older applications which bind to low (privileged) ports start first as root, then immediately drop privileges to some other user. In such a scenario, the container must be configured to start the application as root, and so the original user (root) would have access to the volume. Once the application calls setuid(2)/seteuid(2) though, it won't have access anymore.

@eatnumber1 Can you elaborate a bit more on why we will have this issue with the supplementary group solution mentioned in this thread? IIUC, setuid(2)/seteuid(2) will not change the supplementary group of the calling process, so as long as the application is in a group which have the access to the volume, it should not have problems to access the volume, right?

qianzhangxa on 4 Apr 2018

It looks like I was mistaken and calling setgid(2) doesn't change supplementary groups (which I had thought it did).

Looking around, it seems like at least nginx drops supplementary groups explicitly here (and otherwise would be a minor security vulnerability). I'd be surprised if any well-written privilege-dropping application doesn't drop supplementary groups.

eatnumber1 on 4 Apr 2018

Thanks @eatnumber1! So Nginx initially runs with root, and later reset uid, gid and supplementary groups with what are configured in nginx.conf. Then I think with the pod security context, we can set fsGroup to the group configured in nginx.conf, in this way, even after Nginx resets its supplementary groups, it can still access the volume. Right?

qianzhangxa on 7 Apr 2018

👍2

From a cursory reading about pod security contexts, it seems like it would.
I haven't used them though (note that my original comment on this bug is
multiple years old).
On Fri, Apr 6, 2018 at 20:28 Qian Zhang notifications@github.com wrote:

Thanks @eatnumber1 https://github.com/eatnumber1! So Nginx initially
runs with root, and later reset uid, gid and supplementary groups with what
are configured in nginx.conf. Then I think with the pod security context
https://kubernetes.io/docs/tasks/configure-pod-container/security-context/,
we can set fsGroup to the group configured in nginx.conf, in this way,
even after Nginx resets its supplementary groups, it can still access the
volume. Right?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-379422625,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABEj5KFwz4VWsvSpT2UwzY_1KIW0DB4ks5tmBZRgaJpZM4DBDWs
.

eatnumber1 on 7 Apr 2018

Besides supplementary group, I think POSIX ACL can be another solution to fix this issue, I mean we can add an ACL entry to grant rwx permission to the pod/container user on the volume. But I do not see POSIX ACL is not mentioned in this thread, any drawbacks it has?

cc @thockin @saad-ali

qianzhangxa on 9 Apr 2018

I don't know that nginx clearing supplementary groups prevents any
vulnerability in this case? It is specifically defeating a well-understood
mechanism. Can we fix nginx?

As for ACL or other mechanisms, I don't object to them, I just have less
context on them.

On Sun, Apr 8, 2018 at 6:26 PM Qian Zhang notifications@github.com wrote:

Besides supplementary group, I think POSIX ACL
http://man7.org/linux/man-pages/man5/acl.5.html can be another solution
to fix this issue, I mean we can add an ACL entry to grant rwx permission
to the pod/container user on the volume. But I do not see POSIX ACL is not
mentioned in this thread, any drawbacks it has?

cc @thockin https://github.com/thockin @saad-ali
https://github.com/saad-ali

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-379601556,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVHTRb0-By20y2QzU4idsBoZVDNpLks5tmrjZgaJpZM4DBDWs
.

thockin on 9 Apr 2018

It is specifically defeating a well-understood mechanism.

@thockin Can you please elaborate a bit on this? And why do we need to fix Nginx?

qianzhangxa on 10 Apr 2018

We explicitly set up supplemental groups so we can do things like volumes
and per-volume accounting. It is 100% intentional and then nginx drops
supplemental groups in the name of security. Breaking valid use cases.

On Mon, Apr 9, 2018 at 5:59 PM Qian Zhang notifications@github.com wrote:

It is specifically defeating a well-understood mechanism.
@thockin https://github.com/thockin Can you please elaborate a bit on
this? And why do we need to fix Nginx?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-379940087,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVPn8KDON0zSROa1Qy1uFgMsaR9jqks5tnAQHgaJpZM4DBDWs
.

thockin on 10 Apr 2018

In a non-containerized world, if nginx didn't drop supplemental groups, a remote code execution vulnerability in nginx could leak undesired privileges to remote attackers via its supplemental groups. I therefore don't think you'll ever get the nginx developers to be willing to stop doing that. Even if you do manage to convince them to, dropping supplemental groups is the standard practice, and you'd have to convince every developer of every privilege dropping application to do the same. Apache does the same exact thing here.

Furthermore, even if you pick another obscure Linux access control mechanism to use instead (for example, fsuid), it is _intentional_ that every possible type of privilege is dropped, so it would be a security vulnerability if applications didn't drop that privilege as well. That _is_ the security model here.

In a non-containerized world, the only way to grant privileges to the application after it drops privileges is to grant privileges to the user/group/etc that the application switches _to_. Hence my original (3 year old) comment about supporting UID and GID explicitly, which would allow the user to specify the UID or GID that the application is going to switch to.

Looking at the documentation for PodSecurityContext, it says this about fsGroup:

A special supplemental group that applies to all containers in a pod. Some volume types allow the Kubelet to change the ownership of that volume to be owned by the pod:

The owning GID will be the FSGroup

The setgid bit is set (new files created in the volume will be owned by FSGroup)

The permission bits are OR'd with rw-rw---- If unset, the Kubelet will not modify the ownership and permissions of any volume.

As far as I'm aware, these actions should be sufficient to allow the resulting unprivileged user after a privilege drop to access the volume successfully. (caveat, I haven't tested it)

eatnumber1 on 10 Apr 2018

👍2

Yes, so I think setting fsGroup to the group configured in nginx.conf will make Nginx can access the volume even after a privilege drop and also make volume accounting work.

qianzhangxa on 10 Apr 2018

👍2

But I have another question: Besides the fsGroup in pod security context, user can also set fsGroup in container security context, so if a pod has multiple containers in it and each container has its own fsGroup, how can we make sure all of these containers can access the volume (since a volume can only be owned by a single group rather than multiple)?

qianzhangxa on 10 Apr 2018

@qianzhangxa if multiple containers need access to that volume , you will need to make sure all containers request the same fagroup in the container level security context or better just set at the pod level

krmayankk on 12 Apr 2018

@tallclair FYI I believe we can close this issue

krmayankk on 12 Apr 2018

/sig auth

krmayankk on 12 Apr 2018

None of the solutions suggested are working for me.

YML:

apiVersion: apps/v1beta1 # for versions before 1.8.0 use apps/v1beta1
kind: Deployment
metadata:
  labels:
    tier: frontend
spec:
  selector:
    matchLabels:
      tier: frontend
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        tier: frontend
    spec:
      securityContext:
        fsGroup: 1000
        runAsUser: 0
      initContainers:
      - image: some-sftp-container
        name: sftp-mount-permission-fix
        command: ["sh", "-c", "chown -R <user> /mnt/permission-fix"]
        volumeMounts:
        - name: azure
          mountPath: /mnt/permission-fix
      containers:
      - image: some-sftp-container
        name: sftp-container
        ports:
        - containerPort: 22
          name: port_22
        volumeMounts:
        - name: azure
          mountPath: /home/<user>/data
      volumes:
        - name: azure
          azureFile:
            secretName: azure-secret
            shareName: sftp-share
            readOnly: false

Once the Pod is ready and I exec into the container and check the dirs, nothing has happened:

root@container:/# cd /home/<user>                                                                        
root@container:/home/<user># ls -als
total 8
4 drwxr-xr-x 3 root root 4096 Apr 24 18:45 .
4 drwxr-xr-x 1 root root 4096 Apr 24 18:45 ..
0 drwxr-xr-x 2 root root    0 Apr 22 21:32 data
root@container:/home/<user># cd data
root@container:/home/<user>/data# ls -als
total 1
1 -rwxr-xr-x 1 root root 898 Apr 24 08:55 fix.sh
0 -rwxr-xr-x 1 root root   0 Apr 22 22:27 test.json
root@container:/home/<user>/data#

At some point I also had the runAsUser: 0 on the container itself. But that didn't work either. Any help would be much appreciated

gitnik on 24 Apr 2018

👍4

Also running a chown afterwards didn't work

gitnik on 24 Apr 2018

@eatnumber1 if a group is in your supplemental groups, shouldn't you assume that it was intended that you have access to that group's resources? Dropping supplemental groups is saying "I know you told me I need this, but I don't want it" and then later complaining that you don't have it.

Regardless, I am now throughly lost as to what this bug means - there are too many followups that don't seem to be quite the same.

Can someone summarize for me? Or better, post a full repro with non-pretend image names?

thockin on 29 Apr 2018

@thockin IIUC, Nginx is not just dropping the supplementary groups, it is actually resetting it with what is configured in nginx.conf by calling initgroups.

qianzhangxa on 1 May 2018

This worked for me.. part of the script.

```
spec:
containers:
- name: jenkins
image: jenkins/jenkins
ports:
- containerPort: 50000
- containerPort: 8080
volumeMounts:
- mountPath: /var/jenkins_home
name: jenkins-home
securityContext:
fsGroup: 1000
runAsUser: 0

qafro1 on 16 May 2018

the solutions arent ideal, now your containers are running as root which is against the security standards that k8s tries to get its users to impose.

it would be great if persistent volumes could be created with securityContext in mind, ie
apiVersion: v1 kind: PersistentVolume metadata: name: redis-data-pv namespace: data labels: app: redis spec: securityContext: runAsUser: 65534 fsGroup: 65534 capacity: storage: 1Gi accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain claimRef: namespace: data name: redis-data hostPath: path: "/data"

ekhaydarov on 22 Jun 2018

👍7

As a workaround, I use a postStart lifecycle hook to chown the volume data to the correct permissions. This may not work for all applications, because the postStart lifecycle hook may run too late, but it's more secure than running the container as root and then fixing permissions and dropping root (or using gosu) in the entrypoint script.

robbyt on 23 Jun 2018

@robbyt commented
As a workaround, I use a postStart lifecycle hook to chown the volume data to the correct permissions. This may not work for all applications, because the postStart lifecycle hook may run too late, but it's more secure than running the container as root and then fixing permissions and dropping root (or using gosu) in the entrypoint script.

We use initContainer, can a lifecycle hook have a different securityContext than the container itself?

chicocvenancio on 23 Jun 2018

it's sad to see that after I have to do research again @chicocvenancio's option (which I use as well) is still apparently the only way to achieve this.

I understand where the problem is coming from and why we are so reluctant to change this, however, especially for Secret volumes changing the UID of volumes can be essential.

Here is an example from the PostgreSQL world: mount a TLS client cert for your application with a secret volume. As recommended everywhere, you don't run your container as root. However, the postgres connection library will instantaneously complain that the key is world readable. "No problem" you think and you change the mode / default mode to match the _demanded_ 0600 (which is very reasonable to demand that as a client library). However, now this won't work either, because now root is the only user which can read this file.

The point I'm trying to make with this example is: groups don't come to the rescue here.

Now PostgreSQL is definitely a standard database and a product that a lot of people use. And asking for mounting client certs in a way with Kubernetes that do not require an initContainer as a workaround is not too much to ask imho.

So please, let's find some middle ground on this issue, and not just close it. :pray:

mheese on 28 Jun 2018

👍9 ❤3

I'm trying to mount a ssh-key to user's .ssh directory with defaultMode 0400 so the application can ssh without a password. But that doesn't work if the secret is mounted as owned by root. Can you explain again how this can be solved using fsGroup or some other such mechanism?
I don't see a solution if PodSecurityPolicy is enabled so applications cannot run as root. Please advise.

rezroo on 28 Jun 2018

I am still hopelessly confused about this bug. There seems to be about 6 things being reported that all fail the same way but are different for different reasons.

nginx drops supplemental groups
ssh/postgres demands a particular mode for keys (and does not accept group-read)
something about running as root ?

Can someone explain, top-to-bottom the issue (or issues) in a way that I can follow without having to re-read the whole thread?

Keep in mind that Volumes are defined as a Pod-scope construct, and 2 different containers may run as 2 different UIDs. Using group perms is ideal for this, but if it is really not meeting needs, then let's fix it. But i need to understand it first.

@saad-ali for your radar

thockin on 2 Jul 2018

👍2

@thockin My use-case is very simple. I'm injecting a secret (ssh key) into a container that is not running as root. The ssh key in /home/username/.ssh must have 400 permission which I can do, but must also be owned by the UID, or it won't work. I don't want to give this pod any root privilege of any sorts, so an init container that modifies the UID of the file does not work for me. How do I do it, other than including the ssh-key in the image?

rezroo on 6 Jul 2018

👍1

Right. That seems clear. If we fixed this, does it apply to every usecase
herein? Or is there more?

I could see adding a user field in various places.

e.g. start with this. Implement it and add some tests. I need a volunteer
to carry the football. It's not a ton of effort, honestly, lprobably less
than a day, but I won't have time to spend on it any time soon.

Anyone?

diff --git a/staging/src/k8s.io/api/core/v1/types.go b/staging/src/
k8s.io/api/core/v1/types.go
index 99159ee75a..e98c035528 100644
--- a/staging/src/k8s.io/api/core/v1/types.go
+++ b/staging/src/k8s.io/api/core/v1/types.go
@@ -1048,6 +1048,9 @@ type SecretVolumeSource struct {
        // mode, like fsGroup, and the result can be other mode bits set.
        // +optional
        DefaultMode *int32 `json:"defaultMode,omitempty"
protobuf:"bytes,3,opt,name=defaultMode"`
+       // Optional: user ID to use on created files by default.  Default
is implementation-defined.
+       // +optional
+       DefaultUser *int64 `json:"defaultUser,omitempty"
protobuf:"varint,4,opt,name=defaultUser"`
        // Specify whether the Secret or it's keys must be defined
        // +optional
        Optional *bool `json:"optional,omitempty"
protobuf:"varint,4,opt,name=optional"`
@@ -1474,6 +1477,9 @@ type ConfigMapVolumeSource struct {
        // mode, like fsGroup, and the result can be other mode bits set.
        // +optional
        DefaultMode *int32 `json:"defaultMode,omitempty"
protobuf:"varint,3,opt,name=defaultMode"`
+       // Optional: user ID to use on created files by default.  Default
is implementation-defined.
+       // +optional
+       DefaultUser *int64 `json:"defaultUser,omitempty"
protobuf:"varint,4,opt,name=defaultUser"`
        // Specify whether the ConfigMap or it's keys must be defined
        // +optional
        Optional *bool `json:"optional,omitempty"
protobuf:"varint,4,opt,name=optional"`
@@ -1541,6 +1547,9 @@ type ProjectedVolumeSource struct {
        // mode, like fsGroup, and the result can be other mode bits set.
        // +optional
        DefaultMode *int32 `json:"defaultMode,omitempty"
protobuf:"varint,2,opt,name=defaultMode"`
+       // Optional: user ID to use on created files by default.  Default
is implementation-defined.
+       // +optional
+       DefaultUser *int64 `json:"defaultUser,omitempty"
protobuf:"varint,4,opt,name=defaultUser"`
 }

 // Projection that may be projected along with other supported volume types
@@ -1581,6 +1590,9 @@ type KeyToPath struct {
        // mode, like fsGroup, and the result can be other mode bits set.
        // +optional
        Mode *int32 `json:"mode,omitempty"
protobuf:"varint,3,opt,name=mode"`
+       // Optional: user ID to use on this file.
+       // +optional
+       User *int64 `json:"User,omitempty"
protobuf:"varint,4,opt,name=User"`
 }

 // Local represents directly-attached storage with node affinity (Beta
feature)
@@ -5080,6 +5092,9 @@ type DownwardAPIVolumeSource struct {
        // mode, like fsGroup, and the result can be other mode bits set.
        // +optional
        DefaultMode *int32 `json:"defaultMode,omitempty"
protobuf:"varint,2,opt,name=defaultMode"`
+       // Optional: user ID to use on created files by default.  Default
is implementation-defined.
+       // +optional
+       DefaultUser *int64 `json:"defaultUser,omitempty"
protobuf:"varint,4,opt,name=defaultUser"`
 }

 const (
@@ -5103,6 +5118,9 @@ type DownwardAPIVolumeFile struct {
        // mode, like fsGroup, and the result can be other mode bits set.
        // +optional
        Mode *int32 `json:"mode,omitempty"
protobuf:"varint,4,opt,name=mode"`
+       // Optional: user ID to use on this file.
+       // +optional
+       User *int64 `json:"User,omitempty"
protobuf:"varint,5,opt,name=User"`
 }

 // Represents downward API info for projecting into a projected volume.

On Fri, Jul 6, 2018 at 8:00 AM Reza notifications@github.com wrote:

@thockin https://github.com/thockin My use-case is very simple. I'm
injecting a secret (ssh key) into a container that is not running as root.
The ssh key in /home//.ssh must have 400 permission which I can do, but
must also be owned by the UID, or it won't work. I don't want to give this
pod any root privilege of any sorts, so an init container that modifies the
UID of the file does not work for me. How do I do it, other than including
the ssh-key in the image?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-403059487,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVG5ERBNuP4MuNprc27c9YIyajAIJks5uD3uFgaJpZM4DBDWs
.

thockin on 6 Jul 2018

@vikaschoudhary16 @derekwaynecarr this has some overlap / implications for user-namespace mapping.

tallclair on 6 Jul 2018

@rezroo a workaround to could be to simply make a copy of the ssh key in an Init container that way you’ll be able to control who owns the file right? Provided the init container runs as the same user that needs to read the ssh key later. It’s a little gross, but “should” work I think.

pearj on 7 Jul 2018

@thockin another use-case: I'm trying to run an ELK statefulset. The pod has an Elasticsearch container running as non-root. I'm using a volumeClaimTemplate to hold the elasticsearch data. The container is unable to write to the volume though as it is not running as root. K8s v.1.9 . The pod has multiple containers and i don't want to use the same fsgroup for all of them.

mattthelee on 11 Jul 2018

👍3

@pearj that's exactly the workaround that everybody uses ... and as the name says: it's a workaround, and should get addressed :) ... However, there is also a problem with this workaround: updated secrets will eventually get updated in mounted volumes which will make it possible to act on a file change in the running pod; you will miss out on this update when you copy it from an init container.

mheese on 19 Jul 2018

@pearj @mheese This work around wouldn't work for me anyway - because our PodSecurityPolicy doesn't allow containers to run as root - normal or init containers - doesn't matter - no one can access a secret owned by root as far as I can tell.

rezroo on 21 Jul 2018

❤3

Yet another use case for this: I'm working on using XFS quotas (obviously, if XFS is in use) for ephemeral storage. The current enforcement mechanism for ephemeral storage is to run du periodically; in addition to being slow and rather coarse granularity, it can be faked out completely (create a file, keep a file descriptor open on it, and delete it). I intend to use quotas for two purposes:

Hard cap usage across all containers of a pod.
Retrieve the per-volume storage consumption without having to run du (which can bog down).

I can't use one quota for both purposes. The hard cap applies to all emptydir volumes, the writable layer, and logs, but a quota used for that purpose can't be used to retrieve storage used for each volume. So what I'd like to do is use project quotas in a non-enforcing way to retrieve per-volume storage consumption and either user or group quotas to implement the hard cap. To do that requires that each pod have a unique UID or single unique GID (probably a unique UID would be best, since there may be reasons why a pod needs to be in multiple groups).

(As regards group and project IDs being documented as mutually exclusive with XFS, that is in fact no longer the case, as I've verified. I've asked some XFS people about it, and they confirmed that the documentation is out of date and needs to be fixed; this restriction was lifted about 5 years ago.)

RobertKrawitz on 7 Aug 2018

@robbyt please tell how you managed to chown with postStart ? My container runs as nonroot user, so poststart still uses nonroot permissions and can't change permissions:

chown: /home/user/: Operation not permitted
, message: "chown: /home/user/: Permission denied\nchown: /home/user/: Operation not permitted

ksemaev on 9 Aug 2018

Same problem here: whe have somo Dockerized tomcat that run our web applicaition and we us jmx to monithor them, we want to serve jmxremote user and jmxremote password as secrets, but tomcat, which obviously doesen't run as root, want that jmx files are readable only for the user that run tomcat.

Addendum: whe have many tomcat, and want to run every of them as different users.

Cobra1978 on 30 Aug 2018

❤1

the same problem!

ludwikbukowski on 5 Sep 2018

For now, the hack that works is by setting user to root at the end of your dockerfile. And set a custom entrypoint script. Chown the volume in your custom entrypoint script then use gosu to run the default entrypoint script as the default user. The thing I hate about this is I have to do it for every single image that uses a volume in kubernetes. Totally lame. Please provide a UID GID option on the volume mount or volume claim config.

woodcockjosh on 26 Sep 2018

👍1

That hack doesn’t work if you want to run a secure Kubernetes cluster with PodSecurityPolicies applied to enforce pods to run as a non-root user.

praseodym on 26 Sep 2018

True. All hacks have their downsides. It's either that or logging in as root after the volume is created and chowning the directory manually. Not sure really which is worse :-D. Can't believe this is even a thing.

woodcockjosh on 26 Sep 2018

@thockin from what I gather following this issue now since nearly 4 years, the solution that everybody wants is to be able to set a uid and gid for a volume - in particular secret volumes (but not only those). On Jul 6 you posted a starting point for a possible solution to this. If this is a supported path from the maintainers, I'd finally start and try to solve this problem.

mheese on 26 Sep 2018

👍14 ❤2 🎉2

@mheese i'd say go for it.

dims on 27 Sep 2018

@mheese I'll collab on a PR if you want?

woodcockjosh on 27 Sep 2018

👍2

@mheese It seems gid is taken from the SecurityContext, so I guess, for a fast relief, an uid implementation would be enough. Also because gid had more second guesses in the discussion.

blaggacao on 28 Sep 2018

same issue with psql here

michalpiasecki1 on 6 Oct 2018

@michalpiasecki1 Look how I solved it with PostStart-hook: https://github.com/xoe-labs/odoo-operator/blob/1be88b67d4ded5c4a0aea6e26b711241f0d09f89/pkg/controller/odoocluster/odoocluster_controller.go#L579-L586

blaggacao on 7 Oct 2018

running into same issue. is there any recommended solution for this ?

debianmaster on 9 Oct 2018

👍2

@blaggacao : thanks for the hint, however i found another workaround
@debianmaster : i would recommend securityContext and fsGroup as described in https://kubernetes.io/docs/tasks/configure-pod-container/security-context/

I ended with certificates owned by root, and group postgres with permissions 440

michalpiasecki1 on 9 Oct 2018

👍1

@michalpiasecki1 can you give more info about how you resolved this for postgres?

I have the server.crt and server.key files stored in a k8s secret pg-certs-secret and I want to mount them into my container running postgres:9.6. I have this set up with:

      containers:
      - name: pg
         image: postgres:9.6
         ...
         volumeMounts:
         - name: pg-certs
            mountPath: "/etc/certs/"
            readOnly: true
            args: ["-c", "ssl=on", "-c", "ssl_cert_file=/etc/certs/server.crt", "-c", "ssl_key_file=/etc/certs/server.key"]
      volumes:
        - name: pg-certs
           secret:
           secretName: pg-certs-secret
           defaultMode: 384

But deploying this, the container dies with the error FATAL: could not load server certificate file "/etc/certs/pg_server.crt": Permission denied

I assume this is because the certs are loaded so that they are owned by root, when they need to be owned by postgres. It's not clear from the docs, etc what I should do to change ownership absent creating a custom Docker image, but I'd rather not. The securityContext and fsGroup you suggested seemed like it could work, but I would appreciate if you would share more info about how exactly you achieved this.

Also worth noting: I added defaultMode: 384 to ensure the files were added with 0600 file permissions. Before I added that, the container died with the error

FATAL:  private key file "/etc/certs/pg_server.key" has group or world access
DETAIL:  File must have permissions u=rw (0600) or less if owned by the database user, or permissions u=rw,g=r (0640) or less if owned by root.

izgeri on 8 Dec 2018

For reference, I just figured this out and it worked when I added

     securityContext:
        fsGroup: 999

to the spec.

izgeri on 8 Dec 2018

👍3

I have same problem #72085
can any one help me

@izgeri i tried link
but not working, can you help me

viveksaiaws on 16 Dec 2018

Is there any chance for fix this issue? Are the Kubernetes guys working on the solution?

realrill on 5 Feb 2019

👎1 👍1

This issue has not been a problem for us for a very long time. We set the "fsGroup" in the pod's security context to match the group ID of the user that runs the main Docker entry point, and any volumes in the pod become accessible to that container's main process:

https://kubernetes.io/docs/tasks/configure-pod-container/security-context/

Note that the proper group ID will vary depending on how the Docker container is created and run. I usually ascertain it by kubectl exec ing into the pod into a shell and typing id -g

charles-crain on 5 Feb 2019

👍4

@charles-crain: Your suggestion works really well for most cases.

Here's another case that's not covered:

If the container starts as root but uses a tool such as gosu to become another user (for some processes).

Then locking the container into only one group with fsGroup will prevent cases such as "I want my non-root user to have access to SSH keys mounted into it's ~/.ssh directory, while having my root user having access to other mounts too".

One example of this: "a DinD container where dockerd must start as root, but subsequent containers are run by a non-root user".

MPV on 6 Feb 2019

👍1

Hi there @charles-crain, I am facing very interesting issue that matches topic of this thread. Seems fsGroup does not work for all cases,
here is example of the deployment, it is test nginx deployment, I am trying to mount nfs and additionally mount empty directory - just to compare.
apiVersion: apps/v1 kind: Deployment metadata: name: nginx-test labels: app: nginx-test spec: selector: matchLabels: app: nginx-test strategy: type: Recreate template: metadata: labels: app: nginx-test spec: securityContext: fsGroup: 2000 volumes: - name: nfs-volume nfs: server: # nfs with no_root_squash path: /nfs - name: test-fs-group emptyDir: {} containers: - image: nginx name: nginx-test imagePullPolicy: Always volumeMounts: - name: nfs-volume mountPath: /var/local/test5 - name: test-fs-group mountPath: /var/local/test6
when I exec bash into the pod's nginx container GID applied only for empty dir, and not to the dir mounted for nfs. Nfs configured with no root squash in testing purposes, process in my container has non-root user, so that is the problem, it can be solved via chown, however I am trying to achieve it with with native solution

cherednichenkoa on 14 Feb 2019

👍3

I also face the exact same issue described above.

manojtr on 18 Feb 2019

this issue has been open for like 5 years. no one from kubernetes is interested in it and it may be for a reason, valid or not. there were many number of valid solutions to the simple problem but none of them were implemented.

Not sure why this issue doesnt just get closed

ekhaydarov on 18 Feb 2019

👎1

Help w comp talk.

On Mon, Feb 18, 2019, 12:44 PM Erkin Khaydarov <[email protected]
wrote:

this issue has been open for like 5 years. no one from kubernetes is
interested in it and it may be for a reason, valid or not. there were many
number of valid solutions to the simple problem but none of them were
implemented.

Not sure why this issue doesnt just get closed

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-464824407,
or mute the thread
https://github.com/notifications/unsubscribe-auth/Atif5VY8mSFL1QRED9HS2oRbV3qyhPQ4ks5vOuZngaJpZM4DBDWs
.

anarchistHH1983 on 18 Feb 2019

@mheese As you commented here https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-424876108 to set a uid and gid for a volume , do you still trying to working on it? Thanks!

jingxu97 on 18 Feb 2019

I also encountered this issue. Is there any plan to create a viable solution for it? Local persistent volumes can't replace all use-cases of hostPath volume

cosmin-ionita on 19 Feb 2019

Same here.

mmack on 15 Mar 2019

👎6 👍2

@jingxu97 I haven't given it a try yet because I don't really feel that there is a consensus that this is what should be done.

Let me come up with a detailed proposal and post it here when ready.

mheese on 16 Mar 2019

👍2

Ok

raotkind on 21 Mar 2019

👎5

For reference, I just figured this out and it worked when I added
     securityContext:
        fsGroup: 999
to the spec.

For postgres:11.1-alpine use this:

securityContext:
  fsGroup: 70

mlensment on 23 Apr 2019

👍2

I just can hope that the kubernetes members prioritize this issue. IMO, It's really a blocker especially from security point of view and becoming vulnerabilities risk :'(

rhzs on 12 May 2019

👍4

long-term-issue (note to self)

dims on 13 May 2019

I'm hitting this in context of cert-manager managed secrets. I was using an initContainer to copy the certs to the right place and update the permissions. Cert-manager needs to update the secrets in place so that trick won't work. I'll explore the fsGroup workaround.

kfox1111 on 13 May 2019

@incubus8
I am trying to working on this issue. Could you please describe your use case and what kind of behavior you would expect? Thanks!

jingxu97 on 23 May 2019

@jingxu97 I can offer two examples. The Prometheus docker image starts the prometheus service as user nobody (uid 65534) and the Grafana docker image starts grafana as uid=472 (https://grafana.com/docs/installation/docker/)

Both of these fail to create directories when they first start up because of these permissions. I've worked around this in my setup with an initContainer that creates the required directories and chowns them appropriately.

ford-prefect on 23 May 2019

👍1

@ford-prefect if you set fsGroup in PodSecurityContext, and runAsUser, wouldn't those services have the permission to write?

jingxu97 on 23 May 2019

no because the permissions are set for the pod not for the volume that was created independently. it would be great if PodSecurityContext could infact alter the permissions of the volumes, or at least fail to mount and throw an error

ekhaydarov on 23 May 2019

@ekhaydarov , in current setVolumeOwnerrhsip function, if fsGroup is provided, it will have rw-rw---- permission, so that group has rw permission. And when container is started, it will set up the supplemental group so that it can read and write to the volume. Anything I am missing?

jingxu97 on 23 May 2019

@jingxu97 thi is not always a solution, as example: we use secretes for jmxremote.password and jmxremote.user that are needed for jmxmonitoring of java applications, java requires that those files belong to the user that run te application an that have permissions 400, so by now there is no way to use secrets this way in rancher 2.x

Cobra1978 on 24 May 2019

👍1

I was perplexed to see that fsGroup was an option and fsUser was not.
Also, the permissions/mode portion of this is confusing. We should make it clearer how volumes like EmptyDir get their default mode or allow the user to set it explicitly, as this is a pretty normal unix admin task.

If root is the only user that can ever own your volume (aside from using an initContainer to chmod it at runtime), the API encourages usage of root for an application's user which is a weak security practice.

@jingxu97 What do you think?

stealthybox on 14 Jun 2019

👍14 ❤12

@stealthybox, thank you for the feedback. I am currently working on a proposal for API on volume ownership and permission and will share with the community soon. Feedback/comments are welcome then.

jingxu97 on 14 Jun 2019

❤14 👍13

Hi.

There are some news about this issue?

Cobra1978 on 10 Jul 2019

Why does pv.beta.kubernetes.io/gid not work for the local host path provisoner ?

mikekuzak on 16 Jul 2019

Hey,

I am encountering this as well, I'd appreciate some news :).

Reinkaos on 12 Aug 2019

this has been my workaround so far:
initContainers: - name: init image: busybox:latest command: ['/bin/chown', 'nobody:nogroup', '/<my dir>'] volumeMounts: - name: data mountPath: /<my dir>

hughobrien on 12 Aug 2019

👍6 👎4

this has been my workaround so far:

      - name: init
        image: busybox:latest
        command: ['/bin/chown', 'nobody:nogroup', '/<my dir>']
        volumeMounts:
        - name: data
          mountPath: /<my dir>

The workarounds with chowning do not work for read-only volumes, such as secret mounts, unfortunately.

maxneaga on 22 Aug 2019

I would need this as well (pretty urgently), because we have software not starting due to permissions not being able to be different then 0600. If we could mount the volume under a specific UID my (and other's) problem will be solved.

jeffdesc on 15 Nov 2019

You can run a job as part of your deployment to update the volume permissions and use a ready state to check for write permission as a workaround. Or you can use fsGroup to specify the group for the volume and add the application user to the group that owns the volume. Option 2 seems cleaner to me. I used to use option 1 but now I use option 2.

woodcockjosh on 15 Nov 2019

Note that if Kubernetes did support an fsUser option, then you'd trip over https://github.com/kubernetes/kubernetes/issues/57923 where all files within the mounted secret would be given 0440 permission (or 0660 for writeable mounts) and would ignore any other configuration.

wjam on 25 Nov 2019

@woodcockjosh fsGroup doesn't cover the use case of security-sensitive software such as Vault trying to run as vault:vault and loading a private key file requiring permissions equal to or less than 0600. @wjam fsUser would be ideal if we could get 0400 permissions set as well (for things like private key files).

We hit this trying to configure Vault to authenticate to a PostgreSQL DB with certificates. The underlying Go library hard fails if the permission bits differ (https://github.com/lib/pq/blob/90697d60dd844d5ef6ff15135d0203f65d2f53b8/ssl_permissions.go#L17).

theonewolf on 4 Dec 2019

👍4

@jingxu97: Are there any news on that. We still have the pv ownership problem in our clusters with strict security policies.

eichlerla2 on 20 Feb 2020

This article looks like working I din't test it but I'll test it on Monday, if anyone can do it b4 then please let us know.
The detail is here
Data persistence is configured using persistent volumes. Due to the fact that Kubernetes mounts these volumes with the root user as the owner, the non-root containers don't have permissions to write to the persistent directory.

The following are some things we can do to solve these permission issues:

Use an init-container to change the permissions of the volume before mounting it in the non-root container. Example:
```
spec:
initContainers:
- name: volume-permissions
image: busybox
command: ['sh', '-c', 'chmod -R g+rwX /bitnami']
volumeMounts:
- mountPath: /bitnami
name: nginx-data
containers:
- image: bitnami/nginx:latest
name: nginx
volumeMounts:
- mountPath: /bitnami
name: nginx-data

Use Pod Security Policies to specify the user ID and the FSGroup that will own the pod volumes. (Recommended)
  ```
    spec:
        securityContext:
          runAsUser: 1001
          fsGroup: 1001
        containers:
        - image: bitnami/nginx:latest
          name: nginx
          volumeMounts:
          - mountPath: /bitnami
            name: nginx-data

kaleabgirma on 21 Feb 2020

👍11 👎2

Hi,
I've seen all around the Internet the workaround with that weak initContainer running as root.
I've also been struggling with fsGroup, which apply only on the scope of the pod, not on each container in a pod, which is [also] a shame.
Just build a custom image (nonroot-initContainer) based on alpine, with sudo installed and custom /etc/sudoers giving my non-root user full power to apply the chmod actions. Unfortunately, I'm hitting another wall with:

sudo: effective uid is not 0, is /usr/bin/sudo on a file system with the 'nosuid' \
option set or an NFS file system without root privileges?

Since I'm not willing to create a less secure PodSecurityPolicy for that deployment, any news from that issue would be very welcome for people having to be compliant with security best practices.

Thanks in advance !

tisc0 on 3 Apr 2020

👍4

Is there fsGroup for kubernetes deployment files?

thehappycoder on 8 Jun 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 6 Sep 2020

/remove-lifecycle stale

unixfox on 6 Sep 2020

👍

pacoxu on 17 Nov 2020

Is this still an issue? I've done some tests (Minikube 1.14, 1.15, 1.19 and EKS 1.14) and the permissions on the emptyDir volume is 777 as intended:

apiVersion: v1
kind: Pod
metadata:
  name: debug
  namespace: default
spec:
  containers:
  - image: python:2.7.18-slim
    command: [ "tail", "-f" ]
    imagePullPolicy: Always
    name: debug
    volumeMounts:
    - mountPath: /var/log/test-dir
      name: sample-volume
  volumes:
  - emptyDir:
      sizeLimit: 10M
    name: sample-volume

Writing files in the dir, works with any user as expected.

israelbgf on 17 Nov 2020

Kubernetes: Volumes are created in container with root ownership and strict permissions

Most helpful comment

All 207 comments

Related issues