The emptyDir volumeMount is owned by root:root and permissions set to 750
hostDir is the same but with 755 permissions
Containers running with a non-root USER can't access the volumes
Related discussion at https://groups.google.com/forum/#!topic/google-containers/D5NdjKFs6Cc
and Docker issue https://github.com/docker/docker/issues/9360
hostDir should get the same permissions as the existing host entry, though I am not sure we ensure a host direct exists before using hostDir
Part of the problem here is that different containers can run as
different users in the same pod - which user do we create the volume
with? what we really need is a way to tell docker to add supplemental
group IDs when launching a container, so we can assign all containers
in a pod to a common group.
Would it be reasonable to add user
and/or permissions
option to volumeMounts
or emptyDir
to explicitly force it?
I don't think that we want that in the API long-term, so I'd rather apply a
hidden heuristic like "chown to the USER of the first container that mounts
the volume" or even "ensure that all VolumeMounts for an emptyDir Volume
have the same USER, else error". Do you think such heuristics would hold?
On Wed, Nov 26, 2014 at 9:07 AM, Carlos Sanchez [email protected]
wrote:
Would it be reasonable to add user and/or permissions option to
volumeMounts or emptyDir to explicitly force it?
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-64678153
.
That sounds good to me
This is a good starter project
Background
Inside a docker container, the primary process is launched as root by default . And, currently, docker containers can not be run without root privileges (Once docker supports the user namespace, a process inside a container can run as root, and the container root user could actually be mapped to a normal, non-privileged, user outside the container). However, even today, inside a docker container a process can be run under a non-privileged user: the Docker image can create new users and then force docker to launch the entry point process as that user instead of root (as long as that user exists within the container image).
When an external volume is mounted it’s permissions are set to ROOT (UID 0), therefore unless the process inside the container is launched as root, it won’t have permission to access the mounted directory.
Proposed Workarounds on the Kuberentes side
Both approaches feel to me like they are breaking a layer of abstraction by having Kubernetes reach into the container to figure out what user the main process would start as, and doing something outside the container with that information. I feel like the right approach would be for the containers themselves to CHOWN any "mounted volumes" during setup (after creating and setting user).
Thoughts?
@thockin, after talking to some folks, I think @carlossg's approach of explicitly specifying the user in the API would be the cleanest work around. I don't thing we can apply "hidden heuristics" without doing icky violation of abstractions (like reaching in to a container to figure out what username to use and then mounting the container's /etc/passwd
file to figure out the associated UID).
Proposal to modify the API:
EmptyDir
, GitRepo
, and GCEPersistentDisk
volumes to optionally specify a unsigned integer UID.750
(User: rwx
, Group: r-x
, World: ---
) when the volume directory is created.757
(User: rwx
, Group: r-x
, World: rxw
), i.e. world writable, when the volume directory is created.Thoughts?
CC: @bgrant0607, @dchen1107, @lavalamp
I think adding UID to volumes is a hack and redundant. I'd rather we do
the right thing and get Docker to support supplemental group IDs.
https://github.com/docker/docker/issues/9360
On Thu, Dec 18, 2014 at 3:13 PM, Saad Ali [email protected] wrote:
@thockin https://github.com/thockin, after talking to some folks, I
think @carlossg https://github.com/carlossg's approach of explicitly
specifying the user in the API would be the cleanest work around. I don't
thing we can apply "hidden heuristics" without doing icky violation of
abstractions (like reaching in to a container to figure out what username
to use and then mounting the container's /etc/passwd file to figure out
the associated UID).Proposal to modify the API:
- Extend the API for EmptyDir, GitRepo, and GCEPersistentDisk volumes
to optionally specify a unsigned integer UID.
- If the UID is specified, the host will change the owner of the
directory to that UID and set the permissions to 750 (User: rwx,
Group: r-x, World: ---) when the volume directory is created.
- If the UID is not specified, the host will not change the owner,
but set the permissions to 757 (User: rwx, Group: r-x, World: rxw),
i.e. world writable, when the volume directory is created.
- HostDir volumes would be left untouched, since those directories
are not created by Kubernetes.
- Require UID instead of username string so there are no problems
if the user does exist on the host machine (issue 2.ii above).
Thoughts?
CC: @bgrant0607 https://github.com/bgrant0607, @dchen1107
https://github.com/dchen1107, @lavalamp https://github.com/lavalamp
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-67573932
.
@saad-ali I think HostDir should not be left untouched. Let's consider this: Hadoop on restart, restores the blocks from the directory it stores data in. If we use emptyDir, the container which restarted will get another directory and the previous data will be lost. And Hadoop requires the permissions and ownership of directory to be set to the user starting Hadoop (hdfs). If HostDir is not allowed to change permissions as per user, then similar use cases to this cannot be achieved. Please, comment.
Define restart? Do you mean the container crashed and came back, or do you
mean the machine rebooted and a new pod was scheduled and expects to be
able to reclaim the disk space used by the previous pod? Or something else?
On Mon, Jan 12, 2015 at 11:56 PM, Luqman [email protected] wrote:
@saad-ali https://github.com/saad-ali I think HostDir should not be
left untouched. Let's consider this: Hadoop on restart, restores the blocks
from the directory it stores data in. If we use emptyDir, the container
which restarted will get another directory and the previous data will be
lost. And Hadoop requires the permissions and ownership of directory to be
set to the user starting Hadoop (hdfs). If HostDir is not allowed to change
permissions per user, then similar use cases to this cannot be achieved.
Please, comment.
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-69707793
.
@thockin Restart could be anything. It could be after pod failure or container failure. Or the container could be restarted, after changing some configurations (Hadoop needs to be restarted after changes in configs). Does that answer?
This document mentions that when a pod is unbound, the emptyDir is deleted. In use case of Hadoop, the data might be essential and might be required when another pod of Hadoop comes back (or the container restarts). So, HostDir must be used to persist data even the pod is unbound. But Hadoop requires permissions to be set for the user for the data directory. Hope this explains.
With docker/libcontainer/pull/322, docker containers now allow specifying AdditionalGroups
(supplementary group GIDs). So an updated proposal to handle shared volumes amongst different containers in a pod:
EmptyDir
, GitRepo
, or GCEPersistentDisk
volumes for a new pod, Kubelet will:770
(User: rwx
, Group: rwx
, World: ---
).AdditionalGroups
via docker container configs.AdditionalGroups
through to libcontainer (https://github.com/docker/docker/issues/9360).fsouza/go-dockerclient
to support AdditionalGroups
HostDir
volumes for a new pod, Kubelet will:There's an important distinction between a container restarting and a pod
being removed. When a container restarts, the data in a normal emptyDir
volume is safe. when a pod is removed, it should be GONE. Leaving
Hostdata and expecting it to be there at some later point in time is
awkward at best.
All of this is more complicated as soon as user namespaces land.
On Wed, Jan 14, 2015 at 12:04 AM, Luqman [email protected] wrote:
This document
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/volumes.md#emptydir,
mentions that when a pod unbound, the emptyDir is deleted. In use case of
Hadoop, the data might be essential and might be required when another pod
of Hadoop comes back (or the container restarts). So, HostDir must be used
to persist data even the pod is unbound. But Hadoop requires permissions to
be set for the user for the data directory. Hope this explains.
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-69881834
.
If I'm understanding the current proposal correctly, I think this is going to create surprising behavior for a certain class of applications.
Many older applications which bind to low (privileged) ports start first as root, then immediately drop privileges to some other user. In such a scenario, the container must be configured to start the application as root, and so the original user (root) would have access to the volume. Once the application calls setuid(2)
/seteuid(2)
though, it won't have access anymore. Now the only way to get access to that directory is to modify the container to chown
the volume before starting the application itself. This is the situation I'm currently in.
Due to this, I'd like to voice another opinion in favor of extending the API to allow explicitly specifying UID
and GID
as I don't think the current proposal covers all possible (reasonable) use cases.
At a minimum the emptydir should use the UID/GID/Labels of the security context (if specified).
I think adding UID to volumes is a hack and redundant. I'd rather we do the right thing and get Docker to support supplemental group IDs.
+1, but also:
At a minimum the emptydir should use the UID/GID/Labels of the security context (if specified).
Now that we have security context API in place, I think we should make emptyDir work with the security context of the containers in a pod. One little wrinkle to iron out about this is that volumes are pod-scoped while security contexts are container scoped. I think you will have to look at which containers have the volume mounted and what their security contexts are. If there's a single container that mounts an emptyDir, it's easy -- use the security context of that container. If there are multiple, it gets dicey:
I think I will probably start prototyping this concentrating on the simple case where there's a single container. I will use the security context of the first container that mounts the volume in the pod spec at first and we can change the strategy for determining the context to use as discussion goes.
@smarterclayton @thockin @erictune @pweil-
I don't really like the heuristics here. I acknowledge that I suggested a
heuristic but that was half a year ago :)
Other than Docker support not being done yet, why can't we do something
like:
Net result should be that all containers in the pod can access all volumes
in the pod without restriction, regardless of what UID each container is
using.
I don't know anything about SELinux labels, might be a problem. I assert
that all volumes in a pod should be available to all containers
Next we have to define "all volumes". emptyDir is pretty obvious. What
happens to hostPath mounts that did not exist and were created for this
use? Seems reasonable. What about hostPath mounts that existed before
this use - no way can we change those. What about things like PDs? Do we
run a recursive chown/chgrp/chmod Blech.
@mrunalp for docker support status on supplemental groups.
Waiting for https://github.com/docker/libcontainer/pull/603 to be merged.
I assert that all volumes in a pod should be available to all containers
Spent a lot of time thinking about this last night, and I agree with you on this point.
I don't know if that means we need a pod-level security context or
something, though :)
On Thu, Jun 4, 2015 at 8:40 AM, Paul Morie [email protected] wrote:
I assert that all volumes in a pod should be available to all containers
Spent a lot of time thinking about this last night, and I agree with you
on this point.—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108939442
.
Although - I may collocate two containers and not want them to share contents (db and web logs) but share a work for.
On Jun 4, 2015, at 12:22 PM, Tim Hockin [email protected] wrote:
I don't know if that means we need a pod-level security context or
something, though :)On Thu, Jun 4, 2015 at 8:40 AM, Paul Morie [email protected] wrote:
I assert that all volumes in a pod should be available to all containers
Spent a lot of time thinking about this last night, and I agree with you
on this point.—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108939442
.—
Reply to this email directly or view it on GitHub.
I don't find the case of two containers in a pod needing different access
control to a volume to be very compelling. The alternative is to spec a
full security context for volumes and then force the complexity back onto
API users.
On Thu, Jun 4, 2015 at 9:30 AM, Clayton Coleman [email protected]
wrote:
Although - I may collocate two containers and not want them to share
contents (db and web logs) but share a work for.On Jun 4, 2015, at 12:22 PM, Tim Hockin [email protected]
wrote:I don't know if that means we need a pod-level security context or
something, though :)On Thu, Jun 4, 2015 at 8:40 AM, Paul Morie [email protected]
wrote:I assert that all volumes in a pod should be available to all
containersSpent a lot of time thinking about this last night, and I agree with
you
on this point.—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108939442>.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108957878
.
The argument here seems to be that you don't need intra pod security more complex than "don't mount the same volume into different contexts". I'm ok with a single security context for the pod - just pointing out that if you want to have complex, secure pods, you may want to use user isolation between the containers to secure disk contents.
----- Original Message -----
I don't find the case of two containers in a pod needing different access
control to a volume to be very compelling. The alternative is to spec a
full security context for volumes and then force the complexity back onto
API users.On Thu, Jun 4, 2015 at 9:30 AM, Clayton Coleman [email protected]
wrote:Although - I may collocate two containers and not want them to share
contents (db and web logs) but share a work for.On Jun 4, 2015, at 12:22 PM, Tim Hockin [email protected]
wrote:I don't know if that means we need a pod-level security context or
something, though :)On Thu, Jun 4, 2015 at 8:40 AM, Paul Morie [email protected]
wrote:I assert that all volumes in a pod should be available to all
containersSpent a lot of time thinking about this last night, and I agree with
you
on this point.—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108939442>.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108957878
.
Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108959938
Hrm.
Although the pod security context is unlikely to work for most real containers once user namespaces land - the UID a container runs as (in user namespaces) is really tied to the container, not the pod. So either that has to be a default security context at the pod level (overridable) or it's instead the volume security context.
----- Original Message -----
The argument here seems to be that you don't need intra pod security more
complex than "don't mount the same volume into different contexts". I'm ok
with a single security context for the pod - just pointing out that if you
want to have complex, secure pods, you may want to use user isolation
between the containers to secure disk contents.----- Original Message -----
I don't find the case of two containers in a pod needing different access
control to a volume to be very compelling. The alternative is to spec a
full security context for volumes and then force the complexity back onto
API users.On Thu, Jun 4, 2015 at 9:30 AM, Clayton Coleman [email protected]
wrote:Although - I may collocate two containers and not want them to share
contents (db and web logs) but share a work for.On Jun 4, 2015, at 12:22 PM, Tim Hockin [email protected]
wrote:I don't know if that means we need a pod-level security context or
something, though :)On Thu, Jun 4, 2015 at 8:40 AM, Paul Morie [email protected]
wrote:I assert that all volumes in a pod should be available to all
containersSpent a lot of time thinking about this last night, and I agree with
you
on this point.—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108939442>.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108957878
.
Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108959938
We should share user namespace across the pod.
On Thu, Jun 4, 2015 at 10:32 AM, Clayton Coleman [email protected]
wrote:
Although the pod security context is unlikely to work for most real
containers once user namespaces land - the UID a container runs as (in user
namespaces) is really tied to the container, not the pod. So either that
has to be a default security context at the pod level (overridable) or it's
instead the volume security context.----- Original Message -----
The argument here seems to be that you don't need intra pod security more
complex than "don't mount the same volume into different contexts". I'm
ok
with a single security context for the pod - just pointing out that if
you
want to have complex, secure pods, you may want to use user isolation
between the containers to secure disk contents.----- Original Message -----
I don't find the case of two containers in a pod needing different
access
control to a volume to be very compelling. The alternative is to spec a
full security context for volumes and then force the complexity back
onto
API users.On Thu, Jun 4, 2015 at 9:30 AM, Clayton Coleman <
[email protected]>
wrote:Although - I may collocate two containers and not want them to share
contents (db and web logs) but share a work for.On Jun 4, 2015, at 12:22 PM, Tim Hockin [email protected]
wrote:I don't know if that means we need a pod-level security context or
something, though :)On Thu, Jun 4, 2015 at 8:40 AM, Paul Morie <
[email protected]>
wrote:I assert that all volumes in a pod should be available to all
containersSpent a lot of time thinking about this last night, and I agree
with
you
on this point.—
Reply to this email directly or view it on GitHub
<https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108939442
.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108957878.
Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108959938
—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108983917
.
The user namespace of the two containers should probably be in the same range. But the UID of container A and B are not required to be ==, and in many cases you don't want them to be trivially ==, because you may want to "read" the volume but not write it.
----- Original Message -----
We should share user namespace across the pod.
On Thu, Jun 4, 2015 at 10:32 AM, Clayton Coleman [email protected]
wrote:Although the pod security context is unlikely to work for most real
containers once user namespaces land - the UID a container runs as (in user
namespaces) is really tied to the container, not the pod. So either that
has to be a default security context at the pod level (overridable) or it's
instead the volume security context.----- Original Message -----
The argument here seems to be that you don't need intra pod security more
complex than "don't mount the same volume into different contexts". I'm
ok
with a single security context for the pod - just pointing out that if
you
want to have complex, secure pods, you may want to use user isolation
between the containers to secure disk contents.----- Original Message -----
I don't find the case of two containers in a pod needing different
access
control to a volume to be very compelling. The alternative is to spec a
full security context for volumes and then force the complexity back
onto
API users.On Thu, Jun 4, 2015 at 9:30 AM, Clayton Coleman <
[email protected]>
wrote:Although - I may collocate two containers and not want them to share
contents (db and web logs) but share a work for.On Jun 4, 2015, at 12:22 PM, Tim Hockin [email protected]
wrote:I don't know if that means we need a pod-level security context or
something, though :)On Thu, Jun 4, 2015 at 8:40 AM, Paul Morie <
[email protected]>
wrote:I assert that all volumes in a pod should be available to all
containersSpent a lot of time thinking about this last night, and I agree
with
you
on this point.—
Reply to this email directly or view it on GitHub
<https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108939442
.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108957878.
Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108959938
—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108983917
.
Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108987165
@smarterclayton
The user namespace of the two containers should probably be in the same range. But the UID of container A and B are not required to be ==, and in many cases you don't want them to be trivially ==, because you may want to "read" the volume but not write it.
Do you think we could infer this by whether the readOnly flag is set on the VolumeMount?
Some thoughts...
s0:c1.c10
and container has s0:c1,c2
, GID that has only read, etc) to facilitate custom, complex approaches with fine grained access----- Original Message -----
Some thoughts...
- When mounting a volume for a container it could inherit the SC of the
container if it has nothing set.
Hrm - that means a pod with one container would behave differently from a pod with two containers?
- For a complex case we could spec out the SC for the volume to support
ranges of SELinux labels as mentioned before and in this case it would not
inherit the SC of the volume- For a predefined volume SCs the container's SC would need to be allocated
in a manner consistent with the desired security (ie. volume has range
s0:c1.c10
and container hass0:c1,c2
, GID that has only read, etc) to
facilitate custom, complex approaches with fine grained access
Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108991796
Hrm - that means a pod with one container would behave differently from a pod with two containers?
And this is where some of the complexity comes in from being flexible. It is a valid use case to a single pod to have different security contexts for every container in the pod. And likewise, in OpenShift, if the continers make SC requests each container may validate against different SCCs. In that case, it seems like a predefined SC on the volume should be used with container SCs that comply. Inheriting is simply an ease of use feature.
Another idea to throw against the wall - being able to go from a pre-defined SC on a volume (inherited or not) to a RunInRange policy and validating that all container SCs that request the volume will have some sort of access.
Security aside -- can we agree that a PR that relaxes the mode is a Good Thing? I would like it if non-root uids could use volumes and junk.
@thockin @smarterclayton
I don't know what "relaxes the mode" means - can you be more concrete?
On Fri, Jun 5, 2015 at 2:41 PM, Paul Morie [email protected] wrote:
Security aside -- can we agree that a PR that relaxes the mode is a Good
Thing? I would like it if non-root uids could use volumes and junk.@thockin https://github.com/thockin @smarterclayton
https://github.com/smarterclayton—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109447608
.
My bad. I meant, make emptyDir 0777 instead of 0700.
At a minimum for us (Paul) we could use the sc of the first container that mounts the volume until we get a broader solution. Users can then control ordering and we'll at least be unbroken.
On Jun 5, 2015, at 6:01 PM, Paul Morie [email protected] wrote:
My bad. I meant, make emptyDir 0777 instead of 0700.
—
Reply to this email directly or view it on GitHub.
@smarterclayton that's what I was going to next. We still need to fix the
non-root UID case when SELinux isn't in play -- which I think we need the
0777 mode to do until we have supplemental groups in docker.
On Fri, Jun 5, 2015 at 6:13 PM, Clayton Coleman [email protected]
wrote:
At a minimum for us (Paul) we could use the sc of the first container that
mounts the volume until we get a broader solution. Users can then control
ordering and we'll at least be unbroken.On Jun 5, 2015, at 6:01 PM, Paul Morie [email protected]
wrote:My bad. I meant, make emptyDir 0777 instead of 0700.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109457218
.
I do agree that a separate security context for the pod volumes (individual or group) is probably necessary.
On Jun 5, 2015, at 6:39 PM, Paul Morie [email protected] wrote:
@smarterclayton that's what I was going to next. We still need to fix the
non-root UID case when SELinux isn't in play -- which I think we need the
0777 mode to do until we have supplemental groups in docker.On Fri, Jun 5, 2015 at 6:13 PM, Clayton Coleman [email protected]
wrote:At a minimum for us (Paul) we could use the sc of the first container that
mounts the volume until we get a broader solution. Users can then control
ordering and we'll at least be unbroken.On Jun 5, 2015, at 6:01 PM, Paul Morie [email protected]
wrote:My bad. I meant, make emptyDir 0777 instead of 0700.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109457218
.—
Reply to this email directly or view it on GitHub.
I do agree that a separate security context for the pod volumes (individual or group) is probably necessary.
I think so too after having my brain thoroughly pretzelized while thinking through all the permutations of what you would have to handle without one. :curly_loop: :curly_loop: :curly_loop:
For comparison, we are intentionally NOT this flexible for things like
RestartPolicy - defining sane semantics for this sort of edge case is just
not worth the effort. Is it really worth the effort for security context ?
On Jun 4, 2015 11:27 AM, "Paul Weil" [email protected] wrote:
Hrm - that means a pod with one container would behave differently from a
pod with two containers?And this is where some of the complexity comes in from being flexible. It
is a valid use case to a single pod to have different security contexts for
every container in the pod. And likewise, in OpenShift, if the continers
make SC requests each container may validate against different SCCs. In
that case, it seems like a predefined SC on the volume should be used with
container SCs that comply. Inheriting is simply an ease of use feature.Another idea to throw against the wall - being able to go from a
pre-defined SC on a volume (inherited or not) to a RunInRange policy and
validating that all container SCs that request the volume will have some
sort of access.—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-108999035
.
Can't we make it 0770 and set the group ID for it and every container in
the pod? It's coarse but better than 0777 and closer to where we should go
(IMO supplemental GIDs). As far as I see, Security Context does not yet
allow a container to set GID...
On Jun 5, 2015 3:01 PM, "Paul Morie" [email protected] wrote:
My bad. I meant, make emptyDir 0777 instead of 0700.
—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109452834
.
I could live with this iff it comes with a giant TODO and docs in the right
places.
On Jun 5, 2015 3:13 PM, "Clayton Coleman" [email protected] wrote:
At a minimum for us (Paul) we could use the sc of the first container that
mounts the volume until we get a broader solution. Users can then control
ordering and we'll at least be unbroken.On Jun 5, 2015, at 6:01 PM, Paul Morie [email protected]
wrote:My bad. I meant, make emptyDir 0777 instead of 0700.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109457218
.
I really really want to lean on simple assumptions intra-pod. Adding SC on
volumes is not simpler.
On Jun 5, 2015 3:44 PM, "Paul Morie" [email protected] wrote:
I do agree that a separate security context for the pod volumes
(individual or group) is probably necessary.I think so too after having my brain thoroughly pretzelized while thinking
through all the permutations of what you would have to handle without one. [image:
:curly_loop:] [image: :curly_loop:] [image: :curly_loop:]—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109468896
.
Volumes plural, not volumes singular. I'm pretty sure different uids per container is absolutely valid, so we can't guess a uid based on containers alone and be predictable. So we either need a pod default sc or a subset of sc applied to all volumes. If the uid on the directory is wrong for that sc that has other security implications. And labels _have_ to match or you get nothing.
So the two options seem to be:
Pod level default sc kind of makes sense, while the first option also makes sense but is order dependent and somewhat implicit.
On Jun 7, 2015, at 1:55 PM, Tim Hockin [email protected] wrote:
I really really want to lean on simple assumptions intra-pod. Adding SC on
volumes is not simpler.
On Jun 5, 2015 3:44 PM, "Paul Morie" [email protected] wrote:I do agree that a separate security context for the pod volumes
(individual or group) is probably necessary.I think so too after having my brain thoroughly pretzelized while thinking
through all the permutations of what you would have to handle without one. [image:
:curly_loop:] [image: :curly_loop:] [image: :curly_loop:]—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109468896
.—
Reply to this email directly or view it on GitHub.
I agree group is generally useful. It wouldn't work for labels though.
On Jun 7, 2015, at 1:50 PM, Tim Hockin [email protected] wrote:
Can't we make it 0770 and set the group ID for it and every container in
the pod? It's coarse but better than 0777 and closer to where we should go
(IMO supplemental GIDs). As far as I see, Security Context does not yet
allow a container to set GID...
On Jun 5, 2015 3:01 PM, "Paul Morie" [email protected] wrote:My bad. I meant, make emptyDir 0777 instead of 0700.
—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109452834
.—
Reply to this email directly or view it on GitHub.
Yeah, I don't know selinux at all (it has only ever given me problems that
I don't know how to solve - same as here).
On Jun 7, 2015 11:05 AM, "Clayton Coleman" [email protected] wrote:
I agree group is generally useful. It wouldn't work for labels though.
On Jun 7, 2015, at 1:50 PM, Tim Hockin [email protected]
wrote:Can't we make it 0770 and set the group ID for it and every container in
the pod? It's coarse but better than 0777 and closer to where we should
go
(IMO supplemental GIDs). As far as I see, Security Context does not yet
allow a container to set GID...
On Jun 5, 2015 3:01 PM, "Paul Morie" [email protected] wrote:My bad. I meant, make emptyDir 0777 instead of 0700.
—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109452834>.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109784566
.
Treat it like 700 - if labels are different you can't see it, if they are you can.
In this context we're probably going to stay simple and say every volume has a single label, and every container has a single label, and we want everything in the pod to have the same label in 99% of cases. Eventually we may want a container to have a different label (which means it can access anything outside of its label, period).
On Jun 7, 2015, at 2:21 PM, thockin-cc [email protected] wrote:
Yeah, I don't know selinux at all (it has only ever given me problems that
I don't know how to solve - same as here).
On Jun 7, 2015 11:05 AM, "Clayton Coleman" [email protected] wrote:I agree group is generally useful. It wouldn't work for labels though.
On Jun 7, 2015, at 1:50 PM, Tim Hockin [email protected]
wrote:Can't we make it 0770 and set the group ID for it and every container in
the pod? It's coarse but better than 0777 and closer to where we should
go
(IMO supplemental GIDs). As far as I see, Security Context does not yet
allow a container to set GID...
On Jun 5, 2015 3:01 PM, "Paul Morie" [email protected] wrote:My bad. I meant, make emptyDir 0777 instead of 0700.
—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109452834>.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109784566
.—
Reply to this email directly or view it on GitHub.
Hrm
On Sun, Jun 7, 2015 at 2:25 PM Clayton Coleman [email protected]
wrote:
Treat it like 700 - if labels are different you can't see it, if they are
you can.In this context we're probably going to stay simple and say every volume
has a single label, and every container has a single label, and we want
everything in the pod to have the same label in 99% of cases. Eventually we
may want a container to have a different label (which means it can access
anything outside of its label, period).On Jun 7, 2015, at 2:21 PM, thockin-cc [email protected]
wrote:Yeah, I don't know selinux at all (it has only ever given me problems
that
I don't know how to solve - same as here).
On Jun 7, 2015 11:05 AM, "Clayton Coleman" [email protected]
wrote:I agree group is generally useful. It wouldn't work for labels though.
On Jun 7, 2015, at 1:50 PM, Tim Hockin [email protected]
wrote:Can't we make it 0770 and set the group ID for it and every
container in
the pod? It's coarse but better than 0777 and closer to where we
should
go
(IMO supplemental GIDs). As far as I see, Security Context does not
yet
allow a container to set GID...
On Jun 5, 2015 3:01 PM, "Paul Morie" [email protected]
wrote:My bad. I meant, make emptyDir 0777 instead of 0700.
—
Reply to this email directly or view it on GitHub
<https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109452834>
.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109784566>.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109785941
.
It doesn't look like docker exposed control of the group at all yet. Am i
missing something, @thockin? Would the kubelet have to setgid on the
container process? Sounds racy to me.
On Sun, Jun 7, 2015 at 3:22 PM Paul Morie [email protected] wrote:
Hrm
On Sun, Jun 7, 2015 at 2:25 PM Clayton Coleman [email protected]
wrote:Treat it like 700 - if labels are different you can't see it, if they are
you can.In this context we're probably going to stay simple and say every volume
has a single label, and every container has a single label, and we want
everything in the pod to have the same label in 99% of cases. Eventually we
may want a container to have a different label (which means it can access
anything outside of its label, period).On Jun 7, 2015, at 2:21 PM, thockin-cc [email protected]
wrote:Yeah, I don't know selinux at all (it has only ever given me problems
that
I don't know how to solve - same as here).
On Jun 7, 2015 11:05 AM, "Clayton Coleman" [email protected]
wrote:I agree group is generally useful. It wouldn't work for labels
though.On Jun 7, 2015, at 1:50 PM, Tim Hockin [email protected]
wrote:Can't we make it 0770 and set the group ID for it and every
container in
the pod? It's coarse but better than 0777 and closer to where we
should
go
(IMO supplemental GIDs). As far as I see, Security Context does not
yet
allow a container to set GID...
On Jun 5, 2015 3:01 PM, "Paul Morie" [email protected]
wrote:My bad. I meant, make emptyDir 0777 instead of 0700.
—
Reply to this email directly or view it on GitHub
<https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109452834>
.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109784566>.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109785941
.
Multiple GH fail, my bad!
Hmm, I thought docker allowed setting GID. Damn.
On Jun 7, 2015 1:23 PM, "Paul Morie" [email protected] wrote:
It doesn't look like docker exposed control of the group at all yet. Am i
missing something, @thockin? Would the kubelet have to setgid on the
container process? Sounds racy to me.
On Sun, Jun 7, 2015 at 3:22 PM Paul Morie [email protected] wrote:Hrm
On Sun, Jun 7, 2015 at 2:25 PM Clayton Coleman <[email protected]wrote:
Treat it like 700 - if labels are different you can't see it, if they
are
you can.In this context we're probably going to stay simple and say every volume
has a single label, and every container has a single label, and we want
everything in the pod to have the same label in 99% of cases.
Eventually we
may want a container to have a different label (which means it can
access
anything outside of its label, period).On Jun 7, 2015, at 2:21 PM, thockin-cc [email protected]
wrote:Yeah, I don't know selinux at all (it has only ever given me problems
that
I don't know how to solve - same as here).
On Jun 7, 2015 11:05 AM, "Clayton Coleman" [email protected]
wrote:I agree group is generally useful. It wouldn't work for labels
though.On Jun 7, 2015, at 1:50 PM, Tim Hockin [email protected]
wrote:Can't we make it 0770 and set the group ID for it and every
container in
the pod? It's coarse but better than 0777 and closer to where we
should
go
(IMO supplemental GIDs). As far as I see, Security Context does
not
yet
allow a container to set GID...
On Jun 5, 2015 3:01 PM, "Paul Morie" [email protected]
wrote:My bad. I meant, make emptyDir 0777 instead of 0700.
—
Reply to this email directly or view it on GitHub
<https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109452834
.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
<https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109784566
.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109785941.
—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109797284
.
@thockin :'-/
On Sun, Jun 7, 2015 at 8:54 PM, Tim Hockin [email protected] wrote:
Hmm, I thought docker allowed setting GID. Damn.
On Jun 7, 2015 1:23 PM, "Paul Morie" [email protected] wrote:
It doesn't look like docker exposed control of the group at all yet. Am i
missing something, @thockin? Would the kubelet have to setgid on the
container process? Sounds racy to me.
On Sun, Jun 7, 2015 at 3:22 PM Paul Morie [email protected] wrote:Hrm
On Sun, Jun 7, 2015 at 2:25 PM Clayton Coleman <
[email protected]wrote:
Treat it like 700 - if labels are different you can't see it, if they
are
you can.In this context we're probably going to stay simple and say every
volume
has a single label, and every container has a single label, and we
want
everything in the pod to have the same label in 99% of cases.
Eventually we
may want a container to have a different label (which means it can
access
anything outside of its label, period).On Jun 7, 2015, at 2:21 PM, thockin-cc [email protected]
wrote:Yeah, I don't know selinux at all (it has only ever given me
problems
that
I don't know how to solve - same as here).
On Jun 7, 2015 11:05 AM, "Clayton Coleman" <
[email protected]>
wrote:I agree group is generally useful. It wouldn't work for labels
though.On Jun 7, 2015, at 1:50 PM, Tim Hockin <
[email protected]>
wrote:Can't we make it 0770 and set the group ID for it and every
container in
the pod? It's coarse but better than 0777 and closer to where we
should
go
(IMO supplemental GIDs). As far as I see, Security Context does
not
yet
allow a container to set GID...
On Jun 5, 2015 3:01 PM, "Paul Morie" [email protected]
wrote:My bad. I meant, make emptyDir 0777 instead of 0700.
—
Reply to this email directly or view it on GitHub
<https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109452834
.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
<https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109784566
.
—
Reply to this email directly or view it on GitHub.—
Reply to this email directly or view it on GitHub
<https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109785941
.
—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109797284.
—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-109819177
.
@thockin @pmorie The syntax for setting gid is --user "uid:gid". For e.g.
docker run -it --rm --user "1:777" busybox sh
@thockin @pmorie The syntax for setting gid is --user "uid:gid". For e.g.
@pmorie @smarterclayton - we probably want to patch this in to SCs and SCCs then.
Yes we do.
----- Original Message -----
@thockin @pmorie The syntax for setting gid is --user "uid:gid". For e.g.
@pmorie @smarterclayton - we probably want to patch this in to SCs and SCCs
then.
Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110031779
@pweil-
On Mon, Jun 8, 2015 at 11:25 AM, Clayton Coleman [email protected]
wrote:
Yes we do.
----- Original Message -----
@thockin @pmorie The syntax for setting gid is --user "uid:gid". For
e.g.@pmorie @smarterclayton - we probably want to patch this in to SCs and
SCCs
then.
Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110031779
—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110039834
.
Docker's syntax is atrocious, please don't copy it.
On Jun 8, 2015 8:45 AM, "Paul Morie" [email protected] wrote:
@pweil-
On Mon, Jun 8, 2015 at 11:25 AM, Clayton Coleman <[email protected]
wrote:
Yes we do.
----- Original Message -----
@thockin @pmorie The syntax for setting gid is --user "uid:gid". For
e.g.@pmorie @smarterclayton - we probably want to patch this in to SCs and
SCCs
then.
Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110031779
—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110039834.
—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110049481
.
@thockin agree, we need a ditinct gid field in security context
On Mon, Jun 8, 2015 at 12:29 PM, thockin-cc [email protected]
wrote:
Docker's syntax is atrocious, please don't copy it.
On Jun 8, 2015 8:45 AM, "Paul Morie" [email protected] wrote:
@pweil-
On Mon, Jun 8, 2015 at 11:25 AM, Clayton Coleman <
[email protected]wrote:
Yes we do.
----- Original Message -----
@thockin @pmorie The syntax for setting gid is --user "uid:gid".
For
e.g.@pmorie @smarterclayton - we probably want to patch this in to SCs
and
SCCs
then.
Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110031779
—
Reply to this email directly or view it on GitHub
<https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110039834
.
—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110049481.
—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-110065270
.
Docker's syntax is atrocious, please don't copy it.
Definitely not, we'd add an RunAsGroup int64
field.
Depending on requirements, you might want to store user and group as strings. They are looked up by default in the passwd and group files. If they aren't found and are numeric then they are converted to uid/gid.
So I know it's a wild unreasonable idea, but from a purely correctness technical and layering PoV it seems to me like the right solution would be to just forbid/ignore the USER line in any docker file and make people set the uid of a container somewhere in the kube pod/container declaration. (although that still obviously suffers from the insanity of having to look at /etc/passwd inside the container to interpret the uid of something outside the container if we make this a string)
More seriously though, there doesn't seem to be ANY good solution until we get to user namespaces. Then we can generate a random uid/gid on the outside, chown the emptyDir directory to the uid/gid we picked and just let the container on the inside run as whatever uid/gid it wants. Nothing until we get to that point is anything but an ugly hack.
A relatedish point, just FYI, today (plain) docker is picking a random selinux context for every container. Docker recently accepted a new option -v /source:/dest:rw,z
. The z
portion is new. It means to do the equivalent of chown -R
except it sets the selinux label on all files in the volume to the randomly generated label. They do not have an option to actually chown -R
and set uid/gid but I know some people want that as well...
Personally, I think putting a security context on the volume and foisting the complexity on the user is the only 'clean' option unless we just ignore it entirely until user namespaces are a reality. One big reason I think this instead of jumping on @thockin supplemental gid solution is because there is no selinux analog. selinux is used similarly to uids. You get one. And either it exactly matches or it doesn't. It is not like groups. You can't have more than 1 label. Even with user namespaces, this part is still going to be a PITA...
On Jun 8, 2015, at 6:45 PM, Eric Paris [email protected] wrote:
So I know it's a wild unreasonable idea, but from a purely correctness technical and layering PoV it seems to me like the right solution would be to just forbid/ignore the USER line in any docker file and make people set the uid of a container somewhere in the kube pod/container declaration. (although that still obviously suffers from the insanity of having to look at /etc/passwd inside the container to interpret the uid of something outside the container if we make this a string)
I agree - we plan to reject images with non numeric usernames in some modes. You can't even trust /etc/passwd anyway, so you're just compensating for lazy image authors.
More seriously though, there doesn't seem to be ANY good solution until we get to user namespaces. Then we can generate a random uid/gid on the outside, chown the emptyDir directory to the uid/gid we picked and just let the container on the inside run as whatever uid/gid it wants. Nothing until we get to that point is anything but an ugly hack.People need to write images to a known uid or work on any uid. But all the prep work to get to user namespaces can be done now. And user namespaces doesn't solve the ownership problem because containers can have multiple users, so you still don't know which uid to map to unless the image tells you with a numeric uid.
A relatedish point, just FYI, today (plain) docker is picking a random selinux context for every container. Docker recently accepted a new option -v /source:/dest:rw,z. The z portion is new. It means to do the equivalent of chown -R except it sets the selinux label on all files in the volume to the randomly generated label. They do not have an option to actually chown -R and set uid/gid but I know some people want that as well...Personally, I think putting a security context on the volume and foisting the complexity on the user is the only 'clean' option unless we just ignore it entirely until user namespaces are a reality. One big reason I think this instead of jumping on @thockin supplemental gid solution is because there is no selinux analog. selinux is used similarly to uids. You get one. And either it exactly matches or it doesn't. It is not like groups. You can't have more than 1 label. Even with user namespaces, this part is still going to be a PITA...
—
Reply to this email directly or view it on GitHub.
This topic is relevant to @jmccormick2001's interests
@thockin @eparis Is there merit in peeling off a separate issue to discuss this problem for NFS and other !emptyDir volumes?
yes
On Thu, Jul 9, 2015 at 8:34 PM, Paul Morie [email protected] wrote:
@thockin https://github.com/thockin @eparis https://github.com/eparis
Is there merit in peeling off a separate issue to discuss this problem for
NFS and other !emptyDir volumes?—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/2630#issuecomment-120213150
.
What is the Kubernetes team recommended workaround for this for the time being? The options I see are as follows:
Neither of them seem especially palatable.
I see that a fix was made to chmod emptyDirs to 777 but this isn't in 1.0.x (which is what Google Container Engine, my preferred deployment target, is using.) So it looks like I have to dig in and use one of the above solutions for now.
kubernetes v1.1 will be out soon with better support for this case!
@pmorie
On Mon, Nov 2, 2015 at 9:03 AM, Joshua Kwan [email protected]
wrote:
What is the Kubernetes team recommended workaround for this for the time
being? The options I see are as follows:
- Run as root
- Run as root initially and have an entrypoint wrapper that chowns
specified directories before dropping permissions (?? would this even work?)Neither of them seem especially palatable.
I see that a fix was made to chmod emptyDirs to 777 but this isn't in
1.0.x (which is what Google Container Engine, my preferred deployment
target, is using.) So it looks like I have to dig in and use one of the
above solutions for now.—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-153083650
.
@joshk0 we perform a chown from the containers' entry point (a bash script) and fallback to running our main process using a normal user using runuser or gosu.
Not ideal but it's the only way to do it on v1.0.
@joshk0 @antoineco We've just introduced API changes that will allow you to specify a supplemental group that owns emptyDir and its derivatives and some block device volumes: #15352
Paul,
Do we have a roadmap doc for the evolution of this? We've talked about
FSGid being auto-allocated at admission, then maybe being coupled to PVs
eventually. It would be nice to be able to see that plan all laid out.
On Tue, Nov 3, 2015 at 7:01 AM, Paul Morie [email protected] wrote:
@joshk0 https://github.com/joshk0 @antoineco
https://github.com/antoineco We've just introduced API changes that
will allow you to specify a supplemental group that owns emptyDir and its
derivatives and some block device volumes: #15352
https://github.com/kubernetes/kubernetes/pull/15352—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-153381097
.
@thockin I would love to write one up once I am freed up from finishing stuff for 3.1... I'll make an issue for it.
@pmorie looks good in terms of usability! Thanks a lot for working on this.
Will it also be possible to set a more fine-grained mode on the mountpoint? Right row RW volumes are mounted with 0777 (1.1-beta.1). I have a use case where this causes minor issues:
❯ kubectl exec mypod -c logrotate -- ls -ld /var/log/containers/rails/
drwxrwxrwx 2 9999 root 4096 Nov 6 14:35 /var/log/containers/rails/
❯ kubectl exec mypod -c logrotate -- logrotate /etc/logrotate.conf
error: skipping "/var/log/containers/rails/production.log" because parent directory has insecure permissions (It's world writable or writable by group which is not "root") Set "su" directive in config file to tell logrotate which user/group should be used for rotation.
@antoineco Are you using an emptyDir for that volume? I think I will create an issue to change the behavior so that emptyDirs are chmod a-rwx
if FSGroup
is specified.
Yes, it's an emptyDir volume.
@antoineco Okay, I am packing for a trip next week today, but I will make an issue for this and tag you into it. Do the semantics I mentioned work for you?
(Did I hear "KubeCon“? 😄)
What you suggested would work, but you probably meant o-rwx, which would set the mode and owner as follows: 0770 root : FSGroup
. Or did I get everything wrong?
@antoineco You did :)
I think it articulated things the wrong way -- what I meant was that emptyDir should be 0770 root:fsgroup g+s
.
@antoineco You suggested as a workaround for 1.0.X to perform a chown from the container's entrypoint. Do you have an example of that?
I can't seem to be able to see the mounted volume when I execute such command as the entrypoint. Is the volume mounted at a later stage?
@marcolenzo Nothing fancy, I use a tiny bash script as my entrypoint: ENTRYPOINT ["/run.sh"]
This script sets the correct permissions on the shared volume(s) and then starts my service. Example:
#!/bin/sh
# reset permissions on log volumes
vol=/var/log/nginx
if [ -d "$vol" -a "$(stat -c '%U' "$vol" 2>/dev/null)" = "root" ]; then
chown app "$vol"
chmod o-rwx "$vol"
fi
# startup
echo "+-- starting nginx..."
exec nginx "$@"
I noticed that the PR for docker/docker#9360 was merged: docker/docker#10717
Does this mean the group ID solution to this problem can now be implemented? I have had several cases where I had to do what @antoineco has had to do, i.e. create little mini startup scripts that keep me from being able to use many 3rd party Docker images as-is.
@charles-crain There are two ways you can work with supplemental groups via the API now:
Does that help? Let me know if you need more information.
@pmorie From which version are pod.spec.securityContext.supplementalGroups and pod.spec.securityContext.fsGroup supported? Does it also apply to non-existing hostPaths that are created when the POD starts for the first time?
@pmorie Looking through the commit logs it looks like fsGroup isn't supported until 1.2.0-alpha3 yes?
It seems to me this issue is only caused by kubernetes relying on bind mount. When using docker volumes (i.e created using a volume driver, docker volume
command, or implicit VOLUME
from Dockerfile) the volume is initialized with content from the docker image, and can be chwon
ed to match image requirements. Can't kubernetes use volume API to do the same ?
@thockin , what's the final decision of this issue? I meet the same when using jenkins & glusterfs: the jenkins is using UID:1000 and glusterfs is mounted by UID:0; the jenkins can not access the FS :(.
[root@cdemo01 jenkins]# kc logs jenkins-3tozq
touch: cannot touch ‘/var/jenkins_home/copy_reference_file.log’: Permission denied
Can not write to /var/jenkins_home/copy_reference_file.log. Wrong volume permissions?
@andrejvanderzee @charles-crain my sincere apologies, I somehow missed the tags on this thread in the torrent of github alerts. FSGroup is supported as of 1.2.0.
Does it also apply to non-existing hostPaths that are created when the POD starts for the first time?
Host paths do not support FSGroup or SELinux relabeling, because those could provide an escalation path for a pod to take over a host. However, that said, host paths are not created if they do not exist. Could you be thinking of empty dir? Empty dir volumes do support both FSGroup and SELinux relabel.
@k82 The glusterfs plugin does not support FSGroup at this time.
I am positive we have an issue for controlling the UID of a volume, but I will need to find it.
IIUC Empty Dir Volume are created for Pod but also destroyed when Pod is deleted, so they couldn't be used for persistant data (typically, jenkins_home considering @k82 scenario).
Currently, I'm using 0777
as workaround for demo environment; but a better solution is necessary for production :).
So my reading this issue is largely mitigated, though some plugins, specifically those that are shared like glusterfs, do not support it well. Paul is there a writeup / walkthru of how to use FSGroup?
There is doc in the form of the proposal, and some doc for security context, but this could probably use better examples.
Here’s an example for how to use the fsgroup directive:
For the Docker container defined in https://github.com/robustirc/robustirc/blob/master/Dockerfile (which specifies RUN echo 'nobody:x:99:99:nobody:/:/bin/sh' >> /etc/passwd
and USER nobody
), the following modification to my kubernetes replicationcontroller config was necessary:
--- a/robustirc-node-1.rc.yaml 2016-07-11 22:04:31.795710444 +0200
+++ b/robustirc-node-1.rc.yaml 2016-07-11 22:04:37.815678489 +0200
@@ -14,6 +14,10 @@
spec:
restartPolicy: Always
dnsPolicy: ClusterFirst
+ securityContext:
+ # Specify fsGroup so that the persistent volume is writable for the
+ # non-privileged uid/gid 99, which is used in robustirc’s Dockerfile.
+ fsGroup: 99
containers:
- name: robustirc
image: robustirc/robustirc
What is the current best workaround for 1.4?
@stapelberg I've tried exactly like you explained with k8s 1.4.6 on a hostPath volume but I still see that root owns the mounted directory with 755 permissions. Any additional pointers on how to debug this?
Also, does it depend on the existence of the given user/group on the actual node/host where the container is scheduled?
Ahh.. Not sure if I'm looking at the correct code but seems like setUp method doesn't do anything for hostPath volume driver. See https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/host_path/host_path.go#L198
@thockin @saad-ali Can you please confirm? Seems like you guys are maintaining hostPath volume driver.
Since fsGroup
isn't available for Gluster yet, is there a recommended workaround other than running the container as root?
FsGroup is not supported by hostpath, yiu do not want kubelet making such permission changes on the host.
@pmorie Why not give the kubelet permission? Otherwise, the hand that is forced is to give the container full access as root
to do anything on the host? After all, it would have to be specified by fsGroup
in the manifest.
I agree with @haf regarding his "hand forcing" comment.
As an example, the kube-aws
tool allows users to configure an auto-attached NFS drive to all of the workers during cluster provisioning. So, by default, every single worker has an AWS EFS drive mounted to /efs when it comes online (uses cloud-init under the hood to accomplish this). Admins can then create shared storage PVs by pointing the hostPath
option to that NFS mount.
This type of hybrid setup is very simple to set up, works well, and I would think a lot of people would like to take advantage of this because of those reasons. Isn't that a good reason for hostPath
to support the FsGroup
option?
@joan38 @rushtehrani I use init-containers
to chown
the volume:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: jenkins
name: jenkins
spec:
replicas: 1
template:
metadata:
labels:
app: jenkins
annotations:
pod.alpha.kubernetes.io/init-containers: '[
{
"name": "jenkins-init",
"image": "busybox",
"imagePullPolicy": "IfNotPresent",
"command": ["sh", "-c", "chown -R 1234:1234 /jenkins"],
"volumeMounts": [
{
"name": "jenkins-home",
"mountPath": "/jenkins"
}
]
}
]'
spec:
containers:
- name: jenkins
image: 'amarruedo/jenkins:v2.32.1'
..........
volumes:
- name: jenkins-home
vsphereVolume:
volumePath: "[FAS03_IDVMS] volumes/jenkins"
fsType: ext4
I define in my containers the UID and GID to a known one so I can chown
in the init-container
. I've used this aproach with hostPath
volumes as well with no problems. When FSGroup
gets available in vsphere volumes I'll use that.
Hope this helps.
As with https://github.com/kubernetes/kubernetes/pull/39438#issuecomment-275459427, maybe we can only set FSGroup if the hostpath is being created?
After reading this full thread I am not sure what the resolution was. I am having a similar problem where I am on a mac using virtualbox and trying to start a mysql pod with a volume mount into /data (i have also tried /Users but it has the same behaviour). When minikube creates the directories they are created with root ownership and restrictive write permissions. Mysql is not able to write to them and so my pods crash. If I minikube ssh
and chmod -R 777 /data then the pods start and work correctly. What should I be doing different so I can start this pod without having to modify the permission of the data directory?
@justechn you may want to check https://kubernetes.io/docs/concepts/policy/security-context/ and https://kubernetes.io/docs/api-reference/v1.6/#podsecuritycontext-v1-core and set the value of fsGroup
to whatever gid your mysql deamon is running with.
fsGroup
Volumes which support ownership management are modified to be owned and writable by the GID specified infsGroup
.
@antoineco thanks for the tip. I must be doing something wrong because it is not working for me. I logged into the pod and ran id mysql
and got back uid=999(mysql) gid=999(mysql) groups=999(mysql)
. So I added fsGroup with 999 to my spec and restarted, but nothing changed. the directories are still owned by root.
spec: {
containers: [
{
name: 'percona',
image: 'percona:5.6',
imagePullPolicy: 'Always',
env: [
{
name: 'MYSQL_ROOT_PASSWORD',
value: secrets['system-mysql-root-password'],
},
{
name: 'MYSQL_OPS_USER',
value: variables['system-mysql-ops-user'],
},
{
name: 'MYSQL_OPS_PASSWORD',
value: secrets['system-mysql-ops-password'],
},
{
name: 'MYSQL_APP_USER',
value: variables['system-mysql-app-user'],
},
{
name: 'MYSQL_APP_PASSWORD',
value: secrets['system-mysql-app-password'],
},
],
ports: [
{
containerPort: 3306,
protocol: 'TCP',
},
],
volumeMounts: [
{ name: 'data', mountPath: '/var/lib/mysql' },
{ name: 'conf', mountPath: '/etc/mysql/conf.d' },
],
},
],
securityContext: {
fsGroup: 999,
},
volumes: [
{ name: 'data', hostPath: { path: '/data/db/db' } },
{ name: 'conf', hostPath: { path: '/data/db/db-conf' } },
],
},
drwxr-xr-x 2 root root 4096 May 9 18:14 db
drwxr-xr-x 2 root root 4096 May 9 18:14 db-conf
drwxr-xr-x 2 root root 4096 May 9 18:14 shared
drwxr-xr-x 2 root root 4096 May 9 18:14 shared-conf
drwxr-xr-x 2 root root 4096 May 9 18:14 shard-1
drwxr-xr-x 2 root root 4096 May 9 18:14 shard-1-conf
You may have to adjust the Pod Security Policy as well:
https://kubernetes.io/docs/concepts/policy/pod-security-policy/
What's the progress on this?
How can we access PV if using non-root users?
ZK does not run on minikube because of this: https://github.com/kubernetes/charts/issues/976
FYI - For those coming to this and using the workaround by @amarruedo it will need to be updated to the new v1.6+ syntax for initContainers. and looks like the following:
initContainers:
- name: volume-mount-hack
image: busybox
command: ["sh", "-c", "chown -R 1000:100 /usr/share/elasticsearch/data"]
volumeMounts:
- name: data
mountPath: /usr/share/elasticsearch/data
Regardless does the trick, but will be nice when this is natively supported.
Thank you @antoineco for the suggestions that brought many hours of searching to an end. It seems that this issue is focused on host directories, but there weren't any hints for "Using Persistent Volumes as non-root user" within the Persistent Volumes Using section.
I was successful with a simple addition to the Pod.spec using v1.7 and a Dynamically Provisioned AWS Persistent Volume for a StatefulSet. I did not require a Pod Security Policy:
# Allow non-root user to access PersistentVolume
securityContext:
fsGroup: 1000
Could we please have more documentation than the brief mention in the API Reference? It's ambiguous to know that some volume types will work without knowing which ones.
fsGroup: A special supplemental group that applies to all containers in a pod. Some volume types allow the Kubelet to change the ownership of that volume to be owned by the pod: 1. The owning GID will be the FSGroup 2. The setgid bit is set (new files created in the volume will be owned by FSGroup) 3. The permission bits are OR'd with rw-rw---- If unset, the Kubelet will not modify the ownership and permissions of any volume.
I tried to mount an azurefile persistent volume for Jenkins and had the permission denied problem, because the jenkins_home directory was only rwx for root, when Jenkins works with jenkins user.
I could workaround using securityContext/runAsUser: 0 but it would be better to inherit the existing dir rights or to provide a chmod property.
cc @kubernetes/sig-storage-feature-requests @kubernetes/sig-node-feature-requests
@Guillaume-Mayer - wouldn't a postStart
lifecycle hook to chmod the files work (alternatively an init container if execution needs to be completed before the entrypoint)?
@so0k That won't work if the runAsUser
and allowPrivilegeEscalation
prevent the root
user from being used.
I have a very similar use case where I would need to change the owner of a file that is mounted (from a secret in my case).
I have a mongodb-cluster in k8s which uses a special cluster.key
file for cluster authorization. That file is stored in a secret; we have a client where running images as root is forbidden. Our pod has a securityContext set with a runAsUser: 1000
directive. Mongodb itself forbids the case that the file is accessible by anyone else but the owner itself. It will reject startup if the file is readable by group
or other
.
Since the owner is root
, and I cannot run a chown
as non-root on that file, neither changing permissions works, nor (since there is no k8s support) changing the owner of the file does.
I am currently working around this by injecting as an environment variable in a busybox init container which in turn mounts an emptyDir
and writes there. The secret is then not mounted as a file anymore. It's quite ugly, and if there is a chance to get rid of it' I'd be in.
The fact that so many of the Docs advise and caution the user against running containers as root, and that this issue is now 3 years old astounds me. This should at least be explained in much greater detail in the Docs.
@saad-ali
On Wed, Feb 21, 2018 at 9:34 AM, Jordan Wilson notifications@github.com
wrote:
The fact that so many of the Docs advise and caution the user against
running containers as root, and that this issue is now 3 years old astounds
me. This should at least be explained in much greater detail in the Docs.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-367406210,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALNCDzdco3jAGv5kwRfMg497RuZeiWWbks5tXFOJgaJpZM4DBDWs
.
Hi!
I ended up with the below initContainers config for giving the node-red-docker container, which runs as a non-privileged user, access to a externally created disk. After trying a lot of things, it seemed the "runAsUser" 0 (root) did the trick.
Cheers
-jo
initContainers:
- name: volume-mount-hack
image: nodered/node-red-docker:slim
command:
- sh
- -c
- 'chmod -R a+rwx /data'
volumeMounts:
- name: picturl-persistent-storage
mountPath: /data
securityContext:
runAsUser: 0
Many older applications which bind to low (privileged) ports start first as root, then immediately drop privileges to some other user. In such a scenario, the container must be configured to start the application as root, and so the original user (root) would have access to the volume. Once the application calls setuid(2)/seteuid(2) though, it won't have access anymore.
@eatnumber1 Can you elaborate a bit more on why we will have this issue with the supplementary group solution mentioned in this thread? IIUC, setuid(2)
/seteuid(2)
will not change the supplementary group of the calling process, so as long as the application is in a group which have the access to the volume, it should not have problems to access the volume, right?
It looks like I was mistaken and calling setgid(2)
doesn't change supplementary groups (which I had thought it did).
Looking around, it seems like at least nginx drops supplementary groups explicitly here (and otherwise would be a minor security vulnerability). I'd be surprised if any well-written privilege-dropping application doesn't drop supplementary groups.
Thanks @eatnumber1! So Nginx initially runs with root, and later reset uid, gid and supplementary groups with what are configured in nginx.conf
. Then I think with the pod security context, we can set fsGroup
to the group configured in nginx.conf
, in this way, even after Nginx resets its supplementary groups, it can still access the volume. Right?
From a cursory reading about pod security contexts, it seems like it would.
I haven't used them though (note that my original comment on this bug is
multiple years old).
On Fri, Apr 6, 2018 at 20:28 Qian Zhang notifications@github.com wrote:
Thanks @eatnumber1 https://github.com/eatnumber1! So Nginx initially
runs with root, and later reset uid, gid and supplementary groups with what
are configured in nginx.conf. Then I think with the pod security context
https://kubernetes.io/docs/tasks/configure-pod-container/security-context/,
we can set fsGroup to the group configured in nginx.conf, in this way,
even after Nginx resets its supplementary groups, it can still access the
volume. Right?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-379422625,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABEj5KFwz4VWsvSpT2UwzY_1KIW0DB4ks5tmBZRgaJpZM4DBDWs
.
Besides supplementary group, I think POSIX ACL can be another solution to fix this issue, I mean we can add an ACL entry to grant rwx
permission to the pod/container user on the volume. But I do not see POSIX ACL is not mentioned in this thread, any drawbacks it has?
cc @thockin @saad-ali
I don't know that nginx clearing supplementary groups prevents any
vulnerability in this case? It is specifically defeating a well-understood
mechanism. Can we fix nginx?
As for ACL or other mechanisms, I don't object to them, I just have less
context on them.
On Sun, Apr 8, 2018 at 6:26 PM Qian Zhang notifications@github.com wrote:
Besides supplementary group, I think POSIX ACL
http://man7.org/linux/man-pages/man5/acl.5.html can be another solution
to fix this issue, I mean we can add an ACL entry to grant rwx permission
to the pod/container user on the volume. But I do not see POSIX ACL is not
mentioned in this thread, any drawbacks it has?cc @thockin https://github.com/thockin @saad-ali
https://github.com/saad-ali—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-379601556,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVHTRb0-By20y2QzU4idsBoZVDNpLks5tmrjZgaJpZM4DBDWs
.
It is specifically defeating a well-understood mechanism.
@thockin Can you please elaborate a bit on this? And why do we need to fix Nginx?
We explicitly set up supplemental groups so we can do things like volumes
and per-volume accounting. It is 100% intentional and then nginx drops
supplemental groups in the name of security. Breaking valid use cases.
On Mon, Apr 9, 2018 at 5:59 PM Qian Zhang notifications@github.com wrote:
It is specifically defeating a well-understood mechanism.
@thockin https://github.com/thockin Can you please elaborate a bit on
this? And why do we need to fix Nginx?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-379940087,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVPn8KDON0zSROa1Qy1uFgMsaR9jqks5tnAQHgaJpZM4DBDWs
.
In a non-containerized world, if nginx didn't drop supplemental groups, a remote code execution vulnerability in nginx could leak undesired privileges to remote attackers via its supplemental groups. I therefore don't think you'll ever get the nginx developers to be willing to stop doing that. Even if you do manage to convince them to, dropping supplemental groups is the standard practice, and you'd have to convince every developer of every privilege dropping application to do the same. Apache does the same exact thing here.
Furthermore, even if you pick another obscure Linux access control mechanism to use instead (for example, fsuid), it is _intentional_ that every possible type of privilege is dropped, so it would be a security vulnerability if applications didn't drop that privilege as well. That _is_ the security model here.
In a non-containerized world, the only way to grant privileges to the application after it drops privileges is to grant privileges to the user/group/etc that the application switches _to_. Hence my original (3 year old) comment about supporting UID and GID explicitly, which would allow the user to specify the UID or GID that the application is going to switch to.
Looking at the documentation for PodSecurityContext, it says this about fsGroup
:
A special supplemental group that applies to all containers in a pod. Some volume types allow the Kubelet to change the ownership of that volume to be owned by the pod:
- The owning GID will be the FSGroup
- The setgid bit is set (new files created in the volume will be owned by FSGroup)
- The permission bits are OR'd with rw-rw---- If unset, the Kubelet will not modify the ownership and permissions of any volume.
As far as I'm aware, these actions should be sufficient to allow the resulting unprivileged user after a privilege drop to access the volume successfully. (caveat, I haven't tested it)
Yes, so I think setting fsGroup
to the group configured in nginx.conf will make Nginx can access the volume even after a privilege drop and also make volume accounting work.
But I have another question: Besides the fsGroup
in pod security context, user can also set fsGroup
in container security context, so if a pod has multiple containers in it and each container has its own fsGroup
, how can we make sure all of these containers can access the volume (since a volume can only be owned by a single group rather than multiple)?
@qianzhangxa if multiple containers need access to that volume , you will need to make sure all containers request the same fagroup in the container level security context or better just set at the pod level
@tallclair FYI I believe we can close this issue
/sig auth
None of the solutions suggested are working for me.
YML:
apiVersion: apps/v1beta1 # for versions before 1.8.0 use apps/v1beta1
kind: Deployment
metadata:
labels:
tier: frontend
spec:
selector:
matchLabels:
tier: frontend
strategy:
type: Recreate
template:
metadata:
labels:
tier: frontend
spec:
securityContext:
fsGroup: 1000
runAsUser: 0
initContainers:
- image: some-sftp-container
name: sftp-mount-permission-fix
command: ["sh", "-c", "chown -R <user> /mnt/permission-fix"]
volumeMounts:
- name: azure
mountPath: /mnt/permission-fix
containers:
- image: some-sftp-container
name: sftp-container
ports:
- containerPort: 22
name: port_22
volumeMounts:
- name: azure
mountPath: /home/<user>/data
volumes:
- name: azure
azureFile:
secretName: azure-secret
shareName: sftp-share
readOnly: false
Once the Pod is ready and I exec into the container and check the dirs, nothing has happened:
root@container:/# cd /home/<user>
root@container:/home/<user># ls -als
total 8
4 drwxr-xr-x 3 root root 4096 Apr 24 18:45 .
4 drwxr-xr-x 1 root root 4096 Apr 24 18:45 ..
0 drwxr-xr-x 2 root root 0 Apr 22 21:32 data
root@container:/home/<user># cd data
root@container:/home/<user>/data# ls -als
total 1
1 -rwxr-xr-x 1 root root 898 Apr 24 08:55 fix.sh
0 -rwxr-xr-x 1 root root 0 Apr 22 22:27 test.json
root@container:/home/<user>/data#
At some point I also had the runAsUser: 0
on the container itself. But that didn't work either. Any help would be much appreciated
Also running a chown afterwards didn't work
@eatnumber1 if a group is in your supplemental groups, shouldn't you assume that it was intended that you have access to that group's resources? Dropping supplemental groups is saying "I know you told me I need this, but I don't want it" and then later complaining that you don't have it.
Regardless, I am now throughly lost as to what this bug means - there are too many followups that don't seem to be quite the same.
Can someone summarize for me? Or better, post a full repro with non-pretend image names?
@thockin IIUC, Nginx is not just dropping the supplementary groups, it is actually resetting it with what is configured in nginx.conf by calling initgroups
.
This worked for me.. part of the script.
```
spec:
containers:
- name: jenkins
image: jenkins/jenkins
ports:
- containerPort: 50000
- containerPort: 8080
volumeMounts:
- mountPath: /var/jenkins_home
name: jenkins-home
securityContext:
fsGroup: 1000
runAsUser: 0
the solutions arent ideal, now your containers are running as root which is against the security standards that k8s tries to get its users to impose.
it would be great if persistent volumes could be created with securityContext
in mind, ie
apiVersion: v1
kind: PersistentVolume
metadata:
name: redis-data-pv
namespace: data
labels:
app: redis
spec:
securityContext:
runAsUser: 65534
fsGroup: 65534
capacity:
storage: 1Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
claimRef:
namespace: data
name: redis-data
hostPath:
path: "/data"
As a workaround, I use a postStart
lifecycle hook to chown the volume data to the correct permissions. This may not work for all applications, because the postStart
lifecycle hook may run too late, but it's more secure than running the container as root and then fixing permissions and dropping root (or using gosu) in the entrypoint script.
@robbyt commented
As a workaround, I use a postStart lifecycle hook to chown the volume data to the correct permissions. This may not work for all applications, because the postStart lifecycle hook may run too late, but it's more secure than running the container as root and then fixing permissions and dropping root (or using gosu) in the entrypoint script.
We use initContainer
, can a lifecycle hook have a different securityContext than the container itself?
it's sad to see that after I have to do research again @chicocvenancio's option (which I use as well) is still apparently the only way to achieve this.
I understand where the problem is coming from and why we are so reluctant to change this, however, especially for Secret volumes changing the UID of volumes can be essential.
Here is an example from the PostgreSQL world: mount a TLS client cert for your application with a secret volume. As recommended everywhere, you don't run your container as root. However, the postgres connection library will instantaneously complain that the key is world readable. "No problem" you think and you change the mode / default mode to match the _demanded_ 0600 (which is very reasonable to demand that as a client library). However, now this won't work either, because now root is the only user which can read this file.
The point I'm trying to make with this example is: groups don't come to the rescue here.
Now PostgreSQL is definitely a standard database and a product that a lot of people use. And asking for mounting client certs in a way with Kubernetes that do not require an initContainer as a workaround is not too much to ask imho.
So please, let's find some middle ground on this issue, and not just close it. :pray:
I'm trying to mount a ssh-key to user's .ssh directory with defaultMode 0400 so the application can ssh without a password. But that doesn't work if the secret is mounted as owned by root. Can you explain again how this can be solved using fsGroup or some other such mechanism?
I don't see a solution if PodSecurityPolicy is enabled so applications cannot run as root. Please advise.
I am still hopelessly confused about this bug. There seems to be about 6 things being reported that all fail the same way but are different for different reasons.
Can someone explain, top-to-bottom the issue (or issues) in a way that I can follow without having to re-read the whole thread?
Keep in mind that Volumes are defined as a Pod-scope construct, and 2 different containers may run as 2 different UIDs. Using group perms is ideal for this, but if it is really not meeting needs, then let's fix it. But i need to understand it first.
@saad-ali for your radar
@thockin My use-case is very simple. I'm injecting a secret (ssh key) into a container that is not running as root. The ssh key in /home/username/.ssh must have 400 permission which I can do, but must also be owned by the UID, or it won't work. I don't want to give this pod any root privilege of any sorts, so an init container that modifies the UID of the file does not work for me. How do I do it, other than including the ssh-key in the image?
Right. That seems clear. If we fixed this, does it apply to every usecase
herein? Or is there more?
I could see adding a user
field in various places.
e.g. start with this. Implement it and add some tests. I need a volunteer
to carry the football. It's not a ton of effort, honestly, lprobably less
than a day, but I won't have time to spend on it any time soon.
Anyone?
diff --git a/staging/src/k8s.io/api/core/v1/types.go b/staging/src/
k8s.io/api/core/v1/types.go
index 99159ee75a..e98c035528 100644
--- a/staging/src/k8s.io/api/core/v1/types.go
+++ b/staging/src/k8s.io/api/core/v1/types.go
@@ -1048,6 +1048,9 @@ type SecretVolumeSource struct {
// mode, like fsGroup, and the result can be other mode bits set.
// +optional
DefaultMode *int32 `json:"defaultMode,omitempty"
protobuf:"bytes,3,opt,name=defaultMode"`
+ // Optional: user ID to use on created files by default. Default
is implementation-defined.
+ // +optional
+ DefaultUser *int64 `json:"defaultUser,omitempty"
protobuf:"varint,4,opt,name=defaultUser"`
// Specify whether the Secret or it's keys must be defined
// +optional
Optional *bool `json:"optional,omitempty"
protobuf:"varint,4,opt,name=optional"`
@@ -1474,6 +1477,9 @@ type ConfigMapVolumeSource struct {
// mode, like fsGroup, and the result can be other mode bits set.
// +optional
DefaultMode *int32 `json:"defaultMode,omitempty"
protobuf:"varint,3,opt,name=defaultMode"`
+ // Optional: user ID to use on created files by default. Default
is implementation-defined.
+ // +optional
+ DefaultUser *int64 `json:"defaultUser,omitempty"
protobuf:"varint,4,opt,name=defaultUser"`
// Specify whether the ConfigMap or it's keys must be defined
// +optional
Optional *bool `json:"optional,omitempty"
protobuf:"varint,4,opt,name=optional"`
@@ -1541,6 +1547,9 @@ type ProjectedVolumeSource struct {
// mode, like fsGroup, and the result can be other mode bits set.
// +optional
DefaultMode *int32 `json:"defaultMode,omitempty"
protobuf:"varint,2,opt,name=defaultMode"`
+ // Optional: user ID to use on created files by default. Default
is implementation-defined.
+ // +optional
+ DefaultUser *int64 `json:"defaultUser,omitempty"
protobuf:"varint,4,opt,name=defaultUser"`
}
// Projection that may be projected along with other supported volume types
@@ -1581,6 +1590,9 @@ type KeyToPath struct {
// mode, like fsGroup, and the result can be other mode bits set.
// +optional
Mode *int32 `json:"mode,omitempty"
protobuf:"varint,3,opt,name=mode"`
+ // Optional: user ID to use on this file.
+ // +optional
+ User *int64 `json:"User,omitempty"
protobuf:"varint,4,opt,name=User"`
}
// Local represents directly-attached storage with node affinity (Beta
feature)
@@ -5080,6 +5092,9 @@ type DownwardAPIVolumeSource struct {
// mode, like fsGroup, and the result can be other mode bits set.
// +optional
DefaultMode *int32 `json:"defaultMode,omitempty"
protobuf:"varint,2,opt,name=defaultMode"`
+ // Optional: user ID to use on created files by default. Default
is implementation-defined.
+ // +optional
+ DefaultUser *int64 `json:"defaultUser,omitempty"
protobuf:"varint,4,opt,name=defaultUser"`
}
const (
@@ -5103,6 +5118,9 @@ type DownwardAPIVolumeFile struct {
// mode, like fsGroup, and the result can be other mode bits set.
// +optional
Mode *int32 `json:"mode,omitempty"
protobuf:"varint,4,opt,name=mode"`
+ // Optional: user ID to use on this file.
+ // +optional
+ User *int64 `json:"User,omitempty"
protobuf:"varint,5,opt,name=User"`
}
// Represents downward API info for projecting into a projected volume.
On Fri, Jul 6, 2018 at 8:00 AM Reza notifications@github.com wrote:
@thockin https://github.com/thockin My use-case is very simple. I'm
injecting a secret (ssh key) into a container that is not running as root.
The ssh key in /home//.ssh must have 400 permission which I can do, but
must also be owned by the UID, or it won't work. I don't want to give this
pod any root privilege of any sorts, so an init container that modifies the
UID of the file does not work for me. How do I do it, other than including
the ssh-key in the image?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-403059487,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVG5ERBNuP4MuNprc27c9YIyajAIJks5uD3uFgaJpZM4DBDWs
.
@vikaschoudhary16 @derekwaynecarr this has some overlap / implications for user-namespace mapping.
@rezroo a workaround to could be to simply make a copy of the ssh key in an Init container that way you’ll be able to control who owns the file right? Provided the init container runs as the same user that needs to read the ssh key later. It’s a little gross, but “should” work I think.
@thockin another use-case: I'm trying to run an ELK statefulset. The pod has an Elasticsearch container running as non-root. I'm using a volumeClaimTemplate to hold the elasticsearch data. The container is unable to write to the volume though as it is not running as root. K8s v.1.9 . The pod has multiple containers and i don't want to use the same fsgroup for all of them.
@pearj that's exactly the workaround that everybody uses ... and as the name says: it's a workaround, and should get addressed :) ... However, there is also a problem with this workaround: updated secrets will eventually get updated in mounted volumes which will make it possible to act on a file change in the running pod; you will miss out on this update when you copy it from an init container.
@pearj @mheese This work around wouldn't work for me anyway - because our PodSecurityPolicy doesn't allow containers to run as root - normal or init containers - doesn't matter - no one can access a secret owned by root as far as I can tell.
Yet another use case for this: I'm working on using XFS quotas (obviously, if XFS is in use) for ephemeral storage. The current enforcement mechanism for ephemeral storage is to run du periodically; in addition to being slow and rather coarse granularity, it can be faked out completely (create a file, keep a file descriptor open on it, and delete it). I intend to use quotas for two purposes:
Hard cap usage across all containers of a pod.
Retrieve the per-volume storage consumption without having to run du (which can bog down).
I can't use one quota for both purposes. The hard cap applies to all emptydir volumes, the writable layer, and logs, but a quota used for that purpose can't be used to retrieve storage used for each volume. So what I'd like to do is use project quotas in a non-enforcing way to retrieve per-volume storage consumption and either user or group quotas to implement the hard cap. To do that requires that each pod have a unique UID or single unique GID (probably a unique UID would be best, since there may be reasons why a pod needs to be in multiple groups).
(As regards group and project IDs being documented as mutually exclusive with XFS, that is in fact no longer the case, as I've verified. I've asked some XFS people about it, and they confirmed that the documentation is out of date and needs to be fixed; this restriction was lifted about 5 years ago.)
@robbyt please tell how you managed to chown with postStart
? My container runs as nonroot user, so poststart still uses nonroot permissions and can't change permissions:
chown: /home/user/: Operation not permitted
, message: "chown: /home/user/: Permission denied\nchown: /home/user/: Operation not permitted
Same problem here: whe have somo Dockerized tomcat that run our web applicaition and we us jmx to monithor them, we want to serve jmxremote user and jmxremote password as secrets, but tomcat, which obviously doesen't run as root, want that jmx files are readable only for the user that run tomcat.
Addendum: whe have many tomcat, and want to run every of them as different users.
the same problem!
For now, the hack that works is by setting user to root at the end of your dockerfile. And set a custom entrypoint script. Chown the volume in your custom entrypoint script then use gosu to run the default entrypoint script as the default user. The thing I hate about this is I have to do it for every single image that uses a volume in kubernetes. Totally lame. Please provide a UID GID option on the volume mount or volume claim config.
That hack doesn’t work if you want to run a secure Kubernetes cluster with PodSecurityPolicies applied to enforce pods to run as a non-root user.
True. All hacks have their downsides. It's either that or logging in as root after the volume is created and chowning the directory manually. Not sure really which is worse :-D. Can't believe this is even a thing.
@thockin from what I gather following this issue now since nearly 4 years, the solution that everybody wants is to be able to set a uid and gid for a volume - in particular secret volumes (but not only those). On Jul 6 you posted a starting point for a possible solution to this. If this is a supported path from the maintainers, I'd finally start and try to solve this problem.
@mheese i'd say go for it.
@mheese I'll collab on a PR if you want?
@mheese It seems gid
is taken from the SecurityContext, so I guess, for a fast relief, an uid
implementation would be enough. Also because gid
had more second guesses in the discussion.
same issue with psql here
@michalpiasecki1 Look how I solved it with PostStart
-hook: https://github.com/xoe-labs/odoo-operator/blob/1be88b67d4ded5c4a0aea6e26b711241f0d09f89/pkg/controller/odoocluster/odoocluster_controller.go#L579-L586
running into same issue. is there any recommended solution for this ?
@blaggacao : thanks for the hint, however i found another workaround
@debianmaster : i would recommend securityContext and fsGroup as described in https://kubernetes.io/docs/tasks/configure-pod-container/security-context/
I ended with certificates owned by root, and group postgres with permissions 440
@michalpiasecki1 can you give more info about how you resolved this for postgres?
I have the server.crt and server.key files stored in a k8s secret pg-certs-secret
and I want to mount them into my container running postgres:9.6
. I have this set up with:
containers:
- name: pg
image: postgres:9.6
...
volumeMounts:
- name: pg-certs
mountPath: "/etc/certs/"
readOnly: true
args: ["-c", "ssl=on", "-c", "ssl_cert_file=/etc/certs/server.crt", "-c", "ssl_key_file=/etc/certs/server.key"]
volumes:
- name: pg-certs
secret:
secretName: pg-certs-secret
defaultMode: 384
But deploying this, the container dies with the error FATAL: could not load server certificate file "/etc/certs/pg_server.crt": Permission denied
I assume this is because the certs are loaded so that they are owned by root
, when they need to be owned by postgres
. It's not clear from the docs, etc what I should do to change ownership absent creating a custom Docker image, but I'd rather not. The securityContext and fsGroup you suggested seemed like it could work, but I would appreciate if you would share more info about how exactly you achieved this.
Also worth noting: I added defaultMode: 384
to ensure the files were added with 0600
file permissions. Before I added that, the container died with the error
FATAL: private key file "/etc/certs/pg_server.key" has group or world access
DETAIL: File must have permissions u=rw (0600) or less if owned by the database user, or permissions u=rw,g=r (0640) or less if owned by root.
For reference, I just figured this out and it worked when I added
securityContext:
fsGroup: 999
to the spec.
I have same problem #72085
can any one help me
@izgeri i tried link
but not working, can you help me
Is there any chance for fix this issue? Are the Kubernetes guys working on the solution?
This issue has not been a problem for us for a very long time. We set the "fsGroup" in the pod's security context to match the group ID of the user that runs the main Docker entry point, and any volumes in the pod become accessible to that container's main process:
https://kubernetes.io/docs/tasks/configure-pod-container/security-context/
Note that the proper group ID will vary depending on how the Docker container is created and run. I usually ascertain it by kubectl exec
ing into the pod into a shell and typing id -g
@charles-crain: Your suggestion works really well for most cases.
Here's another case that's not covered:
If the container starts as root but uses a tool such as gosu
to become another user (for some processes).
Then locking the container into only one group with fsGroup will prevent cases such as "I want my non-root user to have access to SSH keys mounted into it's ~/.ssh
directory, while having my root user having access to other mounts too".
One example of this: "a DinD container where dockerd
must start as root, but subsequent containers are run by a non-root user".
Hi there @charles-crain, I am facing very interesting issue that matches topic of this thread. Seems fsGroup does not work for all cases,
here is example of the deployment, it is test nginx deployment, I am trying to mount nfs and additionally mount empty directory - just to compare.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-test
labels:
app: nginx-test
spec:
selector:
matchLabels:
app: nginx-test
strategy:
type: Recreate
template:
metadata:
labels:
app: nginx-test
spec:
securityContext:
fsGroup: 2000
volumes:
- name: nfs-volume
nfs:
server: # nfs with no_root_squash
path: /nfs
- name: test-fs-group
emptyDir: {}
containers:
- image: nginx
name: nginx-test
imagePullPolicy: Always
volumeMounts:
- name: nfs-volume
mountPath: /var/local/test5
- name: test-fs-group
mountPath: /var/local/test6
when I exec bash into the pod's nginx container GID applied only for empty dir, and not to the dir mounted for nfs. Nfs configured with no root squash in testing purposes, process in my container has non-root user, so that is the problem, it can be solved via chown, however I am trying to achieve it with with native solution
I also face the exact same issue described above.
this issue has been open for like 5 years. no one from kubernetes is interested in it and it may be for a reason, valid or not. there were many number of valid solutions to the simple problem but none of them were implemented.
Not sure why this issue doesnt just get closed
Help w comp talk.
On Mon, Feb 18, 2019, 12:44 PM Erkin Khaydarov <[email protected]
wrote:
this issue has been open for like 5 years. no one from kubernetes is
interested in it and it may be for a reason, valid or not. there were many
number of valid solutions to the simple problem but none of them were
implemented.Not sure why this issue doesnt just get closed
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-464824407,
or mute the thread
https://github.com/notifications/unsubscribe-auth/Atif5VY8mSFL1QRED9HS2oRbV3qyhPQ4ks5vOuZngaJpZM4DBDWs
.
@mheese As you commented here https://github.com/kubernetes/kubernetes/issues/2630#issuecomment-424876108 to set a uid and gid for a volume , do you still trying to working on it? Thanks!
I also encountered this issue. Is there any plan to create a viable solution for it? Local persistent volumes can't replace all use-cases of hostPath volume
Same here.
@jingxu97 I haven't given it a try yet because I don't really feel that there is a consensus that this is what should be done.
Let me come up with a detailed proposal and post it here when ready.
Ok
For reference, I just figured this out and it worked when I added
securityContext: fsGroup: 999
to the spec.
For postgres:11.1-alpine
use this:
securityContext:
fsGroup: 70
I just can hope that the kubernetes members prioritize this issue. IMO, It's really a blocker especially from security point of view and becoming vulnerabilities risk :'(
long-term-issue (note to self)
I'm hitting this in context of cert-manager managed secrets. I was using an initContainer to copy the certs to the right place and update the permissions. Cert-manager needs to update the secrets in place so that trick won't work. I'll explore the fsGroup workaround.
@incubus8
I am trying to working on this issue. Could you please describe your use case and what kind of behavior you would expect? Thanks!
@jingxu97 I can offer two examples. The Prometheus docker image starts the prometheus service as user nobody
(uid 65534) and the Grafana docker image starts grafana as uid=472 (https://grafana.com/docs/installation/docker/)
Both of these fail to create directories when they first start up because of these permissions. I've worked around this in my setup with an initContainer
that creates the required directories and chown
s them appropriately.
@ford-prefect if you set fsGroup in PodSecurityContext, and runAsUser, wouldn't those services have the permission to write?
no because the permissions are set for the pod not for the volume that was created independently. it would be great if PodSecurityContext could infact alter the permissions of the volumes, or at least fail to mount and throw an error
@ekhaydarov , in current setVolumeOwnerrhsip function, if fsGroup is provided, it will have rw-rw---- permission, so that group has rw permission. And when container is started, it will set up the supplemental group so that it can read and write to the volume. Anything I am missing?
@jingxu97 thi is not always a solution, as example: we use secretes for jmxremote.password and jmxremote.user that are needed for jmxmonitoring of java applications, java requires that those files belong to the user that run te application an that have permissions 400, so by now there is no way to use secrets this way in rancher 2.x
I was perplexed to see that fsGroup
was an option and fsUser
was not.
Also, the permissions/mode portion of this is confusing. We should make it clearer how volumes like EmptyDir get their default mode or allow the user to set it explicitly, as this is a pretty normal unix admin task.
If root is the only user that can ever own your volume (aside from using an initContainer to chmod it at runtime), the API encourages usage of root for an application's user which is a weak security practice.
@jingxu97 What do you think?
@stealthybox, thank you for the feedback. I am currently working on a proposal for API on volume ownership and permission and will share with the community soon. Feedback/comments are welcome then.
Hi.
There are some news about this issue?
Why does pv.beta.kubernetes.io/gid not work for the local host path provisoner ?
Hey,
I am encountering this as well, I'd appreciate some news :).
this has been my workaround so far:
initContainers:
- name: init
image: busybox:latest
command: ['/bin/chown', 'nobody:nogroup', '/<my dir>']
volumeMounts:
- name: data
mountPath: /<my dir>
this has been my workaround so far:
- name: init image: busybox:latest command: ['/bin/chown', 'nobody:nogroup', '/<my dir>'] volumeMounts: - name: data mountPath: /<my dir>
The workarounds with chown
ing do not work for read-only volumes, such as secret mounts, unfortunately.
I would need this as well (pretty urgently), because we have software not starting due to permissions not being able to be different then 0600
. If we could mount the volume under a specific UID my (and other's) problem will be solved.
You can run a job as part of your deployment to update the volume permissions and use a ready state to check for write permission as a workaround. Or you can use fsGroup to specify the group for the volume and add the application user to the group that owns the volume. Option 2 seems cleaner to me. I used to use option 1 but now I use option 2.
Note that if Kubernetes did support an fsUser
option, then you'd trip over https://github.com/kubernetes/kubernetes/issues/57923 where all files within the mounted secret would be given 0440
permission (or 0660
for writeable mounts) and would ignore any other configuration.
@woodcockjosh fsGroup
doesn't cover the use case of security-sensitive software such as Vault trying to run as vault:vault
and loading a private key file requiring permissions equal to or less than 0600
. @wjam fsUser
would be ideal if we could get 0400
permissions set as well (for things like private key files).
We hit this trying to configure Vault to authenticate to a PostgreSQL DB with certificates. The underlying Go library hard fails if the permission bits differ (https://github.com/lib/pq/blob/90697d60dd844d5ef6ff15135d0203f65d2f53b8/ssl_permissions.go#L17).
@jingxu97: Are there any news on that. We still have the pv ownership problem in our clusters with strict security policies.
This article looks like working I din't test it but I'll test it on Monday, if anyone can do it b4 then please let us know.
The detail is here
Data persistence is configured using persistent volumes. Due to the fact that Kubernetes mounts these volumes with the root user as the owner, the non-root containers don't have permissions to write to the persistent directory.
The following are some things we can do to solve these permission issues:
Use an init-container to change the permissions of the volume before mounting it in the non-root container. Example:
```
spec:
initContainers:
- name: volume-permissions
image: busybox
command: ['sh', '-c', 'chmod -R g+rwX /bitnami']
volumeMounts:
- mountPath: /bitnami
name: nginx-data
containers:
- image: bitnami/nginx:latest
name: nginx
volumeMounts:
- mountPath: /bitnami
name: nginx-data
Use Pod Security Policies to specify the user ID and the FSGroup that will own the pod volumes. (Recommended)
```
spec:
securityContext:
runAsUser: 1001
fsGroup: 1001
containers:
- image: bitnami/nginx:latest
name: nginx
volumeMounts:
- mountPath: /bitnami
name: nginx-data
Hi,
I've seen all around the Internet the workaround with that weak initContainer running as root.
I've also been struggling with fsGroup, which apply only on the scope of the pod, not on each container in a pod, which is [also] a shame.
Just build a custom image (nonroot-initContainer) based on alpine, with sudo installed and custom /etc/sudoers giving my non-root user full power to apply the chmod actions. Unfortunately, I'm hitting another wall with:
sudo: effective uid is not 0, is /usr/bin/sudo on a file system with the 'nosuid' \
option set or an NFS file system without root privileges?
Since I'm not willing to create a less secure PodSecurityPolicy for that deployment, any news from that issue would be very welcome for people having to be compliant with security best practices.
Thanks in advance !
Is there fsGroup
for kubernetes deployment files?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
👍
Is this still an issue? I've done some tests (Minikube 1.14, 1.15, 1.19 and EKS 1.14) and the permissions on the emptyDir
volume is 777 as intended:
apiVersion: v1
kind: Pod
metadata:
name: debug
namespace: default
spec:
containers:
- image: python:2.7.18-slim
command: [ "tail", "-f" ]
imagePullPolicy: Always
name: debug
volumeMounts:
- mountPath: /var/log/test-dir
name: sample-volume
volumes:
- emptyDir:
sizeLimit: 10M
name: sample-volume
Writing files in the dir, works with any user as expected.
Most helpful comment
Would it be reasonable to add
user
and/orpermissions
option tovolumeMounts
oremptyDir
to explicitly force it?