kubeadm 🚀 - Document kubeadm usage with SELinux

@pipejakob I added the kind/postmortem label as it's in the same theme, we broke SELinux users again without noticing it...

luxas on 29 May 2017

I don't work with kubadmin but would be very willing to help whoever takes this on.

rhatdan on 30 May 2017

👍3 🎉2

@rhatdan Great! What I'm looking for is persons that are familiar with SELinux and willing to help.
I might be able to coordinate the work though.

A rough todo list would look like:

Make kubeadm work with SELinux enabled in v1.7
Make an e2e suite of CentOS/Fedora nodes that will notify us if there is a regression.
Look into the CoreOS issue and how the SELinux setup between CentOS and CoreOS differs.

@rhatdan Let's first try and get it working in v1.7, can be done in #215

luxas on 30 May 2017

@timothysc

roberthbailey on 30 May 2017

I will take for now, since I raised it. I'll have some updates soon, @rhatdan, please advise me ;)

coeki on 31 May 2017

@luxas , @jasonbrooks - does this still exist in fedora?

I think folks have patched policies on other channels.

/cc @eparis

timothysc on 11 Jul 2017

@timothysc I haven't tried w/ 1.7 yet, but w/ 1.6, CentOS worked w/ selinux but Fedora 25 didn't. I'll test w/ 1.7

jasonbrooks on 11 Jul 2017

for reference, I just ran kubeadm 1.7 on f26 in permissive mode, and these are the denials I got:

[root@fedora-1 ~]# ausearch -m avc -ts recent
----
time->Tue Jul 11 13:03:50 2017
type=AVC msg=audit(1499792630.959:321): avc:  denied  { read } for  pid=2885 comm="kube-apiserver" name="apiserver.crt" dev="dm-0" ino=16820634 scontext=system_u:system_r:container_t:s0:c171,c581 tcontext=unconfined_u:object_r:cert_t:s0 tclass=file permissive=1
----
time->Tue Jul 11 13:03:50 2017
type=AVC msg=audit(1499792630.959:322): avc:  denied  { open } for  pid=2885 comm="kube-apiserver" path="/etc/kubernetes/pki/apiserver.crt" dev="dm-0" ino=16820634 scontext=system_u:system_r:container_t:s0:c171,c581 tcontext=unconfined_u:object_r:cert_t:s0 tclass=file permissive=1
----
time->Tue Jul 11 13:04:18 2017
type=AVC msg=audit(1499792658.917:331): avc:  denied  { read } for  pid=2945 comm="kube-controller" name="sa.key" dev="dm-0" ino=16820637 scontext=system_u:system_r:container_t:s0:c755,c834 tcontext=unconfined_u:object_r:cert_t:s0 tclass=file permissive=1
----
time->Tue Jul 11 13:04:18 2017
type=AVC msg=audit(1499792658.917:332): avc:  denied  { open } for  pid=2945 comm="kube-controller" path="/etc/kubernetes/pki/sa.key" dev="dm-0" ino=16820637 scontext=system_u:system_r:container_t:s0:c755,c834 tcontext=unconfined_u:object_r:cert_t:s0 tclass=file permissive=1

On CentOS 7, same thing, no denials.

jasonbrooks on 11 Jul 2017

You are volume mounting in content from the host into a container. If you want an SELinux confined process inside the container to be able to read the content, it has to have an SELinux label that the container is allowed to read.

Mounting the object with :Z or :z would fix the issue. Note either of these would allow the container to write these objects. If you want to allow the container to read without writing then you could change the content on the host to something like container_share_t.

rhatdan on 11 Jul 2017

https://github.com/kubernetes/kubernetes/pull/48607 will also help here as it starts making mounting everything but etcd read-only...

luxas on 12 Jul 2017

@luxas @jasonbrooks - someone want to tinker with adjusting the manifests ( https://kubernetes.io/docs/tasks/configure-pod-container/security-context/ ) ?

timothysc on 12 Jul 2017

To me it's unclear which policies kubeadm should add to:

read only mounts
read write mounts

that is working on CentOS, Fedora _and_ CoreOS

On 12 Jul 2017, at 16:57, Timothy St. Clair notifications@github.com wrote:

@luxas @jasonbrooks - someone want to tinker with adjusting the manifests ( https://kubernetes.io/docs/tasks/configure-pod-container/security-context/ ) ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

luxas on 12 Jul 2017

@rhatdan It looks like :Z is only used if the pod provides an selinux label. In my initial tests, container_runtime_t seems to work -- would that be an appropriate label? And then, I'm assuming in a system w/o selinux, this would just be ignored?

jasonbrooks on 12 Jul 2017

Yes it will be ignored by non SELinux systems. RUnning an app as container_runtime_t, basically provides no SELinux confinement, since it is supposed to be the label of container runtimes like docker and CRI-O. If you are running the kublet as this, that is probably fairly accurate.

rhatdan on 12 Jul 2017

Right now, we're running the etcd container as spc_t -- would it be better to run that one as container_runtime_t too?

jasonbrooks on 12 Jul 2017

It looks like this does it:

diff --git a/cmd/kubeadm/app/master/manifests.go b/cmd/kubeadm/app/master/manifests.go
index 55fe560c46..228f935cdd 100644
--- a/cmd/kubeadm/app/master/manifests.go
+++ b/cmd/kubeadm/app/master/manifests.go
@@ -96,6 +96,7 @@ func WriteStaticPodManifests(cfg *kubeadmapi.MasterConfiguration) error {
                        LivenessProbe: componentProbe(int(cfg.API.BindPort), "/healthz", api.URISchemeHTTPS),
                        Resources:     componentResources("250m"),
                        Env:           getProxyEnvVars(),
+                        SecurityContext: &api.SecurityContext{SELinuxOptions: &api.SELinuxOptions{Type: "container_runtime_t",}},
                }, volumes...),
                kubeControllerManager: componentPod(api.Container{
                        Name:          kubeControllerManager,
@@ -105,6 +106,7 @@ func WriteStaticPodManifests(cfg *kubeadmapi.MasterConfiguration) error {
                        LivenessProbe: componentProbe(10252, "/healthz", api.URISchemeHTTP),
                        Resources:     componentResources("200m"),
                        Env:           getProxyEnvVars(),
+                        SecurityContext: &api.SecurityContext{SELinuxOptions: &api.SELinuxOptions{Type: "container_runtime_t",}},
                }, volumes...),
                kubeScheduler: componentPod(api.Container{
                        Name:          kubeScheduler,

Would this be something to submit as PRs to the 1.7 branch and to master, or just to master? The source moved around a bit in master, the patch above is to the 1.7 branch.

jasonbrooks on 13 Jul 2017

I would actually prefer that it run as spc_t, or as a confined domain(container_t). etcd should be easily be able to be confined by SELinux.

rhatdan on 13 Jul 2017

I think spc_t should work. I tried w/ container_t and that didn't work. audit2allow says it needs:

allow container_t cert_t:file { open read };

jasonbrooks on 13 Jul 2017

👍1

Could we relabel the certs directory with container_file_t or container_share_t. then it would work.

rhatdan on 13 Jul 2017

kubeadm creates an /etc/kubernetes/pki dir when you run kubeadm init, but when you kubeadm reset, it only empties that dir. If we created the pki dir when the rpm is installed, we could do the labeling at that point, by modding the spec file.

jasonbrooks on 13 Jul 2017

For etcd, the container would need allow container_t container_var_lib_t:file { create lock open read unlink write }; for /var/lib/etcd on the host.

jasonbrooks on 13 Jul 2017

I'm trying to figure out if it's legitimate to chcon directories in the rpm spec file -- I see many instances of it in github (https://github.com/search?l=&p=1&q=chcon+extension%3Aspec) but I can't tell whether that's considered good packaging practice or not. We could either change kubeam to run the components as spc_t, unconfined, or we could leave kubeadm alone and chcon the pki dir.

jasonbrooks on 14 Jul 2017

I am not as familiar with the Kubernetes Architecture, as I should be. But are we talking about different containers or the same container. kubeadmin container versus etcd container? The management container which can launch other containers as "priviliged" should be running as spc_t, since confining it buys us nothing. A service that just listens on the network and hands out data on the other hand, could be run with more confinement.

rhatdan on 14 Jul 2017

kubeadm is distributed as a deb or rpm package, and it depends on the kubelet and the cni packages (and on docker). You start the kubelet, and then you run kubeadm, which creates manifests for etcd, apiserver, controller manager, scheduler and proxy, and those all run as containers. That's the main way to run kubeadm, as described here: https://kubernetes.io/docs/setup/independent/install-kubeadm/

I have experimented with running kubeadm as a system container, as well: http://www.projectatomic.io/blog/2017/05/testing-system-containerized-kubeadm/

jasonbrooks on 14 Jul 2017

Ok That is kind of what I thought. If we split allof these services into different system containers or orchestrated containers, some can probably run confined and some need to run with ull privs. kubeadmin as a tool for an administrator should be run with full privs. spc_t, if it runs inside of a container, if it runs outside, it would run as the administrators label.

If all of these services are running in the same container, then they would have to probably run as privileged.

rhatdan on 14 Jul 2017

They're all running in separate containers. They can run as container_t, but apiserver and controller manager need to open and read cert_t, and etcd needs access to container_var_lib_t.

We can create the /etc/kubernetes/pki and /var/lib/etcd dirs and set their contexts to container_share_t in the spec file for the kubeadm rpm, or we can make the apiserver and controller manager containers run as spc_t (like the etcd container does now), and have it just work, but w/o confinement, or maybe make some sort of custom policy or something like that.

What do you think, @rhatdan

jasonbrooks on 14 Jul 2017

As @jasonbrooks describes we have few options here. But it's not the main thing.

The main thing is, where do we store the secrets... I thought the consensus was to store the CA and stuff in kubernetes secrets, so then only spc_t is needed for etcd

@luxas @timothysc @jbeda

coeki on 16 Jul 2017

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot on 31 Dec 2017

/lifecycle frozen

timothysc on 30 Jan 2018

/cc @detiber

timothysc on 30 Jan 2018

Hi I still have a problem with SELinux when I run kubeadm init

audit2allow -a -w
type=AVC msg=audit(1522929610.297:136): avc:  denied  { write } for  pid=2817 comm="etcd" name="etcd" dev="dm-0" ino=67228425 scontext=system_u:system_r:svirt_lxc_net_t:s0:c430,c632 tcontext=system_u:object_r:var_lib_t:s0 tclass=dir

Versons:

kubeadm 1.9.3
CentOS 7.4

cynepco3hahue on 5 Apr 2018

Looks like a directory in /var/lib/etcd? Is volume mounted into a container without a correct SELinux label on it.
Mounting this with the equivalent of :Z will fix that or
chcon -R -v svirt_sandbox_file_t /var/lib/etcd

And then it should work.

rhatdan on 5 Apr 2018

/assign @detiber

timothysc on 7 Apr 2018

I suspect we can handle this by setting a security context on the static pod definitions where needed (and only conditionally based on whether selinux is enabled on the host).

detiber on 15 Jun 2018

I believe the container runtimes will ignore the security context if the hosts are not enabled.

rhatdan on 16 Jun 2018

ref #1026 #1082

While trying to reproduce the issues with latest CentOS I noticed that the API server cannot load its certificate if SELinux is set in enforcing mode.

Here is my comment:
https://github.com/kubernetes/kubeadm/issues/1082#issuecomment-416991032

rosti on 30 Aug 2018

AVC Messages?

ausearch -m avc -ts recent

rhatdan on 30 Aug 2018

@rhatdan

time->Fri Aug 31 11:47:18 2018
type=PROCTITLE msg=audit(1535705238.732:281): proctitle=6B7562652D617069736572766572002D2D617574686F72697A6174696F6E2D6D6F64653D4E6F64652C52424143002D2D6164766572746973652D616464726573733D3139322E3136382E3231372E313333002D2D616C6C6F772D70726976696C656765643D74727565002D2D636C69656E742D63612D66696C653D2F6574632F
type=SYSCALL msg=audit(1535705238.732:281): arch=c000003e syscall=257 success=no exit=-13 a0=ffffffffffffff9c a1=c420427a10 a2=80000 a3=0 items=0 ppid=4525 pid=4541 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="kube-apiserver" exe="/usr/local/bin/kube-apiserver" subj=system_u:system_r:container_t:s0:c224,c932 key=(null)
type=AVC msg=audit(1535705238.732:281): avc:  denied  { read } for  pid=4541 comm="kube-apiserver" name="apiserver.crt" dev="dm-0" ino=604382 scontext=system_u:system_r:container_t:s0:c224,c932 tcontext=unconfined_u:object_r:cert_t:s0 tclass=file
----
time->Fri Aug 31 11:47:26 2018
type=PROCTITLE msg=audit(1535705246.653:285): proctitle=65746364002D2D6164766572746973652D636C69656E742D75726C733D68747470733A2F2F3132372E302E302E313A32333739002D2D636572742D66696C653D2F6574632F6B756265726E657465732F706B692F657463642F7365727665722E637274002D2D636C69656E742D636572742D617574683D74727565002D2D6461
type=SYSCALL msg=audit(1535705246.653:285): arch=c000003e syscall=257 success=no exit=-13 a0=ffffffffffffff9c a1=c420195d70 a2=80000 a3=0 items=0 ppid=4594 pid=4609 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="etcd" exe="/usr/local/bin/etcd" subj=system_u:system_r:container_t:s0:c315,c1002 key=(null)
type=AVC msg=audit(1535705246.653:285): avc:  denied  { read } for  pid=4609 comm="etcd" name="peer.crt" dev="dm-0" ino=102172270 scontext=system_u:system_r:container_t:s0:c315,c1002 tcontext=unconfined_u:object_r:cert_t:s0 tclass=file
----
time->Fri Aug 31 11:52:29 2018
type=PROCTITLE msg=audit(1535705549.708:291): proctitle=6B7562652D617069736572766572002D2D617574686F72697A6174696F6E2D6D6F64653D4E6F64652C52424143002D2D6164766572746973652D616464726573733D3139322E3136382E3231372E313333002D2D616C6C6F772D70726976696C656765643D74727565002D2D636C69656E742D63612D66696C653D2F6574632F
type=SYSCALL msg=audit(1535705549.708:291): arch=c000003e syscall=257 success=no exit=-13 a0=ffffffffffffff9c a1=c4205c5800 a2=80000 a3=0 items=0 ppid=4839 pid=4855 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="kube-apiserver" exe="/usr/local/bin/kube-apiserver" subj=system_u:system_r:container_t:s0:c224,c932 key=(null)
type=AVC msg=audit(1535705549.708:291): avc:  denied  { read } for  pid=4855 comm="kube-apiserver" name="apiserver.crt" dev="dm-0" ino=604382 scontext=system_u:system_r:container_t:s0:c224,c932 tcontext=unconfined_u:object_r:cert_t:s0 tclass=file
----
time->Fri Aug 31 11:52:36 2018
type=PROCTITLE msg=audit(1535705556.661:295): proctitle=65746364002D2D6164766572746973652D636C69656E742D75726C733D68747470733A2F2F3132372E302E302E313A32333739002D2D636572742D66696C653D2F6574632F6B756265726E657465732F706B692F657463642F7365727665722E637274002D2D636C69656E742D636572742D617574683D74727565002D2D6461
type=SYSCALL msg=audit(1535705556.661:295): arch=c000003e syscall=257 success=no exit=-13 a0=ffffffffffffff9c a1=c420195d70 a2=80000 a3=0 items=0 ppid=4907 pid=4922 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="etcd" exe="/usr/local/bin/etcd" subj=system_u:system_r:container_t:s0:c315,c1002 key=(null)
type=AVC msg=audit(1535705556.661:295): avc:  denied  { read } for  pid=4922 comm="etcd" name="peer.crt" dev="dm-0" ino=102172270 scontext=system_u:system_r:container_t:s0:c315,c1002 tcontext=unconfined_u:object_r:cert_t:s0 tclass=file

rosti on 31 Aug 2018

You are volume mounting some content in /etc/pki into the container?

rhatdan on 31 Aug 2018

Yes, in /etc/kubernetes/pki. That's how kubeadm works.

rosti on 31 Aug 2018

If no other confined processes are reading these files other then containers, then you could mount it using :z, or chcon -t container_share_t -R /etc/kubernetes/pki

rhatdan on 31 Aug 2018

👍1

Thanks @rhatdan I'll try that next week.

rosti on 31 Aug 2018

We also have kubelet itself reading certs, which isn't containerised. Anything else needed for that @rhatdan?

randomvariable on 7 Sep 2018

What label does the kublet run with?

ps -eZ | grep kublet

rhatdan on 7 Sep 2018

system_u:system_r:unconfined_service_t:s0 31110 ? 00:00:01 kubelet

Ah, ok this makes sense now. Also linking https://bugzilla.redhat.com/show_bug.cgi?id=1546160 for other travellers.

Thanks.

randomvariable on 7 Sep 2018

Since it is running as an unconfined service, then it would not have any issue reading the labels on content if it was labeled container_file_t.

rhatdan on 8 Sep 2018

Thanks Dan.

So, https://github.com/kubernetes/kubernetes/pull/68448 gets kubeadm initialised nodes working on Fedora 28 with Docker+SELinux with containers confined to container_t at least, with one additional manual command.

I have a number of questions about what we _should_ be doing though (which should be postponed til after 1.12):

Should we get changes made to container-selinux upstream or ship policies in our kubeadm package
- We have caveats around the fact that both the etcd data directory and certificate directory are configurable at run-time.
The PR uses opencontainers/selinux to just write the extended attributes on certs and the etcd data dir.
- Should this actually be applied with semanage fcontext + restorecon ?
Should we apply PodSecurityPolicies, particularly to the certificates directory, such that each of the core components then have private shares.
- Would require further split out of certificates into directories per component, at least for the private key.
I had to manually apply container_file_t to /opt/cni/bin as the current practice of most CNI plugins is to mount it and then write their plugins into that host directory.
- /opt/cni/bin is not a kubeadm concern, so haven't touched it here.
- Environments interested in SELinux will probably not want /opt/cni/bin having things that didn't come from an RPM.
We need the e2e

randomvariable on 10 Sep 2018

Should we get changes made to container-selinux upstream or ship policies in our kubeadm package

It would probably be best to get them into container-selinux, to at least have me review them.

    We have caveats around the fact that both the etcd data directory and certificate directory are configurable at run-time.

The PR uses opencontainers/selinux to just write the extended attributes on certs and the etcd data dir.

    Should this actually be applied with semanage fcontext + restorecon ?

Yes that would be best, although if everyone agrees on this, we could get them into the upstream package. Should these files be shared read/only or Read/write?

Should we apply PodSecurityPolicies, particularly to the certificates directory, such that each of the core components then have private shares.

    Would require further split out of certificates into directories per component, at least for the private key.

I have no idea.

I had to manually apply container_file_t to /opt/cni/bin as the current practice of most CNI plugins is to mount it and then write their plugins into that host directory.

container_t process should be able to read/execute the default label on these files/directories, Do you want to allow containers to write to this directory?

$ matchpathcon /opt/cni/bin
/opt/cni/bin    system_u:object_r:bin_t:s0
$ sesearch -A -s container_t -t bin_t -c file
allow domain base_ro_file_type:file { getattr ioctl lock open read };
allow svirt_sandbox_domain exec_type:file { entrypoint execute execute_no_trans getattr ioctl lock map open read };

rhatdan on 10 Sep 2018

Thanks. I'll have a look at what to do for container-selinux next week.

container_t process should be able to read/execute the default label on these files/directories, Do you want to allow containers to write to this directory?

Unfortunately, yes. The way most CNI plugins work now in their default manifests, they use an init container to download their CNI plugin and store it in /opt/cni/bin for kubelet to then use, even though our CNI RPM may already have them so that they can match the CNI plugin version being used with the rest of their control planes.

On the flip side, I think I can narrow down write access down to just etcd for the data directory, and kubelet remaining unconfined can do its certificate rotation as normal.

The fact that the CNI RPM puts stuff in /opt is problematic for Atomic Hosts anyway, so maybe we need to address CentOS / Fedora support more widely?

randomvariable on 11 Sep 2018

@timothysc do you know if this work was / is scheduled for 1.15 cycle ?

DanyC97 on 3 Jun 2019

@DanyC97 it's not

yagonobre on 3 Jun 2019

👍1

@rcythr made some recent discoveries WRT to selinux support here:
https://github.com/kubernetes/kubeadm/issues/1654

we are leaning towards documenting ways of how to enable selinux in our setup docs.

neolit123 on 4 Jul 2019

Brief summary of the problem:

On selinux systems all files and processes are marked with an selinux label, and policies are installed to determine which label types can interact with one another. Docker and other runtimes which support selinux invoke their containers with a context like system_u:system_r:container_t:s0:... There are certain rules about what a container_t process can do, and there are two problematic ones I am aware of right now:

it cannot interact with files of type cert_t (i.e. certificates and keys)
it cannot write to /var/lib/ directories

Normally, when working directly with docker you could mount a volume with the :z or :Z flags, which will automatically relabel the files before mounting so the container can read them. In this case, I can't assume that docker is in use since other runtimes must be supported.

My initial workaround
I relabeled the _/etc/kubernetes_ directory and _/var/lib/etcd/_ directories with a type which all containers of type container_t can read and write (svirt_sandbox_file_t). This mostly works, but not all of the mounted directories can be relabeled like this without breaking the system (e.g. _/etc/pki_).

The best solution I have right now
I noticed that because kubelet is not containerized, it runs with type system_u:system_r:unconfined_service_t:s0, so I decided to find a container-selinux equivalent (which turns out to be spc_t). I added the following to the manifest files for all 4 pods:

securityContext:
  seLinuxOptions:
    user: "system_u"
    role: "system_r"
    type: "spc_t"
    level: "s0"

This solution is great because it relaxes selinux enough to allow these systems to work, but does not impact non-selinux systems at all. A similar (but worse) solution would be to set "privileged: true", which would work for the exact same reason, but would diminish security for all users (including selinux ones due to increased capabilities provided to the privileged containers).

However, this solution is not _perfect_ because it more or less fixes the problem by disabling the benefits of selinux for these service containers. It's basically the rough equivalent of calling setenforce 0 for just these 4 containers.

rcythr on 4 Jul 2019

👍3

thanks for the detailed investigation @rcythr

However, this solution is not perfect because it more or less fixes the problem by disabling the benefits of selinux for these service containers. It's basically the rough equivalent of calling setenforce 0 for just these 4 containers.

it's still much better than turning off selinux for the node completely.
this solution allows us to allow the control-plane and etcd to deploy with selinux not interfering with their operation, however what about user workloads?

i guess we'd still need to document the securityContext additions.

@detiber @rosti PTAL.

PS. also i'm wondering if there are still remaining gotchas here.

neolit123 on 5 Jul 2019

User workloads will get the usual type system_u:system_r:container_t:s0:.... Note: the ... is necessary because of MCS.

The most notable gotcha I'm aware of right now is that in my experience some of the pod network plugins will fall flat on their face the moment they try to start with selinux enabled. I have a similar issue open with calico to resolve this problem. It's a rather brute force fix, but it should work. https://github.com/projectcalico/calico/issues/2704

Similarly, user applications may have been built without selinux in mind, and will break. An easy way to test would be to temporarily run the pod/container that's breaking with "privledged: true" and see if it suddenly starts working fine. Then they can raise an issue/PR with the developer of the broken container to ask them, nicely, to support selinux.

rcythr on 5 Jul 2019

CNI plugins failing can be a problem. the most popular ones we got from a survey were Flannel, Calico, Weave and the rest of the results were quite spread.

have you tried weave or flannel with selinux?

but in any case, we have to document the selinux case for both user workloads and CNI.

going back to the proposal for the kubeadm manifests.
i forgot to mention that we have plans to run these containers as non-root, quite possibly in the 1.16 timeframe, would that change the selinux behavior?

cc @randomvariable

neolit123 on 5 Jul 2019

CNI plugins failing can be a problem. the most popular ones we got from a survey were Flannel, Calico, Weave and the rest of the results were quite spread.

have you tried weave or flannel with selinux?

No, but I will give it a try and see whether it works or not.

but in any case, we have to document the selinux case for both user workloads and CNI.

Agreed. A warning about what might happen if it's left in enforcing is certainly helpful. Users who want to keep it in enforcing probably won't care if some userspace applications fail -- this is most likely the same group that will require non-root containers (myself included). CNI is more problematic, but I suspect most CNI plugins will be eager to accept selinux patches ;)

going back to the proposal for the kubeadm manifests.
i forgot to mention that we have plans to run these containers as non-root, quite possibly in the 1.16 timeframe, would that change the selinux behavior?

I just did a quick test using bitnami/nginx which i know runs as non-root.

system_u:system_r:container_t:s0:c615,c897 1001 18714 0.5  0.0 26860 3280 ?    Ss   18:47   0:00 nginx: master process /opt/bitnami/nginx/sbin/nginx -c /opt/bitnami/nginx/conf/nginx.conf -g daemon off;

The selinux context is the same as if it were running as root; however, we should be careful because I suspect that this could change in the future. Right now the only difference is it's pid 1001 instead of 0. As a result, changing these containers to run as non-root shouldn't break this fix as spc_t's privileges are a strict superset of container_t's privileges.

rcythr on 5 Jul 2019

👍2

but I suspect most CNI plugins will be eager to accept selinux patches ;)

yes, hopefully.

The selinux context is the same as if it were running as root; however, we should be careful because I suspect that this could change in the future. Right now the only difference is it's pid 1001 instead of 0. As a result, changing these containers to run as non-root shouldn't break this fix as spc_t's privileges are a strict superset of container_t's privileges.

ok, thanks. that's good to know.

neolit123 on 5 Jul 2019

Here's the patch I cooked up so far. Seems to generate the right manifests. https://github.com/rcythr/kubernetes/commit/fb6c0248488af2b3a54ceb638e29db84ae0530bf

I'll create a PR once I've tested a couple other things and I've after signed the CLA -- I currently have a support ticket in with CNCF due to a mail issue.

The three things I'm currently wanting to test:

I want to look into incorporating some of the changes by @randomvariable to more tightly confine some of the containers, where possible. I believe his change will allow us to tightly confine kube-scheduler and etcd; however, because we cannot relabel /etc/ssl/certs or /etc/pki without breaking the system, we cannot confine kube-apiserver or kube-controller-manager tighter than spc_t. We either need to create a custom type for them, stop using these two system directories, or just live with the spc_t type.
Looking over the history of this ticket, I noticed some chatter about issues with spc_t on CoreOS. I want to do some testing to see if that's still a problem, or what was going on there.
While I believe this is only tangentially related to PR, I want to test flannel and weave net under selinux enforcing.

rcythr on 5 Jul 2019

Thanks for taking this on @rcythr .

Could be wrong, but I don't think I've had issues with CNIs once they've had the privileged flag on.

Just as a dump of stuff I've had to do at least for Fedora 28, so maybe these are now incorporated into container-selinux:

Create the following modules:

Allows Cilium and Weave Scope to work

module container_bpf 1.0;

require {
  type spc_t;
  type container_runtime_t;
  class bpf { map_create prog_load prog_run map_write map_read };
}

#============= spc_t ==============
allow container_runtime_t self:bpf { map_create prog_load prog_run map_write map_read };
allow container_runtime_t self:bpf map_create;
allow container_runtime_t spc_t:bpf { map_read map_write };
allow spc_t container_runtime_t:bpf { map_create prog_load prog_run map_write map_read };
allow spc_t self:bpf { map_create prog_load prog_run map_write map_read };

Allow containers to load certificates

module container_cert 1.0;

require {
  type container_t;
  type cert_t;
  class file { open read };
  class dir { read };
  class lnk_file { read };
}

allow container_t cert_t:file { open read };
allow container_t cert_t:lnk_file { read };
allow container_t cert_t:dir { read };

randomvariable on 5 Jul 2019

The thing, that I am concerned the most is the testing situation here. Currently we have no way to automatically perform a smoke test of SELinux support with kubeadm and k8s as a whole. kind/kinder base their node images on ubuntu 19.04. This, of course, does not have SELinux enabled in it by default.
We need to have an ability to spin up a testing cluster with kinder on a Fedora/CentOS base node image with SELinux enabled. Once we can do that we can maintain a working state of SELinux support through the k8s test-grid.

If we don't handle the testing situation properly, we might end up in a "works today, but might not work tomorrow" situation and angry users over claimed SELinux support in our documentation.

rosti on 5 Jul 2019

👍2

100% agreed.

I'm tempted to say we hold off until CentOS 8 goes GA so we have a consistent baseline wrt to the kernel version across distros.

randomvariable on 5 Jul 2019

If we don't handle the testing situation properly, we might end up in a "works today, but might not work tomorrow" situation and angry users over claimed SELinux support in our documentation.

if selinux is something that strictly requires e2e tests, we cannot support it today.

problem is that the ecosystem has some many distros and flavors that we cannot test them all.
if we don't want to maintain code for selinux in kubeadm, we can still have a guide in our setup with a disclaimer "may not work" and this was my initial proposal.

neolit123 on 5 Jul 2019

If we sort out #1379, this would take us a long way towards enabling users who want to have a stricter SELinux setup.

I will gather instructions on how you can do SELinux in an unsupported fashion today - either as a blogpost or a doc that can go on docs.k8s.io.

Additionally, @TheFoxAtWork did raise SELinux at CNCF SIG Security this week, so wondering if there's broader interest in getting this working.

randomvariable on 5 Jul 2019

problem is that the ecosystem has some many distros and flavors that we cannot test them all.

One thing that helps here is that the container-selinux package is shared across all of the distros, so maybe ok to add only one. Suggestion is to add CentOS 8 because it'll be on Linux 4.18, which is slightly ahead of Ubuntu 18.04 but not massive enough to start exhibiting other bugs. Amazon Linux 2 is also an option if we're testing on it elsewhere.

randomvariable on 5 Jul 2019

problem is that the ecosystem has some many distros and flavors that we cannot test them all.

One thing that helps here is that the container-selinux package is shared across all of the distros, so maybe ok to add only one. Suggestion is to add CentOS 8 because it'll be on Linux 4.18, which is slightly ahead of Ubuntu 18.04 but not massive enough to start exhibiting other bugs. Amazon Linux 2 is also an option if we're testing on it elsewhere.

i guess my point was more about the fact that selinux is not something that is really supported on Ubuntu and AppArmor is the alternative for it.

https://security.stackexchange.com/a/141716

Now practically SElinux works better with Fedora and RHEL as it comes preshipped while AA works better on Ubuntu and SUSE which means it would be better to learn how to use SElinux on the former distros than going through the hassel of making AA work on them and vice versa.

this is the distro flavor mess that i don't want to get kubeadm into.

the kubeadm survey told us that 65% of our users use Ubuntu, so technically we should be prioritizing apparmor. has anyone tried kubeadm with apparmor?
xref https://kubernetes.io/docs/tutorials/clusters/apparmor/

If we sort out #1379, this would take us a long way towards enabling users who want to have a stricter SELinux setup.

yes, it feels to me we should just document some basic details and punt the rest to #1379.
but i'm also seeing demand for static pod configuration enhancements in the next v1betaX, because RN we cannot persist securitycontext modifications after upgrade.

neolit123 on 5 Jul 2019

has anyone tried kubeadm with apparmor?

AppArmor is much more lightweight than SELinux and has a different security model. It's pretty much on by default these days for Docker on Ubuntu.

this is the distro flavor mess that i don't want to get kubeadm into.

Given AppArmor can't be used on CentOS and equivalents (other than AL2 which supports both), we're already there in saying some percentage of users can't make use of a Linux Security Module in a supported fashion.

randomvariable on 5 Jul 2019

this is the distro flavor mess that i don't want to get kubeadm into.

I definitely understand the desire to not get into the mess of distros and options -- it's a combinatorial explosion of test configurations. Personally, I believe if kubeadm was at least _compatible_ with selinux it would have a larger share of non-ubuntu users, but I have no proof of that beyond the fact I'm one of those people. However, If the only distro/cri combination that's tested is ubuntu with docker, then that's really the only supported distro/cri.

If you don't want to support other configurations that's your choice, but at least be clear about that in the documentation and close this issue now. Telling centos/rhel/fedora users to disable selinux for their entire system because figuring out (and testing) the policy for an application is annoying is the equivalent to telling them to disable their firewall because figuring out (and testing) the rules is annoying.

rcythr on 5 Jul 2019

@randomvariable

AppArmor is much more lightweight than SELinux and has a different security model. It's pretty much on by default these days for Docker on Ubuntu.

actually, i think it's already running in the prow/kubekins image.

@rcythr

If the only distro/cri combination that's tested is ubuntu with docker, then that's really the only supported distro/cri.

currently we are testing containerd and docker on Ubuntu.

Telling centos/rhel/fedora users to disable selinux for their entire system because figuring out (and testing) the policy for an application is annoying is the equivalent to telling them to disable their firewall because figuring out (and testing) the rules is annoying.

we need help with the selinux details. we already tell "CentOS, RHEL or Fedora" users to disable selinux completely:
https://github.com/kubernetes/website/blob/master/content/en/docs/setup/production-environment/tools/kubeadm/install-kubeadm.md#installing-kubeadm-kubelet-and-kubectl

this isn't desired and i still think we should have a document or a paragraph with some guiding steps.

neolit123 on 5 Jul 2019

Filed https://github.com/containerd/cri/issues/1195 to log the fact that there's still work to be done to complete SELinux support in ContainerD.

randomvariable on 5 Jul 2019

👍2

I want to look into incorporating some of the changes by @randomvariable to more
tightly confine some of the containers, where possible. I believe his change will allow us
to tightly confine kube-scheduler and etcd; however, because we cannot relabel
/etc/ssl/certs or /etc/pki without breaking the system, we cannot confine kube-apiserver
or kube-controller-manager tighter than spc_t. We either need to create a custom type
for them, stop using these two system directories, or just live with the spc_t type.

Not using system directories or creating custom types for components needing access to these system directories would allow us to further tighten the various components and avoid using spc_t type completely. This is clearly the best solution IMHO.

yann-soubeyrand on 6 Jul 2019

I want to look into incorporating some of the changes by @randomvariable to more
tightly confine some of the containers, where possible. I believe his change will allow us
to tightly confine kube-scheduler and etcd; however, because we cannot relabel
/etc/ssl/certs or /etc/pki without breaking the system, we cannot confine kube-apiserver
or kube-controller-manager tighter than spc_t. We either need to create a custom type
for them, stop using these two system directories, or just live with the spc_t type.

Not using system directories or creating custom types for components needing access to these system directories would allow us to further tighten the various components and avoid using spc_t type completely. This is clearly the best solution IMHO.

I agree. That's why I wanted to test it out and get it working.

Based on feedback from @neolit123, I doubt we'll see built-in changes to the code to automatically handle selinux compatibility anytime soon. Instead I'm going to make a doc page that describes how to use kubeadm on systems with selinux. It'll be a few steps longer than the usual kubeadm init process, but it should help anyone who wants to use kubeadm on selinux systems immediately.

On that page I'll present three options:

Disable selinux: This is the least confined option, but the most supported. It may be necessary for some CNI plugins or user workloads.
Keep using system directories /etc/pki and /etc/ssl/certs and use spt_t to avoid problems.
Custom directories, and chcon relabeling and pretty tight confinement.

rcythr on 6 Jul 2019

Hi, just wanted to comment that we have been running with selinux enabled for almost a year together with kubeadm and kubernetes+docker+calico on Centos7. Most workloads has no issues, had some issues with concourse only that i can recall.

However we would like to do better enforcement though at some point, currently we run with
'chcon -Rt container_file_t' on below directories and make sure they are created before running kubeadm init (including /etc/kubernetes/pki/etcd)

/var/lib/etcd (etcd datadir)
/etc/kubernetes/pki (`certificatesDir' in kubeadm.conf)
/etc/cni/net.d
/opt/cni/bin/

qpehedm on 10 Jul 2019

@qpehedm If you could apply a JSONPatch to the static pod manifests, and also make sure kubeadm runs restorecon for each file/directory it writes, would that be sufficient to allow you to set stricter confinement?

randomvariable on 10 Jul 2019

@randomvariable Yes to add securitycontext with the spc_t for the k8s static pods as suggested is likely better than our current solution. I suppose the same needs to be done for CNI plugins or other infrastructure containers. Even better would be dedicated labels for this purpose, so that they only get permissions for the specific files needed?

qpehedm on 11 Jul 2019

i'm going to close this ticket and here is my rationale, explained with bullet points:

Linux has multiple security models AppArmor, Seccomp, SELinux and more might spawn in the future
documenting security options are the responsibility of SIG Auth, this is already done here:
https://kubernetes.io/docs/tasks/configure-pod-container/security-context
https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#assign-selinux-labels-to-a-container
security options can vary from distro to distro
kubeadm already supports a feature to allow the user to customize their control-plane static pods:
https://github.com/kubernetes/kubeadm/issues/1379
e2e tests for some of this are already done by "node_e2e" and are maintained by SIG Node:
https://github.com/kubernetes/kubernetes/blob/ee81b306815268cfbc01501f9dae12ce38974405/test/e2e_node/summary_test.go
https://github.com/kubernetes/kubernetes/blob/305377a2f577a47e4decc6ebfc4a3c2a393a00f5/test/e2e_node/apparmor_test.go

/close

neolit123 on 20 Jan 2020

@neolit123: Closing this issue.

In response to this:

i'm going to close this ticket and here is my rationale, explained with bullet points:

Linux has multiple security models AppArmor, Seccomp, SELinux and more might spawn in the future

documenting security options are the responsibility of SIG Auth, this is already done here:
https://kubernetes.io/docs/tasks/configure-pod-container/security-context
https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#assign-selinux-labels-to-a-container

security options can vary from distro to distro

kubeadm already supports a feature to allow the user to customize their control-plane static pods:
https://github.com/kubernetes/kubeadm/issues/1379

e2e tests for some of this are already done by the node_e2e tests:
https://github.com/kubernetes/kubernetes/blob/ee81b306815268cfbc01501f9dae12ce38974405/test/e2e_node/summary_test.go
https://github.com/kubernetes/kubernetes/blob/305377a2f577a47e4decc6ebfc4a3c2a393a00f5/test/e2e_node/apparmor_test.go

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 20 Jan 2020

Kubeadm: Document kubeadm usage with SELinux

Is this a BUG REPORT or FEATURE REQUEST?

Versions

Most helpful comment

All 77 comments

Create the following modules:

Allows Cilium and Weave Scope to work

Allow containers to load certificates

Related issues