Kubeadm: Check for /sys/fs/cgroup/cpu/cpu.cfs_quota_us / config_cfs_bandwith on init

Created on 26 Oct 2020  路  16Comments  路  Source: kubernetes/kubeadm

What keywords did you search in kubeadm issues before filing this one?

cfs_quota_us aarch64 centos8 cri-o crio

Oct 24 01:00:44 chaos kubelet[14438]: E1024 01:00:44.465714   14438 kuberuntime_manager.go:804] container &Container{Name:etcd,Image:k8s.gcr.io/etcd:3.4.13-0,Command:[etcd --advertise-client-urls=https://192.168.1.70:2379 --cert-file=/et>
Oct 24 01:00:44 chaos kubelet[14438]: time="2020-10-24T01:00:44Z" level=error msg="container_linux.go:349: starting container process caused \"process_linux.go:449: container init caused \\\"process_linux.go:415: setting cgroup config fo>
Oct 24 01:00:44 chaos kubelet[14438]: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:415: setting cgroup config for procHooks process caused \\\"failed to write \>
Oct 24 01:00:44 chaos kubelet[14438]: E1024 01:00:44.465863   14438 pod_workers.go:191] Error syncing pod e0f6df37808b0e77cc354267a4dc6b38 ("etcd-chaos_kube-system(e0f6df37808b0e77cc354267a4dc6b38)"), skipping: failed to "StartContainer">

Is this a BUG REPORT or FEATURE REQUEST?

FEATURE REQUEST

As seen in:
https://github.com/cri-o/cri-o/issues/4307
https://bugs.centos.org/view.php?id=17813

kubeadm should check for the existance of /sys/fs/cgroup/cpu/cpu.cfs_quota_us. It missing is a sign of the kernel being compiled without config_cfs_bandwith which results in the errors above and a failed kubeadm init.

Versions

kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:47:53Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/arm64"}

kinbug kinfeature prioritbacklog sinode

Most helpful comment

I commented in the linked runc issue, but yes our stance is generally speaking we don't silently ignore settings because basically all container runtime settings are related to security in some form and ignoring explicit requests from a user is not a good idea. As far as I can see, Kubelet seems to have some default cpu cgroup settings which runc is trying to faithfully apply and maybe the auto-detection should be done there (or as part of the setup for a node?).

All 16 comments

kubeadm's system validators reside here:
https://github.com/kubernetes/system-validators/

kubeadm should check for the existance of /sys/fs/cgroup/cpu/cpu.cfs_quota_us. It missing is a sign of the kernel being compiled without config_cfs_bandwith which results in the errors above and a failed kubeadm init.

@SergeyKanzhelev @KentaTada @odinuge
what do you think?

/sig node
/priority awaiting-more-evidence

I think we should add CONFIG_CFS_BANDWIDTH and CONFIG_FAIR_GROUP_SCHED as the optional kernel configs in https://github.com/kubernetes/system-validators/blob/6069d2dca63940eeee5482fa739ee3806d77b896/validators/types_unix.go#L52.

The above error is caused by runc because runc does not check for the existence of /sys/fs/cgroup/cpu/cpu.cfs_quota_us.

As the runc's requirement, CONFIG_CFS_BANDWIDTH which creates /sys/fs/cgroup/cpu/cpu.cfs_quota_us is one of the optional kernel config.
https://github.com/opencontainers/runc/blob/ff9852c4de0596bee80927ee1cf7a7218a396bf5/script/check-config.sh

In addition to that, runtime-spec defines that this value is optional.
https://github.com/opencontainers/runtime-spec/blob/v1.0.2/config-linux.md#cpu

In other words, /sys/fs/cgroup/cpu/cpu.cfs_quota_us is the just option and not clearly required as the container-runtime specification.
The container runtime should check the existence of /sys/fs/cgroup/cpu/cpu.cfs_quota_us and not write to that if the file does not exist.

So, I think that the proper fixes are

  • to add CONFIG_CFS_BANDWIDTH and CONFIG_FAIR_GROUP_SCHED as the optional kernel config in https://github.com/kubernetes/system-validator.
  • to modify runc to check the existence of /sys/fs/cgroup/cpu/cpu.cfs_quota_us and not write to that if the file does not exist.

I think this is just the tip of the iceberg. So I want to update the kernel config checklist of https://github.com/kubernetes/system-validators/blob/6069d2dca63940eeee5482fa739ee3806d77b896/validators/types_unix.go

Thanks @KentaTada! I do agree with both proposed changes. I am surprised that runc is not checking for something under /cpu before writing to it. This could be fixed in a newer version of runc.

With respect to the optional config in system-validators, i am more tentative about the change because i think it can be avoided and just rely on runc to fail on runtime? But if you send the pr i will ask for a sig node reviewer and we can merge it.

@neolit123
I think that we should update the checklist of system-validators because users can gather information of the error in advance. Especially, it helps them when using a new immature container runtime.
So, I'll create the PR for system-validators after I consider what checks are needed.
Could you review it?

Thanks!

I agree that we should check those kernel configs. They are currently required in order to run Kubernetes with default config, but it can be disabled by the kubelet flag --cpu-cfs-quota=false. Have you tried running with that @kallisti5? From the back of my head I think we still write/try to write cpu-quota to the pod level no matter what the flag is set to, and in that case we should probably fix that as well.

When it comes to runc, I think the current behavior is reasonable. When we provide a cpu cfs quota and period to a CRI, I think it is reasonable that the container runtime tries to enforce it, and then fail in case of lacking support. Failing silently in this case feels strange to me, or do you have other thoughts?

I commented in the linked runc issue, but yes our stance is generally speaking we don't silently ignore settings because basically all container runtime settings are related to security in some form and ignoring explicit requests from a user is not a good idea. As far as I can see, Kubelet seems to have some default cpu cgroup settings which runc is trying to faithfully apply and maybe the auto-detection should be done there (or as part of the setup for a node?).

Yeah, that sounds correct @cyphar. If the node doesn't support cfs I think we should do the discovery and changes in Kubernetes to avoid the problem all together. And yeah, kubelet will set cfs settings (and try to apply them) for ~all pods via libcontainer and for all containers via cri/dockershim.

The question is probably then if we want to discover it and silently disable cfs quotas, or if we should require users to set the config flag to disable it (and fail kubelet if it isn't supported with the flag not set), and behave in the same way as we do with swap today?

@KentaTada

I think that we should update the checklist of system-validators because users can gather information of the error in advance. Especially, it helps them when using a new immature container runtime.
So, I'll create the PR for system-validators after I consider what checks are needed.

SGTM, please make it closes ... this issue.

Could you review it?

sure.

@odinuge

The question is probably then if we want to discover it and silently disable cfs quotas, or if we should require users to set the config flag to disable it (and fail kubelet if it isn't supported with the flag not set), and behave in the same way as we do with swap today?

if swap is enabled on a node where the kubelet runs, it fails fast and provides a somewhat descriptive message for a user action.
i think we should do the same for this case..based on the logs by @kallisti5 it doesn't seem the kubelet is doing that today.

@odinuge would you like to help logging an issue in kubernetes/kubernetes and a /sig node tag? i could, but i don't think i understand the underlying reasons well enough.

if swap is enabled on a node where the kubelet runs, it fails fast and provides a somewhat descriptive message for a user action.
i think we should do the same for this case..based on the logs by @kallisti5 it doesn't seem the kubelet is doing that today.

I agree that it would be the best approach, so that sounds good!

@odinuge would you like to help logging an issue in kubernetes/kubernetes and a /sig node tag? i could, but i don't think i understand the underlying reasons well enough.

Yeah, I can make a sig-node issue about the details around --cpu-cfs-quota in case in behaves unexpected, and the fact that we should/can verify it and fail fast with a warning if things are in a bad state. It would be nice if @kallisti5 could verify if things works as expected or not when running with --cpu-cfs-quota=false.

Edit:

For reference https://github.com/torvalds/linux/blob/master/kernel/sched/core.c#L8183-L8205

  • CONFIG_FAIR_GROUP_SCHED should be set as required, since there is currenlty no way to disable using it in Kubernetes
  • CONFIG_CFS_BANDWIDTH should be set as optional, as long as --cpu-cfs-quota=false actually works when CONFIG_CFS_BANDWIDTH=n

@kallisti5
I think we could resolve this issue in https://github.com/kubernetes/system-validators/pull/19.
Could you close this ticket if you don't have any questions?

if we want to make the changes apply in 1.20 we still have push a new release / tag of the system-validators library and re-vendor it in kubernetes/kubernetes.

@KentaTada i see you sent another PR that adds CGROUP_PIDS as required and CGROUP_HUGETLB as optional:
https://github.com/kubernetes/system-validators/pull/20
that SGTM; @odinuge do you agree?

both these PRs introduce new required configuration options (FAIR_GROUP_SCHED and CGROUP_PIDS), which can trip existing setups:
https://github.com/kubernetes/system-validators/pull/19
https://github.com/kubernetes/system-validators/pull/20
how common are these options and do you have any concerns about that?
i'm less concerned about CGROUP_PIDS, for some reason.

one good aspect is that users are allowed to skip the kernel validation entirely (at least on the kubeadm side).

if we want to make the changes apply in 1.20 we still have push a new release / tag of the system-validators library and re-vendor it in kubernetes/kubernetes.

Nice! Would be nice to get them into v1.20!

how common are these options and do you have any concerns about that?

It has never been supported (or been working) running with a configuration that fails with the new validation, so I think it should be good to go. The only places they might be disabled, from my experience, are raspberry-pi like devices used for prototyping.

Kubeadm actually has a fair number of RPI users. Still, it feels right to include these in the kernel validation. Thanks for confirming.

I will send the vendor update later today.

v1.3.0 tag pushed.

PR for the vendor update for 1.20 is here:
https://github.com/kubernetes/kubernetes/pull/96378

kubernetes/kubernetes#96378

didn't get approved in time for 1.20, so it has to be moved to 1.21.

Was this page helpful?
0 / 5 - 0 ratings