Kops: Kops default images should follow cgroup hierarchy best practices

Created on 3 Nov 2017  路  18Comments  路  Source: kubernetes/kops

This document outlines the recommended cgroup hierarchy. If we modify the default base image to match this it would be much more intuitive for people to enable Resource Reservations, etc.

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/node-allocatable.md#recommended-cgroups-setup

lifecyclrotten

Most helpful comment

Just wanted to share back my finalised cluster manifest snippet for kops 1.8 - kubernetes 1.7.12 - running m4.large masters and m4.xlarge nodes. would love feedback if I did something wrong.

Note: Adjust memory reservations according to your node size or your cluster won't come up - I spent way too long on this as I was testing on machines that were too small... :rage4:

clusterSpec - snippet:

...
  # fileAssets only works with kops >=1.8.0
  fileAssets:
  - name: podruntime-slice
    # if no roles specified, defaults to all roles
    path: /usr/lib/systemd/system/podruntime.slice
    content: |
      [Unit]
      Description=Limited resources slice for Kubernetes services
      Documentation=man:systemd.special(7)
      DefaultDependencies=no
      Before=slices.target
      Requires=-.slice
      After=-.slice
  masterKubelet:
    kubeletCgroups: "/podruntime.slice"
    runtimeCgroups: "/podruntime.slice"
  kubelet:
    kubeletCgroups: "/podruntime.slice"
    runtimeCgroups: "/podruntime.slice"
    # Comma-delimited list of hard eviction expressions.  For example, 'memory.available<300Mi'
    # Eviction signals: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#eviction-signals
    evictionHard: memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5%,imagefs.available<10%,imagefs.inodesFree<5%
    # Resource reservation for kubernetes system daemons like the kubelet, container runtime, node problem detector, etc. (map[string]string)
    # https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#kube-reserved
    kubeReserved:
      cpu: 750m
      memory: 1.3Gi
    # Absolute name of the top level cgroup that is used to manage kubernetes components for which compute resources were reserved
    kubeReservedCgroup: "/podruntime.slice"
    # not enforcing system-reserved due to issue with /system.slice
    enforceNodeAllocatable: "pods,kube-reserved"
...

Note: This snippet is not setting system-reserved because:

  1. it is advised to only do this when you really know what you're doing (have monitored resource usage long enough)
  2. I was getting error on kope.io/k8s-1.7-debian-jessie-amd64-hvm-ebs-2017-12-02

Else, set system-reserved as follows

... <under kubelet key>
    # Capture resource reservation for OS system daemons like sshd, udev, etc. (map[string]string)
    systemReserved:
      cpu: 250m
      memory: 750Mi
    # Absolute name of the top level cgroup that is used to manage non-kubernetes components for which compute resources were reserved
    systemReservedCgroup: "/system.slice"

    # and don't forget to change the above enforceNodeAllocatable into this:
    enforceNodeAllocatable: "pods,system-reserved,kube-reserved"


podruntime.slice works fine:

container_manager_linux.go:108] Configure resource-only container "/podruntime.slice" with memory limit: 0

results in

admin@ip-10-0-0-1:~$ sudo find /sys/fs/cgroup -name "podruntime.slice"
/sys/fs/cgroup/pids/podruntime.slice
/sys/fs/cgroup/perf_event/podruntime.slice
/sys/fs/cgroup/net_cls,net_prio/podruntime.slice
/sys/fs/cgroup/freezer/podruntime.slice
/sys/fs/cgroup/devices/podruntime.slice
/sys/fs/cgroup/memory/podruntime.slice
/sys/fs/cgroup/blkio/podruntime.slice
/sys/fs/cgroup/cpu,cpuacct/podruntime.slice
/sys/fs/cgroup/cpuset/podruntime.slice
/sys/fs/cgroup/systemd/podruntime.slice

and for system.slice:

[Failed to start ContainerManager Failed to enforce System Reserved Cgroup Limits on "/system.slice": "/system.slice" cgroup does not exist]
admin@ip-10-0-0-1:~$ sudo find /sys/fs/cgroup -name "system.slice"
/sys/fs/cgroup/systemd/system.slice

I tried overwriting etc/systemd/system.conf through assetFiles to change

#JoinControllers=cpu,cpuacct net_cls,net_prio

into

JoinControllers=cpu,cpuacct,cpuset,net_cls,net_prio,memory

but assume systemd has to be reloaded after that?

All 18 comments

@blakebarnett awesome find!

How do other operating systems handle this? Do we need different process for rhel? CoreOs?

/cc @gambol99

I know nada about CoreOS

We should put this in nodeup

As a workaround using the default debian images, would it be sufficient to use system.slice as system-reserved cgroup and use hooks to create a dedicated cgroup for kube-reserved?

i.e podruntime.slice

# /usr/lib/systemd/system/podruntime.slice
[Unit]
Description=Limited resources slice for Kubernetes services
Documentation=man:systemd.special(7)
DefaultDependencies=no
Before=slices.target
Requires=-.slice
After=-.slice

EDIT: Guess it would also be required to set Slice using drop-ins for kubelet.service and docker.service to podruntime.slice to match the recommended cgroup hierarchy?

# /etc/systemd/system/docker.service.d/set-slice.conf
[Service]
Slice=podruntime.slice

# /etc/systemd/system/kubelet.service.d/set-slice.conf
[Service]
Slice=podruntime.slice

Is it necessary to set --cgroup-parent on docker?
I assume it would need to be added to https://godoc.org/k8s.io/kops/pkg/apis/kops#DockerConfig
or should this be controlled through KubeletConfigSpec.CgroupRoot

Just wanted to share back my finalised cluster manifest snippet for kops 1.8 - kubernetes 1.7.12 - running m4.large masters and m4.xlarge nodes. would love feedback if I did something wrong.

Note: Adjust memory reservations according to your node size or your cluster won't come up - I spent way too long on this as I was testing on machines that were too small... :rage4:

clusterSpec - snippet:

...
  # fileAssets only works with kops >=1.8.0
  fileAssets:
  - name: podruntime-slice
    # if no roles specified, defaults to all roles
    path: /usr/lib/systemd/system/podruntime.slice
    content: |
      [Unit]
      Description=Limited resources slice for Kubernetes services
      Documentation=man:systemd.special(7)
      DefaultDependencies=no
      Before=slices.target
      Requires=-.slice
      After=-.slice
  masterKubelet:
    kubeletCgroups: "/podruntime.slice"
    runtimeCgroups: "/podruntime.slice"
  kubelet:
    kubeletCgroups: "/podruntime.slice"
    runtimeCgroups: "/podruntime.slice"
    # Comma-delimited list of hard eviction expressions.  For example, 'memory.available<300Mi'
    # Eviction signals: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#eviction-signals
    evictionHard: memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5%,imagefs.available<10%,imagefs.inodesFree<5%
    # Resource reservation for kubernetes system daemons like the kubelet, container runtime, node problem detector, etc. (map[string]string)
    # https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#kube-reserved
    kubeReserved:
      cpu: 750m
      memory: 1.3Gi
    # Absolute name of the top level cgroup that is used to manage kubernetes components for which compute resources were reserved
    kubeReservedCgroup: "/podruntime.slice"
    # not enforcing system-reserved due to issue with /system.slice
    enforceNodeAllocatable: "pods,kube-reserved"
...

Note: This snippet is not setting system-reserved because:

  1. it is advised to only do this when you really know what you're doing (have monitored resource usage long enough)
  2. I was getting error on kope.io/k8s-1.7-debian-jessie-amd64-hvm-ebs-2017-12-02

Else, set system-reserved as follows

... <under kubelet key>
    # Capture resource reservation for OS system daemons like sshd, udev, etc. (map[string]string)
    systemReserved:
      cpu: 250m
      memory: 750Mi
    # Absolute name of the top level cgroup that is used to manage non-kubernetes components for which compute resources were reserved
    systemReservedCgroup: "/system.slice"

    # and don't forget to change the above enforceNodeAllocatable into this:
    enforceNodeAllocatable: "pods,system-reserved,kube-reserved"


podruntime.slice works fine:

container_manager_linux.go:108] Configure resource-only container "/podruntime.slice" with memory limit: 0

results in

admin@ip-10-0-0-1:~$ sudo find /sys/fs/cgroup -name "podruntime.slice"
/sys/fs/cgroup/pids/podruntime.slice
/sys/fs/cgroup/perf_event/podruntime.slice
/sys/fs/cgroup/net_cls,net_prio/podruntime.slice
/sys/fs/cgroup/freezer/podruntime.slice
/sys/fs/cgroup/devices/podruntime.slice
/sys/fs/cgroup/memory/podruntime.slice
/sys/fs/cgroup/blkio/podruntime.slice
/sys/fs/cgroup/cpu,cpuacct/podruntime.slice
/sys/fs/cgroup/cpuset/podruntime.slice
/sys/fs/cgroup/systemd/podruntime.slice

and for system.slice:

[Failed to start ContainerManager Failed to enforce System Reserved Cgroup Limits on "/system.slice": "/system.slice" cgroup does not exist]
admin@ip-10-0-0-1:~$ sudo find /sys/fs/cgroup -name "system.slice"
/sys/fs/cgroup/systemd/system.slice

I tried overwriting etc/systemd/system.conf through assetFiles to change

#JoinControllers=cpu,cpuacct net_cls,net_prio

into

JoinControllers=cpu,cpuacct,cpuset,net_cls,net_prio,memory

but assume systemd has to be reloaded after that?

Just a heads-up for people who are considering the above, after adding the cgroup changes to my manifest and restarting my masters, a lot of nodes went into NotReady until their kubelet was restarted. I guess due to a mismatch in where the masters expect docker and kubelet to be running, and where the nodes actually have it running?
All in all, would not recommend to implement this on a running cluster where uptime is important...

@so0k As part of trying to solve my "SystemOOM" issues, I tried something similar to your config above and still got "SystemOOM" on my nodes.

Was wondering if the block below is required?

# /etc/systemd/system/docker.service.d/set-slice.conf
[Service]
Slice=my_slice.slice

# /etc/systemd/system/kubelet.service.d/set-slice.conf
[Service]
Slice=my_slice.slice

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

/remove-lifecycle rotten

Still relevant I believe?

do we have any suggestion as to fix the system.slice not found error by using kops config?

Should be fixed in the AMI - https://github.com/kubernetes/kube-deploy/issues/479 (not saying it is already fixed, just that the AMI needs to be updated)

Gently ping

Any recommended values for kubeReserved cpu and memory? Or would this be included in the AMIs?

Depends on your Instance type...

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings