kubeadm 🚀 - Configurable sandbox (pause) image

According to the kubeadm charter, we expect that users have configured their container runtimes properly to be able to be used in a Kubernetes cluster. As such, kubeadm refrains from having to start, stop, configure or in any way try to change the configuration and state of the container runtime itself.
However, there is one notable exception to this and that is Docker itself. That was caused by its sheer popularity and the fact that SIG Node is maintaining the large "docker shim" component as part the kubelet.
At this point, kubeadm supplies the kubelet with the --pod-infra-container-image command line option to override the pause image. This makes sure the pause image can be pre-pulled before use and air-gaped environments don't try to access the Internet. However, this works only for Docker, as this option is used solely by the docker shim.
Other container runtimes, like containerd and CRI-O, have their separate options for sandbox image configuration. They are not propagated by the kubelet and that's probably because the kubelet cannot have code to modify all the container runtimes out there. Neither does kubeadm at that matter. In addition to that, CRI does not have a sandbox image override option.
Hence, I don't think that we should expose a config option in the kubeadm config for pause image override, because that option is going to be used only for image pre-pulling and the docker shim.
The is a lot of uncertainty about the latter and what path it takes on and the former is completely out of scope if kubeadm cannot ensure that the pre-pulled image is going to be used by the container runtime.

rosti on 6 Feb 2020

/cc @neolit123 @fabriziopandini @ereslibre @timothysc

rosti on 6 Feb 2020

I agree that kubeadm should not enter in the business of configuring CRIs, unless this is not done through the kubelet.
So the question here are:

should we ask to sig-node what are the plans for the --pod-infra-container-image wrt to other CRIs
should we, as a kubeadm team, provide support for setting --pod-infra-container-image via config only for docker, that as far as we know, it is the most used CRI

fabriziopandini on 9 Feb 2020

👍1

I also agree that configuring the CRI is out of the scope for kubeadm.

Adding the PauseImage via config when only one runtime will honor this setting does not look to me like a good abstraction for our config. Maybe it's the most used one, but I'd rather prefer to add documentation instead when users are relying on Docker and require this very specific feature.

In general, we can add to the documentation that if you want to configure the pause image, you have several options:

No specific pause image needs
- Docker: we will default it to ImageRepository/pause/3.1, whatever ImageRepository is based on the configuration, and whatever pause image version the current kubeadm version has in its constants.
- Other container runtimes: we do not attempt to configure, their default configuration is what rules -- check what their settings are on your current distribution.
Need to set a specific pause image
- All container runtimes: provide documentation on how to configure the container runtime:
- Docker: how to make sure that kubeadm won't override with the dynamic env file? systemd drop-in?
- Other container runtimes: check their specific documentation on how to set the pause image before calling to kubeadm.

ereslibre on 10 Feb 2020

👍1

should we ask to sig-node what are the plans for the --pod-infra-container-image wrt to other CRIs

This command line flag is tightly coupled to the docker shim. So I suspect that its long term life is going to be connected to that (rather than the kubelet itself).
Is the docker shim moving out? Is kubelet getting in on the business of configuring more than just Docker? I do think that the answers to those questions can only be given by SIG Node.

should we, as a kubeadm team, provide support for setting --pod-infra-container-image via config only for docker, that as far as we know, it is the most used CRI

We already do that through kubelet's docker shim/command line.

Also, I don't see a clear user benefit here (in kubeadm managing the sandbox image). Image pre-pull for air-gaped env? - That can be done by users way before the k8s packages are installed.

rosti on 10 Feb 2020

Is there any way today to tell kubeadm today "hey actually I'm using this pause image"?

While I don't need kubeadm to configure the CRI I'm aware of what image the CRI implementation is using and I'd like kubelet flag and the image pull to match (or just no image pulling, which seems to be possible in 1.13+ with --skip-phases=preflight).

Not configuring this properly in kubeadm indeed breaks airgapping.

Also worth noting: IIRC kubelet has an image GC blacklist internally of the pod sandbox image specifically, so you probably want kubelet to know about this until we get a more flexible whitelis t.. (at some point I want to revive an old KEP for a configurable whitelist)

Other than the image pulling (which is never desirable in our case..) and the image GC thing, in practice the CRI implementation implements the pod sandbox and it should be an implementation detail what this image is without any kubeadm involvement, unless kubeadm wants to start trying to manage CRI implementations.

BenTheElder on 22 Apr 2020

Is the docker shim moving out? Is kubelet getting in on the business of configuring more than just Docker? I do think that the answers to those questions can only be given by SIG Node.

TBD, though there is provisional support now for building kubelet w/o dockershim, I think eventually it will move out.

Also, I don't see a clear user benefit here (in kubeadm managing the sandbox image). Image pre-pull for air-gaped env? - That can be done by users way before the k8s packages are installed.

IMHO it's mildly useful to make sure the flag aligns, but it would be just as good for kind at least to just not have kubeadm manage the pause image at all and leave us to supply this flag.

Probably the main reason this is a thing is for people configuring ImageRepository to use a mirror consistently..?

IMHO this is a somewhat questionable feature to begin with though, users could instead configure docker / containerd / cri-o to use a mirror for k8s.gcr.io at that layer.

BenTheElder on 22 Apr 2020

The other problem with NOT having a structured field for this in kubeadm config is that kubeadm also doesn't have a structured way to discover what will be used, so you can't automate aligning these cleanly when using CRI instead of dockershim.

At best you can do kubeadm config images list | grep pause and then insert that into the relevant implementation config which is pretty gross and assumes that we'll always have the substring pause in just that one image.

It's also entirely possible to implement the CRI spec without the pause image or a totally custom pause image for that matter.

BenTheElder on 22 Apr 2020

Probably the main reason this is a thing is for people configuring ImageRepository to use a mirror consistently..?

users (especially in China) have air-gap local registries where they rely on kubeadm config images to tell them what they need to prepull for a working cluster. by doing that they are also prepulling the sandbox container.

from there CR docs need to tell them where to apply this image in the CR config. from my outdated investigation this is not exactly well documented by the CRI providers.

70% of the kubeadm users use dockershim and "pause" is managed for them (minus upgrades)
for the rest (30%) CRI mostly works, except that we are still getting the occasional k/kubeadm k/k issues about the sandbox not being configured properly.

At best you can do kubeadm config images list | grep pause and then insert that into the relevant implementation config which is pretty gross and assumes that we'll always have the substring pause in just that one image.

1.18 kubeadm has structured output for kubeadm config images list, IIRC.
but i agree that using kubeadm to tell CRI config what is the sandbox feels backwards.

While I don't need kubeadm to configure the CRI I'm aware of what image the CRI implementation is using and I'd like kubelet flag and the image pull to match (or just no image pulling, which seems to be possible in 1.13+ with --skip-phases=preflight).

--skip-phases=preflight is viable for skipping prepull.
one potential improvement is adding ImagePullPolicy to kubeadm's config.

i think the dockershim changes will tell us what is going to happen in terms of the sandbox image. the two options are:

like @rosti is saying above - stop managing pause and document it
- easier for us
- more churn for new users
- users manage pause upgrades too
continue managing pause in kubeadm and pass the relevant image as a dockershim flag:
- complicated for us (it already is + dockershim will be a new service)
- dockershim's future state of flags/componentconfig is unknown
- upgrading pause needs at least docs updates

neolit123 on 22 Apr 2020

👍1

users (especially in China) have air-gap local registries where they rely on kubeadm config images to tell them what they need to prepull for a working cluster. by doing that they are also prepulling the sandbox container.

... precisely? but only if kubeadm is managing this image / flag. Do you have any additional reasons for managing this image in kubeadm?

from there CR docs need to tell them where to apply this image in the CR config. from my outdated investigation this is not exactly well documented by the CRI providers.

It is abnormal to override this in a container runtime implementing pods because the sandbox image is part of the pods. However it's not difficult to find the pod sandbox image option in containerd CRI plugin config, it's no less well documented than anything else.

https://github.com/containerd/cri/blob/master/docs/config.md#cri-plugin-config-guide

sandbox_image

70% of the kubeadm users use dockershim and "pause" is managed for them (minus upgrades)
for the rest (30%) CRI mostly works, except that we are still getting the occasional k/kubeadm k/k issues about the sandbox not being configured properly.

I don't know what percentage kind users are but the issue here is that there isn't not only does this not work properly it doesn't functionally make sense.

Even for the dockershim users kubeadm only needs to be specifying this if you're overriding the image registry. Otherwise kubelet defaults this just fine, and the image pulling for pause is of questionable value.

1.18 kubeadm has structured output for kubeadm config images list, IIRC.

Unfortunately this output while "structured" contains no more data, it's still just a grab bag list of images.

i think the dockershim changes will tell us what is going to happen in terms of the sandbox image.

I see no related details there -- it's still just in dockershim mode the pod is implemented by us and we specify this and in CRI mode it's only relevant to the kubelet imageGC.

like @rosti is saying above - stop managing pause and document it

As I said if you don't manage this, usage of imageRepository will become problematic, users will wind up reimplementing imageRepository for this image anyhow.

The entire imageRepository concept could be obviated by setting a comparable mirror at the runtime level though.

If users override nothing "manage pause upgrades" is a no-op except when airgapped. The new kubelet / dockershim will have a new default pause image.

External CRI implementations upgrade on their own schedule and aren't tied to kubernetes versions.

BenTheElder on 22 Apr 2020

... precisely? but only if kubeadm is managing this image / flag. Do you have any additional reasons for managing this image in kubeadm?

only prepull / air-gap, i'd say.

Is there any way today to tell kubeadm today "hey actually I'm using this pause image"?

is the request on the kind side related to?
https://github.com/kubernetes-sigs/kind/issues/1471

~isn't --ignore-preflight-errors=all (which kind is doing AFAIK) sufficient to stop kubeadm from pulling any images during init/join?~

actually it has to be --skip-phases=preflight because pre-flight still pulls. so isn't that sufficient of a workaround?

As I said if you don't manage this, usage of imageRepository will become problematic, users will wind up reimplementing imageRepository for this image anyhow.

that is a side effect, but we may have to tolerate this.

neolit123 on 22 Apr 2020

if we end up adding https://github.com/kubernetes/kubeadm/issues/524 with support for a pull-policy of Never this can also skip pre-pull completely, but this will not align it with the k8s pull polices.

alternatively we can break down preflight into sub-phases and this way one can skip ..=preflight/prepull

neolit123 on 22 Apr 2020

the kind issue is just what brought me here, we'll need to employ workarounds for older k8s versions anyhow, but I dont think this is a good state to be in going forward with CRI becoming more common (e.g. CAP*)

actually it has to be --skip-phases=preflight because pre-flight still pulls. so isn't that sufficient of a workaround?

yeah, which sadly 1.11 / 1.12 don't appear to have.

that still leaves us doing substring matches on the images which I guess works but feels wrong.

that is a side effect, but we may have to tolerate this.

yeah, again I might suggest documenting the simpler way: https://docs.docker.com/registry/recipes/mirror/
https://github.com/containerd/cri/blob/master/docs/registry.md
https://github.com/containers/image/blob/master/docs/containers-registries.conf.5.md (cri-o)

BenTheElder on 22 Apr 2020

the kind issue is just what brought me here, we'll need to employ workarounds for older k8s versions anyhow, but I dont think this is a good state to be in going forward with CRI becoming more common (e.g. CAP*)

yeah, which sadly 1.11 / 1.12 don't appear to have.

it's hard to maintain such a big skew. i'd drop support for these older versions at this point.
with the current skew (without LTS KEPs etc) we try to close <1.16 tickets in k/k and k/kubeadm unless the user wants to confirm the bug is present in the skew or if a developer has the time to check.

related issue to the kind issue, that was logged today; the user is asking why they have two pause images:
https://github.com/kubernetes/kubernetes/issues/90326

i see this as a potential miss configuration of containerd. yet this combined with the kind issue maybe signals that it's time kubeadm to stop managing the sandbox image.

neolit123 on 22 Apr 2020

i see this as a potential miss configuration of containerd

er, the pause image is not kubernetes version specific unless we're talking about dockershim because that is itself a partial container runtime. containerd, cri-o etc. should use whatever pod sandbox image meets their needs.

for example, using the latest containerd with any kubernetes version works fine, but will use the 3.2 image, while older kubeadm will assume the 3.1 image.

you can make containerd / the CRI plugin use whatever pause image kubeadm expects, but there's not a good reason to except for the double image pulling.

FTR we are currently rewriting containerd config to match roughly kubeadm config images list | grep pause as a minimum fix for now, but may not do this in the future.

Ideally we should use defer to the CR for podsandbox management, which probably doesn't need to involve kubeadm except that's it's going to be best practice to prevent imageGC of the sandbox image since all pods need it. (currently this is done via the kubelet sandbox image flag even if not using dockershim, imho it should be an image GC denylist in the future)

BenTheElder on 26 Apr 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 24 Nov 2020

/remove-lifecycle stale

fabriziopandini on 25 Nov 2020

bboreham on 3 Feb 2021

minor update here.

with the dockershim extraction supposedly happening in the next few releases, kubeadm no longer has to manage pause.
we are going to stop passing --pod-infra-container-image to the kubelet (dockershim).

but kubeadm continuing to manage the pause image will allow users to see the default pause bundled with a k8s version. this is a good idea in general, related to security fixes in pause, although it doesn't change often.

the API change requested here (which can land in v1beta3) still makes sense for pre-pull - e.g. users might choose to use the GCR pause, but configure an image repository for everything else.

neolit123 on 3 Feb 2021

👍1

we are going to stop passing --pod-infra-container-image to the kubelet (dockershim).

One side effect of this flag is kubelet will never try to GC this image ever (and only this image), which may be desirable.

Kind now avoids the preflight / pull, but I still think it makes sense for other users.

I've also seen chitchat about CRI implemented without the pause image entirely / your CRI implementation will have a tested default independent of the Kubernetes release cycle, so it may make sense to decouple this entirely.

At the very least making it configurable in kubeadm would make it easier to align both on the CRI preferred version.

BenTheElder on 3 Feb 2021

One side effect of this flag is kubelet will never try to GC this image ever (and only this image), which may be desirable.

that's a good point. if the kubelet actually guarantees that via this flag we should be passing it from kubeadm for all CRs.
this reminded me of this bug:
"Kubelet garbage-collects the pause image, which might be unpullable in ContainerD"
https://github.com/kubernetes/kubernetes/issues/81756

neolit123 on 3 Feb 2021

👍1

Yeah that's it exactly, I investigated the sources previously and found the behavior described in that issue: the flag is not gated only to dockershim as documented and imageGC at least will skip this image. It also has a default.

I think the better approach would be a way to tell kubelet "don't ever delete these images" but I don't know if anyone has the bandwidth to get something like that through (technically easy, but features take a lot of effort to get approved). It could also arguably be done in the CRI instead (rejecting deletes?)

BenTheElder on 4 Feb 2021

In https://github.com/kubernetes/kubernetes/issues/97438#issuecomment-749421571, containerd uses its configuration in /etc/containerd/config.toml of sandbox_image. Should we check it during installation(configure it or warning if neeed) or before GC the pause image?

pacoxu on 4 Feb 2021

cc @adisky

dims on 4 Feb 2021

As I can understand there are two things here -- for which same flag pod-infra-container-image is used

Overriding the pause image for container runtime with the kubelet flag
Skipping Eviction of pause image by kubelet GC.

[1] applies only to docker-shim, [2] applies to all container runtimes. (possibly missed/side-effect)

I am not sure but AFAIU we do not want [1] for all container runtimes but we are concerned about [2].
To solve [2], we have following options

Introduce a new configurable option to specify pause image for kubelet. As pod-infra-container-image will go away soon.
Suggested by @BenTheElder above It could also arguably be done in the CRI instead (rejecting deletes?)

adisky on 4 Feb 2021

@pacoxu

In kubernetes/kubernetes#97438 (comment), containerd uses its configuration in /etc/containerd/config.toml of sandbox_image . Should we check it during installation(configure it or warning if neeed) or before GC the pause image?

i don't think the kubelet would like to check the config.toml.

neolit123 on 4 Feb 2021

@adisky @BenTheElder

similar to the cgroup driver problem, instead of passing a flag to the kubelet for feature X, the kubelet should be able to ask about the configuration at the CRI socket.

i can see that crictl info already shows the full config for containerd (inc cgroup driver and sandbox_image), but that doesn't work for cri-o and dockershim.

neolit123 on 4 Feb 2021

@neolit123 Thanks for the pointer, I too tried containerd returns "sandboxImage":"k8s.gcr.io/pause:3.2" in Info of StatusResponse https://github.com/kubernetes/cri-api/blob/master/pkg/apis/runtime/v1/api.pb.go#L6054, which is not a standard. How kubelet will know in what parameter container runtime is passing the pause image info?

adisky on 4 Feb 2021

it has to be standardized eventually, until then the kubelet may have to accept the image to not be GCed.

neolit123 on 4 Feb 2021

any objections to starting to pass --pod-infra-container-image to the kubelet for all container runtimes (not only docker) in 1.21?

as discussed above this flag tells the kubelet image GC to not remove it.

code in kubeadm is here:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/phases/kubelet/flags.go#L82-L83

neolit123 on 25 Feb 2021

I still think there should be a more flexible way to not evict critical images, but short of that any configured "don't evict this pause image" autodetected from CRI SGTM.

The desync between dockershim's / kubeadm's preferred pause image and CRI has caused us to avoid preflight entirely (so as not to break airgapping). GC we have to blanket disable for now anyhow though.

BenTheElder on 25 Feb 2021

👍1

any objections to starting to pass --pod-infra-container-image to the kubelet for all container runtimes (not only docker) in 1.21?

here is the PR:
https://github.com/kubernetes/kubernetes/pull/99476

neolit123 on 26 Feb 2021

Kubeadm: Configurable sandbox (pause) image

All 32 comments

Related issues