Kind: More unstable cluster

Created on 22 Feb 2019  路  13Comments  路  Source: kubernetes-sigs/kind

So now I am running kind on our CI (GitLab) and I am noticing more than I would hope instability. What we do is create a kind cluster and then create a namespace, in the namespace few pods, run some jobs (one job takes generally around and hour or so) inside the namespace, cleanup the namespace, and repeat with another namespace. We do this few times. I use Python Kubernetes client.

What happens sometimes is that at some point commands do not seem to get through. A typical example is that after creating a namespace, commands starts failing with "default service account dos not exist" error message. I added a check to wait for it to be created but it seems it never is. And this happens only occasionally.

Another example is that I have a watch observing and waiting for some condition (like all pods ready) and that just dies on me and connection gets closed.

I am attaching kind logs for one such failed CI session.

results-43316.zip

I see errors like:

Error syncing pod b734fcc86501dde5579ce80285c0bf0c ("kube-scheduler-kind-control-plane_kube-system(b734fcc86501dde5579ce80285c0b
f0c)"), skipping: failed to "StartContainer" for "kube-scheduler" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=kube-scheduler pod=kube-scheduler-kind-control-plane_kube-system(b734fcc86501dde5579ce80285c0bf
0c)"
kinbug kinsupport lifecyclstale

All 13 comments

hmm while looking into causes I see your comment here :^) https://github.com/kubernetes/kubernetes/issues/66689#issuecomment-463097073

Yes. :-) I have such waiting. But it does not really resolve it. I thought it is just a race condition, but in fact it seems it just does not create properly the namespace. I am guessing some core service dies or something.

So I watch all events and print them out. I am noticing during a run of my CI such events unrelated to what I am doing in our tests:

namespace=default, reason=RegisteredNode, message=Node kind-control-plane event: Registered Node kind-control-plane in Controller, for={kind=Node, name=kind-control-plane}

Not sure why would new node be registered in the middle of CI? Maybe because it died before and now it was recreated?

Also, I have an example situation which is a showcase of this issue, and I would like just to make sure it is not something I am doing wrong. So this is example log output in my CI script I wrote:

[2019-02-22 06:02:34,187] [cmu/simple-ta3] Running tests.
[2019-02-22 06:02:36,425] podpreset.settings.k8s.io/tests-configuration created
[2019-02-22 06:02:36,717] job.batch/simple-ta3-tests created
[2019-02-22 06:02:36,727] [cmu/simple-ta3] Waiting for all pods matching a selector 'controller-uid in (73ff307f-3667-11e9-9ef3-024280e8c710)' to be ready.
[2019-02-22 06:03:36,987] >>> ERROR [cmu/simple-ta3] Exception raised: Waiting timeout: No pods appeared in 60 seconds.

So after systems are up in their pods, I start tests against them. This is done by creating a job. After I create a job, I use list_namespaced_job to obtain job description, from which I store (in Python) job.spec.selector.match_labels['controller-uid']. Then, I wait for pods with selector controller-uid in ({job_selectors}), where job_selectors is what I stored above. That should match any pods created to satisfy that job, no? So, the issue is that sometimes no such pod appears in 60 seconds after job was created. And this is why then my CI script complains. I would assume pods should appear in 60 seconds, of course not yet in ready state, but at least by watching list_namespaced_pod using that selector.

I am assuming some core service has issues running on the kind cluster and this is why no pod appears and why my CI script complains. The question is, which service has issues and why.

@mitar are you reusing the same cluster for all tests or each test gets its own cluster? If things start hanging then the first thing to look at is the api server. You should be able to jump in the control plane container and start looking around at the different containers. Also it would be useful to track your machines resources if you deploy kind for a long period of time.

So there is a series of tests inside one CI run. And cluster is made only once per CI run.

I cannot really jump into "CI run" because it is running on GitLab workers. So maybe there is some resource starvation or anything, but it would be useful if that would be somehow reported or something. Or made so that core systems are never removed. I would understand and easier debug if my test pods gets killed with "out of resources" by the cluster. But not core pods.

Definitely agree on the core systems never being removed, unfortunately there are some Kubernetes limitations there where core workloads can be rescheduled, k8s is designed to have reserved overhead but we can't fully do that in kind right now.

I spent some time looking at another user's logs with some similar issues but we haven't pinned it down yet.

We need to debug and solve this as much as though, and the tooling for that needs improvement.

there are some Kubernetes limitations there where core workloads can be rescheduled

Rescheduled to where? If this is one node cluster, where would it go? :-)

Rescheduled to where? If this is one node cluster, where would it go? :-)

re-created on the same node.

some of these (like daemonsets) are expected to change I think...

RE: https://github.com/kubernetes-sigs/kind/issues/303, I think these mounts may fix some of the issues with repeated nesting. The other user seemed to have tracked it down to the host systemd killing things in their environment. I can't replicate this yet (not enough details, doesn't happen in my environments so far...).

Hm, would there be anything in the log if hosts kills a container?

You may see a signal handler log in the pod / container logs IIRC, I haven't first-hand seen this occur yet. It would not be normal on a "real" cluster.

Another user with issues: https://github.com/kubernetes-sigs/kind/issues/136#issuecomment-466898626

I will spend some time later this week looking at how to improve debug-ability and bring this up in our subproject meeting tomorrow. If it's related to mounting the host groups it may be particularly tricky to identify though... 馃槵

kind has been remarkably stable in the environments I have access too ... which is of course wildly unhelpful for identifying causes of instability 馃槥

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

I think we can close this for now. It looks relatively stable recently.

Was this page helpful?
0 / 5 - 0 ratings