Kind: Unstable cluster

Created on 4 Oct 2018 · 49Comments · Source: kubernetes-sigs/kind

When running it locally on my machine the cluster seems much more unstable than on our CI. So now cluster is created inside a privileged container, but then I am getting strange errors:

$ kubectl cluster-info
Unable to connect to the server: unexpected EOF

$ kubectl cluster-info
Error from server (InternalError): an error on the server ("") has prevented the request from succeeding (get services)

$ kubectl cluster-info 
error: the server doesn't have a resource type "services"

triagneeds-information

Source

mitar

Most helpful comment

Fwiw, I struggled with this problem today. Initially, I tried increasing the amount of memory given to Docker (on MacOS) as well as freeing what I thought ought to be enough disk space.

Then, after finding this issue, I ran docker system prune and I was able to create and use a kind cluster. 🎉 Thanks for the tip @BenTheElder!

davidewatson on 11 Jan 2019

🎉1 👍1

All 49 comments

Can you provide more details about how you're running this?

Creating the cluster locally should not require anything more than:

have docker running
kind create

I've tested regularly both on docker for mac and on docker-ce on linux. Currently it doesn't depend on any recent or advanced functionality. Sticking it in another layer of docker will likely make things less debuggable.

BenTheElder on 4 Oct 2018

So I am running inside a Docker container to simulate how my CI (gitlab) works. So it seems it just dies after some time. I am running it more or less like this.

So I am not sure what to inspect when this errors start happening? I can exec into kind-1-control-plane?

mitar on 4 Oct 2018

right now docker exec kind-1-control-plane journalctl > logs.txt works, but I'm working on a nice kind command for this soon... stuck on go1.11 upgrade issues for kubernetes/kubernetes right now, xref https://github.com/kubernetes/test-infra/pull/9695

BenTheElder on 4 Oct 2018

Those errors look like the API server is not actually running or the networking is not pointing at it correctly again. It's not clear from that output what else is going on beyond that yet.

docker exec (not the prettiest since the container names are technically not really exposed yet) can get you onto the "node" after which normal debian debugging tools should generally work (ps, journalctl, etc.)

BenTheElder on 4 Oct 2018

OK, it is failing even if I run it directly on my laptop/host.

mitar on 4 Oct 2018

See log: log.txt

mitar on 4 Oct 2018

Can you provide more details about your laptop/host? Docker version? Any special network settings?

I think we'll need the API server logs as well. That will be in a location like /var/log/containers/kube-apiserver-kind-1-control-plane.*.log Again, I'll be adding a tool to collect these shortly.. :grimacing:

BenTheElder on 4 Oct 2018

Docker version 18.06.1-ce, build e68fc7a. Host is Ubuntu 18.04 personal laptop. No fancy configuration or network settings.

mitar on 4 Oct 2018

Hmm.. I have a very similar setup running kind at home both on the host
docker and in DinD from replicating the gitlab setup myself, I'm not sure
what this would be...

Do you mind if we dig into this more once we have something to scoop up the
logs? It's a bit hard to pin down otherwise.

On Wed, Oct 3, 2018, 20:02 Mitar notifications@github.com wrote:

Docker version 18.06.1-ce, build e68fc7a. Host is Ubuntu 18.04 personal
laptop. No fancy configuration or network settings.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes-sigs/kind/issues/55#issuecomment-426869864,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq5lamxw33Ggs3qBbELJLu6k4h7aGks5uhXougaJpZM4XHFb7
.

BenTheElder on 4 Oct 2018

I have started a fresh prototype of collecting up debug info today.

On Thu, Oct 4, 2018, 00:22 Benjamin Elder bentheelder@google.com wrote:

Hmm.. I have a very similar setup running kind at home both on the host
docker and in DinD from replicating the gitlab setup myself, I'm not sure
what this would be...

Do you mind if we dig into this more once we have something to scoop up
the logs? It's a bit hard to pin down otherwise.

On Wed, Oct 3, 2018, 20:02 Mitar notifications@github.com wrote:

Docker version 18.06.1-ce, build e68fc7a. Host is Ubuntu 18.04 personal
laptop. No fancy configuration or network settings.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes-sigs/kind/issues/55#issuecomment-426869864,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq5lamxw33Ggs3qBbELJLu6k4h7aGks5uhXougaJpZM4XHFb7
.

BenTheElder on 4 Oct 2018

Sure. It seems pretty reproducible on my side, so once you get things going, feel free to ping me and I can retry.

mitar on 4 Oct 2018

Still getting to this. The volume change may(?) help. I expect to get more of the tooling improvements around debugging in tomorrow hopefully...

BenTheElder on 9 Oct 2018

I tried with yesterday's version and there was not much difference.

BTW, I am noticing it is using aufs inside, not ovelray2. Is there a way to control this? It might be the reason.

Edit: It seems go get sigs.k8s.io/kind does not yet give you the version with volume change.

mitar on 9 Oct 2018

BTW, I am noticing it is using aufs inside, not ovelray2. Is there a way to control this? It might be the reason.

There's not an option now, but we can and should configure this.

Edit: It seems go get sigs.k8s.io/kind does not yet give you the version with volume change.

it won't if you already have a copy unless you do go get -u sigs.k8s.io/kind, otherwise it should.

BenTheElder on 10 Oct 2018

Any progress on logging?

mitar on 26 Oct 2018

I put this on the backburner to make a few more breaking changes prior to putting up an 0.1.0, but I found something else to debug that was causing instability on a few machines I know about: local disk space!

The kubelet can see how much space is left on the disk that the docker graph volume is on and will start evicting everything if it's low.

I expect to have this in soon though, #77 and #75 were some of the remaining changes I wanted to get in.

BenTheElder on 26 Oct 2018

So after absurd delay (I'm sorry!) I finally filed a PR with an initial implementation, after rewriting it a few times inbetween other refactors. see #123.

There've been a lot of other changes to kind though.

BenTheElder on 21 Nov 2018

Thanks. I am currently busy with some other projects, so maybe after this gets merged in I can try to see if this makes it easier to debug issues with kind on my machine (or even, maybe issues are gone with other updates).

mitar on 21 Nov 2018

👍1

If you do get some time, I'll be happy to poke around the logs and see if I can find anything. The implementation also needs some improvement still but it should provide at least some useful information.

No rush if you don't have time though, apologies again for the large delay in getting this out. Getting things very stable and very debuggable is a major priority.

BenTheElder on 21 Nov 2018

I'm experiencing the same error:

$ kubectl get pods
Unable to connect to the server: EOF

If you let me know what I should do, I can try to debug it.

danielepolencic on 10 Jan 2019

@danielepolencic kind export logs (possibly with--name to match whatever you supplied when creating the cluster) can help to debug this. I'd guess that this is the workloads being evicted due to disk pressure / memory pressure which is a common issue. see: https://github.com/kubernetes-sigs/kind/issues/156

if you're on docker for mac / windows, it's common that the docker disk runs out of space, docker system prune can help

BenTheElder on 10 Jan 2019

Fwiw, I struggled with this problem today. Initially, I tried increasing the amount of memory given to Docker (on MacOS) as well as freeing what I thought ought to be enough disk space.

Then, after finding this issue, I ran docker system prune and I was able to create and use a kind cluster. 🎉 Thanks for the tip @BenTheElder!

davidewatson on 11 Jan 2019

🎉1 👍1

I have tried today new updated version and I have still the same issues as I had when I opened this issue. Creating a cluster on my laptop is unstable. Sometimes it does not even create it. Sometimes it does, but it is not really working.

Attaching an exported log if it helps for when it did create a cluster.

664736857.zip

mitar on 7 Feb 2019

👍1

Thanks, looks like eviction thresholds -> evicting API server. I'm going to go poke a SIG-Node expert about thoughts on us just setting the thresholds to the limits.

BenTheElder on 7 Feb 2019

I cannot determine why sometimes cluster creation itself does not work. If I run kind create --loglevel debug cluster it always succeeds, if I run kind create cluster it fails with:

Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.13.2) 🖼
 ✓ [control-plane] Creating node container 📦 
 ✓ [control-plane] Fixing mounts 🗻 
 ✓ [control-plane] Starting systemd 🖥 
 ✓ [control-plane] Waiting for docker to be ready 🐋 
 ✓ [control-plane] Pre-loading images 🐋 
 ✓ [control-plane] Creating the kubeadm config file ⛵ 
ERRO[12:26:05] failed to remove master taint: exit status 1 te) ☸ 
 ✗ [control-plane] Starting Kubernetes (this may take a minute) ☸
ERRO[12:26:05] failed to remove master taint: exit status 1 
Error: failed to create cluster: failed to remove master taint: exit status 1

Not sure why with debug logging it works (in the sense that it starts the container, then it is unstable as logs attached above show).

mitar on 7 Feb 2019

could you please provide your complete system spec.

but debug vs non-debug could mean memory corruption or hitting some sort of a resource cap.

neolit123 on 7 Feb 2019

but debug vs non-debug could mean memory corruption or hitting some sort of a resource cap.

It is very reproducible (tried multiple times).

Docker otherwise runs perfectly.

I think I have pretty standard specs:

Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:    18.04
Codename:   bionic

Docker version 18.09.1, build 4c52b90

16 GB memory, 4 core Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz. Enough disk space.

Not sure what other specs should be relevant.

mitar on 7 Feb 2019

This is due to the api server being evicted by kubelet during bringup, which is visible in the kubelet logs.

We just need to tune the eviction limits, had a chat with a SIG-Node expert just now to confirm that this is sane :^)

BenTheElder on 7 Feb 2019

But why debugging logging influences if a cluster gets created or not?

mitar on 7 Feb 2019

we don't override e.g. pod-eviction-timeout for the CM from kubeadm because the default is sufficient for the regular use case - 5 minutes.

try clearing your images and try again.
related?
https://github.com/kubernetes-sigs/kind/issues/156#issuecomment-445088354

neolit123 on 7 Feb 2019

I do not think I have disk space issues.

mitar on 7 Feb 2019

But why debugging logging influences if a cluster gets created or not?

it shouldn't be, it's racy though. I can't tell from that kubelet log which threshold is being passed, but one of the eviction thresholds is being passed and the api server is being evicted, that will prevent bootup.

BenTheElder on 7 Feb 2019

also @neolit123 system spec is in the log zip. I would guess it's the memory threshold.

BenTheElder on 7 Feb 2019

BTW, any ETA on this? Asking so that I can better plan my work (should I wait for this so that I can develop on my laptop, or should I invest time into deploying kind somewhere else so that I can work there).

mitar on 11 Feb 2019

I just filed the PR. It needs more testing but it should also be possible to do this with a patch targeting the KubeletConfiguration on a recent cluster with:

evictionHard:
  memory.available: "1"
  nodefs.available: "0%"
  nodefs.inodesFree: "0%"
  imagefs.available: "0%"

BenTheElder on 12 Feb 2019

I can test this out if you help me a bit how to do so? I just do go get -u <path to git branch somehow>?

mitar on 12 Feb 2019

👍1

it should be possible to install with this:

cd "$(go env GOPATH)/src/sigs.k8s.io/kind"
git fetch origin pull/293/head:pr293
git checkout pr293
go install .

BenTheElder on 12 Feb 2019

I can confirm that it works for me. Thanks so much!

mitar on 12 Feb 2019

Things do work, but I am seeing some strange events in my case:

[watch event] namespace=default, reason=Starting, message=Starting kubelet., for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=NodeHasSufficientMemory, message=Node kind-1-control-plane status is now: NodeHasSufficientMemory, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=NodeHasNoDiskPressure, message=Node kind-1-control-plane status is now: NodeHasNoDiskPressure, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=NodeHasSufficientPID, message=Node kind-1-control-plane status is now: NodeHasSufficientPID, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=NodeAllocatableEnforced, message=Updated Node Allocatable limit across pods, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=RegisteredNode, message=Node kind-1-control-plane event: Registered Node kind-1-control-plane in Controller, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=Starting, message=Starting kube-proxy., for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=RegisteredNode, message=Node kind-1-control-plane event: Registered Node kind-1-control-plane in Controller, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=FreeDiskSpaceFailed, message=failed to garbage collect required amount of images. Wanted to free 306389538406 bytes, but freed 0 bytes, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=ImageGCFailed, message=failed to garbage collect required amount of images. Wanted to free 306395731558 bytes, but freed 0 bytes, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=ImageGCFailed, message=failed to garbage collect required amount of images. Wanted to free 306400130662 bytes, but freed 0 bytes, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=RegisteredNode, message=Node kind-1-control-plane event: Registered Node kind-1-control-plane in Controller, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=RegisteredNode, message=Node kind-1-control-plane event: Registered Node kind-1-control-plane in Controller, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=FreeDiskSpaceFailed, message=failed to garbage collect required amount of images. Wanted to free 306529232486 bytes, but freed 0 bytes, for={kind=Node, name=kind-1-control-plane}

Not sure why it wants to free 300 GB of images.

mitar on 12 Feb 2019

I do have 197.6GB of images and 44.61GB in local volumes on my laptop. Maybe it is trying to empty that? But that is on my host, not inside Docker inside kind.

mitar on 12 Feb 2019

But that is on my host, not inside Docker inside kind.

ah so, kubelet inside docker can see which disk the storage is on and find the resource usage. It can find that back through the mounts etc.

BenTheElder on 12 Feb 2019

Can we disable those attempts at garbage collection?

mitar on 12 Feb 2019

Yes we can tune that threshold too 👍

BenTheElder on 12 Feb 2019

imageGCHighThresholdPercent: 100 in addition to https://github.com/kubernetes-sigs/kind/pull/293#issuecomment-462619454

# config.yaml
kind: Config
apiVersion: kind.sigs.k8s.io/v1alpha2
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    apiVersion: kubelet.config.k8s.io/v1beta1
    kind: KubeletConfiguration
    imageGCHighThresholdPercent: 100
    evictionHard:
      memory.available:  "100Mi"
      nodefs.available:  "10%"
      nodefs.inodesFree: "5%"
      imagefs.available: "0%"

BenTheElder on 12 Feb 2019

correct yaml is updated, that role is a single node in a list of nodes, the nodes key was missing 😅

BenTheElder on 12 Feb 2019

This didn't work. But I also tried and it works:

# config.yaml
kind: Config
apiVersion: kind.sigs.k8s.io/v1alpha2
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    apiVersion: kubelet.config.k8s.io/v1beta1
    kind: KubeletConfiguration
    imageGCHighThresholdPercent: 100
    evictionHard:
      memory.available:  "100Mi"
      nodefs.available:  "0%"
      nodefs.inodesFree: "0%"
      imagefs.available: "0%"

Not sure why we would not just leave those values on 0?

mitar on 12 Feb 2019

yep, those make sense. disk management is not going to work well in kind right now. thanks for testing!

BenTheElder on 12 Feb 2019

🤞 please re-open or file a new bug if this continues. Apologies for the long time frame on these early issues, we're getting things better spun up now 😅

BenTheElder on 12 Feb 2019

No worries. It was a perfect timing for when I needed this working.

mitar on 12 Feb 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings