When running it locally on my machine the cluster seems much more unstable than on our CI. So now cluster is created inside a privileged container, but then I am getting strange errors:
$ kubectl cluster-info
Unable to connect to the server: unexpected EOF
$ kubectl cluster-info
Error from server (InternalError): an error on the server ("") has prevented the request from succeeding (get services)
$ kubectl cluster-info
error: the server doesn't have a resource type "services"
Can you provide more details about how you're running this?
Creating the cluster locally should not require anything more than:
kind createI've tested regularly both on docker for mac and on docker-ce on linux. Currently it doesn't depend on any recent or advanced functionality. Sticking it in another layer of docker will likely make things less debuggable.
So I am running inside a Docker container to simulate how my CI (gitlab) works. So it seems it just dies after some time. I am running it more or less like this.
So I am not sure what to inspect when this errors start happening? I can exec into kind-1-control-plane?
right now docker exec kind-1-control-plane journalctl > logs.txt works, but I'm working on a nice kind command for this soon... stuck on go1.11 upgrade issues for kubernetes/kubernetes right now, xref https://github.com/kubernetes/test-infra/pull/9695
Those errors look like the API server is not actually running or the networking is not pointing at it correctly again. It's not clear from that output what else is going on beyond that yet.
docker exec (not the prettiest since the container names are technically not really exposed yet) can get you onto the "node" after which normal debian debugging tools should generally work (ps, journalctl, etc.)
OK, it is failing even if I run it directly on my laptop/host.
See log: log.txt
Can you provide more details about your laptop/host? Docker version? Any special network settings?
I think we'll need the API server logs as well. That will be in a location like /var/log/containers/kube-apiserver-kind-1-control-plane.*.log Again, I'll be adding a tool to collect these shortly.. :grimacing:
Docker version 18.06.1-ce, build e68fc7a. Host is Ubuntu 18.04 personal laptop. No fancy configuration or network settings.
Hmm.. I have a very similar setup running kind at home both on the host
docker and in DinD from replicating the gitlab setup myself, I'm not sure
what this would be...
Do you mind if we dig into this more once we have something to scoop up the
logs? It's a bit hard to pin down otherwise.
On Wed, Oct 3, 2018, 20:02 Mitar notifications@github.com wrote:
Docker version 18.06.1-ce, build e68fc7a. Host is Ubuntu 18.04 personal
laptop. No fancy configuration or network settings.โ
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes-sigs/kind/issues/55#issuecomment-426869864,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq5lamxw33Ggs3qBbELJLu6k4h7aGks5uhXougaJpZM4XHFb7
.
I have started a fresh prototype of collecting up debug info today.
On Thu, Oct 4, 2018, 00:22 Benjamin Elder bentheelder@google.com wrote:
Hmm.. I have a very similar setup running kind at home both on the host
docker and in DinD from replicating the gitlab setup myself, I'm not sure
what this would be...Do you mind if we dig into this more once we have something to scoop up
the logs? It's a bit hard to pin down otherwise.On Wed, Oct 3, 2018, 20:02 Mitar notifications@github.com wrote:
Docker version 18.06.1-ce, build e68fc7a. Host is Ubuntu 18.04 personal
laptop. No fancy configuration or network settings.โ
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes-sigs/kind/issues/55#issuecomment-426869864,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4Bq5lamxw33Ggs3qBbELJLu6k4h7aGks5uhXougaJpZM4XHFb7
.
Sure. It seems pretty reproducible on my side, so once you get things going, feel free to ping me and I can retry.
Still getting to this. The volume change may(?) help. I expect to get more of the tooling improvements around debugging in tomorrow hopefully...
I tried with yesterday's version and there was not much difference.
BTW, I am noticing it is using aufs inside, not ovelray2. Is there a way to control this? It might be the reason.
Edit: It seems go get sigs.k8s.io/kind does not yet give you the version with volume change.
BTW, I am noticing it is using aufs inside, not ovelray2. Is there a way to control this? It might be the reason.
There's not an option now, but we can and should configure this.
Edit: It seems go get sigs.k8s.io/kind does not yet give you the version with volume change.
it won't if you already have a copy unless you do go get -u sigs.k8s.io/kind, otherwise it should.
Any progress on logging?
I put this on the backburner to make a few more breaking changes prior to putting up an 0.1.0, but I found something else to debug that was causing instability on a few machines I know about: local disk space!
The kubelet can see how much space is left on the disk that the docker graph volume is on and will start evicting everything if it's low.
I expect to have this in soon though, #77 and #75 were some of the remaining changes I wanted to get in.
So after absurd delay (I'm sorry!) I finally filed a PR with an initial implementation, after rewriting it a few times inbetween other refactors. see #123.
There've been a lot of other changes to kind though.
Thanks. I am currently busy with some other projects, so maybe after this gets merged in I can try to see if this makes it easier to debug issues with kind on my machine (or even, maybe issues are gone with other updates).
If you do get some time, I'll be happy to poke around the logs and see if I can find anything. The implementation also needs some improvement still but it should provide at least some useful information.
No rush if you don't have time though, apologies again for the large delay in getting this out. Getting things very stable and very debuggable is a major priority.
I'm experiencing the same error:
$ kubectl get pods
Unable to connect to the server: EOF
If you let me know what I should do, I can try to debug it.
@danielepolencic kind export logs (possibly with--name to match whatever you supplied when creating the cluster) can help to debug this. I'd guess that this is the workloads being evicted due to disk pressure / memory pressure which is a common issue. see: https://github.com/kubernetes-sigs/kind/issues/156
if you're on docker for mac / windows, it's common that the docker disk runs out of space, docker system prune can help
Fwiw, I struggled with this problem today. Initially, I tried increasing the amount of memory given to Docker (on MacOS) as well as freeing what I thought ought to be enough disk space.
Then, after finding this issue, I ran docker system prune and I was able to create and use a kind cluster. ๐ Thanks for the tip @BenTheElder!
I have tried today new updated version and I have still the same issues as I had when I opened this issue. Creating a cluster on my laptop is unstable. Sometimes it does not even create it. Sometimes it does, but it is not really working.
Attaching an exported log if it helps for when it did create a cluster.
Thanks, looks like eviction thresholds -> evicting API server. I'm going to go poke a SIG-Node expert about thoughts on us just setting the thresholds to the limits.
I cannot determine why sometimes cluster creation itself does not work. If I run kind create --loglevel debug cluster it always succeeds, if I run kind create cluster it fails with:
Creating cluster "kind" ...
โ Ensuring node image (kindest/node:v1.13.2) ๐ผ
โ [control-plane] Creating node container ๐ฆ
โ [control-plane] Fixing mounts ๐ป
โ [control-plane] Starting systemd ๐ฅ
โ [control-plane] Waiting for docker to be ready ๐
โ [control-plane] Pre-loading images ๐
โ [control-plane] Creating the kubeadm config file โต
ERRO[12:26:05] failed to remove master taint: exit status 1 te) โธ
โ [control-plane] Starting Kubernetes (this may take a minute) โธ
ERRO[12:26:05] failed to remove master taint: exit status 1
Error: failed to create cluster: failed to remove master taint: exit status 1
Not sure why with debug logging it works (in the sense that it starts the container, then it is unstable as logs attached above show).
could you please provide your complete system spec.
but debug vs non-debug could mean memory corruption or hitting some sort of a resource cap.
but debug vs non-debug could mean memory corruption or hitting some sort of a resource cap.
It is very reproducible (tried multiple times).
Docker otherwise runs perfectly.
I think I have pretty standard specs:
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic
Docker version 18.09.1, build 4c52b90
16 GB memory, 4 core Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz. Enough disk space.
Not sure what other specs should be relevant.
This is due to the api server being evicted by kubelet during bringup, which is visible in the kubelet logs.
We just need to tune the eviction limits, had a chat with a SIG-Node expert just now to confirm that this is sane :^)
But why debugging logging influences if a cluster gets created or not?
we don't override e.g. pod-eviction-timeout for the CM from kubeadm because the default is sufficient for the regular use case - 5 minutes.
try clearing your images and try again.
related?
https://github.com/kubernetes-sigs/kind/issues/156#issuecomment-445088354
I do not think I have disk space issues.
But why debugging logging influences if a cluster gets created or not?
it shouldn't be, it's racy though. I can't tell from that kubelet log which threshold is being passed, but one of the eviction thresholds is being passed and the api server is being evicted, that will prevent bootup.
also @neolit123 system spec is in the log zip. I would guess it's the memory threshold.
BTW, any ETA on this? Asking so that I can better plan my work (should I wait for this so that I can develop on my laptop, or should I invest time into deploying kind somewhere else so that I can work there).
I just filed the PR. It needs more testing but it should also be possible to do this with a patch targeting the KubeletConfiguration on a recent cluster with:
evictionHard:
memory.available: "1"
nodefs.available: "0%"
nodefs.inodesFree: "0%"
imagefs.available: "0%"
I can test this out if you help me a bit how to do so? I just do go get -u <path to git branch somehow>?
it should be possible to install with this:
cd "$(go env GOPATH)/src/sigs.k8s.io/kind"
git fetch origin pull/293/head:pr293
git checkout pr293
go install .
I can confirm that it works for me. Thanks so much!
Things do work, but I am seeing some strange events in my case:
[watch event] namespace=default, reason=Starting, message=Starting kubelet., for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=NodeHasSufficientMemory, message=Node kind-1-control-plane status is now: NodeHasSufficientMemory, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=NodeHasNoDiskPressure, message=Node kind-1-control-plane status is now: NodeHasNoDiskPressure, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=NodeHasSufficientPID, message=Node kind-1-control-plane status is now: NodeHasSufficientPID, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=NodeAllocatableEnforced, message=Updated Node Allocatable limit across pods, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=RegisteredNode, message=Node kind-1-control-plane event: Registered Node kind-1-control-plane in Controller, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=Starting, message=Starting kube-proxy., for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=RegisteredNode, message=Node kind-1-control-plane event: Registered Node kind-1-control-plane in Controller, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=FreeDiskSpaceFailed, message=failed to garbage collect required amount of images. Wanted to free 306389538406 bytes, but freed 0 bytes, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=ImageGCFailed, message=failed to garbage collect required amount of images. Wanted to free 306395731558 bytes, but freed 0 bytes, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=ImageGCFailed, message=failed to garbage collect required amount of images. Wanted to free 306400130662 bytes, but freed 0 bytes, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=RegisteredNode, message=Node kind-1-control-plane event: Registered Node kind-1-control-plane in Controller, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=RegisteredNode, message=Node kind-1-control-plane event: Registered Node kind-1-control-plane in Controller, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=FreeDiskSpaceFailed, message=failed to garbage collect required amount of images. Wanted to free 306529232486 bytes, but freed 0 bytes, for={kind=Node, name=kind-1-control-plane}
Not sure why it wants to free 300 GB of images.
I do have 197.6GB of images and 44.61GB in local volumes on my laptop. Maybe it is trying to empty that? But that is on my host, not inside Docker inside kind.
But that is on my host, not inside Docker inside kind.
ah so, kubelet inside docker can see which disk the storage is on and find the resource usage. It can find that back through the mounts etc.
Can we disable those attempts at garbage collection?
Yes we can tune that threshold too ๐
imageGCHighThresholdPercent: 100 in addition to https://github.com/kubernetes-sigs/kind/pull/293#issuecomment-462619454
# config.yaml
kind: Config
apiVersion: kind.sigs.k8s.io/v1alpha2
nodes:
- role: control-plane
kubeadmConfigPatches:
- |
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
imageGCHighThresholdPercent: 100
evictionHard:
memory.available: "100Mi"
nodefs.available: "10%"
nodefs.inodesFree: "5%"
imagefs.available: "0%"
correct yaml is updated, that role is a single node in a list of nodes, the nodes key was missing ๐
This didn't work. But I also tried and it works:
# config.yaml
kind: Config
apiVersion: kind.sigs.k8s.io/v1alpha2
nodes:
- role: control-plane
kubeadmConfigPatches:
- |
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
imageGCHighThresholdPercent: 100
evictionHard:
memory.available: "100Mi"
nodefs.available: "0%"
nodefs.inodesFree: "0%"
imagefs.available: "0%"
Not sure why we would not just leave those values on 0?
yep, those make sense. disk management is not going to work well in kind right now. thanks for testing!
๐ค please re-open or file a new bug if this continues. Apologies for the long time frame on these early issues, we're getting things better spun up now ๐
No worries. It was a perfect timing for when I needed this working.
Most helpful comment
Fwiw, I struggled with this problem today. Initially, I tried increasing the amount of memory given to Docker (on MacOS) as well as freeing what I thought ought to be enough disk space.
Then, after finding this issue, I ran
docker system pruneand I was able to create and use akindcluster. ๐ Thanks for the tip @BenTheElder!