Rke: Cannot connect to the Docker daemon when daemon is running

Created on 2 Aug 2018 · 12Comments · Source: rancher/rke

RKE version:
v0.1.9-rc6

Docker version: (docker version,docker info preferred)
1.12

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
Container Linux (CoreOS)

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
Azure

This is a cluster that has been previously provisioned with RKE over 2 months ago. We are attempting to run updates to the nodes and are encountering multiple problems while running RKE.

FATA[0121] Failed to create Certificates deployer container on host [10.18.160.15]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

By this point, it had already passed the initial docker check to dial the tunnel.

Logging into the machine manually via ssh easily yields the ability run any docker command.
There is a long running systemd process that is consuming too much of the CPU, but it's still possible to run docker commands, they just return slightly slower. RKE should have the ability to overcome this somehow, but because of this effect, it's completely blocking our ability to update any of the nodes or pass up nodes that get stuck.

kinbug priorit1 statumore-info

Source

HighwayofLife

❤1 👍1

Most helpful comment

I am using AWS Linux 2, I wonder if there is a connect here. I will get a test setup again tonight and send you the results. I could also provide access to servers, if that helps. Finally, is there a specific set of logs you would like and any additional commands you would like me to run?

techcto on 6 Sep 2018

❤1 👍1

All 12 comments

I have this issue also:

FATA[0695] Failed to copy file [/etc/kubernetes/.tmp/kube-ca.pem] from container [cert-fetcher] on host [35.173.177.72]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Thank you @HighwayofLife for opening this up. You just saved me 30 mins. Any advice for a work-around?

techcto on 2 Aug 2018

@HighwayofLife @techcto Can you guys provide the following:

load average on the nodes
average number of containers on the node
do the nodes have multiple mounts as in #650 ?

moelsayed on 6 Aug 2018

This happens for me after a successful restore. I was going to wait to the next release, as it seems lots of issues around backup / restore. I get slightly different results depending on the version of RKE I use. I will try and download latest version and test again.

techcto on 7 Aug 2018

👍1

I haven't seen this yet since the fix applied for the multiple mounts, our fluentd configuration was generating relatively high load, so we'll have to test this configuration again.

HighwayofLife on 8 Aug 2018

@HighwayofLife @techcto How did it go with 0.1.9 regarding this issue ?

moelsayed on 15 Aug 2018

Just wiped servers again:

INFO[0659] Cluster removed successfully
Then ran:
```docker volume rm $(docker volume ls -q)
cleanupdirs="/var/lib/etcd /etc/kubernetes /etc/cni /opt/cni /var/lib/cni /var/run/calico for dir in $cleanupdirs; do echo "Removing $dir"; rm -rf $dir; done


Then start fresh using 3 clean m3.medium from AWS

On first clean install, I got this:  

`FATA[0797] Failed to get job complete status: <nil>`

I have been getting this since day one.  I thought it went away for a few builds of RKE, but now it seems back.

### I then ran install again and got:
```WARN[0790] Failed to deploy addon execute job [rke-kubedns-addon]: Failed to get job complete status: <nil>
INFO[0790] [addons] Setting up Metrics Server
INFO[0790] [addons] Saving addon ConfigMap to Kubernetes
INFO[0791] [addons] Successfully Saved addon to Kubernetes ConfigMap: rke-metrics-addon
INFO[0791] [addons] Executing deploy job..
WARN[0821] Failed to deploy addon execute job [rke-metrics-addon]: Failed to get job complete status: <nil>
INFO[0821] [ingress] Setting up nginx ingress controller
INFO[0821] [addons] Saving addon ConfigMap to Kubernetes
INFO[0822] [addons] Successfully Saved addon to Kubernetes ConfigMap: rke-ingress-controller
INFO[0822] [addons] Executing deploy job..
WARN[0852] Failed to deploy addon execute job [rke-ingress-controller]: Failed to get job complete status: <nil>
INFO[0852] [addons] Setting up user addons
INFO[0852] [addons] Saving addon ConfigMap to Kubernetes
INFO[0853] [addons] Successfully Saved addon to Kubernetes ConfigMap: rke-user-addon
INFO[0853] [addons] Executing deploy job..
WARN[0883] Failed to deploy addon execute job [rke-user-includes-addons]: Failed to get job complete status: <nil>
INFO[0883] Finished building Kubernetes cluster successfully

Even though it deployed successfully and could log into admin, I still had some odd errors that seem new regarding failed to complete status.

So now, I am logged into Rancher and all seems good. I clicked on the nodes tab and I see all three nodes in good health.

Now, I want to test the Restore feature. I take a snapshot of one server:

INFO[0000] Starting saving snapshot on etcd hosts INFO[0000] [dialer] Setup tunnel for host [18.206.174.236] WARN[0009] Unsupported Docker version found [18.03.1-ce], supported versions are [1.11.x 1.12.x 1.13.x 17.03.x] INFO[0009] [dialer] Setup tunnel for host [54.175.108.226] WARN[0013] Unsupported Docker version found [18.03.1-ce], supported versions are [1.11.x 1.12.x 1.13.x 17.03.x] INFO[0013] [dialer] Setup tunnel for host [35.173.177.72] WARN[0024] Unsupported Docker version found [18.03.1-ce], supported versions are [1.11.x 1.12.x 1.13.x 17.03.x] INFO[0024] [etcd] Saving snapshot [etcdsnapshot] on host [18.206.174.236] INFO[0061] [etcd] Successfully started [etcd-snapshot-once] container on host [18.206.174.236] INFO[0088] [etcd] Saving snapshot [etcdsnapshot] on host [54.175.108.226] INFO[0093] [etcd] Successfully started [etcd-snapshot-once] container on host [54.175.108.226] INFO[0095] [etcd] Saving snapshot [etcdsnapshot] on host [35.173.177.72] INFO[0139] [etcd] Successfully started [etcd-snapshot-once] container on host [35.173.177.72] INFO[0182] [certificates] Successfully started [rke-bundle-cert] container on host [54.175.108.226] INFO[0182] [certificates] successfully saved certificate bundle [/opt/rke/etcd-snapshots/pki.bundle.tar.gz] on host [54.175.108.226] FATA[0242] Failed to start [rke-bundle-cert] container on host [35.173.177.72]: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? Command '['/tmp/bin/rke', 'etcd', 'snapshot-save', '--name', 'etcdsnapshot', '--config', '/tmp/config.yaml']' returned non-zero exit status 1

So now, I died at taking snapshot related to cert bundle. Is there any other command I can run to better clean my servers? I will work to get a new fresh cluster up this week, but if you give me more commands I can try and clean server better.

techcto on 22 Aug 2018

Also, it seems like RKE 1.9 is running much slower than previous. I keep having to adjust my timeout option in the script I am writing. Has there been any timing adjustments?

techcto on 22 Aug 2018

@techcto Can you provide a complete run log with debug enabled along with load and mount information ? I still can't reproduce the slowness issue or the failed connections.

moelsayed on 5 Sep 2018

techcto on 6 Sep 2018

❤1 👍1

@techcto Thank you.
Just the debug log from rke run, so we can figure out what take so long, kubelet logs and the output for mount and df -h. If you have any specific configuration in the cluster.yml file, it would be great to have that as well.

moelsayed on 6 Sep 2018

@moelsayed @techcto @HighwayofLife i wasn't able to reproduce the issue even with amazon linux 2, we already increased the docker dialer timeout to 50 seconds, so it is probably something on the host that preventing rke to successfully connect to the docker daemon, can you please provide the details mentioned in https://github.com/rancher/rke/issues/835#issuecomment-410860427

galal-hussein on 7 Sep 2018

@galal-hussein - I just tested numerous times on a fresh aws2 linux and it did seem to work. I apologize on how long it took for me to test. Thank you for testing this!

Long story short, this seems to work. I am not sure if this has been updates on the RKE end or just me moving off of Lambda to AWS CodeBuild (super cool) to handle the 10+ minutes RKE requires.

Since this seems to work, now would be a good time to share what I have been doing.

I am writing a Cloudformation template to get Rancher HA version on AWS with AutoScaling. Mainly for hosting my CMS, but I am trying to open source the whole project.

As of now, this works great: https://github.com/techcto/rancher-aws

I invite all to try and install Rancher using the above. I also wrote a second project: https://github.com/techcto/rke-runner (I just refactored, so docs off) where I took the docs from RKE and the steps to do healing of lost nodes and used python the script a LifeCycle hook from AWS to figure out the next step. This will automatically get installed using previous project.

Finally, in the code submitted above, I have tried to follow RKE instructions and also incorporated the many fixes recommended by SuperSeb to do HA and restore.

Any recommendations if the above is incubator worthy, feedback on how to improve and also if anyone would like to collaborate please let me know. Thanks all.

techcto on 12 Sep 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings