Rancher: Failed to evaluate networking: failed to find plugin \"rancher-bridge\" when metadata container's bridge IP is equal to another hosts docker0 IP (bip)

Created on 1 Apr 2017  ·  102Comments  ·  Source: rancher/rancher

Rancher Versions:
Server: 1.5.2
healthcheck: 0.2.3
ipsec: rancher/net:holder
network-services: rancher/network-manager:v0.5.3
scheduler: rancher/scheduler:v0.7.5
kubernetes (if applicable):

Docker Version:
1.12.6

OS and where are the hosts located? (cloud, bare metal, etc):
Ubuntu 16.04/Bare Metal/Multiple locations

Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB)
Single node rancher, external DB

Environment Type: (Cattle/Kubernetes/Swarm/Mesos)
Cattle

Steps to Reproduce:
I have about 100 hosts running 4 containers. On 5-6 of them ipsec runs ok, on every other host the ipsec stack fails (also the healthcheck) with Timeout getting ip. Inside the network-manager logs I can see:

nsenter: cannot open /proc/26034/ns/ipc: No such file or directory
time="2017-04-01T02:31:01Z" level=error msg="Failed to evaluate network state for 059fae587888cf4a80bc7020e0ab684927783790703d73591edf07fdf0f6e769: netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input" 
time="2017-04-01T02:31:01Z" level=error msg="Error processing event &docker.APIEvents{Action:\"start\", Type:\"container\", Actor:docker.APIActor{ID:\"059fae587888cf4a80bc7020e0ab684927783790703d73591edf07fdf0f6e769\", Attributes:map[string]string{\"io.rancher.container.agent_id\":\"58465\", \"io.rancher.project_service.name\":\"healthcheck/healthcheck\", \"io.rancher.service.hash\":\"d0a8fd4061d3b2a8c5782f5563db6df0b25655cb\", \"io.rancher.stack_service.name\":\"healthcheck/healthcheck\", \"io.rancher.cni.network\":\"ipsec\", \"io.rancher.scheduler.global\":\"true\", \"io.rancher.service.requested.host.id\":\"209\", \"io.rancher.stack.name\":\"healthcheck\", \"name\":\"r-healthcheck-healthcheck-28-2576d08e\", \"image\":\"rancher/healthcheck:v0.2.3\", \"io.rancher.container.create_agent\":\"true\", \"io.rancher.container.ip\":\"10.42.112.234/16\", \"io.rancher.container.mac_address\":\"02:4f:71:44:58:1a\", \"io.rancher.service.launch.config\":\"io.rancher.service.primary.launch.config\", \"io.rancher.cni.wait\":\"true\", \"io.rancher.container.system\":\"true\", \"io.rancher.container.uuid\":\"2576d08e-3100-4935-aa34-3ecb73da302a\", \"io.rancher.project.name\":\"healthcheck\", \"io.rancher.service.deployment.unit\":\"8a360a36-5ff6-497f-a7db-92193ecf2415\", \"io.rancher.container.name\":\"healthcheck-healthcheck-28\"}}, Status:\"start\", ID:\"059fae587888cf4a80bc7020e0ab684927783790703d73591edf07fdf0f6e769\", From:\"rancher/healthcheck:v0.2.3\", Time:1491013861, TimeNano:1491013861288487613}. Error: netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input" 
aremetadata internal kinbug versio1.6

Most helpful comment

We've been struggling all day with ipsec issues on one of the nodes after upgrading a rancher environment running 1.4 to 1.5.6 (in other words, we have several nodes where ipsec was working).

After a lot of headscratching and cussing, we got ipsec working following these simple steps:

  • copy /var/lib/docker/volumes/rancher-cni-driver/_data/bin/rancher-bridge from one of the working nodes (we were missing this file on the node with ipsec issues).
  • the file is tiny, a one line script calling nsenter, and needs the PID of the target process it should enter. _this needs to be adjusted for each host and time_.
  • so, replace the pid for the -t option to nsenter with the pid of ps ax | grep rancher-cni (it should be the sidekick tailing the log file:
    [root@xxx bin]# ps ax | grep rancher-cni 2754 ? Ss 0:00 tail ---disable-inotify -F /var/log/rancher-cni.log
  • contents of /var/lib/docker/volumes/rancher-cni-driver/_data/bin/rancher-bridge after patching:
[root@xxx bin]# cat rancher-bridge 
#!/bin/sh
exec /usr/bin/nsenter -m -u -i -n -p -t 2754 -- $0 "$@"
  • now, restart network-manager and things starts rolling!! ;)

So, in conclusion, it seems like the process responsible for updating this file is failing, for some to us yet unknown reason (haven't had time to look into that yet).

All 102 comments

Can you provide the ipsec log or can you try http://rancher-metadata/2015-12-19/self in a container which use Managed network and ipsec failed on the same host.

Have you tried delete the network stack and re-run them, then restart the failed ipsec container.

Maybe it's related to #8368

No container uses managed network. All 4 are running with network_mode: host. I just deactivated all my hosts in rancher because it was causing one host (where the scheduler is apparently running) to go at 400Mb/s and also my rancher server to stay 100% at all times with 8 cores and 16gb of RAM. One of the 4 containers is a healthchecker of mine designed to keep the other 3 alive so my services are running even with hosts deactivated in rancher.

I just wish I could get rid of this ipsec thing and clustering features as I'm not interested in them.

I only picked rancher because it was the only one supporting the run one container on each host feature and allowed you to NOT treat all your machines as a single block. Things are moving in the other direction it seems and it's giving me all sorts of issues. I can't even disable ipsec at all.

ipsec does not produce any logs, it doesn't even start.

Tried adding a dummy ubuntu (with the managed network) container that just sleeps and prints something to the console. It fails to start, with the same error saying it failed to bring up the network.

Also, I found this error in the logs of network-manager container:

Error (Couldn't bring up network: failed to find plugin "rancher-bridge" in path [/opt/cni/bin /var/lib/cni/bin /usr/local/sbin /usr/sbin /sbin /usr/local/bin /usr/bin /bin])

I have the same issue, reproducible with 1.5.2, 1.5.3 and 1.5.4 Error (Couldn't bring up network: failed to find plugin "rancher-bridge" in path [/opt/cni/bin /var/lib/cni/bin /usr/local/sbin /usr/sbin /sbin /usr/local/bin /usr/bin /bin])

I had the same issue because I used a virtualbox host rancher within a host rancher :-S. When I removed the virtualbox machine from the cluster the error disappear.

Are you using a similar config?

same issue happens to us on physical hardware :-(

@asg1612 you can read the ticket and see it's not the same config - at all. No virtualbox involved.

I have about 100 hosts running 4 containers.

do someone managed some sort of workaround? this is really annoying issue as we cannot use more than 1 host for now (out of 4 which we are using normally)

me too

do someone managed some sort of workaround? this is really annoying issue as we cannot use more than 1 host for now (out of 4 which we are using normally)

Nope...and noone is responding from the devs either. I upgraded to 1.5.4 and 1.5.5 without luck. Bug is still present and it affects 90% of all our servers. I can't pinpoint the fault at all. I checked, re-installed, deleted, upgraded, downgraded ..you name it, nothing pops out.

We may have to roll our own custom solution as this is not a production ready setup for us unfortunately.

I've reproduce and solve a similar issue #8276 . Hope this helps.

@snipking I believe that's required when you run the agent on the same host with the server. Can you confirm?

same issue

@snipking
i'm investigating a bit here
there is no binaries in /opt/cni/bin dir in rancher/network-manager:v0.6.6

And all of my servers today show such messages like in first message.

Ok this is not funny anymore. Looks like healthcheck and ipsec are constantly failing
http://i.share.pho.to/dda8108c_o.png

Any docker container executed from rancher fails. Doing same thing from console works without issues.

Same issue on fresh, clean machine with Ubuntu 16

@ciokan Actually, I've reproduced the #8276 in two env which means it happen both when agent and server on the same host or not. Even one host in a rancher env with the wrong IP will cause this issue on other hosts in this env. This may not happen immediately because you may add a host with the wrong IP to an env which already have hosts working fine. But it may happen after you reboot some of your working hosts.

I'm not sure your problem same as me. But it worth a try.

@fedya I'm using rancher/network-manager:v0.6.6 and there's only rancher-bridge in /opt/cni/bin. In my case I didn't see any error message relevant with it.

Still not working
docker 17.03.1

time="2017-04-20T11:12:02Z" level=info msg="Evaluating state from retry" cid=eac88e13f505ba0362c49601f269e758a8cf4bb4d2cec434eccdbce0ca908d6e count=14 
time="2017-04-20T11:12:02Z" level=info msg="CNI up" cid=eac88e13f505ba0362c49601f269e758a8cf4bb4d2cec434eccdbce0ca908d6e networkMode=ipsec 
time="2017-04-20T11:12:02Z" level=error msg="Failed to evaluate networking: failed to find plugin \"rancher-bridge\" in path [/opt/cni/bin /var/lib/cni/bin /usr/local/sbin /usr/sbin /sbin /usr/local/bin /usr/bin /bin]"

Well, let's look into rancher/network-manager:v0.6.6

# docker exec -ti 5b4670639597 bash

# root@bongie:/# ls -la /opt/cni/bin/
total 8
drwx------ 2 root root 4096 Apr 20 11:11 .
drwxr-xr-x 4 root root 4096 Apr 20 11:11 ..

Empty dir, there is no such binaries like rancher expect.

let's find them

root@bongie:/# find / -name "rancher-bridge"
find: '/proc/11335': No such file or directory
/var/lib/docker/overlay/9a46a21caa97b45e05beb39bbcdfa699c6e149e929673ef083e051f429eb93a7/root/opt/cni/bin/rancher-bridge
/var/lib/docker/overlay/7e3f77562c357050381c8bffb4e0b03c3d52b3a2cb1e40da94f9f566b5258a37/root/opt/cni/bin/rancher-bridge
/var/lib/docker/overlay/e7d3d2d6ffe968aaa64c9c7ba2b8277672832aeae646af4d8a6b1321d6e0f45d/root/opt/cni/bin/rancher-bridge
/var/lib/docker/overlay/357c145376e656c58432794cdb28adc6ff9bd6848edbde10b03fa11ab3f2779c/root/opt/cni/bin/rancher-bridge

Well, it's found, but in odd places.

Let's copy binaries to the right place /opt/cni/bin

and now in logs i see such stuff:

687210, TimeNano:1492687210508494434}. Error: Get http://rancher-metadata/2015-12-19/version: dial tcp: lookup rancher-metadata on 192.168.1.1:53: no such host" 
time="2017-04-20T11:20:43Z" level=info msg="Evaluating state from retry" cid=8d63ff41f5cabb760b6666ee89579e6dbce012c6ac2141220ea77bbe0e1b982d count=1 
time="2017-04-20T11:20:43Z" level=info msg="CNI up" cid=8d63ff41f5cabb760b6666ee89579e6dbce012c6ac2141220ea77bbe0e1b982d networkMode=ipsec 
time="2017-04-20T11:20:47Z" level=error msg="macsync: error syncing MAC addresses for the first tiime: inspecting container: 5b85bfbcad07ca3856f8d6e7747265243898dfc0068fed5664cda577ef5c5d0e: Error: No such container: 5b85bfbcad07ca3856f8d6e7747265243898dfc0068fed5664cda577ef5c5d0e" 

culrpit
log from Logs: network-services-network-manager-2

 dial tcp: lookup rancher-metadata on **192.168.1.1:53: no such host"**
20.04.2017 14:46:47time="2017-04-20T11:46:47Z" level=info msg="Evaluating state from retry" cid=05eee3c01ca8d0563e50be9bed82e42703b1594fa51a5c017c1492cae99e5d71 count=1

obviously 192.168.1.1 it's my router and of course port 53 not available there

time="2017-04-20T12:12:10Z" level=error msg="Failed to apply cni conf: Error: No such container: f9f200742352744f2f58ea1f70283962d63ed09ddcddf97fbb8940c968fa0550"

probably realted to sidekick container, it's working not as rancher expects

Something similar is happening to me after updating to 1.5.5. I've even tried creating a whole new environment to try and get the ipsec stack to start, but it still exhibits the same error.

"Failed to evaluate networking: netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input"

@penguinxr2 yes, same issue, there is no network interface in container.

Same issue here, anyone have any updates?

My issues went away once I switched to Ubuntu 16.04 instead of Debian

We've been struggling all day with ipsec issues on one of the nodes after upgrading a rancher environment running 1.4 to 1.5.6 (in other words, we have several nodes where ipsec was working).

After a lot of headscratching and cussing, we got ipsec working following these simple steps:

  • copy /var/lib/docker/volumes/rancher-cni-driver/_data/bin/rancher-bridge from one of the working nodes (we were missing this file on the node with ipsec issues).
  • the file is tiny, a one line script calling nsenter, and needs the PID of the target process it should enter. _this needs to be adjusted for each host and time_.
  • so, replace the pid for the -t option to nsenter with the pid of ps ax | grep rancher-cni (it should be the sidekick tailing the log file:
    [root@xxx bin]# ps ax | grep rancher-cni 2754 ? Ss 0:00 tail ---disable-inotify -F /var/log/rancher-cni.log
  • contents of /var/lib/docker/volumes/rancher-cni-driver/_data/bin/rancher-bridge after patching:
[root@xxx bin]# cat rancher-bridge 
#!/bin/sh
exec /usr/bin/nsenter -m -u -i -n -p -t 2754 -- $0 "$@"
  • now, restart network-manager and things starts rolling!! ;)

So, in conclusion, it seems like the process responsible for updating this file is failing, for some to us yet unknown reason (haven't had time to look into that yet).

tagging @leodotcloud into this ticket. Expert sleuthing @kaos

@kaos that seems to work. I have to play "catch" since rancher is constantly changing the pid since it restarts the cni service and, once I'm fast enough with setting the pid, ipsec starts rolling. With so many servers though (and with many of them being restarted through the day) that's quite a lot to manage so I'll try to deploy a container that constantly monitors the pids and does just that...at lest until rancher devs fix this.

Indeed, if I was digging in the right place, the plugin-manager updates stuff every 5 minutes, or when triggered by a change in the metadata.. I'm guess it's in there some where something's broken.

Here's my "working" code to fix this:

import time
import subprocess

bridge_tpl = """#!/bin/sh
exec /usr/bin/nsenter -m -u -i -n -p -t {} -- \$0 \"\$@\"
"""


def cmd(command):
    try:
        return subprocess.check_output(command, shell=True).decode('utf-8').strip()
    except subprocess.CalledProcessError:
        return ''


def watch():
    ipsec_pid = cmd("docker ps --filter \"name=ipsec-cni\" --format {{.ID}} | xargs docker inspect  --format '{{.State.Pid}}'").strip()

    if ipsec_pid:
        bridge_contents = cmd("cat /var/lib/docker/volumes/rancher-cni-driver/_data/bin/rancher-bridge")

        if ipsec_pid not in bridge_contents:
            cmd('truncate -s0 /var/lib/docker/volumes/rancher-cni-driver/_data/bin/rancher-bridge')

            code = bridge_tpl.format(int(ipsec_pid))

            cmd('echo "%s" >> /var/lib/docker/volumes/rancher-cni-driver/_data/bin/rancher-bridge' % code)


if __name__ == "__main__":
    while True:
        watch()
        time.sleep(1)

This one runs inside a container that needs the following volumes:

/usr/bin/docker:/usr/bin/docker
/var/run/docker.sock:/var/run/docker.sock

It constantly monitors the pid and nsenter file for a mismatch. Not the prettiest one but it works for now.

Also, I made it public if anyone else wants to try it out before official fix:

docker pull quay.io/drsoft/ipsec-fix:latest

@ciokan can you ping me on our users slack? I'd gladly troubleshoot this with you

Let me explain how things work so maybe people can troubleshoot this better. The "network-manager" container looks at metadata and looks for services with kind "networkDriverService" and then look for local container for that service. Once it find the container it writes a small nsenter shell script that works as the CNI plugin. The code is here, it's pretty straight forward https://github.com/rancher/plugin-manager/blob/master/binexec/watcher.go. So a couple things that could go wrong

  1. Metadata isn't working: Look at the docker logs output of metadata and see if anything is fishy
  2. Metadata has the wrong info in it: Go into the network-manager container and run curl -H "Accept: application/json " http://169.254.169.250/latest. Then send me the output [email protected].

I'm guessing there could be some issue if there is some crufty data around that is causing issues. Sorry this issue went unnoticed for so long, the couple weeks before and after DockerCon are pretty hectic for us. Thanks!

Any eta on this? We're seeing this in production too, post upgrade the entire environment goes down. Nothing to see in the metadata logs, but healthcheck seems to not work either.

@lkm please try the image I linked above. I have it working in production with 100+ servers. Until an official fix, this is the only thing that worked. It incorporates what @kaos provided earlier.

@ciokan @kaos @lkm @spawnthink @asg1612 @OneB1t @maltyxx @snipking @fedya @tpgxyz @Emil983
Can you please ping me on https://slack.rancher.io so that I can take a look at your setup and debug what's going on?

Also could you please collect the logs and share them with me?
https://github.com/leodotcloud/rancher-logs-collector ==> this will make it easier to collect the logs.

We have the same issue and will help with logs if wanted

@ingardm Thank you. The logs might be more than the allowed limit of email attachment size. Please share them with me on slack.

@leodotcloud which logs are you after specifically? The logs of the r-ipsec-ipsec containers are empty

@ingardm Please use https://github.com/leodotcloud/rancher-logs-collector to collect the logs :)

@leodotcloud @superseb Aight. I've got the logs collected from the environment having problems. Please contact me on slack for details

Thanks for all the digging and fixes, we need to get to the bottom of this. Does anyone have the steps to reproduce please, I think there have been some log digging and troubleshooting on/offline but there is no reproducible plan yet. I've tried multiple big clusters running Debian/Ubuntu, rebooting/resetting nodes, host networked/managed, killing/restarting infra containers.

@superseb I can't provide any steps to reproduce it (if I knew exactly what's the cause of it I would probably avoid it entirely) but I can tell you that it's affecting 95% of all our hosts/servers. I can provide a share screen or even ssh into a non working host if you want to take a look at it.

@superseb Any news on this ? Whole orchestration on rancher is not working.
healthcheck, ipsec are failing.
healthcheck (Expected state running but got error: Couldn't bring up network: failed to find plugin "rancher-bridge" in path [/opt/cni/bin /var/lib/cni/bin /usr/local/sbin /usr/sbin /sbin /usr/local/bin /usr/bin /bin])

@tpgxyz Not yet, working to reproduce. Will update soon.

@ciokan Ping me on Rancher Users Slack, would love to check at the hosts to locate the problem and get this out of the way for a lot of users.

@superseb Can you please provide step-by-step procedure how to get all the needed logs, which may help you to resolve the issue ?

@tpgxyz Logging can be collected using: https://github.com/leodotcloud/rancher-logs-collector

As we are trying to reproduce, I would love to find the common denominator in this case. So if you could post the following:

Current Rancher version:
Upgraded from Rancher version:
Infrastructure rancher/network-manager version:
Docker version: (`docker version`,`docker info` preferred)
Operating system and kernel: (`cat /etc/os-release`, `uname -r` preferred)
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)

Basically we are running rancher to manage de-centralized RPM build system around globe. Each machine is running different operating system (fedora, centos, openmandriva etc.)
We had to disable rancher, due this bug, on all machines and run all the needed docker by hand.

Below you can find information about system where rancher server is running:
Current Rancher version: 1.6.0
Upgraded from Rancher version: 1.5.5
Infrastructure rancher/network-manager version:
Docker version: (docker version,docker info preferred)
abf.openmandriva.org@root[0]:[~] # docker version
Client:
Version: 1.10.3
API version: 1.22
Go version: go1.5.1
Git commit: c3959b1
Built:
OS/Arch: linux/amd64

Server:
Version: 1.10.3
API version: 1.22
Go version: go1.5.1
Git commit: c3959b1
Built:
OS/Arch: linux/amd64

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
abf.openmandriva.org@root[0]:[~] # cat /etc/os-release
NAME="OpenMandriva Lx"
VERSION="2014.0 (Phosphorus)"
ID="openmandriva_association"
VERSION_ID=2014.0
BUILD_ID=20160709.21
PRETTY_NAME="OpenMandriva Lx 2014.0 (Phosphorus)"
ANSI_COLOR="1;43"
CPE_NAME="cpe:/o:openmandriva_association:openmandriva_lx:2014.0"
HOME_URL="http://openmandriva.org/"
BUG_REPORT_URL="https://issues.openmandriva.org/

abf.openmandriva.org@root[0]:[~] # uname -r
4.9.0-server-1omv

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
Bare-metal

I've set up CLI and run that script, well there's nothing interesing inside:

rancher_host.log and rancher_ps_c_a.log are empty - wtf ?

[tpg@lazur rancher.Qz1X]$ cat rancher_ps_s_a.log
ID TYPE NAME IMAGE STATE SCALE SYSTEM ENDPOINTS DETAIL
1s21 service healthcheck/healthcheck rancher/healthcheck:v0.2.3 inactive 0/1 true
1s22 service network-services/network-manager rancher/network-manager:v0.7.0 unhealthy 0/1 true
1s23 service network-services/metadata rancher/metadata:v0.9.1 unhealthy 0/1 true
1s23 sidekick network-services/metadata/dns rancher/dns:v0.15.0 unhealthy 0/1 true
1s24 networkDriverService ipsec/ipsec rancher/net:holder inactive 0/1 true
1s24 sidekick ipsec/ipsec/router rancher/net:v0.11.2 inactive 0/1 true
1s24 sidekick ipsec/ipsec/cni-driver rancher/net:v0.11.2 inactive 0/1 true
1s25 service scheduler/scheduler rancher/scheduler:v0.7.5 inactive 0/1 true

@superseb I have a fresh environment with this problem, and I can provide full access for debug. 3 host nodes in HA, each running a MySQL container (1 master, 2 slave), each running a Rancher/Server container, each added as a host within Rancher.

  • Current Rancher version: 1.6.0
  • Upgraded from Rancher version: none (fresh install)
  • Infrastructure rancher/network-manager version: v0.7.0
  • Docker version: 17.03.1-ce, build c6d412e
  • Operating system and kernel: (cat /etc/os-release, uname -r preferred)
    [rancher@rancher-ha1 ~]$ cat /etc/os-release NAME="RancherOS" VERSION=v1.0.1 ID=rancheros ID_LIKE= VERSION_ID=v1.0.1 PRETTY_NAME="RancherOS v1.0.1" HOME_URL="http://rancher.com/rancher-os/" SUPPORT_URL="https://forums.rancher.com/c/rancher-os" BUG_REPORT_URL="https://github.com/rancher/os/issues" BUILD_ID= [rancher@rancher-ha1 ~]$ uname -r 4.9.24-rancher
  • Type/provider of hosts: Vultr (DO)

Steps to reproduce:

  1. Provision 3 nodes with iPXE script
  2. On one node, set MySQL master via /home/rancher/mysql/config/master.cnf

    [mysqld]
    bind-address = 0.0.0.0
    server-id = 1
    
    log_bin = /var/log/mysql/mysql-bin.log
    binlog_do_db = rancher
    
  3. On remaining 2 nodes, set MySQL slave via /home/rancher/mysql/config/slave.cnf

    [mysqld]
    bind-address = 0.0.0.0
    server-id = 2 # or 3
    
    log_bin = /var/log/mysql/mysql-bin.log
    binlog_do_db = rancher
    
  4. Start MySQL containers on each node with the following script:
    docker run -td --restart=unless-stopped \ -v /home/rancher/mysql/config:/etc/mysql/conf.d \ -v /home/rancher/mysql/data:/var/lib/mysql \ -p 3306:3306 \ -e MYSQL_DATABASE=rancher \ -e MYSQL_USER=rancher \ -e MYSQL_PASSWORD=userpass \ -e MYSQL_INITDB_SKIP_TZINFO=yes \ -e MYSQL_ROOT_PASSWORD=rootpass \ --name rancher-mysql \ mysql:5.7
  5. On MySQL master node: set master mode

    docker exec -it rancher-mysql mysql -u root -prootpass
    GRANT REPLICATION SLAVE ON *.* TO 'slave_user'@'%' IDENTIFIED BY 'userpass';
    FLUSH PRIVILEGES;
    USE rancher;
    FLUSH TABLES WITH READ LOCK;
    SHOW MASTER STATUS;
    
    +------------------+----------+--------------+------------------+-------------------+
    | File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
    +------------------+----------+--------------+------------------+-------------------+
    | mysql-bin.000004 |      595 | rancher      |                  |                   |
    +------------------+----------+--------------+------------------+-------------------+
    1 row in set (0.00 sec)
    
    UNLOCK TABLES;
    QUIT;
    
  6. On MySQL slave nodes: set slave mode (using master IP address)
    docker exec -it rancher-mysql mysql -u root -prootpass CHANGE MASTER TO MASTER_HOST='12.34.567.890',MASTER_USER='slave_user', MASTER_PASSWORD='userpass', MASTER_LOG_FILE='mysql-bin.000004', MASTER_LOG_POS=595; START SLAVE; SHOW SLAVE STATUS\G
  7. Start Rancher on first node (MySQL master)
    # Ensure db-host is IP of master mysql node docker run -td --restart=unless-stopped \ -p 8080:8080 \ -p 9345:9345 \ --name rancher-server \ rancher/server:v1.6.0 \ --db-host 12.34.567.890 --db-port 3306 --db-user rancher --db-name rancher --db-pass userpass
  8. Start Rancher on nodes 2 & 3 (repeat last step)
  9. After confirming that Rancher reports all 3 nodes in HA, add each node as a Rancher host
  10. Watch IPSEC/Healthcheck chaos 🍿

Update: Rolled rancher/server back to 1.5.1, wiped DB and reinstalled. Healthcheck and IPSEC are stable.

Just finished a debug session with @superseb (super indeed!) and he managed to identify that one of the host nodes was reporting the docker0 bridge network as its IP address. This is obviously incorrect, as the docker0 bridge is internal and does not represent the external IP address.

We analyzed the HA 4-host stack as it sat on Rancher 1.5.1, then tore it down and built it back up on Rancher 1.6.0. This time, each host was added with a specified external IP address. Stability has been achieved.

I'm not convinced that this issue thread is related to my problem, but I did experience the symptoms described here. Either way, it might not hurt to confirm that your hosts are reporting the correct IP address.

Many thanks for your time and effort, @superseb!

This issue started showing up in our environment since last week. We recently upgraded from Rancher 1.5.3 to 1.5.9. Luckily, most environments are created correctly, however some experience the Timeout getting IP address issue.

We are running Rancheros 1.0.1 on VMware. I will try to find some time later today to debug this further.

@superseb or anyone can resolve this issue ?

Found this
id": "1i33627",
"type": "container",
"links": {
"self": "…/v2-beta/projects/1a5/containers/1i33627",
"account": "…/v2-beta/projects/1a5/containers/1i33627/account",
"credentials": "…/v2-beta/projects/1a5/containers/1i33627/credentials",
"healthcheckInstanceHostMaps": "…/v2-beta/projects/1a5/containers/1i33627/healthcheckinstancehostmaps",
"hosts": "…/v2-beta/projects/1a5/containers/1i33627/hosts",
"instanceLabels": "…/v2-beta/projects/1a5/containers/1i33627/instancelabels",
"instanceLinks": "…/v2-beta/projects/1a5/containers/1i33627/instancelinks",
"instances": "…/v2-beta/projects/1a5/containers/1i33627/instances",
"mounts": "…/v2-beta/projects/1a5/containers/1i33627/mounts",
"ports": "…/v2-beta/projects/1a5/containers/1i33627/ports",
"serviceEvents": "…/v2-beta/projects/1a5/containers/1i33627/serviceevents",
"serviceExposeMaps": "…/v2-beta/projects/1a5/containers/1i33627/serviceexposemaps",
"serviceLogs": "…/v2-beta/projects/1a5/containers/1i33627/servicelogs",
"services": "…/v2-beta/projects/1a5/containers/1i33627/services",
"targetInstanceLinks": "…/v2-beta/projects/1a5/containers/1i33627/targetinstancelinks",
"volumes": "…/v2-beta/projects/1a5/containers/1i33627/volumes",
"stats": "…/v2-beta/projects/1a5/containers/1i33627/stats",
"containerStats": "…/v2-beta/projects/1a5/containers/1i33627/containerstats"
},
"actions": {
"remove": "…/v2-beta/projects/1a5/containers/1i33627/?action=remove"
},
"baseType": "instance",
"name": "network-services-network-manager-3",
"state": "error",
"accountId": "1a5",
"blkioDeviceOptions": null,
"blkioWeight": null,
"build": null,
"capAdd": null,
"capDrop": null,
"cgroupParent": null,
"command": [ 3 items
"plugin-manager",
"--metadata-url",
"http://169.254.169.250/2016-07-29"
],
"cpuCount": null,
"cpuPercent": null,
"cpuPeriod": null,
"cpuQuota": null,
"cpuSet": null,
"cpuSetMems": null,
"cpuShares": null,
"createIndex": 5,
"created": "2017-05-24T19:05:46Z",
"createdTS": 1495652746000,
"dataVolumeMounts": { },
"dataVolumes": [ 8 items
"/var/run/docker.sock:/var/run/docker.sock",
"/var/lib/docker:/var/lib/docker",
"/var/lib/rancher/state:/var/lib/rancher/state",
"/lib/modules:/lib/modules:ro",
"/run:/run:ro",
"/var/run:/var/run:ro",
"rancher-cni-driver:/etc/cni",
"rancher-cni-driver:/opt/cni"
],
"dataVolumesFrom": null,
"deploymentUnitUuid": "ae43589e-e1cd-4c88-bc83-e346a3b2bad5",
"description": null,
"devices": null,
"diskQuota": null,
"dns": null,
"dnsOpt": null,
"dnsSearch": [ 2 items
"network-services.rancher.internal",
"network-manager.network-services.rancher.internal"
],
"domainName": null,
"entryPoint": null,
"environment": {
"DOCKER_BRIDGE": "docker0",
"METADATA_IP": "169.254.169.250"
},
"expose": null,
"externalId": "",
"extraHosts": null,
"firstRunning": null,
"groupAdd": null,
"healthCheck": null,
"healthCmd": null,
"healthInterval": null,
"healthRetries": null,
"healthState": null,
"healthTimeout": null,
"hostId": "1h4",
"hostname": null,
"imageUuid": "docker:rancher/network-manager:v0.7.0",
"instanceTriggeredStop": "stop",
"ioMaximumBandwidth": null,
"ioMaximumIOps": null,
"ip": null,
"ip6": null,
"ipcMode": null,
"isolation": null,
"kernelMemory": null,
"kind": "container",
"labels": {
"io.rancher.service.deployment.unit": "ae43589e-e1cd-4c88-bc83-e346a3b2bad5",
"io.rancher.service.requested.host.id": "4",
"io.rancher.stack_service.name": "network-services/network-manager",
"io.rancher.service.launch.config": "io.rancher.service.primary.launch.config",
"io.rancher.project.name": "network-services",
"io.rancher.project_service.name": "network-services/network-manager",
"io.rancher.stack.name": "network-services",
"io.rancher.service.hash": "f3bf3137ed558867efb7108874cd42cf5baea9a3",
"io.rancher.scheduler.global": "true",
"io.rancher.container.system": "true"
},
"logConfig": {
"type": "logConfig",
"config": {
"max-file": "2",
"max-size": "25m"
},
"driver": "json-file"
},
"lxcConf": null,
"memory": null,
"memoryReservation": null,
"memorySwap": null,
"memorySwappiness": null,
"milliCpuReservation": null,
"mounts": null,
"nativeContainer": false,
"netAlias": null,
"networkContainerId": null,
"networkMode": "host",
"oomKillDisable": null,
"oomScoreAdj": null,
"pidMode": "host",
"pidsLimit": null,
"ports": null,
"primaryIpAddress": null,
"primaryNetworkId": "1n1",
"privileged": true,
"publishAllPorts": false,
"readOnly": false,
"registryCredentialId": null,
"removed": null,
"requestedHostId": "1h4",
"restartPolicy": null,
"secrets": null,
"securityOpt": null,
"serviceId": null,
"serviceIds": [
"1s30"
],
"shmSize": null,
"stackId": null,
"startCount": 0,
"startOnCreate": true,
"stdinOpen": false,
"stopSignal": null,
"storageOpt": null,
"sysctls": null,
"system": true,
"tmpfs": null,
"transitioning": "error",
"transitioningMessage": "Error response from daemon: invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\"process_linux.go:359: container init caused \\\\"rootfs_linux.go:54: mounting \\\\\\\\"/var/lib/docker/containers/f2e9fb35fdcd937a51839746cae0542a80632d120c8892061863e41c344a1814/secrets\\\\\\\\" to rootfs \\\\\\\\"/var/lib/docker/overlay/50978fadaf02d2e29421737ee02290920d62e56904c240bab5e824f44b360077/merged\\\\\\\\" at \\\\\\\\"/var/lib/docker/overlay/50978fadaf02d2e29421737ee02290920d62e56904c240bab5e824f44b360077/merged/run/secrets\\\\\\\\" caused \\\\\\\\"mkdir /var/lib/docker/overlay/50978fadaf02d2e29421737ee02290920d62e56904c240bab5e824f44b360077/merged/run/secrets: read-only file system\\\\\\\\"\\\\"\\"\n\"",
"transitioningProgress": 84,
"tty": false,
"ulimits": null,
"user": null,
"userPorts": null,
"usernsMode": null,
"uts": null,
"uuid": "850e5929-05ad-449d-b27e-f385c5f9d62d",
"version": "0",
"volumeDriver": null,
"workingDir": null

}

Again output form healtcheck

{

"id": "1i36522",
"type": "container",
"links": {
    "self": "…/v2-beta/projects/1a5/containers/1i36522",
    "account": "…/v2-beta/projects/1a5/containers/1i36522/account",
    "credentials": "…/v2-beta/projects/1a5/containers/1i36522/credentials",
    "healthcheckInstanceHostMaps": "…/v2-beta/projects/1a5/containers/1i36522/healthcheckinstancehostmaps",
    "hosts": "…/v2-beta/projects/1a5/containers/1i36522/hosts",
    "instanceLabels": "…/v2-beta/projects/1a5/containers/1i36522/instancelabels",
    "instanceLinks": "…/v2-beta/projects/1a5/containers/1i36522/instancelinks",
    "instances": "…/v2-beta/projects/1a5/containers/1i36522/instances",
    "mounts": "…/v2-beta/projects/1a5/containers/1i36522/mounts",
    "ports": "…/v2-beta/projects/1a5/containers/1i36522/ports",
    "serviceEvents": "…/v2-beta/projects/1a5/containers/1i36522/serviceevents",
    "serviceExposeMaps": "…/v2-beta/projects/1a5/containers/1i36522/serviceexposemaps",
    "serviceLogs": "…/v2-beta/projects/1a5/containers/1i36522/servicelogs",
    "services": "…/v2-beta/projects/1a5/containers/1i36522/services",
    "targetInstanceLinks": "…/v2-beta/projects/1a5/containers/1i36522/targetinstancelinks",
    "volumes": "…/v2-beta/projects/1a5/containers/1i36522/volumes",
    "stats": "…/v2-beta/projects/1a5/containers/1i36522/stats",
    "containerStats": "…/v2-beta/projects/1a5/containers/1i36522/containerstats"
},
"actions": {
    "remove": "…/v2-beta/projects/1a5/containers/1i36522/?action=remove"
},
"baseType": "instance",
"name": "healthcheck-healthcheck-2",
"state": "error",
"accountId": "1a5",
"blkioDeviceOptions": null,
"blkioWeight": null,
"build": null,
"capAdd": null,
"capDrop": null,
"cgroupParent": null,
"command": null,
"cpuCount": null,
"cpuPercent": null,
"cpuPeriod": null,
"cpuQuota": null,
"cpuSet": null,
"cpuSetMems": null,
"cpuShares": null,
"createIndex": 700,
"created": "2017-05-29T08:08:00Z",
"createdTS": 1496045280000,
"dataVolumeMounts": { },
"dataVolumes": [
    "/var/lib/rancher/etc:/var/lib/rancher/etc:ro"
],
"dataVolumesFrom": null,
"deploymentUnitUuid": "9ebfbcef-5316-48cd-a544-e0755259f9f1",
"description": null,
"devices": null,
"diskQuota": null,
"dns": [
    "169.254.169.250"
],
"dnsOpt": null,
"dnsSearch": [ 3 items
    "healthcheck.rancher.internal",
    "healthcheck.healthcheck.rancher.internal",
    "rancher.internal"
],
"dockerPorts": [ ],
"domainName": null,
"entryPoint": null,
"environment": null,
"expose": null,
"externalId": "97d2206b6778fe1d71ae0eb64deabf15b2ed289f67293f09ea2acd48afd6a004",
"extraHosts": null,
"firstRunning": null,
"groupAdd": null,
"healthCheck": {
    "type": "instanceHealthCheck",
    "healthyThreshold": 2,
    "interval": 2000,
    "port": 42,
    "responseTimeout": 2000,
    "strategy": "recreate",
    "unhealthyThreshold": 3
},
"healthCmd": null,
"healthInterval": null,
"healthRetries": null,
"healthState": null,
"healthTimeout": null,
"hostId": "1h7",
"hostname": null,
"imageUuid": "docker:rancher/healthcheck:v0.3.1",
"instanceTriggeredStop": "stop",
"internalVolumes": [ ],
"ioMaximumBandwidth": null,
"ioMaximumIOps": null,
"ip": null,
"ip6": null,
"ipcMode": null,
"isolation": null,
"kernelMemory": null,
"kind": "container",
"labels": {
    "io.rancher.service.deployment.unit": "9ebfbcef-5316-48cd-a544-e0755259f9f1",
    "io.rancher.service.requested.host.id": "7",
    "io.rancher.stack_service.name": "healthcheck/healthcheck",
    "io.rancher.service.launch.config": "io.rancher.service.primary.launch.config",
    "io.rancher.project.name": "healthcheck",
    "io.rancher.project_service.name": "healthcheck/healthcheck",
    "io.rancher.container.create_agent": "true",
    "io.rancher.stack.name": "healthcheck",
    "io.rancher.service.hash": "7f8d78bdc5e48b52d1821bdfa68b774c0b74d456",
    "io.rancher.scheduler.global": "true",
    "io.rancher.container.system": "true"
},
"logConfig": {
    "type": "logConfig",
    "config": {
        "max-file": "2",
        "max-size": "25m"
    },
    "driver": "json-file"
},
"lxcConf": null,
"memory": null,
"memoryReservation": null,
"memorySwap": null,
"memorySwappiness": null,
"milliCpuReservation": null,
"mounts": null,
"nativeContainer": false,
"netAlias": null,
"networkContainerId": null,
"networkMode": "managed",
"oomKillDisable": null,
"oomScoreAdj": null,
"pidMode": null,
"pidsLimit": null,
"ports": null,
"primaryIpAddress": null,
"primaryNetworkId": "1n5",
"privileged": false,
"publishAllPorts": false,
"readOnly": false,
"registryCredentialId": null,
"removed": null,
"requestedHostId": "1h7",
"restartPolicy": null,
"secrets": null,
"securityOpt": null,
"serviceId": null,
"serviceIds": [
    "1s26"
],
"shmSize": null,
"stackId": null,
"startCount": 0,
"startOnCreate": true,
"stdinOpen": false,
"stopSignal": null,
"storageOpt": null,
"sysctls": null,
"system": true,
"tmpfs": null,
"transitioning": "error",
"transitioningMessage": "Couldn't bring up network: failed to find plugin \"rancher-bridge\" in path [/opt/cni/bin /var/lib/cni/bin /usr/local/sbin /usr/sbin /sbin /usr/local/bin /usr/bin /bin]",
"transitioningProgress": 84,
"tty": false,
"ulimits": null,
"user": null,
"userPorts": null,
"usernsMode": null,
"uts": null,
"uuid": "ef09b71e-7326-4e7a-b1ca-089015765387",
"version": "0",
"volumeDriver": null,
"workingDir": null

}

time="2017-05-31T11:44:48Z" level=error msg="Failed to evaluate networking: failed to find plugin \"rancher-bridge\" in path [/opt/cni/bin /var/lib/cni/bin /usr/local/sbin /usr/sbin /sbin /usr/local/bin /usr/bin /bin]"

Issue still not resolved.

Had the time to look around @ciokan systems, thanks a lot for that. Great to get an insight into the actual running problem. So as all the logs hint to a network problem, the root cause seems to be metadata returning inconsistent results. So quering 169.254.169.250/latest/self/host/uuid would return different results, and this output is actually used to setup networking. We have a lead and are investigating why this is happening.

@superseb do you have any plans to fix it? Is it fixed in 1.6.1?

@superseb @maciekciolek Hi, looks like 1.6.1 did not fixed this issue.

As long as this is open we are working on it, if a fix is there we will definitely share it here to let people try it. Let me get you an update today on our investigation. In the meantime, if you can confirm you are seeing the same issue as described in https://github.com/rancher/rancher/issues/8383#issuecomment-305400907 that would be great.

@superseb How to confirm https://github.com/rancher/rancher/issues/8383#issuecomment-305400907 ? Can you please share with me a step-by-step procedure ?

@superseb same issue here. Any update?

@maciekciolek @tpgxyz On the host experiencing issues, you can Execute shell on a container (i.e. network-manager or healthcheck) and execute curl http://169.254.169.250/latest/self/host/uuid repeatedly. This would result in random results.

@superseb I am getting same results while executing this repeatedly. The only fix which works for me is this script: https://github.com/rancher/rancher/issues/8383#issuecomment-298100234

@superseb @tpgxyz I have just solved the problem on our environment.

In our case the problem was, that we have 3 NICs per host and when I was starting the agent, it was auto-detecting wrong agent host IP (IP address of local docker network) which was not accessible from server and the other hosts. The solution which works for me is just providing IP address in 4th step of "Add host" which is accessible from server and the other hosts.

In other words, you need to manually provide CATTLE_AGENT_IP env variable while you are starting an agent. It also means that you need to do it event for host (agent) which will be run on the same host as the server.

sudo docker run -e CATTLE_AGENT_IP="192.168.0.30" --rm --privileged -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher rancher/agent:v1.2.2 http://192.168.0.87:7000/v1/scripts/***

After this fix ipsec, healthcheck, cni erros disappear.

Tested on: 1.5.1 and 1.5.10

@maciekciolek Many thanks! It works now.

@fedya \o/

It works on 1.6.2 as well :1st_place_medal:

@maciekciolek Where is that CATTLE_AGENT_IP supposed to point when having multiple servers? The rancher ip address?

It is already stable branch?

@ciokan as far as I understand you have one (or more) rancher servers S{x} and many instances on which you want to run rancher agents (worker) - let`s say I1, I2, I3. If you are adding new host (rancher agent / worker), which will run on instance I1 you should provided the value of CATTLE_AGENT_IP which is the IP address of the instance I1. If you are adding host which runs on I2, then use the IP address of I2.

The most important thing is, that value of CATTLE_AGENT_IP should be IP of an interface, which is accessible (eg. same network) by other instances and servers.

@maciekciolek i'm right understood it that CATTLE_AGENT_IP must be server IP not docker0 interface?

@fedya CATTLE_AGENT_IP must be the IP of agent which you are starting. The value will be propagated by this agent to the server and other agents and will be used by them to communicate with this agent. So, if you have multiple host in the local network, the value of should be the IP of host on which you are starting an agent.

CATTLE_AGENT_IP is not server IP. The server IP is attached in starting command, eg.:
sudo docker run -e CATTLE_AGENT_IP="192.168.0.30" --rm --privileged -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher rancher/agent:v1.2.2 http://192.168.0.87:7000/v1/scripts/**

192.168.0.30 - IP of host on which I am starting the agent
192.168.0.87:7000 - IP and port of Rancher server

Well.
how i run node

docker run --rm -e CATTLE_AGENT_IP="192.168.1.165" --privileged -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher rancher/agent:v1.2.2 http://abf.openmandriva.org:8080/v1/scripts/xxxxxxx

CATTLE_AGENT_IP = ip from my node

And i got same results as above in this thread

nsenter: failed to execute /opt/cni/bin/rancher-bridge: No such file or directory
time="2017-06-26T14:05:46Z" level=error msg="Failed to evaluate networking: netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input" 
time="2017-06-26T14:05:47Z" level=info msg="Evaluating state from retry" cid=63e6008665a6b764541593ecfa9fba465674b8c420714e39752d83fe7392de75 count=11 
time="2017-06-26T14:05:47Z" level=info msg="CNI up" cid=63e6008665a6b764541593ecfa9fba465674b8c420714e39752d83fe7392de75 networkMode=ipsec 
nsenter: failed to execute /opt/cni/bin/rancher-bridge: No such file or directory
time="2017-06-26T14:05:47Z" level=error msg="Failed to evaluate networking: netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input" 

Metadata returning inconsistent results (exposed ports of container). This might be user error: It seems that metadata results are not ordered in a strict way, so there is no guarantee that index numbers (as used in the snippet below) stay the same. Note to self: Always use resource names when querying metadata.

Rancher v1.6.2
Network Services v0.2.2

root@92e5bd7ca9cf:/# curl rancher-metadata/2015-12-19/stacks/lb-test/serv
containers/0/ports
root@92e5bd7ca9cf:/# curl rancher-metadata/2015-12-19/stacks/lb-test/services/1/
containers/0/ports
0

yes, still not working

2017-07-10 16:17 GMT+03:00 Jan B notifications@github.com:

Metadata returning inconsistent results (exposed ports of container).
Rancher v1.6.2
Network Services v0.2.2

root@92e5bd7ca9cf:/# curl rancher-metadata/2015-12-19/stacks/lb-test/serv
containers/0/ports
root@92e5bd7ca9cf:/# curl rancher-metadata/2015-12-19/stacks/lb-test/services/1/
containers/0/ports
0


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rancher/rancher/issues/8383#issuecomment-314102316,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAkhzQOxEIfWQidxL5Uom4gBrFKRRf10ks5sMiRygaJpZM4MwWTE
.

Is there any progress from the dev team regarding this issue? It has almost 4 months and 71 comments so far. I keep delaying my team from going fully with rancher due to this bug.

We are too. Very annoying issue.
Btw i tested it with stable CoreOS, same bug.

2017-07-10 21:52 GMT+03:00 Romeo Mihalcea notifications@github.com:

Is there any progress from the dev team regarding this issue? It has
almost 4 months and 71 comments so far. I keep delaying my team from going
fully with rancher due to this bug.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rancher/rancher/issues/8383#issuecomment-314200405,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAkhzV1bGpAytoTkvM6ewgQ5pxsqM1tpks5sMnL2gaJpZM4MwWTE
.

Setting CATTLE_AGENT_IP = ip from my node works for me. I've tested on the cloud not local.

i tried with CATTLE_AGENT_IP no success

2017-07-11 1:01 GMT+03:00 Bruno notifications@github.com:

Setting CATTLE_AGENT_IP = ip from my node works for me. I've tested on
the cloud not local.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rancher/rancher/issues/8383#issuecomment-314262602,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAkhzWo1rDjD6Wf0-MpSq_iqg2wNbYjeks5sMp9OgaJpZM4MwWTE
.

@fedya
Try with the latest version of the agent: rancher/agent:v1.2.5

Also, take a look at this: http://rancher.com/docs/rancher/v1.6/en/hosts/#supported-docker-versions

Currently, I'm using docker 1.12.* in my rancher server (Ubuntu 16.04) and coreOS in the agents.

still no success

docker run --rm --privileged -e CATTLE_AGENT_IP=rancher_server_ip_here -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher rancher/agent:v1.2.5 http://rancher_server_ip_here:8080/v1/scripts/KEY

 Time:1499874158, TimeNano:1499874158521952432}. Error: failed to find plugin \"rancher-bridge\" in path [/opt/cni/bin /var/lib/cni/bin /usr/local/sbin /usr/sbin /sbin /usr/local/bin /usr/bin /bin]" 
time="2017-07-12T15:42:40Z" level=info msg="Evaluating state from retry" cid=eadbd8bf8bf74677e15e2d7b6d04d744e841ec3cc94ee421fa8291820cfba211 count=10 
time="2017-07-12T15:42:40Z" level=info msg="CNI up" cid=eadbd8bf8bf74677e15e2d7b6d04d744e841ec3cc94ee421fa8291820cfba211 networkMode=ipsec 
time="2017-07-12T15:42:40Z" level=error msg="Failed to evaluate networking: failed to find plugin \"rancher-bridge\" in path [/opt/cni/bin /var/lib/cni/bin /usr/local/sbin /usr/sbin /sbin /usr/local/bin /usr/bin /bin]" 

@fedya
Which docker version are you using on your rancher server and agents?

1.12.6

On Wed, Jul 12, 2017 at 6:56 PM, Bruno notifications@github.com wrote:

Which docker version are you using on your rancher server and agents?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rancher/rancher/issues/8383#issuecomment-314814679,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAc_RHZbTWtUaaHQdokLidLzB_R41CAfks5sNOzIgaJpZM4MwWTE
.

Are agents able to ping the rancher server? Are you using a firewall?

@fedya: in your commannd.. "CATTLE_AGENT_IP=rancher_server_ip_here" .. that doesn't sound right. CATTLE_AGENT_IP should be IP of the docker host, not the IP of the rancher server .. or did I miss something?

@pasikarkkainen yes I tried with the ip of the client running the agent. Same error: Couldn't bring up network: failed to find plugin "rancher-bridge" in path [/opt/cni/bin /var/lib/cni/bin /usr/local/sbin /usr/sbin /sbin /usr/local/bin /usr/bin /bin]. Docker version doesn't matter as I see the same error on different hosts running the same engine version. Rancher server picks up the correct agent ip address so I don't believe CATTLE_AGENT_IP has anything to do here (tried and tested). I managed to repair 80% of the hosts using my above image but the rest of them refuse to work properly even with it.

I wouldn't be bothered with this bug as it's no affecting my services but, with it present, my rancher server stays at 100% on all cores (8 of them). If I deactivate the clients with the bug present my rancher server CPU goes down to normal 10% - 20%. With the clients deactivated in rancher server services work ok but, if a client gets restarted, they don't start any more.

This is the second "most discussed" ticket on this project and we still don't have a clue, roadmap, timeline or official answer regarding this bug. Everyone involved here provided everything requested, I even collaborated with the dev team and provided SSH access to multiple boxes including rancher server. The culprit seems to be tracked but after that...silence.

We need an official answer as I don't have a single clue regarding how much we're supposed to keep waiting. If it's 1 year I can re-route my team and build something custom. If it's 1-2 months we can probably wait but the lack of feedback generates a lot of frustration within our office and, I'm sure, other teams affected by this thing.

I can see over 1600 tickets being open which must be overwhelming. I'm happy to pay for support knowing this will get the bug fixed. I'm sure others are holding cash in hand for this.

TL; DR: Give us something. We're barking at trees here.

@ciokan @fedya and everyone who are having this problem, can you please ping me on https://slack.rancher.io? I would like to schedule a debug session with you guys. I work in PST timezone so please give me a couple of time slots that work for both us (6 AM PST to 12 AM PST). @superseb @janeczku Can you please coordinate too?

Our QA team puts in a lot of effort testing various scenarios on different operating systems and also different rancher versions. Our team also test rancher/server upgrade scenarios as well. Along with these efforts, we also have a bunch on tag automation runs triggered on a daily basis. Despite all these efforts we have not been able to hit this issue. Hence the delay in root causing/fixing the issue.

TL;DR: We are not ignoring anyone or this issue and would like to fix it if we can find a way to reproduce it.

To avoid confusion regarding CATTLE_AGENT_IP, this has to be set to the IP address of the host where the command is being run. And also this IP address should be reachable by all other nodes in the cluster.
For example, in AWS if you have two hosts in two different VPCs and you have used the private IP addresses of these hosts to register them with rancher/server you might have a problem if inter VPC communication is not configured properly.

Setting this variable is not needed for all environments but only on setups/clouds where the auto IP detection doesn't work due to cloud provider/network topology limitations.

Okay, i will do.
My timezone is MSK, but usually i'm online for 14-17 hours per day

12 июля 2017 г. 21:22 пользователь "leodotcloud" notifications@github.com
написал:

To avoid confusion regarding CATTLE_AGENT_IP, this has to be set to the IP
address of the host where the command is being run. And also this IP
address should be reachable by all other nodes in the cluster.
For example, in AWS if you have two hosts in two different VPCs and you
have used the private IP addresses of these hosts to register them with
rancher/server you might have a problem if inter VPC communication is not
configured properly.

Setting this variable is not needed for all environments but only on
setups/clouds where the auto IP detection doesn't work due to the cloud
provider/network topology limitations.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rancher/rancher/issues/8383#issuecomment-314854952,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAkhzRGprRte_v_PifwDlcMBxxyCuovMks5sNQ7wgaJpZM4MwWTE
.

woah...after all this time...my stack is green. I'll have to thank @leodotcloud and @superseb for taking the time to get to the bottom of it. From the looks of it (leo might step in with more updates) the docker daemon was running on a different ip address than expected, an old one actually. They managed to fix my setup, I can't give a fix right now because I have bits of it but they fixed about 12 - 15 machines and I'm sure the team now has an idea and can push a fix soon. It' def a hard thing to reproduce in their testing setup that's why the long wait over here.

Thank you guys!

@ciokan Thank you for taking time to share your setup for debugging.

The error message: Failed to evaluate networking: failed to find plugin \"rancher-bridge\" could be due to a variety of problems. One of the reasons is as below:

Summary:
TL; DR version: If the IP address of the docker0 bridge on any host is same as the IP address assigned to any container, especially metadata container, in the cluster this problem surfaces.

Long version:
All of Rancher's infra microservices/containers running on any host of the cluster rely on metadata to fetch the information needed to do their jobs. rancher-metadata answers the queries in an intelligent fashion, giving out information based on the source IP address of the request. So 169.254.169.250/latest/self/container/ will yield different answers for various containers. If the source IP address lookup fails in rancher-metadata only parts of the information which are common to all the containers on that host are available while querying. Ex: Only 169.254.169.250/latest/self/host is available but not 169.254.169.250/latest/self/container as we cannot determine from where the request is coming from.

When a new host is added to rancher, the first microservice to start would be rancher-metadata along with it's sidekick rancher-dns. This microservice is on docker's bridge network and gets an IP address from the subnet configured for docker. Example: If the bridge subnet is 172.17.0.0/16, an IP address in the range 172.17.0.1 to 172.17.255.254 is allocated for this container. In general the first IP address of this container, i.e., 172.17.0.1 is configured on the bridge, rancher-metadata generally/most of the times gets 172.17.0.2. While debugging @ciokan's setup, we noticed that some of the rancher-metadata containers in the cluster had an IP address of 172.17.0.1, which was quite unusual and upon further investigation we found docker0 bridge had an IP: 172.17.42.1.

Now let's understand how network-manager works. The microservice network-manager is responsible for setting up the CNI executable, CNI config so that IPSec stack can start properly. network-manager queries 169.254.169.250/latest/containers, 169.254.169.250/latest/services, 169.254.169.250/latest/stacks to figure out what containers, services, stacks are running in the cluster. Now to figure out local containers, it relies on the information returned by 169.254.169.250/latest/self/host. Since network-manager runs on the host network (--net=host) all the requests reaching the metadata container have the IP address of docker0 bridge as the source IP. This means the source IP seen by metadata is 172.17.0.1 in most of the use cases and there is no hit for this IP while looking up metadata's internal info. But in @ciokan's setup there are few containers with this IP address, the lookup didn't miss and metadata was returning information which was incorrect in this context. Since there are multiple entries with the same lookup key, the information from metadata was flapping as observed/mentioned in a previous debugging session with a different user.

Workaround:
On hosts where the IP address of the docker0 bridge is not 172.17.0.1, change the docker config and restart docker.

Here are steps for a Ubuntu:

cat > /etc/docker/daemon.json << EOF
{
    "bip": "172.17.0.1/16",
    "fixed-cidr": "172.17.0.0/16"
}
EOF

service docker stop
rm -rf /var/run/docker.sock
ps -ef | grep docker 
# make sure none of the old docker processes are running.
# if yes do kill -9 ____
service docker start
docker ps
ip addr show docker0

Why docker0 bridge has this IP address?
May be the host was running an older version of docker when the default IP used for the bridge used to be 172.17.42.1 and later the docker version was upgraded? Not sure.
https://github.com/moby/moby/issues/17305

Just got off call with @fedya ... The reason for his problem was different than above.

Summary:
Host 1: Running both rancher/server and rancher/agent ... And this caused the IP detected for this host in Cattle as 172.17.0.1.
Host 2: This is the host having problem "Error .... "

When running both server and agent, it's needed to pass CATTLE_AGENT_IP as the public IP address of the host while registering the host.

Fix:

  • Delete rancher-agent container on Host1
  • Start rancher-agent again on Host1, but this time pass CATTLE_AGENT_IP=PUBLIC_IP_OF_HOST_1
  • Things started fine on both Host 1 and Host 2.

Tonight we were fixed this issue on my servers.
Thanks to @leodotcloud for his time.

Main culrpit was rancher agent that running on rancher server, in such case need to pass CATTLE_AGENT_IP when you deploy agent on the server.

My devops team approached me with this behavior yesterday when applying an upgrade to rancher-server 1.6.5

I've double checked CATTLE_AGENT_IP is properly set when adding all of the hosts.

After doing some digging, I found that two of the hosts had docker0 as the 'old' 172.17.42.1

Applying this fix from @leodotcloud seemed to clear up the problems for them.

But there is still one other host that is in a bad state where its IPsec containers will not start up and complains about not finding rancher-bridge on the path.

I have tried the workaround mentioned by @kaos where we try to write the file ourselves. I just wrote a script to to get it to pull the right PID in for me as the stack appears.

Here's what the script looks like if anyone else is interested:

root@o1:/var/lib/docker/volumes/rancher-cni-driver/_data/bin # ls .
patch*  rancher-bridge*

root@o1:/var/lib/docker/volumes/rancher-cni-driver/_data/bin # cat patch
echo "#!/bin/sh" > rancher-bridge
echo "exec /usr/bin/nsenter -m -u -i -n -p -t $(pgrep -f rancher-cni.log) -- \$0 \"\$@\"" >> rancher-bridge
chmod +x rancher-bridge

This seems to change the message away from any mention of rancher-bridge - but it still fails to start up.

I tried the whole process a few times, and then tried to restart network-manager right after running my patch script as well, but I still can't get IPsec to start on this host.

What I can see in the logs leads me to believe there is something screwed up more deeply with the networking (like maybe iptables) on this box. Here is a log snippet from network-manager, and it causes a problem installing the stack properly.

Connectivity isn't an issue, though -- I was able to deactivate and delete the host/remove rancher-agent before it could delete the containers in the IPsec stack, and they DID start up in standalone mode when I stopped and restarted them. I could see it get its 10.42.x.x address and communicate with everything else. BUT, it wasn't part of the ipsec stack, so Rancher would list it as Standalone when I added the host back and and it would try to start new IPsec containers. I've since removed it, but I swear that did work.

I think the root cause is likely with having crufty values somewhere in the metadata as @ibuildthecloud described. I can see that there is a message still getting spammed in the network-manager log for that host:

level=error msg="macsync: error syncing MAC addresses for the first tiime: inspecting container: 4c65b32cb3ee2a4f1d7c0fd9bc27103e9bd6955765db713d084057df5751ad36: Error: No such container: 4c65b32cb3ee2a4f1d7c0fd9bc27103e9bd6955765db713d084057df5751ad36"

How do I flush that out? I've gone so far as to remove all containers (including infrastructure and rancher-agents) and docker system prune -a on everything and bring it back up but I still see this host complaining about the 4c65b... container.

@leodotcloud helped me troubleshoot our setup. We discovered a veth that was showing up on the host in question that was being returned if you did an ip addr

After verifying this particular one looked off and should not be there, we tried removing it. It seems that it was not being cleaned up for whatever reason, and had some values for the host from the "old" setup before the upgrade.

Removing manually via an ip link delete veth... seemed to fix things on the next go 'round, IPSec started up correctly, then healthcheck, etc.

Thanks, @leodotcloud and other users on this thread.

I could pitch in another case of netw interface issues that relates to this that we found out.

It was something with the docker0 interface that apparently was causing issues for us on some occasions, as it got resolved by stopping docker, wiping all docker state (/var/lib/docker, /var/run/docker, and recreating storage volumes) in addition to actually removing the docker0 interface. Then start docker again. It was that last piece with removing the docker0 interface that fixed it for us.

@kaos do you happen to know what the docker bridge link address was? Just ran into this today where metadata was stealing docker's bridge ip of '172.17.0.1', possibly because post upgrade of docker it came up with a '172.42.x.x' address...

@aemneina IIRC it was 172.17.0.2. It had just incremented one octet by one.

Guys I'm being overwhelmed with this error.
We started the upgrade from version 1.5.5 to rancher 1.6.10.
We followed the recommendations to upgrade the network-services before ipsec. After successfully upgrading network-services we started to upgrade the ipsec stack...it started failing on the first host with the following error:

Rancher managed network not work with error 'Couldn't bring up network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input'

@calexandre rancher server and rancher node is same pc?

Nope, separate hosts.

I've created a forum post with more detail here:
https://forums.rancher.com/t/rancher-1-6-10-upgrade-issues-ipsec-upgrade-error/7476

Let me know if you want me to bring the issue to to github.

I had the same issue yesterday. I have two hosts, and one of them in the same machine of my Rancher. The node I've created in the same PC of the server worked well but the other one was giving me the rancher-bridge error.
I resolved this removing both hosts and reinstalling them. But first I've installed the host from another machine and then the other one (the one in the same machine of the server).

I do the same steps as @guilhermefernandes1 said above and solve the issue.

I had the same problem and it came from the fact that a machine was used both as a server and as an agent, so in the Hosts tab, it was detected with a wrong IP because it resolved it locally au lieu of resolving the IP in an external way (eg: 172.17.0.1). Which made the machine unaccessible from anywhere.

I stopped the agent where the IP was 172.17.0.1 and docker run the agent with :

-e CATTLE_AGENT_IP="$(ifconfig eth0 | grep 'inet ' | cut -d ' ' -f10 | awk '{ print $1}')"

You may have to modify the line a bit if your configuration is different.

Everything works fine now and the right IP shows on the Hosts tab.

LATE EDIT:

Actually is it now told in the "Add Host" page that you'll have a problem by installing the server and the host on the same server.
image

You can specify the IP of your server in the input.

With the release of Rancher 2.0, development on v1.6 is only limited to critical bug fixes and security patches.

Was this page helpful?
0 / 5 - 0 ratings