Rancher: Unable to access Rancher UI after a reboot.

Created on 31 Jul 2015  路  50Comments  路  Source: rancher/rancher

I'm on Ubuntu 14.04.2 LTS

I did docker run -d --restart=always -p 8080:8080 rancher/server

I was able to connect to Rancher UI with my server public IP. http://PUBLIC_IP:8080
Then if I reboot the server, I get an ERR_CONNECTION_TIMED_OUT.

I have a VPN installed, and the UI is still responding on the container IP. http://172.17.0.10:8080/

I tried again with docker run -d --restart=always -p 8090:8080 rancher/server, same thing happened.

docker ps

CONTAINER ID        IMAGE                           COMMAND                CREATED             STATUS              PORTS                                          NAMES
ec7bdcc144ac        rancher/server                  "/usr/bin/s6-svscan    24 minutes ago      Up 15 minutes       3306/tcp, 0.0.0.0:8090->8080/tcp               loving_wright
de3a8ef8cb23        rancher/agent-instance:v0.3.1   "/etc/init.d/agent-i   About an hour ago   Up 34 minutes       0.0.0.0:500->500/udp, 0.0.0.0:4500->4500/udp   acbc14ec-de64-4b8c-8fd8-3d2a8de39714
944718a36aa4        rancher/agent:v0.7.11           "/run.sh run"          About an hour ago   Up 36 minutes                                                      rancher-agent
5e8e126e9219        rancher/server                  "/usr/bin/s6-svscan    About an hour ago   Up 36 minutes       3306/tcp, 0.0.0.0:8080->8080/tcp               dreamy_morse

dockers logs 5e8e126e9219

15:46:07.513 [main] INFO  ConsoleStatus - [DONE ] [95213ms] Startup Succeeded, Listening on port 8081
time="2015-07-31T15:46:09Z" level=info msg="Starting websocket proxy. Listening on [:8080], Proxying to cattle API at [localhost:8081], Monitoring parent pid [7]."
time="2015-07-31T15:46:09Z" level=info msg="Setting log level" logLevel=info
time="2015-07-31T15:46:09Z" level=info msg="Starting go-machine-service..." gitcommit=102d311
time="2015-07-31T15:46:10Z" level=info msg="Initializing event router" workerCount=10
time="2015-07-31T15:46:12Z" level=info msg="Connection established"
2015-07-31 15:46:16,730 ERROR [:] [] [] [] [ecutorService-5] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [1] count [1]
2015-07-31 15:46:21,718 ERROR [:] [] [] [] [ecutorService-2] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [1] count [2]
time="2015-07-31T15:46:25Z" level=info msg="Registering backend for host [d448658b-823a-4fc2-a654-93a35283d7d0]"
2015-07-31 15:46:26,806 ERROR [:] [] [] [] [ecutorService-9] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [1] count [1]
areserver kinbug

Most helpful comment

reproduced this in rancher 1.4.1:

deploy several stacks with port maps and load balancers

systemctl restart docker

rancher resets connection, load balancers and stacks continue to work and are accessible

workaround:

sudo iptables -F CATTLE_PREROUTING -t nat
systemctl restart docker

All 50 comments

Are you restarting the rancher/server container or are you rebooting the machine?

I'm rebooting the machine.

I have no issues with rebooting the machine. After a reboot, the container is restarted and you need to wait for rancher server to come up. That usually takes ~2 minutes.

Can you do a docker logs -f <container_id> after the machine is up again?

When you see this in the logs, the UI should be accessible?

time="2015-07-31T20:33:30Z" level=info msg="Starting websocket proxy. Listening on [:8080], Proxying to cattle API at [localhost:8081], Monitoring parent pid [9]." 

Unfortunately not, I do have the same problem even if I just restart the container after a fresh install.

I replaced my IP by PUBLIC_IP

20:36:55.322 [main] INFO  ConsoleStatus - [DONE ] [96906ms] Startup Succeeded, Listening on port 8081
time="2015-07-31T20:36:56Z" level=info msg="Starting websocket proxy. Listening on [:8080], Proxying to cattle API at [localhost:8081], Monitoring parent pid [7]."
time="2015-07-31T20:36:56Z" level=info msg="Setting log level" logLevel=info
time="2015-07-31T20:36:56Z" level=info msg="Starting go-machine-service..." gitcommit=102d311
2015-07-31 20:37:03,873 ERROR [:] [] [] [] [ecutorService-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [1] count [1]
time="2015-07-31T20:37:08Z" level=info msg="Registering backend for host [d448658b-823a-4fc2-a654-93a35283d7d0]"
2015-07-31 20:37:09,837 ERROR [:] [] [] [] [ecutorService-3] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [1] count [1]
time="2015-07-31T20:37:26Z" level=error msg="Unable to create EventRouter" Err="Get http://PUBLIC_IP:8080/v1: dial tcp PUBLIC_IP:8080: i/o timeout"
time="2015-07-31T20:37:26Z" level=info msg="Exiting go-machine-service..."
time="2015-07-31T20:37:27Z" level=info msg="Setting log level" logLevel=info
time="2015-07-31T20:37:27Z" level=info msg="Starting go-machine-service..." gitcommit=102d311

@oomathias If you are SSH'd into the machine where rancher/server is running, will the API work over localhost?
Specifically, does curl http://127.0.0.1:8080/v1 successfully return some json or does it timeout?

@cjellick It works.

curl -i http://127.0.0.1:8080/v1
HTTP/1.1 401 Unauthorized
Content-Length: 158
Content-Type: application/json; charset=utf-8
Server: Jetty(8.1.11.v20130520)
Www-Authenticate: Basic realm="Enter API access key and secret key as username and password"
X-Api-Schemas: http://127.0.0.1:8080/v1/schemas
Date: Fri, 31 Jul 2015 21:17:15 GMT

{"id":"4f9dc04b-ed2c-40d2-bc23-2f014e61511f","type":"error","links":{},"actions":{},"status":401,"code":"Unauthorized","message":"Unauthorized","detail":null}%

Hm. Ok. So, rancher is running, the port is bound to the host, and it is reachable over localhost, but not over the public IP.

This is perhaps a dumb question, but is it possible that you were using an ephemeral IP that changed when you rebooted the host?

Nope, this is a dedicated server :) I don't even have to reboot the host, if I do:
docker run -d --restart=always -p 8080:8080 rancher/server
It works, and then if I restart by doing
service docker restart
It doesn't work anymore.

level=error msg="Unable to create EventRouter" Err="Get http://PUBLIC_IP:8080/v1: dial tcp PUBLIC_IP:8080: i/o timeout"
level=error msg="Unable to create EventRouter" Err="Get http://PUBLIC_IP:8080/v1: dial tcp PUBLIC_IP:8080: no route to host"

looks like the source of the problem, no?

That is actually a symptom, not the source. There is a microservice called "go-machine-server" running inside the rancher container that attempts to connect back to the api service over a websocket. That log line is the microservice failing to connect because it cant reach the api server over that IP.

On the server where you are running rancher/server, are you also running rancher/agent?

Yes, I do run rancher/agent.

I added the agent with docker run -e CATTLE_AGENT_IP=PUBLIC_IP -d --privileged -v /var/run/docker.sock:/var/run/docker.sock rancher/agent:v0.7.11 http://PUBLIC_IP:8080/v1/scripts/XXX:XXX:XXX

Hm, ok, I think this might be an iptables issue. Can you share the output of
iptables -L -n -t nat | grep 8080 and docker inspect <id of rancher/server container> | grep IPAddress?

My suspicion is that that the 172 address from the iptables command will not be the same as the one from the inspect command. When you restart rancher/server, its IP is changes, but that rule doesn't get cleaned up.

Before

MASQUERADE  tcp  --  172.17.0.4           172.17.0.4           tcp dpt:8080
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:8080 to:172.17.0.4:8080

"IPAddress": "172.17.0.4"

After restarting the OS

MASQUERADE  tcp  --  172.17.0.1           172.17.0.1           tcp dpt:8080
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:8080 to:172.17.0.1:8080

"IPAddress": "172.17.0.1"

After restarting Docker

MASQUERADE  tcp  --  172.17.0.2           172.17.0.2           tcp dpt:8080
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:8080 to:172.17.0.2:8080

IPAddress": "172.17.0.2"

Hmph. Everything appears to be in order there. Well, it was worth a shot.

@oomathias what does your route table look like? You mentioned you were on a VPN, is there any chance your private IP space is overlapping with the docker 172.17.x.x/16 route?

I made some new tests.

My VPN address: MASQUERADE all -- 10.8.0.0/16 0.0.0.0/0, so it looks OK for me.

But more interesting, when I install Rancher and add the agent, I can reboot and everything works.
This is when the Network Agent is created after adding a basic ubuntu:14.04.2 container that something gets wrong.

Initial state (rancher-server + agent):

MASQUERADE  tcp  --  172.17.0.1           172.17.0.1           tcp dpt:8080
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:8080 to:172.17.0.1:8080

I can reboot the OS and/or Docker, everything still works.

Then if I add a container:

MASQUERADE  tcp  --  172.17.0.1           172.17.0.1           tcp dpt:8080
MASQUERADE  udp  --  172.17.0.2           172.17.0.2           udp dpt:4500
MASQUERADE  udp  --  172.17.0.2           172.17.0.2           udp dpt:500
ACCEPT     all  --  10.42.0.0/16         169.254.169.250
MASQUERADE  tcp  --  10.42.0.0/16        !10.42.0.0/16         masq ports: 1024-65535
MASQUERADE  udp  --  10.42.0.0/16        !10.42.0.0/16         masq ports: 1024-65535
MASQUERADE  all  --  10.42.0.0/16        !10.42.0.0/16
MASQUERADE  tcp  --  172.17.0.0/16        0.0.0.0/0            masq ports: 1024-65535
MASQUERADE  udp  --  172.17.0.0/16        0.0.0.0/0            masq ports: 1024-65535
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL tcp dpt:8080 to:172.17.0.2:8080
DNAT       udp  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL udp dpt:4500 to:10.42.220.167:4500
DNAT       udp  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL udp dpt:500 to:10.42.220.167:500
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:8080 to:172.17.0.1:8080
DNAT       udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:4500 to:172.17.0.2:4500
DNAT       udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:500 to:172.17.0.2:500

Notice 172.17.0.1 et 172.17.0.2.
At this point the UI still mostly works, but I do not have access to logs anymore via the UI (stuck at initializing state).
Logs are still working if I use the VPN and access http://172.17.0.1:8080/.

After rebooting Docker:

MASQUERADE  tcp  --  172.17.0.1           172.17.0.1           tcp dpt:8080
MASQUERADE  udp  --  172.17.0.2           172.17.0.2           udp dpt:4500
MASQUERADE  udp  --  172.17.0.2           172.17.0.2           udp dpt:500
ACCEPT     all  --  10.42.0.0/16         169.254.169.250
MASQUERADE  tcp  --  10.42.0.0/16        !10.42.0.0/16         masq ports: 1024-65535
MASQUERADE  udp  --  10.42.0.0/16        !10.42.0.0/16         masq ports: 1024-65535
MASQUERADE  all  --  10.42.0.0/16        !10.42.0.0/16
MASQUERADE  tcp  --  172.17.0.0/16        0.0.0.0/0            masq ports: 1024-65535
MASQUERADE  udp  --  172.17.0.0/16        0.0.0.0/0            masq ports: 1024-65535
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL tcp dpt:8080 to:172.17.0.2:8080
DNAT       udp  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL udp dpt:500 to:10.42.220.167:500
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:8080 to:172.17.0.1:8080
DNAT       udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:4500 to:172.17.0.2:4500
DNAT       udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:500 to:172.17.0.2:500

Notice that DNAT udp -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL udp dpt:4500 to:10.42.220.167:4500 disappeared.
At this point, I do not have access to the UI from my public IP. (Still works from http://172.17.0.1:8080/)

Using sudo iptables -F CATTLE_PREROUTING -t nat seems to work.

But I need to restart the "Network Agent", and flush again CATTLE_PREROUTING, each time I want to add a container. Otherwise, the container is stuck in "Networking" state.

@oomathias you have to restart the newtork agent every tim eyou deploy a container!? That's not good!
Curious: do you have just one host and that host runs both server and agent or do you have any other agent-only hosts and do they experience the network agent restart problem.

I created a second host on AWS for testing purpose, everything works fine there.

CATTLE_PREROUTING is coming back up after any action (stop/add/delete a container), so I need to flush it every time. Restarting Network Agent seems not required indeed.

Ok, good to konw. thanks for the info.

I can confirm that Network Agent doesn't need to be restarted.

Chain CATTLE_PREROUTING (1 references)
target     prot opt source               destination
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL tcp dpt:8080 to:172.17.0.2:8080
DNAT       udp  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL udp dpt:4500 to:10.42.209.15:4500
DNAT       udp  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL udp dpt:500 to:10.42.209.15:500

When these rules are up, nothing works on the host, and like I said they are comming back all the time.
Are theses rules supposed to be up? What should-I do?

@oomathias what version of docker are you running?
Can you give us the docker version and docker info output.

Client version: 1.7.1
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 786b29d
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 786b29d
OS/Arch (server): linux/amd64
Containers: 9
Images: 168
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 202
 Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.13.0-61-generic
Operating System: Ubuntu 14.04.2 LTS
CPUs: 4
Total Memory: 1.938 GiB
ID: FCVM:OVMA:N42N:DEXL:OLUA:OIEM:K2YT:YIER:6W54:72RE:V4YM:OOQC
WARNING: No swap limit support

For now I run a dirty fix each second sudo iptables -t nat -D CATTLE_PREROUTING -p tcp -m addrtype --dst-type LOCAL -m tcp --dport 8080 -j DNAT --to-destination 172.17.0.2:8080

I had the same issue today and after I flushed CATTLE_PREROUTING I could've access to the rancher container.

I used to have below rules before flushing:

Chain CATTLE_PREROUTING (1 references)
 pkts bytes target     prot opt in     out     source               destination         
 1251 75060 DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL tcp dpt:8080 to:172.17.0.8:8080
    0     0 DNAT       udp  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL udp dpt:4500 to:10.42.53.71:4500
    0     0 DNAT       udp  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL udp dpt:500 to:10.42.53.71:500

The first rule is the one causing the issue, I'm not sure why this rule survive rancher os restarting.
My situation is a bit different, there was power down yesterday on my server it cause the server to shutdown, but when I started again I get this issue.

I tried to reboot rancher os again and the whole chain wasn't there! what is surprising is the rancher-server works fine with network-again container.

Containers: 6
Images: 136
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.19.8-ckt1-rancher
Operating System: <unknown>
CPUs: 8
Total Memory: 3.865 GiB
Name: RANCHER-01
ID: PJPJ:AJ2H:6ZHW:LNTA:ZYKQ:GPEY:T3G6:YAVT:HZFX:R4UH:SJZ6:22N4
Docker version 1.7.0, build 851c91a
rancherctl version v0.3.3

Please let me know if you need any other information.

Yep, we have the same problem. I was actually testing some chaos scenarios (random reboot/shutdown) with Rancher.

Can we have more info about this rule? Is it supposed to be there?

I notice today that even after I flushed the rules they return back so it seems that cattle is creating them again, I wanted to check more deep but I get busy with other stuff.

However I'll try to check more tomorrow and I'll keep you guys posted.

Another thing we found with @MhdAli is that the data stored in "instance" DB table isn't updated - especially primaryIpAddress and IPAddress columns. But changing it to proper values doesn't help with anything.
Somehow the old IP is being used for iptables even when there's different one in docker inspect and in the DB.
Still trying to figure out, how does rancher even know about the old IP...

So...
We have the old IP address in ip_address table;
ip_address_nic_map maps this ip address to a network interface;
nic maps this network interface to an instance;
instance.name of the mapped instance is 'rancher-server'.

After changing the problematic ip_address entry everything works as it should!

This just happened to me. Grrr. Rebooted host and rancher mgmt server is toast. Mine is showing this:

MASQUERADE tcp -- 172.17.0.3 172.17.0.3 tcp dpt:8080
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL tcp dpt:8080 to:172.17.0.1:8080
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.3:8080

it happened to me every time I restarted rancher-server container manually, I didn't try restarting it from the UI though.

However, can you check the container IP? is it 172.17.0.3?

Yes, mine is 172.17.0.3. For some reason that 172.17.0.1 is showing up when running iptables -L -n -t nat | grep 8080

if you flush CATTLE_PREROUTING chain it'll help, for few minuets only, because it seems there is something that update it again from the database, I believe it's the network container.

Has anyone determined if this is just a bug or possibly the way we are running our setup? Does the agent have anything to do with issues with the mgmt server running on the same host. Like you I installed the agent with the CATTLE_AGENT_IP=X.X.X.X on that particular host.

@kpelt It is definitely caused by the agent and the server being on the same host. CATTLE_PREROUTING is created by the agent to wire up our networking solutoin.

I'll take some more time today to attempt to reproduce.

Ok, good to know. I am going to keep my mgmt server on its own host for now so I can continue working on the glusterfs, percona and wordpress setup you guys did as I am doing a POC to introduce this technology to my company.

Awesome! Will keep everyone updated on this thread.

@cjellick Who's responsible of updating CATTEL_PREROUTING chain?

Steps to reproduce:
  1. docker run rancher/server as normal on an ec2 host
  2. docker run rancher/agent as normal
  3. deploy a container via rancher a map a port
  4. restart rancher server
    result:
    rancher becomes unavailable on the public IP
Cause:
  • When rancher-server is restarted, it gets a new docker IP. Typically when a container is restarted, rancher detects that and updates the IP in the rancher database. But, since rancher-server is down, it cannot receive the event to detect that its IP has changed, so the old IP is kept in rancher.
  • When a change is made that causes the host's iptables rules to change, we completely rewrite all the rules for the host. Since we have the old ip for rancher-server, we rewrite the rule with the old IP.
Temporary, very bad work around:

To get rancher-server, working again, delete that specific rule: iptables -t nat -D CATTLE_PREROUTING 1 or remove them all (can have temporary bad side effects) iptables -F CATTLE_PREROUTING -t nat, assuming the rule is number "1" but you'd need to keep doing this because rancher keeps reapply the rules.

Proposed fix:

Discussed with @ibuildthecloud and fundamentally, rancher should not be rewriting rules for containers that are not on the rancher network. Originally, we did this as feature. The idea is that it would allow users to dynamically updated container ports even if they were not on the Rancher network. But really, if a user has made the explicit decision to not use the rancher network for a container, than we should just allow docker networking to do its job.

So, we'll make it thus. Need to remove the Port Service from the docker network in the rancher database.

@MhdAli does the above synopsis answer your question?

This sounds good to me. When can we expect the next release?

We release weekly. This week's release won't have ithe fix, but we'll try to get it scheduled into next week's release.

@kpelt We aim to release weekly. If it doesn't pass our QA, then we sometimes skip a week.

Thanks for the information.

As I summarized in my last two comments, the best workaround that I found is to manually update the database record. But sure, you can go flushing the chain every time rancher gets it wrong.

@rmwpl I missed you summary. Excellent analysis and a much better work around. Thank you!

Tested with server build - v0.34.0-rc3

docker run rancher/server on any host
On the same host,run the rancher/agent
Deploy a container via rancher server UI wit port map.
Restart rancher server
After rancher server is restarted , we are still able to access rancher server successfully.

Confirmed, that's the proper steps to repro/prove its fixed.

reproduced this in rancher 1.4.1:

deploy several stacks with port maps and load balancers

systemctl restart docker

rancher resets connection, load balancers and stacks continue to work and are accessible

workaround:

sudo iptables -F CATTLE_PREROUTING -t nat
systemctl restart docker

I need help with my rancher server also. I Restarted my VM and rancher server is in that and docker. When i go to internet and type my ip addr it doesnt go anywhere and when im using your exsamples it always says Cannot connect is the docker daemon still running.

Was this page helpful?
0 / 5 - 0 ratings