I'm on Ubuntu 14.04.2 LTS
I did docker run -d --restart=always -p 8080:8080 rancher/server
I was able to connect to Rancher UI with my server public IP. http://PUBLIC_IP:8080
Then if I reboot the server, I get an ERR_CONNECTION_TIMED_OUT.
I have a VPN installed, and the UI is still responding on the container IP. http://172.17.0.10:8080/
I tried again with docker run -d --restart=always -p 8090:8080 rancher/server, same thing happened.
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ec7bdcc144ac rancher/server "/usr/bin/s6-svscan 24 minutes ago Up 15 minutes 3306/tcp, 0.0.0.0:8090->8080/tcp loving_wright
de3a8ef8cb23 rancher/agent-instance:v0.3.1 "/etc/init.d/agent-i About an hour ago Up 34 minutes 0.0.0.0:500->500/udp, 0.0.0.0:4500->4500/udp acbc14ec-de64-4b8c-8fd8-3d2a8de39714
944718a36aa4 rancher/agent:v0.7.11 "/run.sh run" About an hour ago Up 36 minutes rancher-agent
5e8e126e9219 rancher/server "/usr/bin/s6-svscan About an hour ago Up 36 minutes 3306/tcp, 0.0.0.0:8080->8080/tcp dreamy_morse
dockers logs 5e8e126e9219
15:46:07.513 [main] INFO ConsoleStatus - [DONE ] [95213ms] Startup Succeeded, Listening on port 8081
time="2015-07-31T15:46:09Z" level=info msg="Starting websocket proxy. Listening on [:8080], Proxying to cattle API at [localhost:8081], Monitoring parent pid [7]."
time="2015-07-31T15:46:09Z" level=info msg="Setting log level" logLevel=info
time="2015-07-31T15:46:09Z" level=info msg="Starting go-machine-service..." gitcommit=102d311
time="2015-07-31T15:46:10Z" level=info msg="Initializing event router" workerCount=10
time="2015-07-31T15:46:12Z" level=info msg="Connection established"
2015-07-31 15:46:16,730 ERROR [:] [] [] [] [ecutorService-5] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [1] count [1]
2015-07-31 15:46:21,718 ERROR [:] [] [] [] [ecutorService-2] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [1] count [2]
time="2015-07-31T15:46:25Z" level=info msg="Registering backend for host [d448658b-823a-4fc2-a654-93a35283d7d0]"
2015-07-31 15:46:26,806 ERROR [:] [] [] [] [ecutorService-9] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [1] count [1]
Are you restarting the rancher/server container or are you rebooting the machine?
I'm rebooting the machine.
I have no issues with rebooting the machine. After a reboot, the container is restarted and you need to wait for rancher server to come up. That usually takes ~2 minutes.
Can you do a docker logs -f <container_id> after the machine is up again?
When you see this in the logs, the UI should be accessible?
time="2015-07-31T20:33:30Z" level=info msg="Starting websocket proxy. Listening on [:8080], Proxying to cattle API at [localhost:8081], Monitoring parent pid [9]."
Unfortunately not, I do have the same problem even if I just restart the container after a fresh install.
I replaced my IP by PUBLIC_IP
20:36:55.322 [main] INFO ConsoleStatus - [DONE ] [96906ms] Startup Succeeded, Listening on port 8081
time="2015-07-31T20:36:56Z" level=info msg="Starting websocket proxy. Listening on [:8080], Proxying to cattle API at [localhost:8081], Monitoring parent pid [7]."
time="2015-07-31T20:36:56Z" level=info msg="Setting log level" logLevel=info
time="2015-07-31T20:36:56Z" level=info msg="Starting go-machine-service..." gitcommit=102d311
2015-07-31 20:37:03,873 ERROR [:] [] [] [] [ecutorService-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [1] count [1]
time="2015-07-31T20:37:08Z" level=info msg="Registering backend for host [d448658b-823a-4fc2-a654-93a35283d7d0]"
2015-07-31 20:37:09,837 ERROR [:] [] [] [] [ecutorService-3] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [1] count [1]
time="2015-07-31T20:37:26Z" level=error msg="Unable to create EventRouter" Err="Get http://PUBLIC_IP:8080/v1: dial tcp PUBLIC_IP:8080: i/o timeout"
time="2015-07-31T20:37:26Z" level=info msg="Exiting go-machine-service..."
time="2015-07-31T20:37:27Z" level=info msg="Setting log level" logLevel=info
time="2015-07-31T20:37:27Z" level=info msg="Starting go-machine-service..." gitcommit=102d311
@oomathias If you are SSH'd into the machine where rancher/server is running, will the API work over localhost?
Specifically, does curl http://127.0.0.1:8080/v1 successfully return some json or does it timeout?
@cjellick It works.
curl -i http://127.0.0.1:8080/v1
HTTP/1.1 401 Unauthorized
Content-Length: 158
Content-Type: application/json; charset=utf-8
Server: Jetty(8.1.11.v20130520)
Www-Authenticate: Basic realm="Enter API access key and secret key as username and password"
X-Api-Schemas: http://127.0.0.1:8080/v1/schemas
Date: Fri, 31 Jul 2015 21:17:15 GMT
{"id":"4f9dc04b-ed2c-40d2-bc23-2f014e61511f","type":"error","links":{},"actions":{},"status":401,"code":"Unauthorized","message":"Unauthorized","detail":null}%
Hm. Ok. So, rancher is running, the port is bound to the host, and it is reachable over localhost, but not over the public IP.
This is perhaps a dumb question, but is it possible that you were using an ephemeral IP that changed when you rebooted the host?
Nope, this is a dedicated server :) I don't even have to reboot the host, if I do:
docker run -d --restart=always -p 8080:8080 rancher/server
It works, and then if I restart by doing
service docker restart
It doesn't work anymore.
level=error msg="Unable to create EventRouter" Err="Get http://PUBLIC_IP:8080/v1: dial tcp PUBLIC_IP:8080: i/o timeout"
level=error msg="Unable to create EventRouter" Err="Get http://PUBLIC_IP:8080/v1: dial tcp PUBLIC_IP:8080: no route to host"
looks like the source of the problem, no?
That is actually a symptom, not the source. There is a microservice called "go-machine-server" running inside the rancher container that attempts to connect back to the api service over a websocket. That log line is the microservice failing to connect because it cant reach the api server over that IP.
On the server where you are running rancher/server, are you also running rancher/agent?
Yes, I do run rancher/agent.
I added the agent with docker run -e CATTLE_AGENT_IP=PUBLIC_IP -d --privileged -v /var/run/docker.sock:/var/run/docker.sock rancher/agent:v0.7.11 http://PUBLIC_IP:8080/v1/scripts/XXX:XXX:XXX
Hm, ok, I think this might be an iptables issue. Can you share the output of
iptables -L -n -t nat | grep 8080 and docker inspect <id of rancher/server container> | grep IPAddress?
My suspicion is that that the 172 address from the iptables command will not be the same as the one from the inspect command. When you restart rancher/server, its IP is changes, but that rule doesn't get cleaned up.
Before
MASQUERADE tcp -- 172.17.0.4 172.17.0.4 tcp dpt:8080
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.4:8080
"IPAddress": "172.17.0.4"
After restarting the OS
MASQUERADE tcp -- 172.17.0.1 172.17.0.1 tcp dpt:8080
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.1:8080
"IPAddress": "172.17.0.1"
After restarting Docker
MASQUERADE tcp -- 172.17.0.2 172.17.0.2 tcp dpt:8080
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.2:8080
IPAddress": "172.17.0.2"
Hmph. Everything appears to be in order there. Well, it was worth a shot.
@oomathias what does your route table look like? You mentioned you were on a VPN, is there any chance your private IP space is overlapping with the docker 172.17.x.x/16 route?
I made some new tests.
My VPN address: MASQUERADE all -- 10.8.0.0/16 0.0.0.0/0, so it looks OK for me.
But more interesting, when I install Rancher and add the agent, I can reboot and everything works.
This is when the Network Agent is created after adding a basic ubuntu:14.04.2 container that something gets wrong.
Initial state (rancher-server + agent):
MASQUERADE tcp -- 172.17.0.1 172.17.0.1 tcp dpt:8080
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.1:8080
I can reboot the OS and/or Docker, everything still works.
Then if I add a container:
MASQUERADE tcp -- 172.17.0.1 172.17.0.1 tcp dpt:8080
MASQUERADE udp -- 172.17.0.2 172.17.0.2 udp dpt:4500
MASQUERADE udp -- 172.17.0.2 172.17.0.2 udp dpt:500
ACCEPT all -- 10.42.0.0/16 169.254.169.250
MASQUERADE tcp -- 10.42.0.0/16 !10.42.0.0/16 masq ports: 1024-65535
MASQUERADE udp -- 10.42.0.0/16 !10.42.0.0/16 masq ports: 1024-65535
MASQUERADE all -- 10.42.0.0/16 !10.42.0.0/16
MASQUERADE tcp -- 172.17.0.0/16 0.0.0.0/0 masq ports: 1024-65535
MASQUERADE udp -- 172.17.0.0/16 0.0.0.0/0 masq ports: 1024-65535
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL tcp dpt:8080 to:172.17.0.2:8080
DNAT udp -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL udp dpt:4500 to:10.42.220.167:4500
DNAT udp -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL udp dpt:500 to:10.42.220.167:500
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.1:8080
DNAT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:4500 to:172.17.0.2:4500
DNAT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:500 to:172.17.0.2:500
Notice 172.17.0.1 et 172.17.0.2.
At this point the UI still mostly works, but I do not have access to logs anymore via the UI (stuck at initializing state).
Logs are still working if I use the VPN and access http://172.17.0.1:8080/.
After rebooting Docker:
MASQUERADE tcp -- 172.17.0.1 172.17.0.1 tcp dpt:8080
MASQUERADE udp -- 172.17.0.2 172.17.0.2 udp dpt:4500
MASQUERADE udp -- 172.17.0.2 172.17.0.2 udp dpt:500
ACCEPT all -- 10.42.0.0/16 169.254.169.250
MASQUERADE tcp -- 10.42.0.0/16 !10.42.0.0/16 masq ports: 1024-65535
MASQUERADE udp -- 10.42.0.0/16 !10.42.0.0/16 masq ports: 1024-65535
MASQUERADE all -- 10.42.0.0/16 !10.42.0.0/16
MASQUERADE tcp -- 172.17.0.0/16 0.0.0.0/0 masq ports: 1024-65535
MASQUERADE udp -- 172.17.0.0/16 0.0.0.0/0 masq ports: 1024-65535
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL tcp dpt:8080 to:172.17.0.2:8080
DNAT udp -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL udp dpt:500 to:10.42.220.167:500
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.1:8080
DNAT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:4500 to:172.17.0.2:4500
DNAT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:500 to:172.17.0.2:500
Notice that DNAT udp -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL udp dpt:4500 to:10.42.220.167:4500 disappeared.
At this point, I do not have access to the UI from my public IP. (Still works from http://172.17.0.1:8080/)
Using sudo iptables -F CATTLE_PREROUTING -t nat seems to work.
But I need to restart the "Network Agent", and flush again CATTLE_PREROUTING, each time I want to add a container. Otherwise, the container is stuck in "Networking" state.
@oomathias you have to restart the newtork agent every tim eyou deploy a container!? That's not good!
Curious: do you have just one host and that host runs both server and agent or do you have any other agent-only hosts and do they experience the network agent restart problem.
I created a second host on AWS for testing purpose, everything works fine there.
CATTLE_PREROUTING is coming back up after any action (stop/add/delete a container), so I need to flush it every time. Restarting Network Agent seems not required indeed.
Ok, good to konw. thanks for the info.
I can confirm that Network Agent doesn't need to be restarted.
Chain CATTLE_PREROUTING (1 references)
target prot opt source destination
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL tcp dpt:8080 to:172.17.0.2:8080
DNAT udp -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL udp dpt:4500 to:10.42.209.15:4500
DNAT udp -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL udp dpt:500 to:10.42.209.15:500
When these rules are up, nothing works on the host, and like I said they are comming back all the time.
Are theses rules supposed to be up? What should-I do?
@oomathias what version of docker are you running?
Can you give us the docker version and docker info output.
Client version: 1.7.1
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 786b29d
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 786b29d
OS/Arch (server): linux/amd64
Containers: 9
Images: 168
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 202
Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.13.0-61-generic
Operating System: Ubuntu 14.04.2 LTS
CPUs: 4
Total Memory: 1.938 GiB
ID: FCVM:OVMA:N42N:DEXL:OLUA:OIEM:K2YT:YIER:6W54:72RE:V4YM:OOQC
WARNING: No swap limit support
For now I run a dirty fix each second sudo iptables -t nat -D CATTLE_PREROUTING -p tcp -m addrtype --dst-type LOCAL -m tcp --dport 8080 -j DNAT --to-destination 172.17.0.2:8080
I had the same issue today and after I flushed CATTLE_PREROUTING I could've access to the rancher container.
I used to have below rules before flushing:
Chain CATTLE_PREROUTING (1 references)
pkts bytes target prot opt in out source destination
1251 75060 DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL tcp dpt:8080 to:172.17.0.8:8080
0 0 DNAT udp -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL udp dpt:4500 to:10.42.53.71:4500
0 0 DNAT udp -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL udp dpt:500 to:10.42.53.71:500
The first rule is the one causing the issue, I'm not sure why this rule survive rancher os restarting.
My situation is a bit different, there was power down yesterday on my server it cause the server to shutdown, but when I started again I get this issue.
I tried to reboot rancher os again and the whole chain wasn't there! what is surprising is the rancher-server works fine with network-again container.
Containers: 6
Images: 136
Storage Driver: overlay
Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.19.8-ckt1-rancher
Operating System: <unknown>
CPUs: 8
Total Memory: 3.865 GiB
Name: RANCHER-01
ID: PJPJ:AJ2H:6ZHW:LNTA:ZYKQ:GPEY:T3G6:YAVT:HZFX:R4UH:SJZ6:22N4
Docker version 1.7.0, build 851c91a
rancherctl version v0.3.3
Please let me know if you need any other information.
Yep, we have the same problem. I was actually testing some chaos scenarios (random reboot/shutdown) with Rancher.
Can we have more info about this rule? Is it supposed to be there?
I notice today that even after I flushed the rules they return back so it seems that cattle is creating them again, I wanted to check more deep but I get busy with other stuff.
However I'll try to check more tomorrow and I'll keep you guys posted.
Another thing we found with @MhdAli is that the data stored in "instance" DB table isn't updated - especially primaryIpAddress and IPAddress columns. But changing it to proper values doesn't help with anything.
Somehow the old IP is being used for iptables even when there's different one in docker inspect and in the DB.
Still trying to figure out, how does rancher even know about the old IP...
So...
We have the old IP address in ip_address table;
ip_address_nic_map maps this ip address to a network interface;
nic maps this network interface to an instance;
instance.name of the mapped instance is 'rancher-server'.
After changing the problematic ip_address entry everything works as it should!
This just happened to me. Grrr. Rebooted host and rancher mgmt server is toast. Mine is showing this:
MASQUERADE tcp -- 172.17.0.3 172.17.0.3 tcp dpt:8080
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL tcp dpt:8080 to:172.17.0.1:8080
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.3:8080
it happened to me every time I restarted rancher-server container manually, I didn't try restarting it from the UI though.
However, can you check the container IP? is it 172.17.0.3?
Yes, mine is 172.17.0.3. For some reason that 172.17.0.1 is showing up when running iptables -L -n -t nat | grep 8080
if you flush CATTLE_PREROUTING chain it'll help, for few minuets only, because it seems there is something that update it again from the database, I believe it's the network container.
Has anyone determined if this is just a bug or possibly the way we are running our setup? Does the agent have anything to do with issues with the mgmt server running on the same host. Like you I installed the agent with the CATTLE_AGENT_IP=X.X.X.X on that particular host.
@kpelt It is definitely caused by the agent and the server being on the same host. CATTLE_PREROUTING is created by the agent to wire up our networking solutoin.
I'll take some more time today to attempt to reproduce.
Ok, good to know. I am going to keep my mgmt server on its own host for now so I can continue working on the glusterfs, percona and wordpress setup you guys did as I am doing a POC to introduce this technology to my company.
Awesome! Will keep everyone updated on this thread.
@cjellick Who's responsible of updating CATTEL_PREROUTING chain?
To get rancher-server, working again, delete that specific rule: iptables -t nat -D CATTLE_PREROUTING 1 or remove them all (can have temporary bad side effects) iptables -F CATTLE_PREROUTING -t nat, assuming the rule is number "1" but you'd need to keep doing this because rancher keeps reapply the rules.
Discussed with @ibuildthecloud and fundamentally, rancher should not be rewriting rules for containers that are not on the rancher network. Originally, we did this as feature. The idea is that it would allow users to dynamically updated container ports even if they were not on the Rancher network. But really, if a user has made the explicit decision to not use the rancher network for a container, than we should just allow docker networking to do its job.
So, we'll make it thus. Need to remove the Port Service from the docker network in the rancher database.
@MhdAli does the above synopsis answer your question?
This sounds good to me. When can we expect the next release?
We release weekly. This week's release won't have ithe fix, but we'll try to get it scheduled into next week's release.
@kpelt We aim to release weekly. If it doesn't pass our QA, then we sometimes skip a week.
Thanks for the information.
As I summarized in my last two comments, the best workaround that I found is to manually update the database record. But sure, you can go flushing the chain every time rancher gets it wrong.
@rmwpl I missed you summary. Excellent analysis and a much better work around. Thank you!
Tested with server build - v0.34.0-rc3
docker run rancher/server on any host
On the same host,run the rancher/agent
Deploy a container via rancher server UI wit port map.
Restart rancher server
After rancher server is restarted , we are still able to access rancher server successfully.
Confirmed, that's the proper steps to repro/prove its fixed.
reproduced this in rancher 1.4.1:
deploy several stacks with port maps and load balancers
systemctl restart docker
rancher resets connection, load balancers and stacks continue to work and are accessible
workaround:
sudo iptables -F CATTLE_PREROUTING -t nat
systemctl restart docker
I need help with my rancher server also. I Restarted my VM and rancher server is in that and docker. When i go to internet and type my ip addr it doesnt go anywhere and when im using your exsamples it always says Cannot connect is the docker daemon still running.
Most helpful comment
reproduced this in rancher 1.4.1:
deploy several stacks with port maps and load balancers
systemctl restart dockerrancher resets connection, load balancers and stacks continue to work and are accessible
workaround:
sudo iptables -F CATTLE_PREROUTING -t natsystemctl restart docker