Rancher Version: 1.0.1
Docker Version: 1.10.3
OS and where are the hosts located?: Debian GNU/Linux 8.4 (jessie) bare-metal
Setup Details: multiple nodes
Environment Type: Cattle
Rancher virtual network 10.42.* suddenly stopped working on one node, how can I debug it? Nothing out of the ordinary (from what I can tell) in agent-instance rancher-net.log / charon.log.
Further details, here's a traceroute from a working node, to a service on another node:
traceroute to 10.42.215.27 (10.42.215.27), 30 hops max, 60 byte packets
1 10.42.248.40 (10.42.248.40) 0.055 ms 0.019 ms 0.015 ms
2 10.42.148.180 (10.42.148.180) 0.420 ms 0.360 ms 0.348 ms
3 10.42.215.27 (10.42.215.27) 0.304 ms 0.326 ms 0.320 ms
Where the first hop is the local agent-instance, the second is the remote agent-instance.
Non working node:
traceroute to 10.42.215.27 (10.42.215.27), 30 hops max, 60 byte packets
1 10.42.112.38 (10.42.112.38) 0.050 ms 0.020 ms 0.015 ms
2 * * *
3 * * *
4 * * *
5 * * *
6 * * *
7 *^C
Once again, the first hop is the local agent-instance, so it appears the packages get stuck in the agent.
Any suggestions how to continue debugging?
I'm having a similar issue where I have a database cluster on a few nodes, when I launch new containers linked to the DB containers on other nodes, they cannot reach the DB containers. Restarting the rancher network containers on both the database hosts and the client host will often allow them to reach each other. Another symptom is that upgrading containers on the problematic hosts often gets stuck at the "networking" stage. Again, restarting the networking container will allow it to proceed. This is highly repeatable and has been happening for weeks now. I don't see anything too strange in the rancher agent/server logs. Any help in debugging the rancher networking would be greatly appreciated.
I'm a colleague of deadbeef84. We've found that the VPN client process inside the network agent crashed and didn't restart...
@ronag do you mind expanding on how you diagnosed this? Log messages or inspecting the processes in the container?
I too work with the machines on which this happened...
The process did not actually really crash, somehow connectivity between two particular hosts on the rancher network just mysteriously died, whereas on the "real" network outside the containers these hosts can connect to each other with no problems.
This was concluded by simply pinging all ip's on the rancher network from both these hosts, and as other hosts are reachable I believe that the process/agent/rancher thinks everything is fine, so no steps are taken to attempt any form of recovery. This would probably explain why there's not anything noticeable in the logs either.
@retardtank : If you still have the setup in the problematic state, could you please check the status of the IPSec tunnels on all of the hosts?
For every host:
- sudo docker exec -it $(sudo docker ps | grep agent-instance | awk '{print $1;}') /bin/bash
- swanctl --list-sas
@leodotcloud Thanks, it's still in that state...
While not knowing much of strongswan, as far as I can tell it still looks ok though. Output from the two hosts, 10.206.101.11 and 10.206.101.41 that has problems:
conn-10.206.101.12: #2683, ESTABLISHED, IKEv2, 8dca984d33582852:af68949b667c47df
local '172.17.0.3' @ 172.17.0.3
remote '172.17.0.2' @ 10.206.101.12
AES_CBC-128/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_2048
established 1789s ago, rekeying in 11488s
conn-10.206.101.42: #2682, ESTABLISHED, IKEv2, 03dfb284ae3a0436:6771eb4aa5a5142a
local '172.17.0.3' @ 172.17.0.3
remote '172.17.0.2' @ 10.206.101.42
AES_CBC-128/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_2048
established 2062s ago, rekeying in 10980s
child-10.206.101.42: #9, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/MODP_2048
installed 3318956s ago
in ce54523c, 1612594322872 bytes, 1480976972 packets
out cb8e96e7, 1024694633194 bytes, 1181971103 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.206.101.42: #2681, ESTABLISHED, IKEv2, 2d83053fbe08b05b:7f69ad22faffa3af
local '172.17.0.3' @ 172.17.0.3
remote '172.17.0.2' @ 10.206.101.42
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 4160s ago, rekeying in 9321s
child-10.206.101.42: #15, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 3318956s ago
in cd49be69, 448 bytes, 8 packets
out c6700d14, 240 bytes, 4 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.206.101.42: #2680, ESTABLISHED, IKEv2, 3a60d23c914b85b5:4ad8cbd1ed5a7ac8
local '172.17.0.3' @ 172.17.0.3
remote '172.17.0.2' @ 10.206.101.42
AES_CBC-128/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_2048
established 7623s ago, rekeying in 6039s
conn-10.206.101.41: #2679, ESTABLISHED, IKEv2, c30a18e60f609be2:9d06e622eea20aed
local '172.17.0.3' @ 172.17.0.3
remote '172.17.0.2' @ 10.206.101.41
AES_CBC-128/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_2048
established 8571s ago, rekeying in 4793s
conn-10.206.101.41: #2678, ESTABLISHED, IKEv2, 8b9d83f2029f1c25:892683fd74ff5c30
local '172.17.0.3' @ 172.17.0.3
remote '172.17.0.2' @ 10.206.101.41
AES_CBC-128/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_2048
established 9043s ago, rekeying in 4967s
child-10.206.101.41: #16, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/MODP_2048
installed 3318952s ago
in c1fd0f71, 3944170765020 bytes, 4284922670 packets
out c2f6b4ca, 4292441562334 bytes, 4220038060 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.206.101.41: #2677, ESTABLISHED, IKEv2, 01b563448c913d54:8ce51d02282c5386
local '172.17.0.3' @ 172.17.0.3
remote '172.17.0.2' @ 10.206.101.41
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 9627s ago, rekeying in 4044s
child-10.206.101.41: #13, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 3318956s ago
in c319551e, 448 bytes, 8 packets
out c8075691, 240 bytes, 4 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.206.101.12: #2676, ESTABLISHED, IKEv2, 934b300292ab5d59:5e26300197290fba
local '172.17.0.3' @ 172.17.0.3
remote '172.17.0.2' @ 10.206.101.12
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 10351s ago, rekeying in 3502s
child-10.206.101.12: #14, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 3318956s ago
in c6486314, 448 bytes, 8 packets
out c2075e09, 240 bytes, 4 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.206.101.41: #2675, ESTABLISHED, IKEv2, aae5ef19a175c546:f40f3c8fe236ae92
local '172.17.0.3' @ 172.17.0.3
remote '172.17.0.2' @ 10.206.101.41
AES_CBC-128/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_2048
established 11189s ago, rekeying in 1798s
child-10.206.101.41: #1, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128
installed 3943892s ago
in c04267b1, 12544 bytes, 224 packets
out c10969e9, 6720 bytes, 112 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.206.101.12: #2674, ESTABLISHED, IKEv2, 941bfe1ff1e1a43d:dd6a4ca09907d481
local '172.17.0.3' @ 172.17.0.3
remote '172.17.0.2' @ 10.206.101.12
AES_CBC-128/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_2048
established 12129s ago, rekeying in 1171s
child-10.206.101.12: #17, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/MODP_2048
installed 3318952s ago
in c8e9b3de, 949043022135 bytes, 811765266 packets
out cda9e1ef, 421287480020 bytes, 563766865 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.206.101.42: #2683, ESTABLISHED, IKEv2, 72b2facdc3a6e2ad:acbfc05b89953f53
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.206.101.42
AES_CBC-128/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_2048
established 1884s ago, rekeying in 11607s
child-10.206.101.42: #11, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/MODP_2048
installed 3317727s ago
in c1b2fc18, 756 bytes, 9 packets
out c6bcdd7b, 756 bytes, 9 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.206.101.42: #2682, ESTABLISHED, IKEv2, 5f77aef0a40a2b30:f73e59bdaa7f6931
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.206.101.42
AES_CBC-128/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_2048
established 3375s ago, rekeying in 10245s
conn-10.206.101.12: #2681, ESTABLISHED, IKEv2, ef6aee85f8dce943:d599ffa3705b85aa
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.206.101.12
AES_CBC-128/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_2048
established 4642s ago, rekeying in 8904s
conn-10.206.101.42: #2680, ESTABLISHED, IKEv2, 8cdc4a80958e5e1c:ebc609d0e0e31fbb
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.206.101.42
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 5704s ago, rekeying in 8671s
child-10.206.101.42: #17, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 3317726s ago
in cb82488d, 0 bytes, 0 packets
out ce697e31, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.206.101.11: #2679, ESTABLISHED, IKEv2, c30a18e60f609be2:9d06e622eea20aed
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.3' @ 10.206.101.11
AES_CBC-128/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_2048
established 7342s ago, rekeying in 6193s
conn-10.206.101.11: #2678, ESTABLISHED, IKEv2, 8b9d83f2029f1c25:892683fd74ff5c30
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.3' @ 10.206.101.11
AES_CBC-128/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_2048
established 7814s ago, rekeying in 6354s
child-10.206.101.11: #19, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/MODP_2048
installed 3317722s ago, rekeying in 25430026s, expires in 31371878s
in c2f6b4ca, 4282742024587 bytes, 4210989690 packets
out c1fd0f71, 3954923666520 bytes, 4294967295 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.206.101.11: #2677, ESTABLISHED, IKEv2, 01b563448c913d54:8ce51d02282c5386
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.3' @ 10.206.101.11
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 8397s ago, rekeying in 5621s
child-10.206.101.11: #18, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 3317726s ago, rekeying in 26495034s, expires in 31371874s
in c8075691, 240 bytes, 4 packets
out c319551e, 448 bytes, 8 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.206.101.11: #2676, ESTABLISHED, IKEv2, aae5ef19a175c546:f40f3c8fe236ae92
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.3' @ 10.206.101.11
AES_CBC-128/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_2048
established 9959s ago, rekeying in 4389s
child-10.206.101.11: #3, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128
installed 3942663s ago, rekeying in 25549035s, expires in 30746937s
in c10969e9, 6720 bytes, 112 packets
out c04267b1, 12544 bytes, 224 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.206.101.12: #2675, ESTABLISHED, IKEv2, abbdf0449228ff86:c40f14dcbef9d587
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.206.101.12
AES_CBC-128/HMAC_SHA1_96/PRF_HMAC_SHA1/MODP_2048
established 11714s ago, rekeying in 2287s
child-10.206.101.12: #15, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/MODP_2048
installed 3317726s ago, rekeying in 25277805s, expires in 31371874s
in cef7b5bd, 0 bytes, 0 packets
out c9955f6b, 420 bytes, 5 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.206.101.12: #2674, ESTABLISHED, IKEv2, acca3dc58fffa7e2:1cc9420fda1f6590
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.206.101.12
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 13423s ago, rekeying in 820s
child-10.206.101.12: #16, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 3317726s ago, rekeying in 25466860s, expires in 31371874s
in cf959c40, 420 bytes, 5 packets
out c17a2ef5, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
I have a setup in a problematic state as well.
Machine 1 has a rancher managed db container, Machine 2 has client containers with service links to the db on machine 1.
Existing containers on Machine 2 to can reach the machine 1.
Tracepath from existing container on machine 2 to DB container:
# tracepath 10.42.96.68
1?: [LOCALHOST] pmtu 1500
1: no reply
2: ip-10-42-160-84.ec2.internal 0.167ms pmtu 1422
2: ip-10-42-84-156.ec2.internal 0.518ms
3: ip-10-42-96-68.ec2.internal 0.428ms reached
Resume: pmtu 1422 hops 3 back 3
However, new containers on Machine 2 cannot reach the DB container on machine 1:
Tracepath from new container to DB container:
# tracepath 10.42.96.68
1?: [LOCALHOST] pmtu 1500
1: ip-10-42-160-84.ec2.internal 0.121ms
1: ip-10-42-160-84.ec2.internal 0.056ms
2: ip-10-42-160-84.ec2.internal 0.064ms pmtu 1422
2: no reply
3: no reply
4: no reply
5: no reply
6: no reply
swanctl output machine 1 (DB server)
swanctl output machine 2
This same issue is happening with several machines in the environment, once new containers are launched and unable to reach containers on other machines, restarting the network agent on both machines will allow them to reach each other.
The above does look very similar to what we have experienced. Although it has only happened once here so far, but when it suddenly did, it had quite an impact...
Hello,
I think I am experiencing a very similar issue. I have 2 web application containers on host 1 and 2. Container 1 on host 1 does NOT ping the redis container that is on host 3, but container 2 from host 2 DOES ping it.
Also, as visible in the output below, container 1 DOES resolve the name and DOES ping the nameserver.
Here's the output of traceroute from both containers :
Container that works:
php-fpm_2:/var/www# traceroute redis
traceroute to redis (10.42.209.65), 30 hops max, 60 byte packets
1 * * *
2 10.42.192.9 (10.42.192.9) 0.528 ms 0.421 ms 0.432 ms
3 10.42.209.65 (10.42.209.65) 0.435 ms 0.438 ms 0.468 ms
Container that does not work:
php-fpm_1:/var/www# traceroute redis
traceroute to redis (10.42.209.65), 30 hops max, 60 byte packets
1 10.42.220.226 (10.42.220.226) 0.026 ms 0.015 ms 0.011 ms
2 * * *
3 * * *
4 * * *
5 * * *
6 * * *
7 * * *
8 * * *
9 * * *
10 * * *
11 * * *
12 * * *
13 * * *
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *
Also, here's the output of swanctl from the host 1 (where the bad container lives) :
# swanctl --list-sas
conn-138.201.55.14: #2732, ESTABLISHED, IKEv2, adbf8bd39b23fd5b:cf503a5ea2e22817
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.3' @ 138.201.55.14
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 774s ago, rekeying in 13165s
child-138.201.55.14: #11, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 2214320s ago, rekeying in 26883201s, expires in 32475280s
in c8aa231a, 0 bytes, 0 packets
out c5bbf995, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.14: #2731, ESTABLISHED, IKEv2, 5226afc1a26f5ecf:2a0fcd30e2ce7886
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.3' @ 138.201.55.14
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 5225s ago, rekeying in 8668s
child-138.201.55.14: #12, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 2214320s ago, rekeying in 27083145s, expires in 32475280s
in c8ce9aa0, 0 bytes, 0 packets
out c887d15f, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.14: #2730, ESTABLISHED, IKEv2, d808298af1f22672:9bf2268dd0599e0d
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.3' @ 138.201.55.14
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 6500s ago, rekeying in 6612s
child-138.201.55.14: #14, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 2214320s ago, rekeying in 28544838s, expires in 32475280s
in c35eea98, 4723988485623 bytes, 4294104145 packets
out cb1d0876, 8718126 bytes, 104082 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-136.243.139.110: #2729, ESTABLISHED, IKEv2, f328a93e2bda3b4f:2452733705614f9a
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 136.243.139.110
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 7744s ago, rekeying in 6134s
child-136.243.139.110: #44, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413449s ago, rekeying in 28120970s, expires in 34276151s
in c90737ae, 0 bytes, 0 packets
out c5816d05, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.14: #2728, ESTABLISHED, IKEv2, 9cb5523ccbdbd2c4:4a30fe6bd810a88a
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.3' @ 138.201.55.14
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 8812s ago, rekeying in 4470s
child-138.201.55.14: #13, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 2214320s ago, rekeying in 28718734s, expires in 32475280s
in c74c6f20, 0 bytes, 0 packets
out c85a2ee5, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-136.243.139.110: #2727, ESTABLISHED, IKEv2, 9778e3082a2e2c73:cd7370d4fed61d72
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 136.243.139.110
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 8813s ago, rekeying in 4817s
child-136.243.139.110: #49, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413449s ago, rekeying in 28582747s, expires in 34276151s
in c1d3eaf4, 14944 bytes, 12 packets
out c68f1563, 676 bytes, 13 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-136.243.139.110: #2726, ESTABLISHED, IKEv2, 33f486814dbbae84:43ecbe9aba961ee7
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 136.243.139.110
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 9009s ago, rekeying in 4326s
child-136.243.139.110: #46, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413449s ago, rekeying in 28638287s, expires in 34276151s
in c6777ec6, 404 bytes, 1 packets
out cf58da8f, 164 bytes, 3 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-136.243.139.110: #2725, ESTABLISHED, IKEv2, 6ac7e25a74aa47d6:f1574de802901183
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 136.243.139.110
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 9185s ago, rekeying in 5100s
child-136.243.139.110: #48, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413449s ago, rekeying in 29158491s, expires in 34276151s
in c108e506, 0 bytes, 0 packets
out cf231cc9, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-136.243.139.110: #2724, ESTABLISHED, IKEv2, f7322b4d9b6aafd1:90b74e152131b4c0
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 136.243.139.110
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 9834s ago, rekeying in 3745s
child-136.243.139.110: #43, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413449s ago, rekeying in 28362666s, expires in 34276151s
in c727ae16, 0 bytes, 0 packets
out c28849ed, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-136.243.139.110: #2723, ESTABLISHED, IKEv2, 04dba9f9233bc7db:04721e84beb234a4
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 136.243.139.110
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 9917s ago, rekeying in 4249s
child-136.243.139.110: #52, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413449s ago, rekeying in 28265977s, expires in 34276151s
in c51e6c37, 0 bytes, 0 packets
out c8b411f8, 221893980084 bytes, 361093479 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-136.243.139.110: #2722, ESTABLISHED, IKEv2, ad5fdda7a3698dd2:850f549addc35ed8
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 136.243.139.110
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 9992s ago, rekeying in 3164s
child-136.243.139.110: #47, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413449s ago, rekeying in 28816148s, expires in 34276151s
in c11f7dbb, 250681900082 bytes, 357944555 packets
out c4ccdf0d, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-136.243.139.110: #2721, ESTABLISHED, IKEv2, d0419b2ce79eb268:9582190fef809644
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 136.243.139.110
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 10314s ago, rekeying in 2911s
child-136.243.139.110: #50, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413449s ago, rekeying in 29107448s, expires in 34276151s
in c8cb2180, 60 bytes, 1 packets
out c845d283, 2219 bytes, 4 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-136.243.139.110: #2720, ESTABLISHED, IKEv2, b4d51e7c52cb4daf:ceb3dc240e26f8f0
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 136.243.139.110
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 10933s ago, rekeying in 2917s
child-136.243.139.110: #42, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413497s ago, rekeying in 28452587s, expires in 34276103s
in c6039c78, 3738960 bytes, 3257 packets
out c1a5df01, 384222 bytes, 3401 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.14: #2719, ESTABLISHED, IKEv2, e2fd9f2f73908296:6174dfb5df38a09f
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.3' @ 138.201.55.14
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 11650s ago, rekeying in 2483s
child-138.201.55.14: #8, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 2214320s ago, rekeying in 26965518s, expires in 32475280s
in c3cf15d3, 0 bytes, 0 packets
out c26baac3, 448509677877 bytes, 3594632504 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-136.243.139.110: #2718, ESTABLISHED, IKEv2, 3818cc2cca864965:d166f5df9a6c9de3
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 136.243.139.110
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 11883s ago, rekeying in 1154s
child-136.243.139.110: #51, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413449s ago, rekeying in 29246859s, expires in 34276151s
in c3e32e68, 0 bytes, 0 packets
out cb39cc73, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.14: #2717, ESTABLISHED, IKEv2, 661b6d6859fa0d3a:4e835b1b58e167f1
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.3' @ 138.201.55.14
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 13203s ago, rekeying in 670s
child-138.201.55.14: #10, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 2214320s ago, rekeying in 28789653s, expires in 32475280s
in cf0112c4, 0 bytes, 0 packets
out c1bef744, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-136.243.139.110: #2716, ESTABLISHED, IKEv2, fde625f015d23e62:8016275ab0f8441a
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 136.243.139.110
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 13339s ago, rekeying in 1013s
child-136.243.139.110: #45, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413449s ago, rekeying in 29572997s, expires in 34276151s
in cfb227c1, 0 bytes, 0 packets
out c51ae854, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
And from Container 2 where the healthy container is :
# swanctl --list-sas
conn-138.201.55.14: #528, ESTABLISHED, IKEv2, ac3bc826f84653e8:65fde55c5de9665c
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.3' @ 138.201.55.14
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 792s ago, rekeying in 12386s
child-138.201.55.14: #8, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413544s ago
in c484e808, 0 bytes, 0 packets
out c2e87e9c, 364 bytes, 7 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.7: #527, ESTABLISHED, IKEv2, f328a93e2bda3b4f:2452733705614f9a
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 138.201.55.7
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 7839s ago, rekeying in 5187s
child-138.201.55.7: #4, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413544s ago
in c5816d05, 0 bytes, 0 packets
out c90737ae, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.14: #526, ESTABLISHED, IKEv2, 6c92f0f7fbfa544d:5108db61b76e48ae
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.3' @ 138.201.55.14
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 7927s ago, rekeying in 6060s
child-138.201.55.14: #2, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413592s ago
in c7ff559d, 19207027 bytes, 15819 packets
out ce7df6d7, 1052691 bytes, 12987 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.7: #525, ESTABLISHED, IKEv2, 9778e3082a2e2c73:cd7370d4fed61d72
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 138.201.55.7
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 8908s ago, rekeying in 4881s
child-138.201.55.7: #11, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413544s ago
in c68f1563, 676 bytes, 13 packets
out c1d3eaf4, 14944 bytes, 12 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.14: #524, ESTABLISHED, IKEv2, 65d1da64dcb84853:e756db135e458db4
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.3' @ 138.201.55.14
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 9092s ago, rekeying in 5205s
child-138.201.55.14: #15, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413544s ago
in c8880cc9, 24174 bytes, 17 packets
out c44fa30a, 112 bytes, 2 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.7: #523, ESTABLISHED, IKEv2, 33f486814dbbae84:43ecbe9aba961ee7
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 138.201.55.7
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 9104s ago, rekeying in 4764s
child-138.201.55.7: #6, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413544s ago
in cf58da8f, 164 bytes, 3 packets
out c6777ec6, 404 bytes, 1 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.7: #522, ESTABLISHED, IKEv2, 6ac7e25a74aa47d6:f1574de802901183
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 138.201.55.7
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 9280s ago, rekeying in 3736s
child-138.201.55.7: #10, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413544s ago
in cf231cc9, 0 bytes, 0 packets
out c108e506, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.14: #521, ESTABLISHED, IKEv2, fbbe914b7ebf510d:2fcb52f1fa1391ad
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.3' @ 138.201.55.14
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 9430s ago, rekeying in 3981s
child-138.201.55.14: #17, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413544s ago
in c13e7342, 1150209162001 bytes, 995204550 packets
out cae940ba, 106128960392 bytes, 849191645 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.7: #520, ESTABLISHED, IKEv2, f7322b4d9b6aafd1:90b74e152131b4c0
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 138.201.55.7
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 9929s ago, rekeying in 3738s
child-138.201.55.7: #3, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413544s ago
in c28849ed, 0 bytes, 0 packets
out c727ae16, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.7: #519, ESTABLISHED, IKEv2, 04dba9f9233bc7db:04721e84beb234a4
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 138.201.55.7
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 10012s ago, rekeying in 4174s
child-138.201.55.7: #16, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413544s ago
in c8b411f8, 221799288759 bytes, 361041337 packets
out c51e6c37, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.7: #518, ESTABLISHED, IKEv2, ad5fdda7a3698dd2:850f549addc35ed8
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 138.201.55.7
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 10087s ago, rekeying in 3699s
child-138.201.55.7: #7, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413544s ago
in c4ccdf0d, 0 bytes, 0 packets
out c11f7dbb, 250866738620 bytes, 358117401 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.14: #517, ESTABLISHED, IKEv2, 5b96ad8a893604f6:ce00961b16712a00
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.3' @ 138.201.55.14
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 10399s ago, rekeying in 2761s
child-138.201.55.14: #9, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413544s ago
in c1f718ea, 277067 bytes, 211 packets
out cea1a8cc, 8956 bytes, 122 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.7: #516, ESTABLISHED, IKEv2, d0419b2ce79eb268:9582190fef809644
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 138.201.55.7
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 10409s ago, rekeying in 2748s
child-138.201.55.7: #13, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413544s ago
in c845d283, 2219 bytes, 4 packets
out c8cb2180, 60 bytes, 1 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.7: #515, ESTABLISHED, IKEv2, b4d51e7c52cb4daf:ceb3dc240e26f8f0
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 138.201.55.7
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 11028s ago, rekeying in 3200s
child-138.201.55.7: #1, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413592s ago
in c1a5df01, 384222 bytes, 3401 packets
out c6039c78, 3738960 bytes, 3257 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.7: #514, ESTABLISHED, IKEv2, 3818cc2cca864965:d166f5df9a6c9de3
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 138.201.55.7
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 11978s ago, rekeying in 2402s
child-138.201.55.7: #14, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413544s ago
in cb39cc73, 0 bytes, 0 packets
out c3e32e68, 104 bytes, 2 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.14: #513, ESTABLISHED, IKEv2, b0c95e674cf199ca:03dd646a6e7ea5f3
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.3' @ 138.201.55.14
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 12020s ago, rekeying in 1482s
child-138.201.55.14: #12, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413544s ago
in ce0aea90, 404 bytes, 5 packets
out cf32ecc4, 909 bytes, 15 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-138.201.55.7: #512, ESTABLISHED, IKEv2, fde625f015d23e62:8016275ab0f8441a
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 138.201.55.7
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 13434s ago, rekeying in 941s
child-138.201.55.7: #5, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 413544s ago
in c51ae854, 0 bytes, 0 packets
out cfb227c1, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
Docker version is 1.11.1 and Rancher is 1.0.1
Is there any way to deal with this ? Thanks !
This just happened for us again. Second time.
The fix is to simply restart the network agent.
We're unlucky it seems. Now it happened again on another machine. Restarting the network agent resolved the issue as usual.
Given that it happened roughly at the same time on two machines which were started roughly at the same time (3 weeks ago). It sounds like a memory-leak or timeout issue?
Also, it is strange for me that the VPN process is not automatically restarted once it crashes...
I had two more servers affected by this issue starting yesterday. They cannot be reached over rancher networking until restarting the networking container. Also, when starting or restarting a container on the problematic machines, they get stuck in the "Networking" state until the networking container is restarted.
I thought that stoping and removing the rancher-agent and the agent-instance containers might help. They were both able to be stopped; however when attempting to remove their volumes, I get a "device busy" error, and the volumes cannot be removed. I'm not sure if that is expected or related, but I thought it was worth noting.
Do you have any steps that will help me reproduce this issue at my end? I have been trying to recreate the problem for a while but no success.
This happens to us occasionally as well. There doesn't seem to be a reliable way to trigger it that I can find. Here's our setup that caused the issue:
@deniseschannon: As several people seem to have the same issue maybe this could be re-triaged to kind/bug?
As @remkade said, we neither have been able to find anything obvious in logs or so that gives us a slightest clue to what might have triggered this.
And as mentioned before, the network agent process itself has _not crashed_, but rather seems to report everything as ok, while the actual _transport_ between some peers on the rancher network seems stuck. This while networking outside any containers works just fine.
What @retardtank said.
@leodotcloud: Our setup is:
Node1: Load-Balancer (ha-proxy, rancher built in)
Node2: HTTP server (file server)
Node3: HTTP server (file server)
Node4: HTTP server (file server)
We're having the issues on Node2-Node4. All nodes are identical apart from which containers are running.
According to https://wiki.strongswan.org/issues/183 we need the kernel to be compiled with CONFIG_INET_XFRM_MODE_TRANSPORT set.
Can you please check if this flag is set in your kernel config?
Try /proc/config.gz or /boot/config-uname-r`` for the kernel config files.
boot2docker v1.11.1 doesn't have this flag set, so rancher cross host communication doesn't work using boot2docker.
Also can you check the output of the command:
ip xfrm state
@remkade : Are you using boot2docker inside your VM?
https://forums.rancher.com/t/how-to-get-rancher-overlay-network-working-locally/3044/10?u=leodotcloud
CONFIG_INET_XFRM_MODE_TRANSPORT=m is set on our hosts.
A while a go when we had two peers that could not reach each other, I took a thorough look through the charon log, but can't say that I spotted anything that stood out like NO_PROPOSAL_CHOSEN. There's lots of repetitive querying policy failed: No such file or directory (2) for all peers, but I guess that's normal since networking still (mostly) works (?).
The following strongSwan-related posts seems to describe exactly the same problem that we are facing (except for that networking's usually up for longer than "minutes" in our case):
https://www.mail-archive.com/[email protected]/msg00921.html
http://forum.ipfire.org/viewtopic.php?t=15683
Unfortunately as far I can see, they do not really provide a fix. At least not that's applicable to Rancher, our environment, or the year 2016....
@leodotcloud no boot2docker, these are pxe booted hosts on a qemu vm or bare metal.
+1, seeing this behavior as well.
We have 3x rancher hosts deployed. I deployed a single sample container which ended up on host 3. HA Proxy Container on Host 1 and host 3 could communicate with this container just fine however the same container on host 2 could not. Running ab, i saw a 33% failure rate over 1k connection requests through an ELB to the haproxy instances (one for each host).
I spent a lot of time digging and stumbled on this thread. Without any changes, here is the output of swanctl --list-sas without any restarts
host 1:
root@09e98a5ce5f9:/# swanctl --list-sas
conn-10.202.32.31: #1707, ESTABLISHED, IKEv2, d615263e8a1506bd:4a68f1bf19f3f4fc
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.32.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 1354s ago, rekeying in 12584s
child-10.202.32.31: #2, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 5712809s ago
in cc994365, 0 bytes, 0 packets
out c04cef57, 540 bytes, 9 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.202.32.31: #1706, ESTABLISHED, IKEv2, e5c0484f8109335f:247e0ebf913442c0
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.32.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 2510s ago, rekeying in 11850s
child-10.202.32.31: #3, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 5712801s ago
in c3f31335, 11993182546 bytes, 174347937 packets
out cc2c1209, 11770029055 bytes, 177738279 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.202.33.31: #1705, ESTABLISHED, IKEv2, 68447cf8aa0c0a40:33a0d43b0608ba0d
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.33.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 7003s ago, rekeying in 7243s
child-10.202.33.31: #1, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 5712809s ago
in cdb3cf24, 0 bytes, 0 packets
out cc318e0c, 240 bytes, 4 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.202.33.31: #1704, ESTABLISHED, IKEv2, 314e1a6bf2c732ca:5200fd83a0db5d0f
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.33.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 7692s ago, rekeying in 5388s
child-10.202.33.31: #4, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 5712801s ago
in c2bd067e, 13146217397 bytes, 190925708 packets
out c90cf5d2, 12837992636 bytes, 196628215 packets
local 0.0.0.0/0
remote 0.0.0.0/0
host 2:
root@24122a0d8c8b:/# swanctl --list-sas
conn-10.202.31.31: #933, ESTABLISHED, IKEv2, d615263e8a1506bd:4a68f1bf19f3f4fc
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.31.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 1356s ago, rekeying in 11928s
child-10.202.31.31: #1, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 5712812s ago, rekeying in 23070460s, expires in 28976788s
in c04cef57, 540 bytes, 9 packets
out cc994365, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.202.31.31: #932, ESTABLISHED, IKEv2, e5c0484f8109335f:247e0ebf913442c0
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.31.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 2513s ago, rekeying in 11815s
child-10.202.31.31: #3, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 5712803s ago, rekeying in 24022766s, expires in 28976797s
in cc2c1209, 11769540162 bytes, 177730946 packets
out c3f31335, 11993331445 bytes, 174350068 packets
local 0.0.0.0/0
remote 0.0.0.0/0
host 3:
root@0bcece89fc2b:/# swanctl --list-sas
conn-10.202.31.31: #931, ESTABLISHED, IKEv2, 68447cf8aa0c0a40:33a0d43b0608ba0d
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.31.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 7008s ago, rekeying in 7316s
child-10.202.31.31: #1, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 5712814s ago, rekeying in 23070458s, expires in 28976786s
in cc318e0c, 240 bytes, 4 packets
out cdb3cf24, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.202.31.31: #930, ESTABLISHED, IKEv2, 314e1a6bf2c732ca:5200fd83a0db5d0f
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.31.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 7696s ago, rekeying in 6481s
child-10.202.31.31: #3, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 5712805s ago, rekeying in 24280285s, expires in 28976795s
in c90cf5d2, 12837831134 bytes, 196625762 packets
out c2bd067e, 13146399961 bytes, 190928403 packets
local 0.0.0.0/0
remote 0.0.0.0/0
I then restarted the rancher-agent container on host 2 and now see the following output:
host1:
root@09e98a5ce5f9:/# swanctl --list-sas
conn-10.202.32.31: #1710, ESTABLISHED, IKEv2, 227be6b7e3deaf7a:7876394e0f0397db
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.32.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 331s ago, rekeying in 14055s
child-10.202.32.31: #5, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 331s ago
in c8f27d80, 52006 bytes, 749 packets
out cfb56bf5, 45265 bytes, 703 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.202.32.31: #1709, ESTABLISHED, IKEv2, 1674e6a7bc4bea9d:cb90e9fef87583a2
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.32.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 313s ago, rekeying in 14039s
child-10.202.32.31: #6, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 313s ago
in cfe67c9d, 0 bytes, 0 packets
out c2bf8d8c, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.202.32.31: #1708, ESTABLISHED, IKEv2, 3d2f7096fa91f488:95939026fc13c31d
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.32.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 313s ago, rekeying in 13191s
child-10.202.32.31: #7, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 313s ago
in c6c8b301, 1271346 bytes, 19889 packets
out cbb06ca0, 1297759 bytes, 18415 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.202.33.31: #1705, ESTABLISHED, IKEv2, 68447cf8aa0c0a40:33a0d43b0608ba0d
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.33.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 7473s ago, rekeying in 6773s
child-10.202.33.31: #1, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 5713279s ago
in cdb3cf24, 0 bytes, 0 packets
out cc318e0c, 240 bytes, 4 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.202.33.31: #1704, ESTABLISHED, IKEv2, 314e1a6bf2c732ca:5200fd83a0db5d0f
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.33.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 8162s ago, rekeying in 4918s
child-10.202.33.31: #4, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 5713271s ago
in c2bd067e, 13148807030 bytes, 190963617 packets
out c90cf5d2, 12840465985 bytes, 196665300 packets
local 0.0.0.0/0
remote 0.0.0.0/0
host2:
root@24122a0d8c8b:/# swanctl --list-sas
conn-10.202.31.31: #4, ESTABLISHED, IKEv2, 1674e6a7bc4bea9d:cb90e9fef87583a2
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.31.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 342s ago, rekeying in 12638s
child-10.202.31.31: #3, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 342s ago, rekeying in 28860443s, expires in 34689258s
in c2bf8d8c, 0 bytes, 0 packets
out cfe67c9d, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.202.31.31: #3, ESTABLISHED, IKEv2, 3d2f7096fa91f488:95939026fc13c31d
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.31.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 342s ago, rekeying in 13371s
child-10.202.31.31: #4, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 342s ago, rekeying in 28694644s, expires in 34689258s
in cbb06ca0, 1426008 bytes, 20231 packets
out c6c8b301, 1394312 bytes, 21814 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.202.33.31: #2, ESTABLISHED, IKEv2, 91c6cda0c9664265:43e44deb1499c177
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.33.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 359s ago, rekeying in 13602s
child-10.202.33.31: #2, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 359s ago
in c0982eaf, 1581371 bytes, 22145 packets
out ced98a7b, 1494307 bytes, 23330 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.202.31.31: #1, ESTABLISHED, IKEv2, 227be6b7e3deaf7a:7876394e0f0397db
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.31.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 359s ago, rekeying in 13622s
child-10.202.31.31: #1, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 359s ago, rekeying in 29812695s, expires in 34689241s
in cfb56bf5, 45265 bytes, 703 packets
out c8f27d80, 52006 bytes, 749 packets
local 0.0.0.0/0
remote 0.0.0.0/0
host3:
root@0bcece89fc2b:/# swanctl --list-sas
conn-10.202.32.31: #932, ESTABLISHED, IKEv2, 91c6cda0c9664265:43e44deb1499c177
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.32.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 388s ago, rekeying in 13047s
child-10.202.32.31: #4, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 388s ago, rekeying in 30663425s, expires in 34689212s
in ced98a7b, 1618611 bytes, 25276 packets
out c0982eaf, 1715737 bytes, 24021 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.202.31.31: #931, ESTABLISHED, IKEv2, 68447cf8aa0c0a40:33a0d43b0608ba0d
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.31.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 7531s ago, rekeying in 6793s
child-10.202.31.31: #1, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 5713337s ago, rekeying in 23069935s, expires in 28976263s
in cc318e0c, 240 bytes, 4 packets
out cdb3cf24, 0 bytes, 0 packets
local 0.0.0.0/0
remote 0.0.0.0/0
conn-10.202.31.31: #930, ESTABLISHED, IKEv2, 314e1a6bf2c732ca:5200fd83a0db5d0f
local '172.17.0.2' @ 172.17.0.2
remote '172.17.0.2' @ 10.202.31.31
AES_CBC-128/AES_XCBC_96/PRF_AES128_XCBC/MODP_2048
established 8219s ago, rekeying in 5958s
child-10.202.31.31: #3, reqid 1234, INSTALLED, TUNNEL-in-UDP, ESP:AES_CBC-128/HMAC_SHA1_96
installed 5713328s ago, rekeying in 24279762s, expires in 28976272s
in c90cf5d2, 12840607349 bytes, 196667353 packets
out c2bd067e, 13149301148 bytes, 190970904 packets
local 0.0.0.0/0
remote 0.0.0.0/0
After that restart, with no other modifications I can now access the container on host 3 from host 2.
I did a decent amount of digging through the logs and dont see anything. This is def concerning though.
Host 3 as you can see is minus one connection as well. I suspect that if I restart that I will see 4x connection as host 1 and host 2 now show. However ill hold off in the event there is any additional debug info to be obtained.
For versioning here is what I am running:
Component Version
Rancher v1.0.1
Cattle v0.159.7
User Interface v1.0.5
Rancher Compose v0.7.4
In the next couple of days I plan to upgrade to the latest and I suspect that this has been squashed (although I dug through the patch notes and dont see anything directly relevant unless I missed it)
Happened twice today.
@ronag have you tried upgraded to the latest version or are you still on 1.0.1?
@chkelly: Still on 1.0.1. Has anything happened since then with the network agent that might affect this?
Happened to me twice in last 2 days. Rancher is already 1.1.2. In my case restarting rancher/agent container does nothing to me as the rancher/agent-instance is still up. Restarting rancher/agent-instance stuck and caused host to become unresponsive. I have to reboot the instance to get it up running.
May I ask @ronag how did you restart the network agent?
Here is docker info
Containers: 46
Running: 42
Paused: 0
Stopped: 4
Images: 17
Server Version: 1.10.3
Storage Driver: overlay
Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
Volume: local
Network: bridge null host
Kernel Version: 4.6.3-coreos
Operating System: CoreOS 1068.8.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 14.94 GiB
Name: ip-172-31-1-103.ap-southeast-1.compute.internal
ID: OI4J:WALR:6M2Y:YAN4:WTJ4:OHQO:XNKZ:56AB:6IOX:DOUT:N3PW:U2HV
The OS is CoreOS
NAME=CoreOS
ID=coreos
VERSION=1068.8.0
VERSION_ID=1068.8.0
BUILD_ID=2016-07-18-0616
PRETTY_NAME="CoreOS 1068.8.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
@PanJ that may not be the same issue based on the fact that you had trouble restarting the container.
For reference, I simply did a docker restart <container_id>. I was also able to exec into the container first and issue a swanctl --list-sas and see that it did not have a connection to the other docker host that had the containers on it that failed basic connectivity checks.
@ronag since the upgrade I have not encountered this issue, however we are continuing to monitor it.
There were some networking fixes in v1.1.2, so @PanJ could you share your rancher-net.log and charon.log from inside the Network Agent?
http://docs.rancher.com/rancher/latest/en/faqs/troubleshooting/#where-can-i-find-logs-for-networking
Also, instead of restarting the entire network agent, you should be able to just run monit restart charon inside the network agent to get networking restarted. Please give it a try.
Thank you @chkelly and @deniseschannon for your support. I will try to log next time the problem occurs.
Currently, it looks like this
rancher-net.log
time="2016-07-28T16:00:48Z" level=info msg=Reconfiguring
time="2016-07-28T16:00:48Z" level=info msg="Deleted policy: {Dst:10.42.0.0/16 Src:10.42.145.49/32 Dir:dir in Priority:10000 Index:22720 Tmpls:[{Dst:172.17.0.2 Src:52.77.99.217 Proto:esp Mode:tunnel Reqid:1234}]}"
time="2016-07-28T16:00:48Z" level=info msg="Deleted policy: {Dst:10.42.145.49/32 Src:10.42.0.0/16 Dir:dir out Priority:10000 Index:22713 Tmpls:[{Dst:52.77.99.217 Src:172.17.0.2 Proto:esp Mode:tunnel Reqid:1234}]}"
time="2016-07-28T16:00:48Z" level=info msg="Deleted policy: {Dst:10.42.0.0/16 Src:10.42.145.49/32 Dir:dir fwd Priority:10000 Index:22730 Tmpls:[{Dst:172.17.0.2 Src:52.77.99.217 Proto:esp Mode:tunnel Reqid:1234}]}"
time="2016-07-28T16:00:48Z" level=info msg="Added policy: {Dst:10.42.9.177/32 Src:10.42.0.0/16 Dir:dir out Priority:10000 Index:0 Tmpls:[{Dst:52.77.99.217 Src:172.17.0.2 Proto:esp Mode:tunnel Reqid:1234}]}"
time="2016-07-28T16:00:48Z" level=info msg="Added policy: {Dst:10.42.0.0/16 Src:10.42.9.177/32 Dir:dir in Priority:10000 Index:0 Tmpls:[{Dst:172.17.0.2 Src:52.77.99.217 Proto:esp Mode:tunnel Reqid:1234}]}"
time="2016-07-28T16:00:48Z" level=info msg="Added policy: {Dst:10.42.0.0/16 Src:10.42.9.177/32 Dir:dir fwd Priority:10000 Index:0 Tmpls:[{Dst:172.17.0.2 Src:52.77.99.217 Proto:esp Mode:tunnel Reqid:1234}]}"
time="2016-07-28T16:10:32Z" level=info msg=Reconfiguring
time="2016-07-28T16:10:32Z" level=info msg="Deleted policy: {Dst:10.42.0.0/16 Src:10.42.125.43/32 Dir:dir fwd Priority:10000 Index:22682 Tmpls:[{Dst:172.17.0.2 Src:52.77.99.217 Proto:esp Mode:tunnel Reqid:1234}]}"
time="2016-07-28T16:10:32Z" level=info msg="Deleted policy: {Dst:10.42.0.0/16 Src:10.42.125.43/32 Dir:dir in Priority:10000 Index:22672 Tmpls:[{Dst:172.17.0.2 Src:52.77.99.217 Proto:esp Mode:tunnel Reqid:1234}]}"
time="2016-07-28T16:10:32Z" level=info msg="Deleted policy: {Dst:10.42.125.43/32 Src:10.42.0.0/16 Dir:dir out Priority:10000 Index:22665 Tmpls:[{Dst:52.77.99.217 Src:172.17.0.2 Proto:esp Mode:tunnel Reqid:1234}]}"
time="2016-07-28T16:10:32Z" level=info msg="Added policy: {Dst:10.42.18.9/32 Src:10.42.0.0/16 Dir:dir out Priority:10000 Index:0 Tmpls:[{Dst:52.77.99.217 Src:172.17.0.2 Proto:esp Mode:tunnel Reqid:1234}]}"
time="2016-07-28T16:10:32Z" level=info msg="Added policy: {Dst:10.42.0.0/16 Src:10.42.18.9/32 Dir:dir in Priority:10000 Index:0 Tmpls:[{Dst:172.17.0.2 Src:52.77.99.217 Proto:esp Mode:tunnel Reqid:1234}]}"
time="2016-07-28T16:10:32Z" level=info msg="Added policy: {Dst:10.42.0.0/16 Src:10.42.18.9/32 Dir:dir fwd Priority:10000 Index:0 Tmpls:[{Dst:172.17.0.2 Src:52.77.99.217 Proto:esp Mode:tunnel Reqid:1234}]}"
This loops every 10 minutes.
charon.log
Jul 28 16:16:56 12[IKE] <conn-52.77.221.132|2> peer not responding, trying again (1271/0)
Jul 28 16:16:56 12[IKE] <conn-52.77.221.132|2> initiating IKE_SA conn-52.77.221.132[2] to 52.77.221.132
Jul 28 16:16:56 12[ENC] <conn-52.77.221.132|2> generating IKE_SA_INIT request 0 [ SA KE No N(NATD_S_IP) N(NATD_D_IP) N(HASH_ALG) N(REDIR_SUP) ]
Jul 28 16:16:56 12[NET] <conn-52.77.221.132|2> sending packet: from 172.17.0.2[500] to 52.77.221.132[500] (500 bytes)
Jul 28 16:17:00 13[IKE] <conn-52.77.221.132|2> retransmit 1 of request with message ID 0
Jul 28 16:17:00 13[NET] <conn-52.77.221.132|2> sending packet: from 172.17.0.2[500] to 52.77.221.132[500] (500 bytes)
Jul 28 16:17:07 07[IKE] <conn-52.77.221.132|2> retransmit 2 of request with message ID 0
Jul 28 16:17:07 07[NET] <conn-52.77.221.132|2> sending packet: from 172.17.0.2[500] to 52.77.221.132[500] (500 bytes)
Jul 28 16:17:20 11[IKE] <conn-52.77.221.132|2> retransmit 3 of request with message ID 0
Jul 28 16:17:20 11[NET] <conn-52.77.221.132|2> sending packet: from 172.17.0.2[500] to 52.77.221.132[500] (500 bytes)
The 52.77.221.132 host does not exists anymore as the IP have been changed after rebooting the host instance.
We're also encountering this issue. We upgraded to 1.1.2 today and will see if that's cleared it up...
On 1.1.2 this reoccured. This in the charon logs seemed informative:
Aug 3 06:27:39 15[IKE] <conn-10.0.0.210|2269> IKE_SA deleted
Aug 3 06:27:39 15[ENC] <conn-10.0.0.210|2269> generating INFORMATIONAL response 1 [ ]
Aug 3 06:27:39 15[NET] <conn-10.0.0.210|2269> sending packet: from 172.17.0.2[4500] to 10.0.0.210[4500] (57 bytes)
Aug 3 06:29:28 08[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.0.2.37
Aug 3 06:29:28 13[JOB] CHILD_SA ESP/0x00000000/10.0.2.37 not found for delete
Aug 3 06:29:28 10[KNL] creating acquire job for policy 10.42.100.246/32[tcp/55650] === 10.42.25.47/32[tcp/9300] with reqid {1234}
Aug 3 06:29:28 10[CFG] trap not found, unable to acquire reqid 1234
Aug 3 06:32:13 13[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.0.2.37
Aug 3 06:32:13 13[JOB] CHILD_SA ESP/0x00000000/10.0.2.37 not found for delete
Aug 3 06:32:17 05[KNL] creating acquire job for policy 10.42.8.91/32[tcp/51294] === 10.42.25.47/32[tcp/9300] with reqid {1234}
Aug 3 06:32:17 05[CFG] trap not found, unable to acquire reqid 1234
10.42.25.47 is the IP of the container that was no longer contactable.
As before, restarting the network agent restored connectivity.
Strongswan 5.5 has an interesting line in the changelog about rekeying:
The behavior during IKEv2 exchange collisions has been improved/fixed in several corner cases
and support for TEMPORARY_FAILURE and CHILD_SA_NOT_FOUND notifies, as defined by RFC 7296,
has been added (#379, #464, #876, #1293). The behavior is tested with a series of new unit tests.
Might this be related - it looks like Rancher is currently using 5.4.x
Any update regarding this critical issue ? Had it again in production this night at 6 am, not really something I'm happy to wake up for. Mh?
I have this issue once a week more or less in a 17 machines-cluster. Most of the times I have to restart both the network agent and the container. One time I had to restart the physical node.
For anyone following this issue (@ahfeel, @ddmng). I've shown a failed setup to @leodotcloud. He checked through the different logs and confirmed that there was actually a failed connection between two nodes.
The feedback I received is that they will try to reproduce it in their test setups.
The only guess I have is that it is somehow related to high cpu load since the problem only seems to occur on hosts that run close 100% cpu usage over longer periods.
We have similar issue with the following setup:
Rancher v1.1.3
Cattle v0.165.8.1
User Interface v1.1.12
Rancher Compose v0.8.6
Docker version: 1.12.1
OS: Ubuntu 16.04.1 LTS (4.4.0-34-generic)
We have several nodes and it have happened two times now. Helps to restart the network agent.
We're experiencing the same thing. We have a couple of peculiar things that might point in a useful solution.
We have an internal/public network. The ones suffering and having this behavior our only our edge nodes. This can be because our edge nodes are ubuntu 14:04 and our internal nodes 16.04.
But i think most likely because the amount of data that gets processed trough it. Restarting the network agent seemed to have solved the issue however.
@leodotcloud Is there any update on this ? Had this 2 times this week already.. This is becoming VERY critical.
@ahfeel Do you know if running rancher on kubernetes also uses IPSEC/swan? If not i might consider moving to kubernetes.
No idea :-/ Actually I'm already finding the stack quite complex with all the docker + rancher stuff, I'd really love not to add another level with K8S.. If the issue doesn't get resolved anytime soon I'd rather switch to a simpler setup based on docker 1.12 and a less powerful frontend :(
@ahfeel There is no other reliable setup with docker if you do actual things with containers. It's not the front-end here that's the issue its the overlay network. Swarm isn't ready for anything production yet. So you either end up with kubernetes, mesos or cattle. Rancher does a good job setting up containers fast.
@ahfeel I experienced this issue for some time on a production system, with Elasticsearch nodes losing connection from the cluster once or twice a week: now it's 3 weeks that things seem to without problems.
The only thing that I can say it's changed is the network agent version: in the last event I saw that the cluster (18 nodes) was running version 0.8.1 on some node and 0.8.3 on others (w/ Rancher 1.1.0).
It all started to work when all nodes updated the nerwork agent to v0.8.3... but I'm still crossfingering!
I can't tell that was the cause, but this is what I observed in my experience. Hope it could help you.
Thank you for the feedback @ddmng. What is the easiest way to force the
upgrade to the latest version of the agent ? (I'm running Rancher 1.1.0 but
I just checked and the image used by network agents is still the 0.8.1
@ahfeel http://docs.rancher.com/rancher/v1.2/en/upgrading/ says that it should be triggered "when there is a trigger regarding networking" but #2837, #5572 and #5573 say you should do it manually. I did manually as described in those issues.
I think I just hit this issue on an 11-host setup between two nodes that have been up and working without problems for ~2 weeks. I'm running Rancher Server 1.2.0-pre2 and all hosts are running rancher/agent-instance:v0.8.3 and rancher/agent:v1.0.2. Restarting the agent instance container on one of the affected hosts cleared things up.
It seems like only communication between two particular hosts (say A and B) were affected. I was was able to exec into containers on either A or B and ping containers on other hosts (e.g. C), but containers from A could not ping containers on B and vice versa. One thing I did not try but probably should have was pinging containers on A or B from a container on C, although I suspect that would have worked fine.
It doesn't look like cpu usage went over ~50% on either affected host around the time at which I think the networking dropped. I see this in the charon logs from one failing host around that time as well (the last line "destroying IKE_SA in state CONNECTING without notification" is the only thing that looks out of the ordinary):
Sep 22 20:01:30 09[IKE] <conn-172.31.15.182|1718> initiating IKE_SA conn-172.31.15.182[1738] to 172.31.15.182
Sep 22 20:01:30 09[ENC] <conn-172.31.15.182|1718> generating CREATE_CHILD_SA request 0 [ SA No KE ]
Sep 22 20:01:30 09[NET] <conn-172.31.15.182|1718> sending packet: from 172.17.0.2[4500] to 172.31.15.182[4500] (465 bytes)
Sep 22 20:01:30 15[NET] <conn-172.31.15.182|1718> received packet: from 172.31.15.182[4500] to 172.17.0.2[4500] (465 bytes)
Sep 22 20:01:30 15[ENC] <conn-172.31.15.182|1718> parsed CREATE_CHILD_SA request 0 [ SA No KE ]
Sep 22 20:01:30 15[IKE] <conn-172.31.15.182|1718> 172.31.15.182 is initiating an IKE_SA
Sep 22 20:01:30 15[IKE] <conn-172.31.15.182|1718> detected IKE_REKEY collision with IKE_REKEY
Sep 22 20:01:30 15[ENC] <conn-172.31.15.182|1718> generating CREATE_CHILD_SA response 0 [ SA No KE ]
Sep 22 20:01:30 15[NET] <conn-172.31.15.182|1718> sending packet: from 172.17.0.2[4500] to 172.31.15.182[4500] (405 bytes)
Sep 22 20:01:30 14[NET] <conn-172.31.15.182|1718> received packet: from 172.31.15.182[4500] to 172.17.0.2[4500] (405 bytes)
Sep 22 20:01:30 14[ENC] <conn-172.31.15.182|1718> parsed CREATE_CHILD_SA response 0 [ SA No KE ]
Sep 22 20:01:30 14[IKE] <conn-172.31.15.182|1718> IKE_SA rekey collision won, waiting for delete
Sep 22 20:01:30 14[IKE] <conn-172.31.15.182|1718> scheduling rekeying in 13575s
Sep 22 20:01:30 14[IKE] <conn-172.31.15.182|1718> maximum IKE_SA lifetime 15015s
Sep 22 20:01:30 14[IKE] <conn-172.31.15.182|1718> IKE_SA conn-172.31.15.182[1738] rekeyed between 172.17.0.2[172.17.0.2]...172.31.15.182[172.17.0.2]
Sep 22 20:01:30 14[IKE] <conn-172.31.15.182|1718> deleting IKE_SA conn-172.31.15.182[1718] between 172.17.0.2[172.17.0.2]...172.31.15.182[172.17.0.2]
Sep 22 20:01:30 14[IKE] <conn-172.31.15.182|1718> sending DELETE for IKE_SA conn-172.31.15.182[1718]
Sep 22 20:01:30 14[ENC] <conn-172.31.15.182|1718> generating INFORMATIONAL request 1 [ D ]
Sep 22 20:01:30 14[NET] <conn-172.31.15.182|1718> sending packet: from 172.17.0.2[4500] to 172.31.15.182[4500] (65 bytes)
Sep 22 20:01:30 15[NET] <conn-172.31.15.182|1718> received packet: from 172.31.15.182[4500] to 172.17.0.2[4500] (57 bytes)
Sep 22 20:01:30 15[ENC] <conn-172.31.15.182|1718> parsed INFORMATIONAL response 1 [ ]
Sep 22 20:01:30 15[IKE] <conn-172.31.15.182|1718> IKE_SA deleted
Sep 22 20:01:40 10[IKE] <conn-172.31.15.182|1739> destroying IKE_SA in state CONNECTING without notification
where 172.31.15.182 is the VPC IP of the other host that suddenly became unreachable. On that host, I see the following corresponding charon logs:
Sep 22 20:01:30 05[IKE] <conn-172.31.14.161|1715> initiating IKE_SA conn-172.31.14.161[1735] to 172.31.14.161
Sep 22 20:01:30 05[ENC] <conn-172.31.14.161|1715> generating CREATE_CHILD_SA request 0 [ SA No KE ]
Sep 22 20:01:30 05[NET] <conn-172.31.14.161|1715> sending packet: from 172.17.0.2[4500] to 172.31.14.161[4500] (465 bytes)
Sep 22 20:01:30 15[NET] <conn-172.31.14.161|1715> received packet: from 172.31.14.161[4500] to 172.17.0.2[4500] (465 bytes)
Sep 22 20:01:30 15[ENC] <conn-172.31.14.161|1715> parsed CREATE_CHILD_SA request 0 [ SA No KE ]
Sep 22 20:01:30 15[IKE] <conn-172.31.14.161|1715> 172.31.14.161 is initiating an IKE_SA
Sep 22 20:01:30 10[MGR] ignoring request with ID 0, already processing
Sep 22 20:01:30 15[IKE] <conn-172.31.14.161|1715> detected IKE_REKEY collision with IKE_REKEY
Sep 22 20:01:30 15[ENC] <conn-172.31.14.161|1715> generating CREATE_CHILD_SA response 0 [ SA No KE ]
Sep 22 20:01:30 15[NET] <conn-172.31.14.161|1715> sending packet: from 172.17.0.2[4500] to 172.31.14.161[4500] (405 bytes)
Sep 22 20:01:30 13[NET] <conn-172.31.14.161|1715> received packet: from 172.31.14.161[4500] to 172.17.0.2[4500] (65 bytes)
Sep 22 20:01:30 13[ENC] <conn-172.31.14.161|1715> parsed INFORMATIONAL request 1 [ D ]
Sep 22 20:01:30 13[IKE] <conn-172.31.14.161|1715> received DELETE for IKE_SA conn-172.31.14.161[1715]
Sep 22 20:01:30 13[IKE] <conn-172.31.14.161|1715> deleting IKE_SA conn-172.31.14.161[1715] between 172.17.0.2[172.17.0.2]...172.31.14.161[172.17.0.2]
Sep 22 20:01:30 13[IKE] <conn-172.31.14.161|1715> IKE_SA deleted
Sep 22 20:01:30 13[ENC] <conn-172.31.14.161|1715> generating INFORMATIONAL response 1 [ ]
Sep 22 20:01:30 13[NET] <conn-172.31.14.161|1715> sending packet: from 172.17.0.2[4500] to 172.31.14.161[4500] (57 bytes)
We are running server v1.1.3 in single node mode, managing two hosts with several containers on each. When it happened, the health checks for containers on one host started to fail and Rancher kept re-creating all containers on that host. We encountered this issue twice, once in Aug and the other one today, each time restarting the rancher network agent on the host where the containers got re-created fixed the issue.
Keeps happening to me, load balancer losing connections to the web nodes. @will-chan @deniseschannon.
This is becoming a real production issue for me, and I was wondering whether you guys have any update? In such a manner that we're thinking of migrating to a different platform/networking solution.
This has been open since may, and i would say this is super critical?
The networking infrastructure is changing in v1.2.
Since there are no reliable reproducible steps to get the networking into this state, it's a bit trickier to debug and pinpoint the exact root cause of this.
@deniseschannon Thanks for replying. Er are however at least 14 people here that have the same experience i'm pretty sure one of them (including me) is willing to give you access to the servers to debug the issue. For us this is a real showstopper because i really can't hold up to my SLA's this way.
What are the options? And when is v.1.2 scheduled?
Happened again just now, and again restarting the network agent fixed it. We are planning to move back to ECS (yes we migrated from ECS to Rancher).
Networking in 1.2.0 was completely revamped where we now use CNI. v1.2 is now out, but there have been some issues that people have faced if you have docker bridge using something besides docker0 or docker installed somewhere outside of var/lib/docker. I'll close this issue as any networking issue should be reported based on v1.2.0 and we'll fix them.
Most helpful comment
Any update regarding this critical issue ? Had it again in production this night at 6 am, not really something I'm happy to wake up for. Mh?