rancher 🚀 - Agents Randomly Flapping since Rancher 1.5.2 Upgrade

Agents flapping in UI.
screen shot 2017-03-28 at 8 27 08 am

AWS ELB setup with proxy_protocol enabled.
screen shot 2017-03-28 at 8 54 19 am

Agent logs looping

3/27/2017 11:15:45 AMINFO: Already up to date
3/27/2017 11:15:45 AMINFO: Sending python-agent applied 1-8d88d6ac405cca84313a9fb1c8b4eba2a4a0f125e4309cdb47625bafbb10ace4
3/27/2017 11:15:45 AMINFO: Sending pyagent applied 1-9e3dcab12585149a65d818082279708965cd826df1937a55b24f48428f8f0c0644fc1033b5be15c88dcf9540b1c04c67
3/27/2017 11:15:46 AMINFO: Starting agent /var/lib/cattle/pyagent/apply.sh
3/27/2017 11:15:46 AMINFO: Executing /var/lib/cattle/pyagent/agent
3/27/2017 11:15:46 AMINFO: Updating host-config
3/27/2017 11:15:46 AMINFO: Downloading https://rancher.demandbase.com/v1//configcontent//host-config current=host-config-1-4995a4a06e8cea52da7f729987a035724633b2abc04a05a443b107618c92c60b
3/27/2017 11:15:46 AMINFO: Already up to date
3/27/2017 11:15:46 AMINFO: Sending host-config applied 1-4995a4a06e8cea52da7f729987a035724633b2abc04a05a443b107618c92c60b
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="Launching agent"
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="https://rancher.demandbase.com/v1EA81F417CA84DC80627EW4YCWcczT4EbLS3ruqWFA7hAGFdEykDXrppRgeYh"
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="Listening for events on https://rancher.demandbase.com/v1"
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="launching hostapi"
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="Starting event router."
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="Initializing event router" workerCount=250
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="Connection established"
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="Starting websocket pings"
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="Connecting to proxy." url="wss://rancher.demandbase.com/v1/connectbackend?token=<REDACTED>
3/27/2017 11:16:16 PMINFO: Starting agent for <REDACTED>
3/27/2017 11:16:17 PMINFO: Access Key: <REDACTED>
3/27/2017 11:16:17 PMINFO: Config URL: https://rancher.demandbase.com/v1
3/27/2017 11:16:17 PMINFO: Storage URL: https://rancher.demandbase.com/v1
3/27/2017 11:16:17 PMINFO: API URL: https://rancher.demandbase.com/v1
3/27/2017 11:16:17 PMINFO: IP: 10.9.1.91
3/27/2017 11:16:17 PMINFO: Port:
3/27/2017 11:16:17 PMINFO: Required Image: rancher/agent:v1.2.1
3/27/2017 11:16:17 PMINFO: Current Image: rancher/agent:v1.2.1
3/27/2017 11:16:17 PMINFO: Using image rancher/agent:v1.2.1

additional agent logs,

time="2017-03-23T04:12:06Z" level=info msg="Registering backend for host 0f2925d0-f6e5-4589-495b-dac751bbae39 with session ID c22c6eba-ff61-4a98-b44a-fe76877746d6." 
time="2017-03-23T04:12:10Z" level=info msg="Handling backend connection request." 
time="2017-03-23T04:12:10Z" level=info msg="Registering backend for host 41ba6d9a-10f9-4c46-61b2-b796a7c9dd3c with session ID 2ecfd38e-520d-4846-9795-20bd443c4611." 
time="2017-03-23T04:12:10Z" level=info msg="Handling backend connection request." 
time="2017-03-23T04:12:10Z" level=info msg="Registering backend for host fc90e4e2-2e23-4900-4ca5-b669bc6b4e9e with session ID 9da14b33-aad7-4556-912e-77458becf034." 
time="2017-03-23T04:12:11Z" level=info msg="Handling backend connection request." 
time="2017-03-23T04:12:11Z" level=info msg="Registering backend for host 56aa5a0a-fb87-464d-6c42-9a23b2e02de6 with session ID 0f4edad7-b55b-4b17-a053-e20d814eac02." 
time="2017-03-23T04:12:12Z" level=info msg="Handling backend connection request." 
time="2017-03-23T04:12:12Z" level=info msg="Registering backend for host 124b306f-6d38-4da2-6bd8-aaea8aee6f67 with session ID 5c248471-bdfe-4370-9725-32d345daf22f." 
2017-03-23 04:12:13,577 ERROR [1be01214-be8f-4778-8734-24d58532e2dd:6184117] [instance:233787] [instance.start] [] [ecutorService-5] [c.p.e.p.i.DefaultProcessInstanceImpl] Unknown exception org.jooq.exception.DataChangedException: Database record has been changed
    at org.jooq.impl.UpdatableRecordImpl.checkIfChanged(UpdatableRecordImpl.java:550) ~[jooq-3.3.0.jar:na]
    at org.jooq.impl.UpdatableRecordImpl.storeUpdate0(UpdatableRecordImpl.java:291) ~[jooq-3.3.0.jar:na]
    at org.jooq.impl.UpdatableRecordImpl.access$200(UpdatableRecordImpl.java:90) ~[jooq-3.3.0.jar:na]
    at org.jooq.impl.UpdatableRecordImpl$3.operate(UpdatableRecordImpl.java:260) ~[jooq-3.3.0.jar:na]
    at org.jooq.impl.RecordDelegate.operate(RecordDelegate.java:123) ~[jooq-3.3.0.jar:na]
    at org.jooq.impl.UpdatableRecordImpl.storeUpdate(UpdatableRecordImpl.java:255) ~[jooq-3.3.0.jar:na]
    at org.jooq.impl.UpdatableRecordImpl.update(UpdatableRecordImpl.java:149) ~[jooq-3.3.0.jar:na]
    at io.cattle.platform.object.impl.JooqObjectManager.persistRecord(JooqObjectManager.java:228) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
    at io.cattle.platform.object.impl.JooqObjectManager.setFieldsInternal(JooqObjectManager.java:133) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
    at io.cattle.platform.object.impl.JooqObjectManager$3.execute(JooqObjectManager.java:121) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]

ecliptik on 28 Mar 2017

Seeing similar issues. Rancher 1.5.1, HA setup behind ELB and connected to RDS. Docker 1.12.6.

Last night I lost 3 of my rancher agents spontaneously. I SSH'd to the server, and everything looked fine, as far as I could tell. All of the contaiers were still running, just weren't connected to the cluster any longer. I cleared the rancher-agent and the other containers on the host, and rerant the rancher bootstrap script and the node rejoined the cluster.

I can't figure out why the agents randomly disconnected though, which is concerning. I found this in the logs though. Any ideas why there would be websocket issues? The IP address for TLS handshake error is 100% an ELB address.

time="2017-03-28T04:32:33Z" level=info msg="Handling backend connection request."
time="2017-03-28T04:32:33Z" level=info msg="Registering backend for host 5ca832d9-e949-4e41-7b70-5b653b41ba1f with session ID e07437cc-0b52-49d9-b34b-ce8bc8d245a6."
2017/03/28 05:03:26 http: TLS handshake error from 172.26.30.88:63316: EOF
2017/03/28 05:03:29 http: TLS handshake error from 172.26.30.88:63324: EOF
time="2017-03-28T05:38:31Z" level=info msg="Shutting down backend d11849cb-d511-4055-4001-30001f8e149d. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-28T05:38:31Z" level=info msg="Removed backend. Key: d11849cb-d511-4055-4001-30001f8e149d. Session ID f9302ce6-8634-427f-82e8-d674978b09cb ."
time="2017-03-28T05:38:44Z" level=info msg="Handling backend connection request."
time="2017-03-28T05:38:44Z" level=info msg="Registering backend for host d11849cb-d511-4055-4001-30001f8e149d with session ID 29397ff5-664a-40ce-a3b7-77d48aedc376."
2017-03-28 05:45:31,659 WARN  [:] [] [] [] [p859417998-4093] [i.c.p.a.p.s.j.WebSocketMessageWriter] Unhandled Error (closing connection) org.eclipse.jetty.websocket.api.WebSocketException: RemoteEndpoint unavailable, current state [CLOSING], expecting [OPEN or CONNECTED]
at org.eclipse.jetty.websocket.common.WebSocketSession.getRemote(WebSocketSession.java:252) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.websocket.common.events.AbstractEventDriver.incomingFrame(AbstractEventDriver.java:141) ~[websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.websocket.common.WebSocketSession.incomingFrame(WebSocketSession.java:309) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.websocket.common.extensions.ExtensionStack.incomingFrame(ExtensionStack.java:214) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.websocket.common.Parser.notifyFrame(Parser.java:220) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.websocket.common.Parser.parse(Parser.java:258) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.readParse(AbstractWebSocketConnection.java:628) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:476) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) [jetty-io-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) [jetty-util-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) [jetty-util-9.2.11.v20150529.jar:9.2.11.v20150529]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]

jmreicha on 28 Mar 2017

After talking with @aemneina in the Rancher slack channel, this appears this could be database related, with our master running into resource limits due to our RDS instance size of m4.large.

screen shot 2017-03-28 at 9 33 20 am

We're upgrading to the RDS instance to m4.xlarge to see if it helps.

ecliptik on 28 Mar 2017

Seeing this in the logs repeatedly,

time="2017-03-29T05:51:13Z" level=info msg="Starting websocket pings"
time="2017-03-29T15:40:40Z" level=error msg="Received error reading from socket. Exiting." error="websocket: close 1006 (abnormal closure): unexpected EOF"
time="2017-03-29T15:40:40Z" level=error msg="websocket: close 1006 (abnormal closure): unexpected EOF"

happening in all environments on almost all the agents.

ecliptik on 29 Mar 2017

Rancher Master logs are showing:

time="2017-03-25T18:13:57Z" level=info msg="Handling backend connection request."
time="2017-03-25T18:13:57Z" level=info msg="Registering backend for host 80addee7-b1c3-46ad-5082-79a3d3eccea8 with session ID 11d8d8c6-7205-4167-b151-40d191fcce64."
time="2017-03-25T18:14:01Z" level=info msg="Handling backend connection request."
time="2017-03-25T18:14:01Z" level=info msg="Registering backend for host 6a83381a-e309-439f-7fab-25b54e54e47e with session ID 5b920945-d810-4946-b0e6-d285d27773fb."
2017-03-25 18:14:02,082 WARN  [:] [] [] [] [p859417998-4530] [i.c.p.a.p.s.j.WebSocketMessageWriter] Unhandled Error (closing connection) org.eclipse.jetty.websocket.api.WebSocketException: RemoteEndpoint unavailable, current state [CLOSING], expecting [OPEN or CONNECTED]
        at org.eclipse.jetty.websocket.common.WebSocketSession.getRemote(WebSocketSession.java:252) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
        at org.eclipse.jetty.websocket.common.events.AbstractEventDriver.incomingFrame(AbstractEventDriver.java:141) ~[websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
        at org.eclipse.jetty.websocket.common.WebSocketSession.incomingFrame(WebSocketSession.java:309) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
        at org.eclipse.jetty.websocket.common.extensions.ExtensionStack.incomingFrame(ExtensionStack.java:214) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
        at org.eclipse.jetty.websocket.common.Parser.notifyFrame(Parser.java:220) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
        at org.eclipse.jetty.websocket.common.Parser.parse(Parser.java:258) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
        at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.readParse(AbstractWebSocketConnection.java:628) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
        at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:476) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
time="2017-03-25T18:14:02Z" level=info msg="Shutting down backend 01a12e35-363f-41b6-6a41-819030c2c765. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:02Z" level=info msg="Removed backend. Key: 01a12e35-363f-41b6-6a41-819030c2c765. Session ID 1c369dcd-e8a9-442d-8d2e-75d610ecc7ef ."
time="2017-03-25T18:14:03Z" level=info msg="Handling backend connection request."
time="2017-03-25T18:14:03Z" level=info msg="Registering backend for host 8a3f62c8-2547-4a95-5734-e0a84c328acd with session ID 4b6cf13b-669f-4c20-9955-f9a36678ad43."
time="2017-03-25T18:14:03Z" level=info msg="Handling backend connection request."
time="2017-03-25T18:14:03Z" level=info msg="Registering backend for host 618f1923-255c-498c-666c-90cdee1a8c51 with session ID 767c61c5-c12f-4c90-b75e-cd0778cddbe5."
time="2017-03-25T18:14:04Z" level=info msg="Shutting down backend be5570d4-cbc2-45c5-7099-0d44591dff8c. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:04Z" level=info msg="Removed backend. Key: be5570d4-cbc2-45c5-7099-0d44591dff8c. Session ID 431ce8d5-74c1-4572-83a4-60f1c4f6fc52 ."
time="2017-03-25T18:14:05Z" level=info msg="Shutting down backend cde44bd4-6e55-487b-64ba-8f5edf338c97. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:05Z" level=info msg="Removed backend. Key: cde44bd4-6e55-487b-64ba-8f5edf338c97. Session ID 47d8b9a0-7f90-418d-b345-e9869ceb4d90 ."
time="2017-03-25T18:14:07Z" level=info msg="Shutting down backend 6f75f30c-6c84-4109-6e0c-82a1da2e57a2. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:07Z" level=info msg="Removed backend. Key: 6f75f30c-6c84-4109-6e0c-82a1da2e57a2. Session ID 1ae4a641-a2bd-4492-8fc1-a208e9bb7118 ."
time="2017-03-25T18:14:10Z" level=info msg="Shutting down backend f4dbaf2c-8e10-4174-5634-1e0f2d59de84. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:10Z" level=info msg="Removed backend. Key: f4dbaf2c-8e10-4174-5634-1e0f2d59de84. Session ID 80c8916d-b91c-4604-b58d-865bd3bc1ccd ."
time="2017-03-25T18:14:10Z" level=info msg="Shutting down backend 026ee519-0ee0-4f5e-4bd0-640fa6cd201d. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:10Z" level=info msg="Removed backend. Key: 026ee519-0ee0-4f5e-4bd0-640fa6cd201d. Session ID 27d5393f-3046-4880-8ec1-18d637130c2c ."
time="2017-03-25T18:14:11Z" level=info msg="Shutting down backend e5a881cf-b76c-44b3-4c1d-ca55c4f9d8f0. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:11Z" level=info msg="Removed backend. Key: e5a881cf-b76c-44b3-4c1d-ca55c4f9d8f0. Session ID f61ca915-e570-4750-9072-b5a4ec365f20 ."
time="2017-03-25T18:14:11Z" level=info msg="Shutting down backend 81675666-b6de-4d48-65b5-5713ce221ce9. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:11Z" level=info msg="Removed backend. Key: 81675666-b6de-4d48-65b5-5713ce221ce9. Session ID 992a5ab2-2ae9-45c8-a8b0-f47bca743969 ."
time="2017-03-25T18:14:13Z" level=info msg="Shutting down backend 2bbec463-98e8-4be8-4dfe-21faaca5a642. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:13Z" level=info msg="Removed backend. Key: 2bbec463-98e8-4be8-4dfe-21faaca5a642. Session ID 0d3de709-c5dd-4ed8-a166-b7d32d67b380 ."
time="2017-03-25T18:14:15Z" level=info msg="Handling backend connection request."
time="2017-03-25T18:14:15Z" level=info msg="Registering backend for host 01a12e35-363f-41b6-6a41-819030c2c765 with session ID 45774d2d-7921-424d-b55b-23ce1f39b358."
time="2017-03-25T18:14:16Z" level=info msg="Shutting down backend 53e4a749-a2a5-40c2-7b9d-18c3f2286ac6. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:16Z" level=info msg="Removed backend. Key: 53e4a749-a2a5-40c2-7b9d-18c3f2286ac6. Session ID db4ab728-0cd0-40c2-b72f-cbf2578fd404 ."
time="2017-03-25T18:14:17Z" level=info msg="Shutting down backend 11228144-5f4a-42a0-4d3f-5e9318d6227b. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:17Z" level=info msg="Removed backend. Key: 11228144-5f4a-42a0-4d3f-5e9318d6227b. Session ID a4f8e0b6-64cb-49b8-98d8-f26d991307a6 ."
time="2017-03-25T18:14:17Z" level=info msg="Handling backend connection request."
time="2017-03-25T18:14:17Z" level=info msg="Registering backend for host be5570d4-cbc2-45c5-7099-0d44591dff8c with session ID f65b437e-ad06-4a74-a5ef-6c2b63247675."
time="2017-03-25T18:14:17Z" level=info msg="Handling backend connection request."
time="2017-03-25T18:14:17Z" level=info msg="Registering backend for host cde44bd4-6e55-487b-64ba-8f5edf338c97 with session ID 01b4cf4b-72e9-4323-99cd-778e7ce1fc05."
time="2017-03-25T18:14:19Z" level=info msg="Shutting down backend c95ce07e-70b8-4d08-73b4-b45f03d377e1. Connection closed because: websocket: close 1006 unexpected EOF."

more errors in logs

2017/03/29 15:29:48 http: proxy error: net/http: request canceled
2017/03/29 15:29:49 http: proxy error: net/http: request canceled
time="2017-03-29T15:31:01Z" level=info msg="Listing webhooks"
2017-03-29 15:40:58,979 INFO    [main] [ConsoleStatus] Starting [11/86]: WebsocketProxyLauncher
time="2017-03-29T15:41:01Z" level=info msg="Downloading key from http://localhost:8081/v1/scripts/api.crt"
time="2017-03-29T15:41:01Z" level=fatal msg="Error getting config." error="Invalid key content"
time="2017-03-29T15:41:02Z" level=fatal msg="Failed to configure cattle client: Get http://localhost:8080/v1: dial tcp [::1]:8080: getsockopt: connection refused"
time="2017-03-29T15:41:02Z" level=info msg="Starting Catalog Service (port 8088, refresh interval 300 seconds)"
time="2017-03-29T15:41:03Z" level=info msg="Downloading key from http://localhost:8081/v1/scripts/api.crt"
time="2017-03-29T15:41:03Z" level=info msg="Starting websocket proxy. Listening on [:8080], Proxying to cattle API at [localhost:8081], Monitoring parent pid [9]."

websocket EOFs

2017/03/28 17:54:07 http: proxy error: net/http: request canceled
time="2017-03-29T05:50:45Z" level=info msg="Exiting go-machine-service"
time="2017-03-29T05:50:46Z" level=info msg="Setting log level" logLevel=info
time="2017-03-29T05:50:46Z" level=info msg="Starting go-machine-service..." gitcommit=v0.36.1
time="2017-03-29T05:50:46Z" level=info msg="Waiting for handler registration (1/2)"
time="2017-03-29T05:50:47Z" level=info msg="Initializing event router" workerCount=10
time="2017-03-29T05:50:47Z" level=info msg="Connection established"
time="2017-03-29T05:50:47Z" level=info msg="Waiting for handler registration (2/2)"
time="2017-03-29T05:50:47Z" level=info msg="Starting websocket pings"
time="2017-03-29T05:50:48Z" level=info msg="Initializing event router" workerCount=250
time="2017-03-29T05:50:48Z" level=info msg="Connection established"
time="2017-03-29T05:50:48Z" level=info msg="Starting websocket pings"
time="2017-03-29T05:50:48Z" level=info msg="Installing builtin drivers"
time="2017-03-29T05:50:48Z" level=info msg="Downloading all drivers"
time="2017-03-29T05:50:48Z" level=info msg="Copying /var/lib/cattle/machine-drivers/56b60d32caac27345f4f55c4c91ed8adcfe7b3d5abcadd29d0551812f222f37e-docker-machine-driver-ubiquity => /usr/local/bin/docker-machine-driver-ubiquity"
time="2017-03-29T05:50:48Z" level=info msg="Done downloading all drivers"
time="2017-03-29T05:50:47Z" level=info msg="Starting websocket pings"
time="2017-03-29T05:51:01Z" level=info msg="Shutting down backend 56aa5a0a-fb87-464d-6c42-9a23b2e02de6. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-29T05:51:01Z" level=info msg="Removed backend. Key: 56aa5a0a-fb87-464d-6c42-9a23b2e02de6. Session ID abd1dd6c-fe84-45c6-a1e4-d6963cf98a03 ."
time="2017-03-29T05:51:04Z" level=info msg="Shutting down backend 41ba6d9a-10f9-4c46-61b2-b796a7c9dd3c. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-29T05:51:04Z" level=info msg="Removed backend. Key: 41ba6d9a-10f9-4c46-61b2-b796a7c9dd3c. Session ID 30c7734b-2c72-4e4a-ac15-86e7cf08ed06 ."
time="2017-03-29T05:51:05Z" level=info msg="Shutting down backend 9461f60b-2497-4458-5387-cafdfe548cea. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-29T05:51:05Z" level=info msg="Removed backend. Key: 9461f60b-2497-4458-5387-cafdfe548cea. Session ID 4e5dc44e-5228-4739-810c-ab4dcecb6ea2 ."
time="2017-03-29T05:51:06Z" level=info msg="Shutting down backend fc90e4e2-2e23-4900-4ca5-b669bc6b4e9e. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-29T05:51:06Z" level=info msg="Removed backend. Key: fc90e4e2-2e23-4900-4ca5-b669bc6b4e9e. Session ID 80361a87-a872-4d5c-bb77-273d71ed3c6e ."
time="2017-03-29T05:51:07Z" level=info msg="Shutting down backend 618f1923-255c-498c-666c-90cdee1a8c51. Connection closed because: websocket: close 1006 unexpected EOF."

ecliptik on 29 Mar 2017

Rebooting the EC2 instance that master is on seems to fix the issue, but we did this the day before and overnight the agents started flapping again, so something is triggering it within 24 hours.

Master info:
EC2 instance type m4.xlarge
Memory: 16Gb memory (free showed 12Gb free during flapping)
CPU: 4
Disk: /dev/xvda1 126G 38G 83G 32% /

jvm process,

java -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -Xms128m -Xmx4g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/cattle/logs -Dlogback.bootstrap.level=WARN -cp /usr/share/cattle/9949e600a7bcdd427ae7bc9d33bf477c:/usr/share/cattle/9949e600a7bcdd427ae7bc9d33bf477c/etc/cattle io.cattle.platform.launcher.Main

ecliptik on 29 Mar 2017

Can you open your browser's developer tab and look at the Rancher UI and let me know if the request to /v2-beta/suscribe (or /v1/subscribe) is succeeding or not? Its a websocket connection.

cjellick on 29 Mar 2017

@cjellick it looks like it succeeding

screen shot 2017-03-29 at 10 36 32 am

In Safari I am still getting the occasional Red Box of Death mentioned in issue https://github.com/rancher/rancher/issues/7389 too.

ecliptik on 29 Mar 2017

Succeeding? It got a 404

cjellick on 29 Mar 2017

That tells me that websockets in general seem to be not working (not just backed connections from agents).

Is this a setup that was working on a previous version of rancher and then was upgraded or is it a fresh setup? If upgrade, what version did you upgrade from?

My first reaction is to suspect a misconfiguration of the ELB. Was this particularly ELB part of a previously working setup? Has any of its configuration changed? Any chance you misconfigured the proxy protocol bit?

cjellick on 29 Mar 2017

@cjellick here's a timeline

Rancher 1.0 was running behind a ELB rancher-master with proxy_protocol setup (this was last fall) and was stable
Upgraded to Rancher 1.2 and followed recommended use of ALB instead of ELB
1.2.2 was running for a few months behind the ALB with minor stability issues and agent disconnects (and https://github.com/rancher/rancher/issues/7389 occasionally)
Upgraded to 1.5.2 last week and used original ELB rancher-master from 1.0

Here is our load balancer policy config with the aws cli as followed from https://docs.rancher.com/rancher/v1.5/en//installing-rancher/installing-server/basic-ssl-config/

Polices

aws elb describe-load-balancer-policies --load-balancer-name rancher-master --region=us-west-1
....
        {
            "PolicyAttributeDescriptions": [
                {
                    "AttributeName": "ProxyProtocol",
                    "AttributeValue": "true"
                }
            ],
            "PolicyName": "rancher-master-proxyprotocol-policy",
            "PolicyTypeName": "ProxyProtocolPolicyType"
        }

Load Balancer

aws elb describe-load-balancers --load-balancer-name rancher-master --region=us-west-1
{
    "LoadBalancerDescriptions": [
        {
            "Subnets": [
                "subnet-388e5861",
                "subnet-95bb2af0"
            ],
            "CanonicalHostedZoneNameID": "REDACTED",
            "CanonicalHostedZoneName": "rancher-master-2099044364.us-west-1.elb.amazonaws.com",
            "ListenerDescriptions": [
                {
                    "Listener": {
                        "InstancePort": 8081,
                        "LoadBalancerPort": 80,
                        "Protocol": "TCP",
                        "InstanceProtocol": "TCP"
                    },
                    "PolicyNames": []
                },
                {
                    "Listener": {
                        "InstancePort": 8080,
                        "SSLCertificateId": "arn:aws:iam::REDACTED:server-certificate/GoDaddy_11_2018_demandbase.com",
                        "LoadBalancerPort": 443,
                        "Protocol": "SSL",
                        "InstanceProtocol": "TCP"
                    },
                    "PolicyNames": [
                        "AWSConsole-SSLNegotiationPolicy-rancher-master-1464896788615"
                    ]
                }
            ],
            "HealthCheck": {
                "HealthyThreshold": 5,
                "Interval": 5,
                "Target": "HTTP:8080/ping",
                "Timeout": 4,
                "UnhealthyThreshold": 5
            },
            "VPCId": "vpc-4dc16a28",
            "BackendServerDescriptions": [
                {
                    "InstancePort": 80,
                    "PolicyNames": [
                        "rancher-master-proxyprotocol-policy"
                    ]
                },
                {
                    "InstancePort": 443,
                    "PolicyNames": [
                        "rancher-master-proxyprotocol-policy"
                    ]
                },
                {
                    "InstancePort": 8080,
                    "PolicyNames": [
                        "rancher-master-proxyprotocol-policy"
                    ]
                },
                {
                    "InstancePort": 8081,
                    "PolicyNames": [
                        "rancher-master-proxyprotocol-policy"
                    ]
                }
            ],
            "Instances": [
                {
                    "InstanceId": "i-0bd99aa655bd7c004"
                }
            ],
            "DNSName": "rancher-master-2099044364.us-west-1.elb.amazonaws.com",
            "SecurityGroups": [
                "sg-79ec971c",
                "sg-95fd43f1",
                "sg-52f38c37",
                "sg-afb534c8"
            ],
            "Policies": {
                "LBCookieStickinessPolicies": [],
                "AppCookieStickinessPolicies": [],
                "OtherPolicies": [
                    "AWSConsole-SSLNegotiationPolicy-rancher-master-1464896788615",
                    "ELBSecurityPolicy-2015-05",
                    "AWSConsole-SSLNegotiationPolicy-rancher-master-1464892662822",
                    "rancher-master-proxyprotocol-policy"
                ]
            },
            "LoadBalancerName": "rancher-master",
            "CreatedTime": "2016-06-02T18:37:42.860Z",
            "AvailabilityZones": [
                "us-west-1a",
                "us-west-1b"
            ],
            "Scheme": "internet-facing",
            "SourceSecurityGroup": {
                "OwnerAlias": "REDACTED",
                "GroupName": "spotinst"
            }
        }
    ]
}

The one thing differently we do is the ELB forwards port 80 -> 8081 which is a nginx container running. That container does a 301 re-direct, so if you go to http://rancher.demandbase.com it re-directs to https://rancher.demandbase.com so we're always going over SSL

879b177c0a25        demandbase/docker-nginx-https-redirect:proxy-protocol   "nginx"                  58 minutes ago      Up 58 minutes       443/tcp, 0.0.0.0:8081->80/tcp      elastic_shannon

Dockerfile for that container

https://github.com/Demandbase/docker-nginx-https-redirect/blob/master/Dockerfile.proxy_protocol

Nginx config with proxy_protocol support

https://github.com/Demandbase/docker-nginx-https-redirect/blob/master/nginx.conf.proxy_protocol

ecliptik on 29 Mar 2017

@ecliptik coudl you run:

docker logs <rancher server container id> 2>&1 | grep -B1000 'Shutting down'

and post the results?

cjellick on 29 Mar 2017

@cjellick nothing in the logs so far, this is a newer container though since we replaced the older m4.xlarge instance the container was running on with a c4.2xlarge since we were spiking the CPU when things were flapping.

ecliptik on 29 Mar 2017

@cjellick I also searched my console history which had logs from the earlier Rancher master that was experiencing the flapping and didn't see a Shutting down either. I'm keeping a watch on it now to see if it pops up again.

ecliptik on 29 Mar 2017

@ecliptik I looked closer at your browser screenshot and I htink you grabbed it from the wrong place. I meant just go into the normal Rancher UI, not the API view.

That said, since your setup is currently working, there's no point doing that now.

cjellick on 29 Mar 2017

@cjellick I did go to just the UI and it seems that in the Network tab of the dev console the socket is in pending

screen shot 2017-03-29 at 12 53 14 pm

but in the console it looks like it connected and opened,

screen shot 2017-03-29 at 12 53 22 pm

not sure if that means anything or not.

ecliptik on 29 Mar 2017

Thats normal bc it is a long lived websocket connection that is streaming data

cjellick on 29 Mar 2017

makes sense, if things start flapping again I'll bring up the console and see what it says.

ecliptik on 29 Mar 2017

came back to the web console after lunch and saw some Socket refusing messages, not sure if that's normal behavior though.

screen shot 2017-03-29 at 2 38 28 pm

ecliptik on 29 Mar 2017

@ecliptik RE seeing this in the agent logs:

time="2017-03-29T05:51:13Z" level=info msg="Starting websocket pings"
time="2017-03-29T15:40:40Z" level=error msg="Received error reading from socket. Exiting." error="websocket: close 1006 (abnormal closure): unexpected EOF"
time="2017-03-29T15:40:40Z" level=error msg="websocket: close 1006 (abnormal closure): unexpected EOF"

Are you seeing that at a regular interval? Do the times line up across hosts?

cjellick on 30 Mar 2017

What is your advertise-http-port set to?

cjellick on 30 Mar 2017

I just saw this happen again. This is starting to seriously affect us.

I pulled logs from one of the rancher-server hosts while this was happening. https://gist.github.com/jmreicha/4271b5343a0915e6a6ee94854b36d73a. Seeing tons of "Error dialing websocket backend." error="dial tcp 172.26.3.26:8080: getsockopt: connection refused and other socket errors on the rancher-servers.

The rancher-agent logs aren't showing much, just that they disconnect and reconnect. https://gist.github.com/jmreicha/7da0db598e7bcda170159ebddda93c17

The container reboot fixed things again eventually, but this time there was still memory available.

jmreicha on 30 Mar 2017

@jmreicha A couple follow up questions:

RE: "This time there was still memory available" - Why did you say that? Have you observed this being tied to OOM issues? Apologies if you already said this in slack, but could you restate it here?
How long since last time you rebooted the container? And how long sincne you last expereinced the problem?

cjellick on 30 Mar 2017

@jmreicha in your setup, are there any additional hops between your ELB and the rancher/server container (like an nginx or hapoxy container)?

cjellick on 30 Mar 2017

@cjellick For 1) Yes we saw this issue crop up a few days ago and observed that the host had basically no memory available to the sysem (~100mb). I have https://github.com/rancher/rancher/issues/8330 open for what I was seeing originally.

The last container reboot looks like it was about 3 weeks ago.

Here is the command I'm using to start up the rancher-server:

docker run -d --restart=unless-stopped \
        --log-driver=json-file \
        --log-opt max-size=100mb \
        --log-opt max-file=2 \
        -p 8080:8080 -p 9345:9345 \
        --name rancher-server rancher/server:${rancher_version} \
        --db-host ${db_host} \
        --db-name ${db_name} \
        --db-port ${db_port} \
        --db-user ${db_username} \
        --db-pass ${db_password} \
        --advertise-address $INTERNAL_IP

jmreicha on 30 Mar 2017

As for extra hops, I tried to follow the guide on the docs for HA setup as much as possible, so it is ELB -> rancher server, RDS in the backend and proxy protocol enabled. If it makes a difference, things were running smoothly for about a month or two, until just a few days ago.

jmreicha on 30 Mar 2017

Did it go bad after an upgrade? If so, from what version to what version?

cjellick on 30 Mar 2017

I've been running on 1.5.1 for about 3 to 4 weeks now. Before that, it was 1.4.2.

jmreicha on 30 Mar 2017

@ecliptik @jmreicha Thanks for the info, I'm going to dig into this issue now. I have some other setups where I think I've seen some related things.

ibuildthecloud on 30 Mar 2017

@cjellick we have the --advertise-address 10.9.1.97arg when we start up the Rancher master container. The IP is the private VPC IP of the EC2 instance.

The ELB we connect to is public:

;; QUESTION SECTION:
;rancher.demandbase.com.                IN      A

;; ANSWER SECTION:
rancher.demandbase.com. 35      IN      A       52.52.222.176
rancher.demandbase.com. 35      IN      A       52.9.76.25

Our setup is

rancher.demandbase.com DNS A record-> public ELB -> rancher ec2 instance in private VPC

all agents connect to the rancher.demandbase.com A record form different VPCs in different regions.

ecliptik on 30 Mar 2017

@ibuildthecloud Let me know if you need any other info, I can pull logs, etc.

jmreicha on 30 Mar 2017

To add to what @jmreicha said - it looks like every twelve hours (occuring in our logs at 9:43 and 21:43 UTC) all the websockets are disconnected, go-machine-service exits, and restarts, and websocket pings are restarted.

The previous two times (9:43 UTC and yesterday 2017-03-28 21:43 UTC) it seems to start the websocket pings and has no issue.

This time (2017-03-29 21:43 UTC) it started websocket pings but was immediately unable to reach any of the agents. This continued with ping errors in the rancher-server logs until we restarted the container at 22:00. Then we see one agent display a ping error before reconnecting successfully.

When did a deploy at 21:45 to one of our services, which got stuck in an Initializing state, and was dropped from our rancher/ha load balancer and caused 503s to show up until we restarted the rancher-server and also finalized the service upgrade (which, for some reason, involved it recreating the containers, probably the unhealthy/init timeout).

We're still looking to see if there's more info.

shakefu on 30 Mar 2017

One more thing to add for now. It looks like just before I started losing connectivity the io.cattle.platform.launcher.Main process starts hitting RDS fairly hard.

jmreicha on 30 Mar 2017

@shakefu that is almost exactly what we're experiencing, especially the websocket issues during those time periods. I'll keep an eye on it tonight to see if the same thing happens and try to get more info.

ecliptik on 30 Mar 2017

Due to our security policy, we locate our Rancher Servers in our Staging VPC which is in AWS us-west-1, and our Production hosts in our Production VPC which is in AWS us-east-1. We've run with this cross-VPC setup with Rancher 1.0.x for over a year without seeing this issue, as well as having run it on 1.4.1 for a couple weeks before upgrading to 1.5.x, which has been stable for several weeks until the last two days for us.

Here's the timeline of the most recent event we were able to put together:

Rancher-Server hosts

172.26.1.85 (was master)
172.26.3.26 (becomes master)

Timeline (UTC)

21:43 - [1.85] Backend web sockets close with unexpected EOF
21:43 - [1.85] go-machine-service exits and restarts
21:43 - [1.85] Start web socket ping
21:44 - [1.85] (+39s) Failed ping messages start and continue until restart at 22:00
21:49 - Service upgrade started by our QA team
21:55 - Service upgrade done, service in "Upgraded" state
22:00 - [1.85] rancher-server manually restarted
22:00 - [3.26] (+4s) rancher-server connection reset from other host (1.85)
22:01 - [3.26] Errors connecting to other host (1.85)
22:01 - [3.26] (+13s) Starts go-machine-service
22:01 - [3.26] (+3s) Appears to elect itself master ("Master address: none, using local")
22:01 - [3.26] Starts taking backend requests from hosts
22:01 - [3.26] "Unable to start event router" error="Timed out waiting for transtion." and "Exiting go-machine-service: Timed out waiting for transtion."
22:01 - [3.26] Starts go-machine-service
22:01 - [3.26] (+1s) Failed ping messages start and continues until 22:04
22:03 - [1.85] Failed ping messages to one host start and continue until 22:04
22:03 - Service upgrade is finished
22:04 - [1.85] (+29s) Failed ping messages end (No further logs)
22:04 - [3.26] (+1s) Failed ping messages end (No further logs)
22:04 - (+15s) Service stops showing 503s

shakefu on 30 Mar 2017

I'm still investigating this issue, but just two random things about 1.5. 1) Rancher 1.5 may use more memory than 1.4 due to caching that we added. 2) The max number of DB connections was set to 50. The DB connection setting being at 50 has proven to be problem for some setups. What happens is that connections spike (for whatever reason) and then the cluster heartbeat thread fails to get a DB connection from the pool and fails. If the cluster heartbeat fails we restart the server. To add more DB connections change the setting (look at the settings in http://localhost:8080/v1/settings)

db.cattle.maxidle=50
db.cattle.maxtotal=150

If you want to turn off the caching we added (which may slow down container deployments in large environments) then set.

cache.metadata=false

I suspect the DB connection pool setting may be the root cause of some of the peoples issues here.

ibuildthecloud on 30 Mar 2017

@ibuildthecloud Is there some scheduled task that you're aware of that would drive up those conns every 12 hours, and/or force the websockets to refresh?

shakefu on 30 Mar 2017

@ibuildthecloud Is there some metric or group of metrics that would be useful to monitor so we can catch this earlier potentially? We adjusted the settings you suggested, keeping an eye on things and our fingers crossed.

jmreicha on 30 Mar 2017

@shakefu there is a bit of code that is ultimately killing websocket connections every 12 hours. It's been in rancher for a very long time. It in and of itself should not cause this problem though. In fact, our design assumes that those connections can be dropped at any time. I guess the question is why, as of 1.5.X, why do agents fail to reconnect when a disconnect occurs.

cjellick on 30 Mar 2017

😕1

After updating our rancher master EC2 instance type to a c4.2xlarge yesterday afternoon it seems to have stabilized. We didn't have any agent disconnects and overall the health of the clusters looks good. This could be coincidental, but if there is something causing CPU spikes at specific times I'm wondering if a snowball effect happens when the system is under load.

Still would like to know the root cause of the issue and if it's something that is only solved with more resources or if it's a bug that could be addressed.

Here's a graph of the new instance running only the rancer/server container since it came online yesterday,

screen shot 2017-03-30 at 10 09 43 am

ecliptik on 30 Mar 2017

🎉1

@ecliptik We're going to try upping the connection pool (with the reco'd settings) before we increase our instance size to see if that helps... should know sometime after 2:43pm PDT today.

shakefu on 30 Mar 2017

👍1

I wonder if the bit of code that terminates websocket connections every 12 hours (from when the connection was established, not at a specific time and not the same time for all connections) can in some cases cause either a connection storm that spikes resources into an unrecoverable state. Or if we are leaking a resource related to those connections such that after a few rounds of dropping, we've exhausted that resource.

cjellick on 30 Mar 2017

@cjellick We manually turned on the hostRemoveDelaySeconds setting in the API originally in our production environment. We have since turned that setting off, but maybe that could be interfering with things?

I have logs from last night when the agents got disconnected from Rancher, hopefully they help.

jmreicha on 30 Mar 2017

New info. We were adding a new agent/server into the Rancher cluster this morning and we started seeing the flapping behavior. Lots of failures to connect and scheduling reconnects in the logs:

2017-03-30 17:32:09,552 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [245] count [6]
2017-03-30 17:32:09,553 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [245]
2017-03-30 17:32:09,594 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [695] count [6]
2017-03-30 17:32:09,595 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [695]
2017-03-30 17:32:09,634 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [534] count [6]
2017-03-30 17:32:09,635 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [534]
2017-03-30 17:32:09,683 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [716] count [6]
2017-03-30 17:32:09,683 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [716]
2017-03-30 17:32:09,723 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [760] count [6]
2017-03-30 17:32:09,723 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [760]
2017-03-30 17:32:09,775 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [956] count [6]
2017-03-30 17:32:09,776 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [956]
2017-03-30 17:32:09,822 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [1087] count [6]
2017-03-30 17:32:09,823 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [1087]
2017-03-30 17:32:09,868 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [848] count [6]
2017-03-30 17:32:09,868 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [848]
2017-03-30 17:32:09,918 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [1109] count [6]
2017-03-30 17:32:09,918 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [1109]
2017-03-30 17:32:09,963 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [1201] count [6]
2017-03-30 17:32:09,963 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Scheduling reconnect for [1201]

So the process of the new agent coming up seems to have triggered the flapping behavior.

jmreicha on 30 Mar 2017

We are going to try doubling the resources for the Rancher servers this afternoon to see if helps.

jmreicha on 30 Mar 2017

Appears pausing/unpausing some hosts cause it to go all haywire again. Restarted the master and it's taking a while to come back up. Applying the DB and Cache value changes @ibuildthecloud recommended to see if it helps.

ecliptik on 30 Mar 2017

After changing those settings and rebooting the master, the host reconnects instantly resolved, however stacks now have an issue of coming up with a Timeout getting IP address

Example log snippet from master

2017-03-30 19:58:50,268 ERROR [4bbd1f76-8733-40be-8db1-5fc1cb800f52:6492399] [instance:236579->instanceHostMap:133653] [instance.start->(InstanceStart)->instancehostmap.activate] [] [cutorService-32] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=25122]: Timeout getting IP address

Some infra services are also in Reinitializing like scheduler and convoy-nfs.

ecliptik on 30 Mar 2017

Looking at the ipsec infra stack logs, this keeps repeating over and over

3/30/2017 1:05:11 PM13[IKE] sending keep alive to 10.9.1.66[4500]
3/30/2017 1:05:11 PM09[NET] received packet: from 10.9.1.111[4500] to 10.42.199.43[4500] (465 bytes)
3/30/2017 1:05:11 PM09[ENC] parsed CREATE_CHILD_SA request 0 [ SA No KE ]
3/30/2017 1:05:11 PM09[IKE] 10.9.1.111 is initiating an IKE_SA
3/30/2017 1:05:11 PM09[IKE] scheduling rekeying in 12982s
3/30/2017 1:05:11 PM09[IKE] maximum IKE_SA lifetime 14422s
3/30/2017 1:05:11 PM09[IKE] IKE_SA conn-10.9.1.111[19703] rekeyed between 10.42.199.43[10.42.199.43]...10.9.1.111[10.42.53.6]
3/30/2017 1:05:11 PM09[ENC] generating CREATE_CHILD_SA response 0 [ SA No KE ]
3/30/2017 1:05:11 PM09[NET] sending packet: from 10.42.199.43[4500] to 10.9.1.111[4500] (405 bytes)
3/30/2017 1:05:11 PM04[NET] received packet: from 10.9.1.111[4500] to 10.42.199.43[4500] (65 bytes)
3/30/2017 1:05:11 PM04[ENC] parsed INFORMATIONAL request 1 [ D ]
3/30/2017 1:05:11 PM04[IKE] received DELETE for IKE_SA conn-10.9.1.111[19379]
3/30/2017 1:05:11 PM04[IKE] deleting IKE_SA conn-10.9.1.111[19379] between 10.42.199.43[10.42.199.43]...10.9.1.111[10.42.53.6]
3/30/2017 1:05:11 PM04[IKE] IKE_SA deleted
3/30/2017 1:05:11 PM04[ENC] generating INFORMATIONAL response 1 [ ]
3/30/2017 1:05:11 PM04[NET] sending packet: from 10.42.199.43[4500] to 10.9.1.111[4500] (57 bytes)
3/30/2017 1:05:11 PM10[JOB] deleting half open IKE_SA after timeout
3/30/2017 1:05:11 PM14[KNL] querying policy failed: No such file or directory (2)
3/30/2017 1:05:11 PM14[IKE] sending keep alive to 10.9.2.191[4500]
3/30/2017 1:05:11 PM13[JOB] deleting half open IKE_SA after timeout
3/30/2017 1:05:11 PM05[KNL] querying policy failed: No such file or directory (2)
3/30/2017 1:05:11 PM05[IKE] sending keep alive to 10.9.1.53[4500]
3/30/2017 1:05:11 PM12[KNL] querying policy failed: No such file or directory (2)
3/30/2017 1:05:11 PM12[IKE] sending keep alive to 10.9.1.111[4500]
3/30/2017 1:05:11 PM10[JOB] deleting half open IKE_SA after timeout
3/30/2017 1:05:11 PM08[JOB] deleting half open IKE_SA after timeout
3/30/2017 1:05:11 PM06[KNL] querying policy failed: No such file or directory (2)
3/30/2017 1:05:11 PM06[IKE] sending keep alive to 10.9.2.136[4500]
3/30/2017 1:05:11 PM04[KNL] querying policy failed: No such file or directory (2)
3/30/2017 1:05:11 PM04[IKE] sending keep alive to 10.9.2.30[4500]
3/30/2017 1:05:11 PM08[KNL] querying policy failed: No such file or directory (2)
3/30/2017 1:05:11 PM08[IKE] sending keep alive to 10.9.1.66[4500]
3/30/2017 1:05:11 PM15[KNL] querying policy failed: No such file or directory (2)

ecliptik on 30 Mar 2017

heads up, DO NOT set cache.metadata=false, stacks all start getting into weird stats like the log messages above. Thankfully setting it back to true and restarting master brings everything back okay.

ecliptik on 30 Mar 2017

👍1

Wellp, we upgraded our instances to double the RAM but didn't see the websocket cleaning on the 12 hour tick, which I'm guessing means the 12 hour rotation was reset to the time the server came up. I suppose we'll find out at 2:20am PDT whether it worked for our setup.

shakefu on 30 Mar 2017

@shakefu a few questions:

How big is your rancher deployment? Roughly how many hosts and how many containers. Ballparked to the closest 10 for hosts and the closest 100 for containers is fine
Are you you in AWS and behind an ELB?
Do you have any other proxies (nginx, haproxy) between the rancher/server containers and the ELB?

cjellick on 31 Mar 2017

@cjellick (I work with shakefu)

1) 11 hosts, 2 environments, ~330 containers total including infra stacks
2) Yes we are behind ELB, proxy protocol enabled for 8080 and 443
3) No other proxies involved in the Rancher server setup

jmreicha on 31 Mar 2017

@jmreicha are you at the same org as shakefu? Just wanted to clarify

cjellick on 31 Mar 2017

Yey, we managed to reproduce it again, by killing a spot instance we were using in our Staging environment via the AWS console.

The instance going into a "reconnecting" state didn't seem to cause issues, but it seemed like as soon as I manually "deactivated" that host, putting it in "deactivating" state, that all the other hosts, in both environments, started to flap the reconnecting loop.

It also appeared that the rancher/server containers became nearly unresponsive (took 1min+ to load the UI) while this was going on.

We were able to recover by restarting both the rancher/server HA containers.

shakefu on 31 Mar 2017

@cjellick Yeap, same org.

shakefu on 31 Mar 2017

Looks like the Deactivation was a spurious correlation (though it was impeccably timed.) We deactivated the old host when our auto-scaling group fired up a new instance.

We've reproduced it (again... ooh our poor infrastructure is taking a whoopin' today), by simply increasing the size of the ASG by one, to add a new host, which then causes all other hosts in all environments to start flapping.

It looks like this flapping happens before any containers are actually started (all were in the Storage state) with, I think, the only exception being the Network metadata sidekick. Hard to tell for sure since I was observing through the UI which was barely responsive. Could be a clue? Dunno.

These reproductions have all been done with the 50/150 DB conn setting, and 2x'd the RAM we had from earlier. Our monitoring is showing plenty of available RAM, CPU and we're well below the ulimits. Not really sure what else to poke at, at this point.

shakefu on 31 Mar 2017

@shakefu Are you using RDS? If so, is it under pressure? Does CPU or IOPs seemed to be spiked there? I'm told memory in RDS instances always appears fully consumed, so that one is less useful.

cjellick on 31 Mar 2017

@cjellick here's the 1hr graph of our RDS instance of when the hosts were flapping, you can see the change when I rebooted the master instance and things went back to stable.

screen shot 2017-03-30 at 5 23 26 pm

ecliptik on 31 Mar 2017

@cjellick During the flapping periods we see CPU spikes up to around 40%, write IOPS around 400-500, network throughput at around 20-25 Mbit/s.

Here's what our graphs look like as well:

jmreicha on 31 Mar 2017

@jmreicha or @ecliptik If you see the problem again, can you grab the syslogs from the host os? Want to see if there is anything in there related to running out of memory in there.

cjellick on 31 Mar 2017

@cjellick I don't have the syslogs, but we do have a New Relic agent installed on the master server and it has system stats from today when we were seeing the flapping from just before 12:00PM to when we rebooted around 1:30PM

screen shot 2017-03-30 at 9 06 58 pm

screen shot 2017-03-30 at 9 08 32 pm

ecliptik on 31 Mar 2017

@cjellick I have systemd logs, there's not much in there unfortunately but I can get them to you tomorrow.

Memory usage appears to be low for us as well, never reaching above 15-20% used on the host and JVM memory never maxing out according to Sysdig. The main things that I've noticed are the disk throughput on RDS, CPU spikes on the Rancher servers and connections flipping between hosts when the master server reboots and all its connections are picked up by the other server.

jmreicha on 31 Mar 2017

@cjellick The Rancher server started doing it's things last night right at 3 am. Restarting the container was the fix. We're starting to run out of ideas.

jmreicha on 31 Mar 2017

Only other thing I can think of that changed is that we upgraded the scheduler a little over a week ago. Could that have anything to do with the Rancher servers acting up?

jmreicha on 31 Mar 2017

@jmreicha No, problems with the scheduler would not manifest as agents disconnecting.

cjellick on 31 Mar 2017

worked with @cjellick in slack, and we can reproduce the issue on-demand,

Pause a host in any environment
Unpause the host
Hosts in all environments will start flapping

The flapping stops if on the master you issue sudo killall -2 websocket-proxy or by rebooting the host.

Gave additional logs to @cjellick to look over with goroutine dumps from doing a kill -2 <pidofwebsocket-proxy>. Only akillall -2 websockt-proxyseems to stop the flapping, not akill -2 `.

ecliptik on 31 Mar 2017

To clarify pause/unpause are the Rancher UI deactivate/activate functions

cjellick on 31 Mar 2017

@cjellick as discussed in Slack, I created a feature request for a heartbeat network as a suggestion to help mitigate this type of flapping scenario https://github.com/rancher/rancher/issues/8380

ecliptik on 1 Apr 2017

I also have this issue. I looks like changing the database charset to utf8 and collation to utf8_general_ci like recommended fixed it so far.
I used this command to do it :

DB="cattle"; ( echo 'SET foreign_key_checks = 0;'; echo 'ALTER DATABASE `'"$DB"'` CHARACTER SET utf8 COLLATE utf8_general_ci;'; mysql "$DB" -e "SHOW TABLES" --batch --skip-column-names | xargs -I{} echo 'ALTER TABLE `'{}'` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;' ; echo 'SET foreign_key_checks = 1;' ) | mysql "$DB"

mcanevet on 5 Apr 2017

Tested with rancher server version - v1.5.4-rc1 have 3 hosts.
Set up had ~45 global services running.
Deactivate host.
Activate host.
After this we do not see any hosts in the set up going to "reconnecting" state.

When a similar scenario was tested on v1.5.3, all the host in the environment was flapping ( switching between "reconnecting" and "active")

sangeethah on 5 Apr 2017

🎉3

@mcanevet the utf settings on the DB really shouldnt have anything to do with this particular bug. If changing those settings caused you to have to either restart rancher or sever connections from rancher to the DB, it may have appeared to fix it, but I don't think that actually fixed it.

Note that if you are experiencing this problem, you can disable the scheduler service as a work around until 1.5.4 comes out. Though, disabling the scheduler service will cause the following scheduling functions to not work properly:

proper even spreading of containers across hosts (spread will be random instead)
Scheduling based on memory and cpu fields on containers
The "Require Container Label" scheduling feature that you can see on the Edit Host screen

cjellick on 6 Apr 2017

Rancher: Agents Randomly Flapping since Rancher 1.5.2 Upgrade

Most helpful comment

All 71 comments

Related issues