Kubernetes-client: Connections are leaked when java.net.ProtocolException occurs while a WebSocket establishing

Created on 4 Jun 2018  路  24Comments  路  Source: fabric8io/kubernetes-client

Hi folks,

I'm from Eclipse Che Team, we use fabric8-kubernetes client to work with Kubernetes/OpenShift clusters.

I discovered an issue with connections leaks when java.net.ProtocolException occurs while a WebSocket establishing. An example of it is trying to execute a command in a non-existing pod.

It can be stably reproduced on Mock KubernetesServer or real K8s/OS cluster. I have prepared a test for that in my fork https://github.com/sleshchenko/kubernetes-client/blob/connectionLeaksTest/kubernetes-tests/src/test/java/io/fabric8/kubernetes/client/mock/ExecTest.java

#testConnectionLeaksAfterExecutingCommandInNonExistingPodOnMockServer for reproducing on mock kubernetes server.

#testConnectionLeaksAfterExecutingCommandInNonExistingPodOnRealServer can be used for reproducing on real Kubernetes/OpenShift server. Note that MASTER_URL and OAUTH_TOKEN should be configured.

Note that System.gc(); just accelerates appearance connection was leaked messages and this bug can be reproduced without invocation of garbage collector.

So, HOW TO REPRODUCE IT:

  1. Clone fork
  2. Run test, for example, #testConnectionLeaksAfterExecutingCommandInNonExistingPodOnMockServer.
  3. Wait ~10 minutes. For me after 5th iteration(after 5 minutes) logs appears.
...
10:57:11.354 [main] INFO  io.fabric8.kubernetes.client.mock.WatchTest - #######################################################
10:57:11.354 [main] INFO  io.fabric8.kubernetes.client.mock.WatchTest - ####                                               ####
10:57:11.354 [main] INFO  io.fabric8.kubernetes.client.mock.WatchTest - ####                   Iteration #  5              ####
10:57:11.354 [main] INFO  io.fabric8.kubernetes.client.mock.WatchTest - ####                                               ####
10:57:11.354 [main] INFO  io.fabric8.kubernetes.client.mock.WatchTest - #######################################################
10:57:11.354 [main] INFO  io.fabric8.kubernetes.client.mock.WatchTest - Try to execute 5 echo command in non-existing pod
Jun 04, 2018 10:57:11 AM okhttp3.mockwebserver.MockWebServer$3 processOneRequest
INFO: MockWebServer[53229] received request: GET /api/v1/namespaces/test/pods/non-existing/exec?command=sh&command=-c&command=echo%20hello&stderr=true HTTP/1.1 and responded: HTTP/1.1 404 Client Error
Jun 04, 2018 10:57:11 AM okhttp3.mockwebserver.MockWebServer$3 processOneRequest
INFO: MockWebServer[53229] received request: GET /api/v1/namespaces/test/pods/non-existing/exec?command=sh&command=-c&command=echo%20hello&stderr=true HTTP/1.1 and responded: HTTP/1.1 404 Client Error
10:57:11.359 [OkHttp http://localhost:53229/...] ERROR io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener - Exec Failure: HTTP:404. Message:
java.net.ProtocolException: Expected HTTP 101 response but was '404 Client Error'
    at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:219) [okhttp-3.9.1.jar:?]
    at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:186) [okhttp-3.9.1.jar:?]
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:153) [okhttp-3.9.1.jar:?]
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [okhttp-3.9.1.jar:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
10:57:11.360 [main] ERROR io.fabric8.kubernetes.client.mock.WatchTest - ERROR: Exception occurred: 
...
10:57:11.373 [main] INFO  io.fabric8.kubernetes.client.mock.WatchTest - Wait 1 minute
Jun 04, 2018 10:58:10 AM okhttp3.internal.platform.Platform log
WARNING: A connection to http://localhost:53229/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);
Jun 04, 2018 10:58:10 AM okhttp3.internal.platform.Platform log
WARNING: A connection to http://localhost:53229/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);
Jun 04, 2018 10:58:10 AM okhttp3.internal.platform.Platform log
...
bug statustale

Most helpful comment

Hm, One more place where the response is not closed properly . This is coming from here:https://github.com/fabric8io/kubernetes-client/blob/3b5c121c95725abfb6c4150c1e60515e2f8c3d1c/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/PodOperationsImpl.java#L265-L268

All 24 comments

Hey @sleshchenko

Can you please provide the information about client version you are using,

This has already been reported here https://github.com/fabric8io/kubernetes-client/issues/1013 and the fix is available in Client 3.1.12 If you are still facing this issue, we can look at it again.

@piyush1594 Thanks for really quick response. We use 3.1.0 kubernetes-client, but the test that I provided is based on master branch. I guess I'll reproduce it easily on 3.1.12. Let me try.

@piyush1594 I tried and I confirm that it can be reproduced on Kubernetes client 3.1.12. Here is a branch based on 3.1.12 for testing https://github.com/sleshchenko/kubernetes-client/commits/connectionLeaksTest_v3.1.12

Thanks @sleshchenko We will take a look at it. cc @rohanKanojia If you can take a look.

@sleshchenko can you try close the stream/connection to avoid connection leaks?

private void doExec(KubernetesClient client) {
    ExecWatch exec = null;
    try {
      exec = client
        .pods()
        .inNamespace("test")
        .withName("non-existing")
        .redirectingError()
        .exec("sh", "-c", "echo hello")

    } catch (Exception e) {

    } finally {
      // null check
     exec.close();
    }
  }

@hrishin I think that try-with-resources statement should it do for me. But of course, I can try to close it obviously.

@hrishin Good news, try-with-resources statement works as it supposed to do. So, obviously closing of stream/connection doesn't fix an issue.

Hm, One more place where the response is not closed properly . This is coming from here:https://github.com/fabric8io/kubernetes-client/blob/3b5c121c95725abfb6c4150c1e60515e2f8c3d1c/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/PodOperationsImpl.java#L265-L268

@rohanKanojia can we address this along with #1067?

Lurking due to my interest in #1013.

I am no expert but note that the response body seems not to be closed here: https://github.com/fabric8io/kubernetes-client/blob/7ea9ce885f9fad72949db6e3f5cc8dceb8ce8766/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/ExecWebSocketListener.java#L211-L237

Also, OperationSupport#createStatus(Response) accesses the body in full by calling string() on it (rendering further access on it impossible) but does not close it. Maybe a close() call should happen in there?

Hopefully this helps.

Working on a PR for further discussion.

@ljnelson : Yep, I also feel that response body should be closed(I've tried that too). But the test case provided by @sleshchenko seems to be producing leaks every time.

See #1198. Again, this is mainly for discussion. If this does no (further) harm, then I'll put together a PR to do an analogous set of operations in the WatchConnectionManager.

I went ahead and copied the test into my own branch and pared it way down so that it runs only against the mocks. Whether or not the responses surfaced in the test are closed, connections eventually leak, or at least the leak detection algorithm reports that they do.

@sleshchenko @rohanKanojia Making a little more progress here.

This screenshot:

screen shot 2018-09-10 at 4 04 00 pm

鈥s from a debugging session and shows the "innards" of a RealConnection in the OkHttp connection pool during one iteration of the ExecTest. The leak detection is time-based, not iteration based:

https://github.com/square/okhttp/blob/5b37cda9e00626f43acf354df145fd452c3031f1/okhttp/src/main/java/okhttp3/ConnectionPool.java#L86

So after five minutes the cleanup thread goes through all allocations associated with a given connection. If one's referent is null, because the backing WeakReference has had its referent garbage collected, then you get the leak messages. What you see in the screenshot above is one such allocation; note that the referent field is null.

The ExecTest is nice in that (in my pared down version) no websocket is actually created because of course there is no pod to connect to. This means in terms of listeners and whatnot only the onFailure method is ever invoked. So we can continue attacking this from just this one simple case.

Closing the Response has no effect on the connection allocations in the case of websockets since (I think鈥攏ot fully confident here) the response has effectively been "upgraded" to websocket streams:

https://github.com/square/okhttp/blob/5b37cda9e00626f43acf354df145fd452c3031f1/okhttp/src/main/java/okhttp3/internal/ws/RealWebSocket.java#L202-L205

This also means that as far as _this_ project is concerned there's nothing we can do here that isn't already being done I don't think. The only thing I could think of is to forcibly cancel the websocket when there's a failure. This is a little brute forcey. If you follow the code from here:

https://github.com/square/okhttp/blob/5b37cda9e00626f43acf354df145fd452c3031f1/okhttp/src/main/java/okhttp3/internal/ws/RealWebSocket.java#L176

鈥o here:

https://github.com/square/okhttp/blob/5b37cda9e00626f43acf354df145fd452c3031f1/okhttp/src/main/java/okhttp3/RealCall.java#L104

鈥o here:

https://github.com/square/okhttp/blob/dd720d260a7073b66e38d73f3956706c5c7a0b84/okhttp/src/main/java/okhttp3/internal/http/RetryAndFollowUpInterceptor.java#L87

鈥ou can see that that seems to be the only path by which a StreamAllocation is ever de-allocated before the cleanup thread gets to it (at least in the case of websockets).

So what is to be done? I still believe that the changes I've made in PR #1198 are correct and necessary (they close ResponseBody objects where previously they were not being closed) but they will not ever make this particular unit test "succeed". Or, to look at it another way, this test is already succeeding: the cleanup thread just happens to be the only way that allocations in this particular case are ever deallocated.

To boil all this down further:

I think that what ExecTest shows is that _in the case of failures_ stream allocations inside the OkHttp library are not being released. Since at least in the case of ExecTest the failure in question doesn't even result in a websocket being created, we _probably_ (maybe? 馃槃 ) should cancel() the WebSocket in ExecWebSocketListener after calling the onFailure() method.

In the case of a normal websocket closing, I think the streams are explicitly deallocated.

I have tried using cancel() calls at the right places in ExecWebSocketListener and this did not work. So much for that theory!

Getting closer.

The problem is definitely in the OkHttp library.

When you debug the initial HTTP call to the mock server to establish the websocket (which will never complete because the pod doesn't exist), you will end up here:

https://github.com/square/okhttp/blob/0b2486f7f32fbe21c8630f330d240546a15beb66/okhttp/src/main/java/okhttp3/internal/connection/ConnectInterceptor.java#L42

That allocates a new connection. The response that comes back here is of course, ultimately, a 404. Because the pod doesn't exist and therefore isn't found, this connection is not ever upgraded to a websocket connection (obviously). Other OkHttp code figures out that the websocket attempt needs to fail, and all the listeners and whatnot are invoked. The stream is never released. There's nothing that this project can do (that I can see) to release it without hacking.

@ljnelson : FYI, I've also filed an issue on okhttp, we can move the discussion there also if you think fix lies in okhttp. WDYT??

Yes, that sounds good. Let's keep track of https://github.com/square/okhttp/issues/4258.

You can put all your findings on that issue, I linked most of the things done till last weekend. There we can get much broader idea of this problem from okhttp maintainers :-)

This issue has been automatically marked as stale because it has not had any activity since 90 days. It will be closed if no further activity occurs within 7 days. Thank you for your contributions!

Was this page helpful?
0 / 5 - 0 ratings