For some reason (I haven't been able to reproduce this locally yet), under heavy load some of our okhttp clients might get into an infinite loop (eating up one or more CPU cores) when connection failure retry is enabled. It seems this affects some connections only in the connection pool.
If I disable connection failure retry then a SocketException is thrown at https://github.com/square/okhttp/blob/master/okhttp/src/main/java/okhttp3/internal/connection/RealConnection.kt#L594 with the following stack trace:
java.net.SocketException: Socket is closed at java.net.Socket.setSoTimeout(Unknown Source)
at sun.security.ssl.BaseSSLSocketImpl.setSoTimeout(Unknown Source)
at sun.security.ssl.SSLSocketImpl.setSoTimeout(Unknown Source)
at okhttp3.internal.connection.RealConnection.newCodec$okhttp(RealConnection.kt:594)
at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:83)
at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:245)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:82)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:74)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:197)
at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:502)
In this case a certain percentage of the requests fail constantly suggesting a stuck pooled faulty connection which is never evicted.
The only option is to restart the affected application.
I think the cause is related to request cancellation and concurrency. Might be the socket is closed right after creation by a different thread?
This issue has been observed since at least 3.14.0. Also happening with latest: 4.4.0
Thanks for the report. I'm disappointed to learn you're seeing this with 4.4. With that update we tried to address issues similar to this.
Will try to reproduce and fix with urgency!
I've been trying to reproduce the problem locally for the last few days without luck. While I could put together conditions where the very same error was thrown from the same location, I couldn't make it stuck in a loop.
Meanwhile, we got several threaddumps from production. Most of the stuck threads were like this:
"OkHttp https://service.xxx/..." #14950 prio=5 os_prio=0 cpu=573879.69ms elapsed=1946.81s tid=0x0000000008fba000 nid=0x3a85 runnable [0x00007fe30e48a000]
java.lang.Thread.State: RUNNABLE
at java.lang.Throwable.fillInStackTrace([email protected]/Native Method)
at java.lang.Throwable.fillInStackTrace([email protected]/Unknown Source)
- locked <0x00000007ffe459a0> (a okhttp3.internal.connection.RouteException)
at java.lang.Throwable.<init>([email protected]/Unknown Source)
at java.lang.Exception.<init>([email protected]/Unknown Source)
at java.lang.RuntimeException.<init>([email protected]/Unknown Source)
at okhttp3.internal.connection.RouteException.<init>(RouteException.kt:25)
at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:89)
at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:245)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:82)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:74)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:197)
at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:502)
at brave.propagation.CurrentTraceContext$1CurrentTraceContextRunnable.run(CurrentTraceContext.java:246)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/Unknown Source)
at java.lang.Thread.run([email protected]/Unknown Source)
Hope this helps.
nit: I don't think that the thread is stuck. I suspect that it's just the most expensive part of the failure handling, and in a tight loop or similar.
I don't think we can help here, I'd just suggest adding some debugging. e.g. If you add the LoggingEventListener it will print out whether it's cycling between multiple DNS results for the server causing additional client load unless you disable the retries.
Sorry we can't help here.
edit: I thought it was failing in zipkin sender, but assume that is just in the stacktrace for propagation etc.
We already gather and log events but it's hard to correlate a specific request with the start of this issue. I'm going to add more tracing information and get back to you.
Managed to narrow down the issue to the first failing request with event logs in place.
callStart=0 ms
proxySelectStart=0 ms
proxySelectEnd=0 ms
dnsStart=0 ms
dnsEnd=101 ms
connectionAcquired=101 ms
requestBodyStart=102 ms
requestBodyEnd=102 ms
requestHeadersStart=102 ms
requestHeadersEnd=102 ms
canceled=5832 ms
responseFailed=17499 ms
connectionReleased=17504 ms
callFailed=17504 ms
The weird thing is that there's no connect* or secureConnect* events here.
There rest of the calls for presumably this connection is as the followings (with 2000 ms timeout on top).
callStart=0 ms
connectionAcquired=0 ms
canceled=2000 ms
connectionReleased=2000 ms
callFailed=2000 ms
Investigating further how this can happen..
Thanks for the logging, but I'm not immediately sure how to help here.
@dave-r12 Any chance this is related? https://github.com/square/okhttp/pull/5888
@danielimre the missing connect* means we were able to reuse an existing connection.
@yschimke yes that's possible!
Thanks @dave-r12 . Meanwhile, I also figured out. I tried several scenario but couldn't reproduce the exact same issue locally.
My best guess is still some race condition in error tracking and request cancellation.
As the issue only surfaces when connection retries are enabled a possible workaround would be to limit the maximum retries. e.g. in okhttp3.internal.http.RetryAndFollowUpInterceptor.
Unfortunately, I couldn't spend more time on this issue and had to look for an alternative :(
@danielimre I'm actively investigating this. Can you help me with more info? I think you are using HTTP/1 and the async API. When you see this, any idea how many requests are pending? In-flight?
While I could put together conditions where the very same error was thrown from the same location, I couldn't make it stuck in a loop.
Can you post a test case or sample code?
@dave-r12 yes, this is http/1 with the async api using retrofit2.adapter.rxjava2.RxJava2CallAdapterFactory#createAsync
I cannot really provide any meaningful sample code. What I was doing is just firing requests against a local web server. I was running my tests in debug mode from IntelliJ with logging only breakpoints (not suspending any threads) over selected okhttp internals.
@danielimre gotcha thanks.
Just making sure we're on the same page, you listed events:
callStart=0 ms
connectionAcquired=0 ms
canceled=2000 ms
connectionReleased=2000 ms
callFailed=2000 ms
I was thinking if we were in an infinite loop that calls would not be completing. But this one did (it failed.) Do you happen to have the events when it goes into an infinite loop? In that case I'd expect the call would never complete.
@dave-r12 , here's my theory:
In okhttp3.internal.http.RetryAndFollowUpInterceptor there's a while(true) loop at line 65.
Eventually, in line 76 it gets a connection which is sort of broken (has a closed socket) so it will throw a SocketException when setSoTimeout is called on it.
The exception is caught and at line 89 it checks if it can be recovered. It decides it can recover so it retries the loop with the same connection.
This will fail over and over again until the call is interrupt by an external timeout through cancel. In our case this was a timeout on the RxJava flow.
Since no events or logs are sent during this period (an existing connection is reused) it cannot be really tracked what is going on.
If connection retries are disabled then the SocketException is thrown immediately. While it prevents the busy spin, it doesn't fix the problem as the connection is kept in the pool and will be reused failing bunch of requests.
What I couldn't figure out is how the connection can become broken.
I tried few scenarios like connection having a single successful request then timing out for the next, which closes the socket yet the connection is kept for some reason. In this case, RealConnection has a special flow at line 713 which was suspicious.
Unfortunately, I didn't manage to reproduce this from a test unless I tempered some variables in debug mode. Maybe some threading and local cache problem (missing volatile) outside of synchronized blocks?
@danielimre ahh right. I spotted one code path that might explain this. But it's pretty extraordinary.
When the application gets in this state are any requests making it to the server? Or no requests make it?
@dave-r12 When this happens then the a fixed percent of the calls fails suggesting only limited number of connections are affected.
I'd highlight that this is happening at high load only when we are hitting CPU limits. Either at startup when the instances are still cold or during heavy traffic spikes.
@danielimre thanks. I believe I was able to identify a code path that reproduces the exception and behavior you are seeing.
Thanks for the release. Haven't seen the issue with 4.5.0 for a week now.
Thanks for the release. Haven't seen the issue with 4.5.0 for a week now.
Thanks for the feedback, really useful to know. I'm hopeful Dave's nice fix is for a class of problems, so let us know if it reoccurs somehow.