Description: When downloading a lot of medium-sized files from Nginx HTTP/2 server, if requests are frequently canceled (RST_STREAM), eventually the connection will timeout and not be able to send/receive any data from the server.
On the server side configure the Nginx server to support HTTP/2 with a self-signed certificate. And host about 100 files with size about 500k each.
On the client side, I built an Android test, using a single thread to download the file from file no.1 to file no.100 consequently. During each download, the request is canceled and then proceed to download next file. Eventually, the request will timeout. Depending on the setup the timeout may happen on different download. But once the testing environment is set, it always failed on the same file.
Server info: nginx/1.13.10
Client info: Android 6.0 with OKHttp 3.10.0
This issue is reproducible on multiple servers. For example, Go HTTP/2 server and AWS CloudFront.
A filtered (sid:75) server log shows that the header frame has been sent on 14:39:32 and then timed out after 60 seconds. Note, it is not always the header frame though. Sometimes it is the data frame not being sent out.
2018/03/05 14:39:32 [debug] 92282#0: *4 http2 frame type:1 f:5 l:14 sid:75
2018/03/05 14:39:32 [debug] 92282#0: *4 http2 HEADERS frame sid:75 depends on 0 excl:0 weight:16
2018/03/05 14:39:32 [debug] 92282#0: *4 http2 frame out: 00007FD4B082ABA0 sid:75 bl:1 len:122
2018/03/05 14:39:32 [debug] 92282#0: *4 http2 frame sent: 00007FD4B082ABA0 sid:75 bl:1 len:122
2018/03/05 14:40:32 [debug] 92282#0: *4 http2 send RST_STREAM frame sid:75, status:1
In the meantime, the client log shows the header is not received until 14:40:32
03-05 14:39:32.229 5979-5995/? V/VB_Network: Sending request https://10.0.0.186/files/37 on Connection{10.0.0.186:443, proxy=DIRECT hostAddress=/10.0.0.186:443 cipherSuite=TLS_RSA_WITH_AES_128_CBC_SHA protocol=h2}
User-Agent:
Host: 10.0.0.186
Connection: Keep-Alive
Accept-Encoding: gzip
03-05 14:40:32.397 5979-5995/? V/VB_Network: Received response for https://10.0.0.186/files/37 (h2) in 60167.4ms
server: nginx/1.13.10
date: Mon, 05 Mar 2018 22:39:32 GMT
content-type: application/octet-stream
content-length: 562176
last-modified: Sun, 04 Mar 2018 00:23:13 GMT
etag: "5a9b3c71-89400"
accept-ranges: bytes
03-05 14:40:32.400 5979-5995/? E/VB_Test: stream was reset: PROTOCOL_ERROR https://10.0.0.186/files/37
At first, I thought it is a server issue, so I filed a bug to the Nginx team. They replied that this is because the client never sends the WINDOW_UPDATE frame except after the initial connection.
Server Full Log
Client Full Log
The test project is under [email protected]:jifang/nginx_bug_repo1.git
Thanks @jifang for including the github project to reproduce it. Not sure who is going to look at this first. n.b. Have you looked at the Http2 frame logging?
Excuse the kotlin
val activeLogger = getLogger(okhttp3.internal.http2.Http2::class.java.name)
activeLogger.level = Level.FINE
Should give output like
>> CONNECTION 505249202a20485454502f322e300d0a0d0a534d0d0a0d0a
>> 0x00000000 6 SETTINGS
>> 0x00000000 4 WINDOW_UPDATE
<< 0x00000000 18 SETTINGS
>> 0x00000000 0 SETTINGS ACK
>> 0x00000003 65 HEADERS END_STREAM|END_HEADERS
<< 0x00000000 0 SETTINGS ACK
<< 0x00000003 209 HEADERS END_HEADERS
<< 0x00000003 14181 DATA
<< 0x00000003 2829 DATA
<< 0x00000003 16375 DATA
<< 0x00000003 8505 DATA
<< 0x00000003 5667 DATA
<< 0x00000003 16384 DATA
<< 0x00000003 16366 DATA
<< 0x00000003 11343 DATA
<< 0x00000003 5667 DATA
<< 0x00000003 5083 DATA END_STREAM
>> 0x00000000 8 GOAWAY
Thanks @yschimke for the tips.
Here is the client log filtered by Http2
The test project is also updated.
I have a theory.
In our HTTP/2 code we don't ack bytes until they're delivered to the application. It's insufficient to get them to the Http2Stream; they actually need to be read from the stream.
I'm guessing that when a steam is cancelled we need to do something with the leftover data.
Nice. That theory should be reasonably easy to repro with a test if so.
Tagging for 3.11 as this seems hi-pri
I have been debugging an issue with ExoPlayer and OkHttp Http/2 connections that I now think may be the same underlying issue as this one. I have a test project attached to the issue that I can get 100% reproduction of the SocketTimeoutException if you run the project on a Genymotion android emulator with the network throttled to GPRS or Edge.
Would love to have an idea of when I could grab a snapshot to test a potential fix to this. I tried the latest 3.11.0-SNAPSHOT from today (April 11th).
Is there any update on this issue? Thanks!
Also looking for an update, we're still seeing this in production, causing any connection to a domain to fail until the socket gets removed from the connection pool
Can you try the latest SNAPSHOT? Be sure to fully read the response body, that's our signal to notify the peer we are ready for more bytes.
If you're still seeing an issue, can you please provide some steps to reproduce? Or ideally a test case if that's possible.
@ojw28 do you think it's feasible to have the loaders in exoplayer clean up cancelled requests by finishing reading the bytes?
As an aside @dave-r12, I personally don't know if it's reasonable to ask the application layer to do something like that when you don't need to do this with Http/1.1 connections and it doesn't break all subsequent connections to that domain. I will say that this is a huge issue for our video playback on android right now and I'm open to fixing it in the short term by doing a workaround but this is just my initial reaction.
If you're looking to reproduce, I have a test project attached to the issue that I can get 100% reproduction of the SocketTimeoutException if you run the project on a Genymotion android emulator with the network throttled to GPRS or Edge.
bump, can I work with someone from square to figure out how to fix this outside of the application layer? this is a huge issue for us right now
I don't think it's reasonable to require the application layer to fully read the response body. What if it's a 2GB video or something?
@natez0r did you try the latest snapshot? If that fixes the problem I can cut a release.
@swankjesse just tried now with (https://oss.sonatype.org/content/repositories/snapshots/com/squareup/okhttp3/okhttp/3.11.0-SNAPSHOT/okhttp-3.11.0-20180522.074215-92.pom) and unfortunately, I was still reproducing the issue in my test app
Are you fully reading the response body for the canceled request? I agree it's a bit of a burden for the application layer but that code should be straightforward to write. Or if you're canceling from a separate thread it should happen naturally. Maybe we can do this differently?
@natez0r here is a test case showing the original theory: https://github.com/square/okhttp/blob/master/okhttp-tests/src/test/java/okhttp3/internal/http2/HttpOverHttp2Test.java#L303
Could you try to muck with that and get it to fail with what you are seeing?
We're not reading the full response body in the test app (i'm just using exoplayer and the OKHttp extension for it).
I modified the test in question to stop discarding the bytes on cancel and it fails with the exact error we're seeing in our application:
// Cancel the call and discard what we've buffered for the response body. This should free up
// the connection flow-control window so new requests can proceed.
call1.cancel();
// assertFalse("Call should not have completed successfully.",
// Util.discard(response1.body().source(), 1, TimeUnit.SECONDS));
and the test seems a SocketTimeoutException, just like in our app code:
`Tests run: 82, Failures: 0, Errors: 2, Skipped: 5, Time elapsed: 37.908 sec <<< FAILURE! - in okhttp3.internal.http2.HttpOverHttp2Test
connectionWindowUpdateAfterCancelingh2_prior_knowledge Time elapsed: 10.063 sec <<< ERROR!
java.net.SocketTimeoutException: timeout
at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:602)
at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:610)
at okhttp3.internal.http2.Http2Stream.takeResponseHeaders(Http2Stream.java:146)
at okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:127)
at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200)
at okhttp3.RealCall.execute(RealCall.java:77)
at okhttp3.internal.http2.HttpOverHttp2Test.connectionWindowUpdateAfterCanceling(HttpOverHttp2Test.java:333)
connectionWindowUpdateAfterCancelingh2 Time elapsed: 10.206 sec <<< ERROR!
java.net.SocketTimeoutException: timeout
at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:602)
at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:610)
at okhttp3.internal.http2.Http2Stream.takeResponseHeaders(Http2Stream.java:146)
at okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:127)
at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200)
at okhttp3.RealCall.execute(RealCall.java:77)
at okhttp3.internal.http2.HttpOverHttp2Test.connectionWindowUpdateAfterCanceling(HttpOverHttp2Test.java:333)`
So, I guess my biggest questions here are:
cancel() call if calling it doesn't gracefully cleanup the request?@natez0r right -- that makes sense because the current fix relies on the app to read the bytes.
I just consulted the docs on Call.cancel():
Cancels the request, if possible. Requests that are already complete cannot be canceled.
There is no mention of cleaning up resources so I'm unsure if that'd be a behavior change.
A potential middle ground: require the app layer to close the Response or ResponseBody? I changed the test to do this and it failed so I think there is a little work to do here. I believe we do this for HTTP/1 (exhaust the response body when closed.)
I think asking the app to close the response body seems like a reasonable middleground from my perspective. Exoplayer already does this in the closeConnectionQuietly helper (https://github.com/google/ExoPlayer/blob/release-v2/extensions/okhttp/src/main/java/com/google/android/exoplayer2/ext/okhttp/OkHttpDataSource.java#L389)
As an aside, I attempted to drain the buffer in my exoplayer test and it didn't work for me because there was contention between the old datasource attempting to drain the bytes and a new request attempting to use the same socket to fetch another video. It's much easier to reproduce if you have a slow connection
@dave-r12 if we exhaust the responsebody when closing an HTTP/2 stream, does that mean that all other connections to that domain will have to wait for the exhausting of the response to complete before they could be multiplexed over the same socket? Do we only have to read whatever's actually been received, or does that mean we need to read the entire response the server was planning to serve up?
Forgive my ignorance here, is there any other way? With Http/1 it's really no big deal to exhaust the data (since there was no time requirement), but since other connections are waiting on using this same socket, I feel like pulling down the bytes we don't plan to actually read and blocking other connections on it seems unideal.
Do we only have to read whatever's actually been received, or does that mean we need to read the entire response the server was planning to serve up?
+1 to this question. The entire response may be very large (e.g. multiple gigabytes for a progressive video stream), so it's an important question to address. Clearly cancelling such a request early shouldn't require the entire response to still be received.
does that mean that all other connections to that domain will have to wait for the exhausting of the response to complete before they could be multiplexed over the same socket?
Connections or requests? Assuming you mean requests and the connection flow-control window is full, then yes the server should not send us any more data frames until we free up space.
Do we only have to read whatever's actually been received, or does that mean we need to read the entire response the server was planning to serve up?
Should only be what we have buffered so far, not the entire response. There should be no blocking on IO.
So I think where we are with this is:
response.body().close() (we already do this in ExoPlayer).Thanks!
Yes, I believe the commit above does that (pending code review.)
Awesome, thanks for the fix @dave-r12! much appreciated. If you'd like, I can run my test project on it once it is in a snapshot.
@natez0r run your test project on latest snapshot?
I tried it, but i am still seeing a socket timeout. I'm trying to investigate a bit further, but wanted to let you know I'm working on it.
Shoot, ok. Let us know if we can help out at all. Ideally we can get an executable test case for this and then fix.
Hey @natez0r let me know if you'd like a second set of hands on this! We're eager to get this resolved.
I am able to verify that calling response.body().close(); had fixed this issue with the latest snapshot. Thanks!
I got a chance to debug this a bit more today. It looks like my issue can occur when we've 'opened' the connection but have yet to begin reading the response but decide to cancel it.
@dave-r12 I guess my issue is slightly different than this, but I am trying to figure out what the actual cause could be:
If we've opened a connection (gotten the response code), but have yet to start reading the body of the response and decide to cancel the request, does the underlying Http2Stream need to write a reset code to the server. I connected the debugger and it appears that the unacknowledgedBytes are 0 when closing the stream, which does not fire a windowUpdate (which sounds reasonable), but I still get sockettimeouts when attempting to make a new connection to the same domain after i've closed the response body of my connection that is in the state described in my first message
Alright. I'll continue to stare at it but nothing is jumping out at me yet. If you could grab the HTTP/2 frame logs or can come up with an executable test case that will help as well. I'd really like to get to the root cause of this one 😄.
does the underlying Http2Stream need to write a reset code to the server.
That should already be happening if we haven't received all the data for the stream or there wasn't an existing error code for the stream.
Hey Dave, I'll try to come up with a test case when I get back from
vacation late next week. Thanks for your help on this!
On Fri, Jun 29, 2018, 8:52 AM Dave Roberge notifications@github.com wrote:
Alright. I'll continue to stare at it but nothing is jumping out at me
yet. If you could grab the HTTP/2 frame logs or can come up with an
executable test case that will help as well. I'd really like to get to the
root cause of this one 😄.does the underlying Http2Stream need to write a reset code to the server.
That should already be happening if we haven't received all the data for
the stream or there wasn't an existing error code for the stream.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/square/okhttp/issues/3915#issuecomment-401345006, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAlEM-1QbpCAGxzLf1BOb4aLmfu_aHwjks5uBiMpgaJpZM4SfJuX
.
@natez0r Just a headsup about 3.11 getting cut soon. Is the current status, that 3 good fixes have gone in but there is still an edge case we need to reproduce and fix?
@yschimke that sounds correct to me. The issue I am having could probably be broken off into a separate issue.
I just got back from vacation today, I am going to try to write a test case to repro the issue before EOD (fingers crossed)
If you think it's different enough, then please split it off.
so, i cannot get it to reproduce yet with the the unit tests, I think it probably can be split off.
The issue that I am seeing is:
EDGE connection speed
open one connection and get the content size, begin reading a chunk of data from that connection (i used 8192 block size).
During that read, try to open a new connection to a URL on the same domain. Regardless of if you close the old connection or not, the second connection will fail for me.
Anywho, I will file an issue on monday.
@natez0r I tried to write a test case to match what you found: https://github.com/square/okhttp/pull/4127. What do you think?
@natez0r can you please share your test code that shows this still happens?
@natez0r can you please share your test code that shows this still happens?
@noamtamim I was using https://github.com/google/ExoPlayer/issues/4078#issuecomment-379898195 to reproduce it. I haven't used it in awhile, but the basic gist for the test project was to have two connections to the same domain and cancel one mid-request and then try to read the second