We have an app on both Android and iOS, and the Android using okhttp/3.6.
We tried to start http/2 support on our api server which the apps connected to, Nginx.
Everything was ok, most requests have used http/2, but something wired happend:
in the nginx access_log, I found some requests with HUGE request_time, which could be more than 900s or 1000s, all these requests returned http_status_408.
And when I checked on these users(or devices), all these requests sent after users stopping using the app, and at most time these requests are duplicated.
I cannot reproduce this on my own devices, or any test devices in our company. But this should be a bug because at the same time, iOS client was working fine with http/2.
here is the wired access_log: we can find that the user stopped using app at 21:50:20, but sent more requests, after 950 seconds, the nginx log these requests at 22:01:27.
Time request_time response_time http2 status request
April 25th 2017, 22:23:25.000 0.021 0.021 h2 200 /messenger/v2/conversations/unread
April 25th 2017, 22:01:27.000 956.083 0.022 h2 408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000 952.644 0.081 h2 408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000 940.531 0.019 h2 408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000 936.136 0.01 h2 408 /v1.1/upgrade/check
April 25th 2017, 22:01:27.000 942.604 0.019 h2 408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000 945.977 0.023 h2 408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000 952.787 0.001 h2 408 /search/v3/config/tabs
April 25th 2017, 22:01:27.000 942.729 0.024 h2 408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000 960.572 0.021 h2 408 /training/v2/running/facade/config?type=3x
April 25th 2017, 22:01:27.000 936.136 0.017 h2 408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000 950.519 0.033 h2 408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000 946.369 0.017 h2 408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 22:01:27.000 956.083 0.023 h2 408 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 21:50:20.000 0.001 0.001 h2 200 /v1.1/log/client?format=json&priority=normal
April 25th 2017, 21:45:28.000 0.004 0.004 200 /config/v2/basic
April 25th 2017, 21:45:21.000 0.022 0.022 h2 200 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 21:45:21.000 0.012 0.012 h2 200 /messenger/v2/conversations/unread
April 25th 2017, 21:45:21.000 0.019 0.019 h2 200 /search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20
April 25th 2017, 21:45:20.000 0.008 0.008 h2 200 /messenger/v2/conversations/unread
April 25th 2017, 21:45:20.000 0.071 0.071 h2 200 /account/v2/dashboard
April 25th 2017, 21:45:20.000 0.017 0.017 h2 200 /training/v2/event/outdoor
April 25th 2017, 21:45:20.000 0.003 0.003 204 /v1.1/ads/splash?type=android
April 25th 2017, 21:45:20.000 0.015 0.015 h2 200 /social/v2/rankinglist/brief?date=20170425
April 25th 2017, 21:45:20.000 0.009 0.009 h2 200 /v1.1/home/dashboard/statistics
Interesting analysis. Would love some help figuring out the root cause.
@swankjesse Thanks, and I have no idea for further research.
And one more thing: all these requests has different $request_id which generated by Nginx.
I suspect this isn't a problem with OkHttp, but rather a case where some
timeout tuning would help to deal with flaky connections.
Notice that the client is trying to get the same resource, and the requests
are staggered by a few seconds. In a generic scenario, the client sends a
request, doesn't hear back from the server, and then sends another. All the
while, nginx is waiting for the client's request. It'd be good to check
whether those logs that you posted are the 408 responses. It seems fishy
that they all happen at the same moment, each with different elapsed time
The event at 22:01:27 may be where the client tears down its TCP
connections.
Some ideas to investigate on your side:
[1]
http://nginx.org/en/docs/http/ngx_http_core_module.html#client_body_timeout
>
@EricEdens
Thanks first.
client_body_timeout is default value: 60s/search/v3/global?keyword=%E5%A4%A7%E8%85%BF&trainer_gender=f&limit=20 at 21:45:21 (twice actually) got http_status_200. And what's the matter of the listening port? it is listen 80; listen 443 ssl http2; in nginx configuration.Thanks for the extra info @rrfeng.
Nix the request for the listening port; I was trying to tease out which TCP connection was used for which request, and I was actually thinking of the client's​ port, which wouldn't be available in the server logs. I don't know what I was thinking last night :)
If you could come up with a test that reproduces this, that'd be great!
I don't know how to log the client_port in the Nginx, but the $connection should be useful for this purpose. I will add this variable to the access_log. Later I will comment if the connection is same for these requests.
I still could not reproduce this problem manually, but the http/2 is running now, I can found this kind of logs every 10min.
http://nginx.org/en/docs/http/ngx_http_core_module.html#variables
Interesting, I will check this user's operations.
Time request_time connection request
April 28th 2017, 11:11:59.000 1,057.841 3255947367 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 11:11:59.000 1,057.841 3255947367 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:54:20.000 0.019 3255947367 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:54:20.000 0.02 3255947367 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:53:54.000 0.021 3255947366 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:53:51.000 0.024 3255947364 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:53:51.000 0.025 3255947360 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:53:50.000 0.02 3255947367 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
April 28th 2017, 10:53:48.000 0.022 3255947363 /search/v3/global?keyword=%E8%83%8C&trainer_gender=f&limit=20
We have found a bug in our apps, the /search/v3/global api may be called several times at a time. But except this situation, the problem still exists.
And I think that the network change(such as wifi -> cellular)/concurrency/app background sleep may relate.
Thanks for the update. It sounds like we can close this issue?
On May 1, 2017 9:08 PM, "rrFeng" notifications@github.com wrote:
We have found a bug in our apps, the /search/v3/global api may be called
several times at a time. But except this situation, the problem still
exists.
And I think that the network change(such as wifi -> cellular)/concurrency/app
background sleep may relate.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/square/okhttp/issues/3312#issuecomment-298492860, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAjIijUr6DfGbHP0dRBzBhdOFoN0j4Y6ks5r1qw8gaJpZM4NJssG
.
(In the sense that the socket is changing under us. Or do you think the OkHttp should be doing something different here?)
It maybe a bug of Nginx, I don't know.
If you guys would like to close this issue, then close.
I will reopen if I find some proof of OkHttp.
Yeah, please do investigate. OkHttp does retries on some calls, but this doesn’t seem like the culprit here.
If you see anything we can fix, please report back!
@rrfeng ,have you found the reason for 408 status,how to solve it? thanks
I am facing the same issue, the server used to send a 408 after a request read timeout of 30 seconds, and this resulted in multiple repeated calls.
After increasing the the readtimeout to 290s , the issue changed to
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="455" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878207"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="453" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878209"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="453" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878148"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="455" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878246"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="454" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878247"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="454" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878229"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="455" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878249"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="454" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878248"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="454" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878250"
[22/Oct/2018:12:07:27 +0000] "POST /api/v1/feedback/ HTTP/1.1" "-" 408 req="454" resp="0" ua="okhttp/3.11.0" rt="290.000" uct="-" uht="-" urt="-" "15878251"
All of them has same request time, same timeout but different connectionId.
I am also having the same issue with my react-native Android app which uses okhttp 3.8.1 when talking to nginx + HTTP/2.
In our case with probability less than 0.1% it sends multiple requests to server.
When it happens, we see 128 multiple requests in our access log within a second.
The number is always 128 because our nginx and HTTP/2 rejects concurrent requests more than 128.
However, at the same time, we see a gigantic number of logs like below in our nginx error log.
In a worse case, it records a million of lines of this error line from the same IP within a few minutes.
I'm afraid it means user sends a very large number of requests.
2018/11/15 07:06:04 [info] 8638#0: *1049430794 concurrent streams exceeded 128 while processing HTTP/2 connection, client: XX.XX.XX.XX, server: 0.0.0.0:443
2018/11/15 07:06:04 [info] 8638#0: *1049430794 concurrent streams exceeded 128 while processing HTTP/2 connection, client: XX.XX.XX.XX, server: 0.0.0.0:443
2018/11/15 07:06:04 [info] 8638#0: *1049430794 concurrent streams exceeded 128 while processing HTTP/2 connection, client: XX.XX.XX.XX, server: 0.0.0.0:443
By disabling HTTP/2, the problem disappeared.
The strange thing is it only happens with POST and DELETE, though our app sends more GET and OPTIONS requests than POST and DELETE.
It seems more likely to happen with slow APIs (image uploading for example).
@pacuum oooh, that’s helpful for debugging this. Perhaps we’re timing out client side and retrying on the same HTTP/2 connection?
@swankjesse
I picked some samples and checked the access log just before the problem happened,
and seems you're right.
I saw a pattern that, before the repeated requests happens, there is often (more than 80%) 60 second blank in access log from the same IP, which looks like a time-out.
But there are still a few case that there is no such blank.
XX.XX.XX.XX - - [15/Nov/2018:12:17:26 +0700] "GET /whats_hot?paging_model[page_size]=40&paging_model[page_index]=1&platform=android HTTP/2.0" 200 24276 "-" "okhttp/3.8.1" "-" 0.618
XX.XX.XX.XX - - [15/Nov/2018:12:18:26 +0700] "POST /tutorials?platform=android HTTP/2.0" 408 0 "-" "okhttp/3.8.1" "-" -
XX.XX.XX.XX - - [15/Nov/2018:12:18:26 +0700] "POST /tutorials?platform=android HTTP/2.0" 408 0 "-" "okhttp/3.8.1" "-" -
XX.XX.XX.XX - - [15/Nov/2018:12:18:26 +0700] "POST /tutorials?platform=android HTTP/2.0" 408 0 "-" "okhttp/3.8.1" "-" -
@pacuum exactly the same with you. After we enable http/2, we found some Android users send around 128 requests in a second. We have to go back to http/1.1
Yikes. What HTTP/2 server are you using? Any interesting interceptors? Any other information you provide could be quite helpful in finding the root cause here.
@swankjesse nginx version: nginx/1.10.3 (Ubuntu)
all requests' $request_time is 60, $status is 408, $request_method is POST, $body_bytes_sent is 0
I have the same problem in Javascript with Chrome android and nginx using http/2. We downgraded to http/1.1 and everything works well again.
I would say the problem is http/2 over Android rather than nginx.
Most helpful comment
I am also having the same issue with my react-native Android app which uses okhttp 3.8.1 when talking to nginx + HTTP/2.
In our case with probability less than 0.1% it sends multiple requests to server.
When it happens, we see 128 multiple requests in our access log within a second.
The number is always 128 because our nginx and HTTP/2 rejects concurrent requests more than 128.
However, at the same time, we see a gigantic number of logs like below in our nginx error log.
In a worse case, it records a million of lines of this error line from the same IP within a few minutes.
I'm afraid it means user sends a very large number of requests.
By disabling HTTP/2, the problem disappeared.
The strange thing is it only happens with POST and DELETE, though our app sends more GET and OPTIONS requests than POST and DELETE.
It seems more likely to happen with slow APIs (image uploading for example).