Okhttp: Possible bug: High amount of requests under certain conditions

Created on 16 Jun 2018 · 18Comments · Source: square/okhttp

Hi! I apologize in advance for this bug report since it will be more FYI than a specific to the point bug report as we have not been able to identify the root cause of it nor reproduce it (I know, I wish we had more info than the below).

Problem: Intermittently we see our app clients using okhttp 3.6.0 launching 60+ POST requests when only one should have been sent. These requests were launched within sub-millisecond of each other hence spamming our API.

App runs React Native 0.55.3 and utilizing the standard fetch() call which in turns falls back on okhttp 3.6.0.

Communication path: App -> AWS ELB -> Apache server.

Some patterns that might hold true (we haven't seen any deviation from this.. but who knows, black swans are out there we might just not have observed them yet):

Problem is only with POST requests. We haven't observed this with GET/PATCH/DELETE yet.
Problem is only when going over HTTP/2. When we turn off HTTP/2 on our ELB we don't observe any clients having the issue any longer.
Problem only seems to exhibit when the Apache/ELB responds with error code >= 400. We haven't observed it happening when error code is 200.

We estimate this to happen in approx 0.1% of all requests or so. Hence, infrequently.

Reproduction: We have been able to find a scenario that can reproduce it but we observe it hundreds of times per day when HTTP/2 is enabled.

We are quite certain that the issue isn't with our app logic since we do not have any logic for retries or anything similar. The problem would be down the stack, in React Native or Okhttp.

I wanted to open this issue in case it has been observed by anyone else out there and they know the root cause for it or if they have a reliable reproduction of it, in such case we could help patch the behavior (assuming the problem is with okhttp).

needs info

Source

verticelo

👍2

All 18 comments

What specific error codes are returned? Is it possible that there is a HTTP/2 stream error returned by the server? There is some HTTP/2 specific retries e.g. REFUSED_STREAM.

I don't think we can resolve just from this report. If its reproducible in a public server then we can work out the bug and fix.

yschimke on 17 Jun 2018

Any chance an authenticator is involved like in https://github.com/square/okhttp/issues/3984

yschimke on 17 Jun 2018

Hi, some additional findings based on some log review we did today:

When the issue starts the bytes received for the first request is let's say 104 bytes, for all subsequent spamming requests it is 23 bytes. Another example has first payload of 151 bytes and the rest is 19.
When the spams start it launches all the requests in about 20-60ms total then it rests for almost exactly 60 seconds and the restart with launching requests again. We have seen it launch more than 200 requests in 30ms. It could be due to the bug, or due to some security mechanisms AWS / Android might deploy.
The error code returned by our ELB is 408. Regarding specific HTTP/2 error codes, the access logs in ELB doesn't contain them so we not able to answer that question unfortunately. We do not have access to any device log where this error has occurred so we can't look there either.

Regarding Authenticator, this is how our Javascript looks:

let conf = { method: 'POST', headers: { 'Accept': 'application/json', 'X-APIKEY': this.apiKey }};
let response = await fetch("https://domain.com/api", conf);

In React Native they seem to have implemented fetch() in the following way utilizing Okhttp:

import okhttp3.Request;

Request.Builder requestBuilder;

requestBuilder = new Request.Builder().url(url);
requestBuilder.headers(requestHeaders);

verticelo on 18 Jun 2018

This might be related to #3483. Could you try upgrading to 3.9.0 or later?

dave-r12 on 18 Jun 2018

We can try that if we can find a way to reproduce the issue first. If we can't trigger the problem we won't know if 3.9.0 fixes it unless we push it all the way to production and re-enable HTTP/2. We are dedicating some time on Wednesday trying to reproduce this issue again and I will get back to you all with the results here after that. Thanks!

verticelo on 18 Jun 2018

@verticelo if the server is returning 408's, the client will retry. This should be safe according to the spec. Prior to that change, it would retry several times (assuming the server continues responding with 408's.) With that change, it will only retry one time then give up.

dave-r12 on 19 Jun 2018

👍1

Hi again. Indeed, the error code 408 is causing the issue. We can reliably reproduce the problem when we return a 408 error code, not with any other code. When we return 408 the client sends hundreds of requests within a few milliseconds.

Our servers do not return the 408 error code (we don't utilize that in our systems) but rather it was the AWS ELB that did that for some reason we are yet to determine. This caused some of our users to become nice little DDoS bots.. happy days :)

We are about to upgrade to RN 0.56 which will utilize okhttp 3.10.0. Once we have done that we will perform additional testing and make sure it is resolved. Otherwise, we'll re-open this ticket.

Thank you all for your help, much appreciated!

verticelo on 20 Jun 2018

@verticelo did you figure our why the ELB was returning 408?

gedl on 8 Nov 2018

@gedl Unfortunately we didn't, if you figure it out, please let me know.

verticelo on 8 Nov 2018

It seems like okhttp is abandoning some requests, but not closing the connection. The ALB then waits for the configured timeout (in our case 60seconds) then returns a 408.
This seems to compound with another okhttp bug that causes it firing a great number of new requests as a response to the ALB.

Does this theory fit your problem?

gedl on 11 Nov 2018

Well, what does it mean that it abandons some requests? If it closes the connection, then it shouldn't receive the 408 reply?

verticelo on 12 Nov 2018

@verticelo Did you manage to solve this? We have exactly the same issue, using Amazon ELB and okhttp 3.10.0

navro on 3 Dec 2018

@gedl @navro is there any test site/project I can use to reproduce this? Alternatively if you have an executable test case, that works too. It would be great to fix this!

dave-r12 on 3 Dec 2018

👍1

@navro We were unable to reproduce the issue after upgrading to 3.10.0. Lycka till!

verticelo on 3 Dec 2018

❤1

@verticelo I meant keeps the connection opened, but discards / ignores the result.

We've upgraded to 3.11.0 and we see this less, but it's still present.

@dave-r12 I don't have anything usable at the moment, apologies. Is there any other information I may provide ?

gedl on 9 Dec 2018

@gedl here is a test case that returns 408, https://github.com/square/okhttp/blob/master/okhttp-tests/src/test/java/okhttp3/CallTest.java#L1714

Can you see if you can alter it to match the behavior you're seeing?

dave-r12 on 9 Dec 2018

It looks like when there is StreamResetException okHttp will retry indefinable. it happens when you are using ELB and you switch from wifi to 4g in the middle of a long request.
a workaround is to add an interceptor
I fix it with this code
public class RetryInterceptor implements Interceptor { @Override public Response intercept(Chain chain) throws IOException { Request request = chain.request(); Response response = null; try { response = chain.proceed(request); } catch (StreamResetException e) { return new Response.Builder() .request(request) .body(ResponseBody.create(MediaType.parse("application/json"), "{\"name\": \"Exception\", \"message\": \"Connection Close\"}")) .protocol(Protocol.HTTP_1_1) .message("Connection Close") .code(400) .build(); } return response; } }

estebanbi on 31 Jul 2019

Still observing this with React Native 0.61 and okhttp 3.12.1. Can we re-open this? Switching between wifi and 4g, or a spotty 4g, will cause okhttp to be come a tiny DoS (about 150 requests in under a second).

Any work around for react native?