Hi! I apologize in advance for this bug report since it will be more FYI than a specific to the point bug report as we have not been able to identify the root cause of it nor reproduce it (I know, I wish we had more info than the below).
Problem: Intermittently we see our app clients using okhttp 3.6.0 launching 60+ POST requests when only one should have been sent. These requests were launched within sub-millisecond of each other hence spamming our API.
App runs React Native 0.55.3 and utilizing the standard fetch() call which in turns falls back on okhttp 3.6.0.
Communication path: App -> AWS ELB -> Apache server.
Some patterns that might hold true (we haven't seen any deviation from this.. but who knows, black swans are out there we might just not have observed them yet):
We estimate this to happen in approx 0.1% of all requests or so. Hence, infrequently.
Reproduction: We have been able to find a scenario that can reproduce it but we observe it hundreds of times per day when HTTP/2 is enabled.
We are quite certain that the issue isn't with our app logic since we do not have any logic for retries or anything similar. The problem would be down the stack, in React Native or Okhttp.
I wanted to open this issue in case it has been observed by anyone else out there and they know the root cause for it or if they have a reliable reproduction of it, in such case we could help patch the behavior (assuming the problem is with okhttp).
What specific error codes are returned? Is it possible that there is a HTTP/2 stream error returned by the server? There is some HTTP/2 specific retries e.g. REFUSED_STREAM.
I don't think we can resolve just from this report. If its reproducible in a public server then we can work out the bug and fix.
Any chance an authenticator is involved like in https://github.com/square/okhttp/issues/3984
Hi, some additional findings based on some log review we did today:
When the issue starts the bytes received for the first request is let's say 104 bytes, for all subsequent spamming requests it is 23 bytes. Another example has first payload of 151 bytes and the rest is 19.
When the spams start it launches all the requests in about 20-60ms total then it rests for almost exactly 60 seconds and the restart with launching requests again. We have seen it launch more than 200 requests in 30ms. It could be due to the bug, or due to some security mechanisms AWS / Android might deploy.
The error code returned by our ELB is 408. Regarding specific HTTP/2 error codes, the access logs in ELB doesn't contain them so we not able to answer that question unfortunately. We do not have access to any device log where this error has occurred so we can't look there either.
Regarding Authenticator, this is how our Javascript looks:
let conf = { method: 'POST', headers: { 'Accept': 'application/json', 'X-APIKEY': this.apiKey }};
let response = await fetch("https://domain.com/api", conf);
In React Native they seem to have implemented fetch() in the following way utilizing Okhttp:
import okhttp3.Request;
Request.Builder requestBuilder;
requestBuilder = new Request.Builder().url(url);
requestBuilder.headers(requestHeaders);
This might be related to #3483. Could you try upgrading to 3.9.0 or later?
We can try that if we can find a way to reproduce the issue first. If we can't trigger the problem we won't know if 3.9.0 fixes it unless we push it all the way to production and re-enable HTTP/2. We are dedicating some time on Wednesday trying to reproduce this issue again and I will get back to you all with the results here after that. Thanks!
@verticelo if the server is returning 408's, the client will retry. This should be safe according to the spec. Prior to that change, it would retry several times (assuming the server continues responding with 408's.) With that change, it will only retry one time then give up.
Hi again. Indeed, the error code 408 is causing the issue. We can reliably reproduce the problem when we return a 408 error code, not with any other code. When we return 408 the client sends hundreds of requests within a few milliseconds.
Our servers do not return the 408 error code (we don't utilize that in our systems) but rather it was the AWS ELB that did that for some reason we are yet to determine. This caused some of our users to become nice little DDoS bots.. happy days :)
We are about to upgrade to RN 0.56 which will utilize okhttp 3.10.0. Once we have done that we will perform additional testing and make sure it is resolved. Otherwise, we'll re-open this ticket.
Thank you all for your help, much appreciated!
@verticelo did you figure our why the ELB was returning 408?
@gedl Unfortunately we didn't, if you figure it out, please let me know.
It seems like okhttp is abandoning some requests, but not closing the connection. The ALB then waits for the configured timeout (in our case 60seconds) then returns a 408.
This seems to compound with another okhttp bug that causes it firing a great number of new requests as a response to the ALB.
Does this theory fit your problem?
Well, what does it mean that it abandons some requests? If it closes the connection, then it shouldn't receive the 408 reply?
@verticelo Did you manage to solve this? We have exactly the same issue, using Amazon ELB and okhttp 3.10.0
@gedl @navro is there any test site/project I can use to reproduce this? Alternatively if you have an executable test case, that works too. It would be great to fix this!
@navro We were unable to reproduce the issue after upgrading to 3.10.0. Lycka till!
@verticelo I meant keeps the connection opened, but discards / ignores the result.
We've upgraded to 3.11.0 and we see this less, but it's still present.
@dave-r12 I don't have anything usable at the moment, apologies. Is there any other information I may provide ?
@gedl here is a test case that returns 408, https://github.com/square/okhttp/blob/master/okhttp-tests/src/test/java/okhttp3/CallTest.java#L1714
Can you see if you can alter it to match the behavior you're seeing?
It looks like when there is StreamResetException okHttp will retry indefinable. it happens when you are using ELB and you switch from wifi to 4g in the middle of a long request.
a workaround is to add an interceptor
I fix it with this code
public class RetryInterceptor implements Interceptor {
@Override
public Response intercept(Chain chain) throws IOException {
Request request = chain.request();
Response response = null;
try {
response = chain.proceed(request);
} catch (StreamResetException e) {
return new Response.Builder()
.request(request)
.body(ResponseBody.create(MediaType.parse("application/json"), "{\"name\": \"Exception\", \"message\": \"Connection Close\"}"))
.protocol(Protocol.HTTP_1_1)
.message("Connection Close")
.code(400)
.build();
}
return response;
}
}
Still observing this with React Native 0.61 and okhttp 3.12.1. Can we re-open this? Switching between wifi and 4g, or a spotty 4g, will cause okhttp to be come a tiny DoS (about 150 requests in under a second).
Any work around for react native?