copied from original ticket akka/akka#23791
Akka-http Version: 10.0.10
Scala Version: 2.11.11/2.12.3
We are experiencing issues where we randomly get a
akka.http.impl.engine.client.OutgoingConnectionBlueprint$UnexpectedConnectionClosureException: The http server closed the connection unexpectedly before delivering responses for 1 outstanding requests
when running a test suite that generates a lot of connections/load
I have attempted debugging with
<logger name="akka.http.impl.engine.client.PoolGateway" level="DEBUG" additivity="false">
<appender-ref ref="STDOUT"/>
</logger>
And after inspecting the PoolGateway all of the connections are correctly being acquired/released (so its not an issue with a http request discardEntityBytes()/Unmarshal not being called)
Http-pipelining is also disabled (here is the application.conf we are using). Note for the akka-http client, http-pipelining should be disabled by default (according to the default conf here https://doc.akka.io/docs/akka-http/current/scala/http/configuration.html).
akka {
loglevel = "DEBUG"
loggers = ["akka.event.slf4j.Slf4jLogger"]
logging-filter = "akka.event.slf4j.Slf4jLoggingFilter"
http {
host-connection-pool {
// This is to speed up our tests
max-connections = 32
max-open-requests = 256
}
}
}
I highly doubt its an issue with the server we are calling since we used a different http client before (http4s) and the same test suite ran without issues (also the error happens with different hosts running different http stacks). Unfortunately this is company code so I can't post it here, however we have a lightbend subscription at our company if it helps diagnosing the problem.
Is there any low hanging fruit which can help in diagnosing/fixing this issue?
@mdedetrich, usually, I'd say that these can happen with certain kinds of HTTP servers / reverse proxies because there's an inherent race condition with persistent HTTP connections if an idle persistent connection is closed by the server at the same time that the client sends another request on it (e.g. because of an idle-timeout). In that case, a server is not able to send a Connection: close header first, because no request is currently pending.
In your case, this seems a bit unlikely because of the different HTTP stacks you mention.
Another case, where I guess this could be observed is if there's a problem right during or after establishing a connection.
In which way is the error delivered? Do you see if the request is retried or are these requests idempotent?
Are these connections using TLS? If you connect to different hosts, does it happen against every one of those?
If it's reproducible for you, in the best case you could provide a tcpdump which contains the failing interaction to see what happens on the line (if the amount of traffic is a concern, packets could be snipped off using tcpdump's -s parameter to only include TCP/IP headers and the first bytes of TCP payload, that should already be enough).
Since this is only happening during tests, I suspect that resolving https://github.com/akka/akka-http/issues/768 would at least temporarily fix the situation for us
I also made a new issue here https://github.com/akka/akka-http/issues/1463. I have been doing some reading online, and apparently in this situation automatically retrying the request if its idempotent (i.e. GET request) is considered standard behaviour. Since akka-http tracks which types of requests are idempotent this might be a better solution (rather than trying to track down the cause of errors) especially considered that it may be the other http clients that is causing this problem
In the logs you sent, I saw lots of these combos:
[DEBUG] Materializer - [client-plain-text UP ] Downstream finished.
[DEBUG] Materializer - [client-plain-text DOWN] Upstream finished.
[DEBUG] PoolGateway - [0] <...> connection was closed by peer while no requests were in flight
The third line could either be this "while no requests were in flight" or the UnexpectedConnectionClosureException if a request was ongoing at the point when it happened.
That it starts with UP Downstream finished is weird because it would mean that it's the client that cancels the incoming data stream from the network despite the later message that the "connection was closed by peer", which could indeed point to a bug. I can reproduce these log combos in our test suite, so I'll have a look if I can understand what's going on there.
Turned out that our own tests showed the similar behavior for good reasons because the requests/responses were Connection: close, so that didn't help diagnose the issue.
I reproduced it with this test in LowLevelConnectionSpec:
"special persistent case" in new TestSetup(config =
"""
akka.http.client.log-unencrypted-network-bytes = 100
akka.http.host-connection-pool.client.log-unencrypted-network-bytes = 100
""".stripMargin) {
requestsSub.sendNext(HttpRequest(HttpMethods.GET, entity = HttpEntity(ContentTypes.`text/plain(UTF-8)`, "hello")))
expectWireData(
"""GET / HTTP/1.1
|Host: example.com
|User-Agent: akka-http/test
|Content-Type: text/plain; charset=UTF-8
|Content-Length: 5
|
|hello""")
sendWireData(
s"""HTTP/1.1 200 OK
|Transfer-Encoding: chunked
|
|3
|abc
|0
|
|""")
val HttpResponse(StatusCodes.OK, _, HttpEntity.Chunked(_, chunkStream), _) = expectResponse()
//val chunks = chunkStream.runWith(Sink.seq).awaitResult(3.seconds.dilated)
//chunks shouldEqual Seq(Chunk(ByteString("abc")), LastChunk)
requestsSub.sendNext(HttpRequest(HttpMethods.PATCH, entity = HttpEntity(ContentTypes.`text/plain(UTF-8)`, "hello2")))
expectWireData(
"""PATCH / HTTP/1.1
|Host: example.com
|User-Agent: akka-http/test
|Content-Type: text/plain; charset=UTF-8
|Content-Length: 6
|
|hello2""")
chunkStream.runWith(Sink.cancelled)
}
It happens when a streamed entity of a previous response is discarded / cancelled after a new request has already been scheduled to the new connection. I'm not yet sure what the right behavior should be or how to fix it but it's good to have it verified as a bug.
On the low level, this behavior is basically ok: if you'd try to use HTTP pipelining (which doesn't really work) and push another request onto a connection where a previous request has not yet completed, your connection will be closed when the response entity of a previous request is cancelled.
The real problem seems to be once again a race condition in the PoolSlot, where the detection of whether the entity of a response to a previous request has completely been read races against a new request already coming in. What should happen, is that the connection is closed first because a previous entity was cancelled before a new request would trigger a reconnection. What actually seems to happen is that the new request comes in slightly before the slot notices that the previous connection is gone and thus schedules the new request onto the old almost-gone connection leading to the observed problems. I could not actually trigger the problem with the pool because I assume it depends on some intricate timing / scheduling details between the connection stream, the entity stream, and the pool slot stream.
We should do a few things:
discardEntity smarter not to cancel entities in all cases to prevent that persistent connections need to be closed in so many cases. That would mean looking for the end of the entity stream for a while before finally cancelling. The timeout could be quite low because in most cases of small entities the entity is already in buffers.@mdedetrich as a short-term workaround, it could make sense to use entity.dataBytes.runWith(Sink.ignore) instead of using discardEntityBytes() when a response entity is not needed. Can you try if that gets rid of the problem?
@jrudolph
On the low level, this behavior is basically ok: if you'd try to use HTTP pipelining (which doesn't really work) and push another request onto a connection where a previous request has not yet completed, your connection will be closed when the response entity of a previous request is cancelled.
To be clear, we are not using pipelining (in the specific test case). We left it as disabled (since it also doesn't work correctly in all cases)
@mdedetrich as a short-term workaround, it could make sense to use entity.dataStream.runWith(Sink.ignore) instead of using discardEntityBytes() when a response entity is not needed. Can you try if that gets rid of the problem?
Will do, we are coding the workaround now and will let you know of the results tomorrow
To be clear, we are not using pipelining (in the specific test case). We left it as disabled (since it also doesn't work correctly in all cases)
Thanks for the clarification. I actually didn't assume it, it was more of an explanation why the behavior was expected on that level.
Thanks for the clarification. I actually didn't assume it, it was more of an explanation why the behavior was expected on that level.
Yup, its for other people in the thread if they are wondering!
Huh, awesome detective work @jrudolph, the making discardEntityBytes smarter will be a bit weird perhaps, but makes sense.
@jrudolph
@mdedetrich as a short-term workaround, it could make sense to use entity.dataBytes.runWith(Sink.ignore) instead of using discardEntityBytes() when a response entity is not needed. Can you try if that gets rid of the problem?
The good news is that the errors rates have fallen down substantially. The bad news is we still got it. I did 25 test runs with our test suite yesterday, and the UnexpectedConnectionClosureException only happened once (before happened around 50-80% of the time)
Update from our side, our current error rate for getting this error is around 15-20% which is a lot lower then before (however we are still getting it). I have submitted another trace output (using the entity.dataBytes.runWith(Sink.ignore) workaround to lightbend.
Many thanks to @jrudolph who helped diagnose the root cause, it appeared that the issue is that akka-stream-json (the json streaming library we used) would sometimes cancel the json stream. This then happened to propagate to the main connection stream, which means the http connection stream would also cancel.
@johanandren close this ticket for now, will re-open if the problem persists or this isn't the root cause
Let's summarize:
So, removing the cancelling code in akka-stream-json is a good workaround for now, but it might also mean that a connection stays now open longer than necessary because it isn't cancelled any more. As I understand it the cancellation in akka-stream-json was only necessary to be able to ignore potential trailing extra (json) data after the main response. (I'll also have a look at the tickets over there.)
@jrudolph
the original here is still a bug in the connection pool: if ever possible the PoolSlot should not send a new request to a connection that is soon to be closed
To clarify, the current workaround for this issue is to use entity.dataBytes.runWith(Sink.ignore) instead of discardEntityBytes (in which case, when akka-stream-json gets updated I will remove the safeEntity fix but the discardEntityBytes workaround will remain)
Has this been fixed in the new client connection pool implementation introduced in 10.1.0? Or would a fix require an implementation of #1686?
Could someone help me to understand why entity.dataBytes.runWith(Sink.ignore) can be different of discardEntityBytes?
From the code, it is exactly what discardEntityBytes is doing:
override def discardBytes(mat: Materializer): HttpMessage.DiscardedEntity =
new HttpMessage.DiscardedEntity(dataBytes.runWith(Sink.ignore)(mat))
And DiscardedEntity is just a wrapper for Sink.ignore: Future[Done]
Has this been fixed in the new client connection pool implementation introduced in 10.1.0? Or would a fix require an implementation of #1686?
The bug portion of this has been fixed with the new pool. So let's close this issue as fixed and use #1479 to track the remaining improvements.
Most helpful comment
I reproduced it with this test in
LowLevelConnectionSpec:It happens when a streamed entity of a previous response is discarded / cancelled after a new request has already been scheduled to the new connection. I'm not yet sure what the right behavior should be or how to fix it but it's good to have it verified as a bug.