Akka-http: Random UnexpectedConnectionClosureException with akka-http client

Created on 9 Oct 2017 · 19Comments · Source: akka/akka-http

copied from original ticket akka/akka#23791

Akka-http Version: 10.0.10
Scala Version: 2.11.11/2.12.3

We are experiencing issues where we randomly get a

akka.http.impl.engine.client.OutgoingConnectionBlueprint$UnexpectedConnectionClosureException: The http server closed the connection unexpectedly before delivering responses for 1 outstanding requests

when running a test suite that generates a lot of connections/load

I have attempted debugging with

    <logger name="akka.http.impl.engine.client.PoolGateway" level="DEBUG" additivity="false">
        <appender-ref ref="STDOUT"/>
    </logger>

And after inspecting the PoolGateway all of the connections are correctly being acquired/released (so its not an issue with a http request discardEntityBytes()/Unmarshal not being called)

Http-pipelining is also disabled (here is the application.conf we are using). Note for the akka-http client, http-pipelining should be disabled by default (according to the default conf here https://doc.akka.io/docs/akka-http/current/scala/http/configuration.html).

akka {
  loglevel = "DEBUG"
  loggers = ["akka.event.slf4j.Slf4jLogger"]
  logging-filter = "akka.event.slf4j.Slf4jLoggingFilter"
  http {
    host-connection-pool {
      // This is to speed up our tests
      max-connections = 32
      max-open-requests = 256
    }
  }
}

I highly doubt its an issue with the server we are calling since we used a different http client before (http4s) and the same test suite ran without issues (also the error happens with different hosts running different http stacks). Unfortunately this is company code so I can't post it here, however we have a lightbend subscription at our company if it helps diagnosing the problem.

Is there any low hanging fruit which can help in diagnosing/fixing this issue?

bug client core tclient

Source

johanandren

Most helpful comment

I reproduced it with this test in LowLevelConnectionSpec:

    "special persistent case" in new TestSetup(config =
      """
        akka.http.client.log-unencrypted-network-bytes = 100
        akka.http.host-connection-pool.client.log-unencrypted-network-bytes = 100
      """.stripMargin) {
      requestsSub.sendNext(HttpRequest(HttpMethods.GET, entity = HttpEntity(ContentTypes.`text/plain(UTF-8)`, "hello")))
      expectWireData(
        """GET / HTTP/1.1
          |Host: example.com
          |User-Agent: akka-http/test
          |Content-Type: text/plain; charset=UTF-8
          |Content-Length: 5
          |
          |hello""")

      sendWireData(
        s"""HTTP/1.1 200 OK
           |Transfer-Encoding: chunked
           |
           |3
           |abc
           |0
           |
           |""")
      val HttpResponse(StatusCodes.OK, _, HttpEntity.Chunked(_, chunkStream), _) = expectResponse()
      //val chunks = chunkStream.runWith(Sink.seq).awaitResult(3.seconds.dilated)
      //chunks shouldEqual Seq(Chunk(ByteString("abc")), LastChunk)

      requestsSub.sendNext(HttpRequest(HttpMethods.PATCH, entity = HttpEntity(ContentTypes.`text/plain(UTF-8)`, "hello2")))
      expectWireData(
        """PATCH / HTTP/1.1
          |Host: example.com
          |User-Agent: akka-http/test
          |Content-Type: text/plain; charset=UTF-8
          |Content-Length: 6
          |
          |hello2""")
      chunkStream.runWith(Sink.cancelled)
    }

It happens when a streamed entity of a previous response is discarded / cancelled after a new request has already been scheduled to the new connection. I'm not yet sure what the right behavior should be or how to fix it but it's good to have it verified as a bug.

jrudolph on 10 Oct 2017

🎉10

All 19 comments

@mdedetrich, usually, I'd say that these can happen with certain kinds of HTTP servers / reverse proxies because there's an inherent race condition with persistent HTTP connections if an idle persistent connection is closed by the server at the same time that the client sends another request on it (e.g. because of an idle-timeout). In that case, a server is not able to send a Connection: close header first, because no request is currently pending.

In your case, this seems a bit unlikely because of the different HTTP stacks you mention.

Another case, where I guess this could be observed is if there's a problem right during or after establishing a connection.

In which way is the error delivered? Do you see if the request is retried or are these requests idempotent?

Are these connections using TLS? If you connect to different hosts, does it happen against every one of those?

If it's reproducible for you, in the best case you could provide a tcpdump which contains the failing interaction to see what happens on the line (if the amount of traffic is a concern, packets could be snipped off using tcpdump's -s parameter to only include TCP/IP headers and the first bytes of TCP payload, that should already be enough).

jrudolph on 9 Oct 2017

Since this is only happening during tests, I suspect that resolving https://github.com/akka/akka-http/issues/768 would at least temporarily fix the situation for us

mdedetrich on 9 Oct 2017

I also made a new issue here https://github.com/akka/akka-http/issues/1463. I have been doing some reading online, and apparently in this situation automatically retrying the request if its idempotent (i.e. GET request) is considered standard behaviour. Since akka-http tracks which types of requests are idempotent this might be a better solution (rather than trying to track down the cause of errors) especially considered that it may be the other http clients that is causing this problem

mdedetrich on 10 Oct 2017

In the logs you sent, I saw lots of these combos:

[DEBUG] Materializer - [client-plain-text UP  ] Downstream finished.
[DEBUG] Materializer - [client-plain-text DOWN] Upstream finished.
[DEBUG] PoolGateway - [0] <...> connection was closed by peer while no requests were in flight

The third line could either be this "while no requests were in flight" or the UnexpectedConnectionClosureException if a request was ongoing at the point when it happened.

That it starts with UP Downstream finished is weird because it would mean that it's the client that cancels the incoming data stream from the network despite the later message that the "connection was closed by peer", which could indeed point to a bug. I can reproduce these log combos in our test suite, so I'll have a look if I can understand what's going on there.

jrudolph on 10 Oct 2017

Turned out that our own tests showed the similar behavior for good reasons because the requests/responses were Connection: close, so that didn't help diagnose the issue.

jrudolph on 10 Oct 2017

I reproduced it with this test in LowLevelConnectionSpec:

    "special persistent case" in new TestSetup(config =
      """
        akka.http.client.log-unencrypted-network-bytes = 100
        akka.http.host-connection-pool.client.log-unencrypted-network-bytes = 100
      """.stripMargin) {
      requestsSub.sendNext(HttpRequest(HttpMethods.GET, entity = HttpEntity(ContentTypes.`text/plain(UTF-8)`, "hello")))
      expectWireData(
        """GET / HTTP/1.1
          |Host: example.com
          |User-Agent: akka-http/test
          |Content-Type: text/plain; charset=UTF-8
          |Content-Length: 5
          |
          |hello""")

      sendWireData(
        s"""HTTP/1.1 200 OK
           |Transfer-Encoding: chunked
           |
           |3
           |abc
           |0
           |
           |""")
      val HttpResponse(StatusCodes.OK, _, HttpEntity.Chunked(_, chunkStream), _) = expectResponse()
      //val chunks = chunkStream.runWith(Sink.seq).awaitResult(3.seconds.dilated)
      //chunks shouldEqual Seq(Chunk(ByteString("abc")), LastChunk)

      requestsSub.sendNext(HttpRequest(HttpMethods.PATCH, entity = HttpEntity(ContentTypes.`text/plain(UTF-8)`, "hello2")))
      expectWireData(
        """PATCH / HTTP/1.1
          |Host: example.com
          |User-Agent: akka-http/test
          |Content-Type: text/plain; charset=UTF-8
          |Content-Length: 6
          |
          |hello2""")
      chunkStream.runWith(Sink.cancelled)
    }

jrudolph on 10 Oct 2017

🎉10

On the low level, this behavior is basically ok: if you'd try to use HTTP pipelining (which doesn't really work) and push another request onto a connection where a previous request has not yet completed, your connection will be closed when the response entity of a previous request is cancelled.

The real problem seems to be once again a race condition in the PoolSlot, where the detection of whether the entity of a response to a previous request has completely been read races against a new request already coming in. What should happen, is that the connection is closed first because a previous entity was cancelled before a new request would trigger a reconnection. What actually seems to happen is that the new request comes in slightly before the slot notices that the previous connection is gone and thus schedules the new request onto the old almost-gone connection leading to the observed problems. I could not actually trigger the problem with the pool because I assume it depends on some intricate timing / scheduling details between the connection stream, the entity stream, and the pool slot stream.

We should do a few things:

fix the race condition
improve and fix logging, so that it is clear why the connection was closed by us (!) in the first place ("entity was cancelled/discarded by the user")
rewrite the PoolSlot / PoolConductor logic to get rid of those race conditions in the first place (see #1312)
make discardEntity smarter not to cancel entities in all cases to prevent that persistent connections need to be closed in so many cases. That would mean looking for the end of the entity stream for a while before finally cancelling. The timeout could be quite low because in most cases of small entities the entity is already in buffers.
another optimization to regard in that respect would be to deal with cancellation more gracefully in the client logic by introducing a mechanism that would eagerly notice that all data is already buffered and, in that case, just discard the buffers, keeping the persistent connection alive (probably hard to do in current implementation to get the feedback wired correctly) (created https://github.com/akka/akka-http/issues/1479 to track this)

@mdedetrich as a short-term workaround, it could make sense to use entity.dataBytes.runWith(Sink.ignore) instead of using discardEntityBytes() when a response entity is not needed. Can you try if that gets rid of the problem?

jrudolph on 10 Oct 2017

@jrudolph

On the low level, this behavior is basically ok: if you'd try to use HTTP pipelining (which doesn't really work) and push another request onto a connection where a previous request has not yet completed, your connection will be closed when the response entity of a previous request is cancelled.

To be clear, we are not using pipelining (in the specific test case). We left it as disabled (since it also doesn't work correctly in all cases)

@mdedetrich as a short-term workaround, it could make sense to use entity.dataStream.runWith(Sink.ignore) instead of using discardEntityBytes() when a response entity is not needed. Can you try if that gets rid of the problem?

Will do, we are coding the workaround now and will let you know of the results tomorrow

mdedetrich on 10 Oct 2017

To be clear, we are not using pipelining (in the specific test case). We left it as disabled (since it also doesn't work correctly in all cases)

Thanks for the clarification. I actually didn't assume it, it was more of an explanation why the behavior was expected on that level.

jrudolph on 10 Oct 2017

Thanks for the clarification. I actually didn't assume it, it was more of an explanation why the behavior was expected on that level.

Yup, its for other people in the thread if they are wondering!

mdedetrich on 10 Oct 2017

Huh, awesome detective work @jrudolph, the making discardEntityBytes smarter will be a bit weird perhaps, but makes sense.

ktoso on 11 Oct 2017

@jrudolph

@mdedetrich as a short-term workaround, it could make sense to use entity.dataBytes.runWith(Sink.ignore) instead of using discardEntityBytes() when a response entity is not needed. Can you try if that gets rid of the problem?

The good news is that the errors rates have fallen down substantially. The bad news is we still got it. I did 25 test runs with our test suite yesterday, and the UnexpectedConnectionClosureException only happened once (before happened around 50-80% of the time)

mdedetrich on 11 Oct 2017

Update from our side, our current error rate for getting this error is around 15-20% which is a lot lower then before (however we are still getting it). I have submitted another trace output (using the entity.dataBytes.runWith(Sink.ignore) workaround to lightbend.

mdedetrich on 11 Oct 2017

Many thanks to @jrudolph who helped diagnose the root cause, it appeared that the issue is that akka-stream-json (the json streaming library we used) would sometimes cancel the json stream. This then happened to propagate to the main connection stream, which means the http connection stream would also cancel.

@johanandren close this ticket for now, will re-open if the problem persists or this isn't the root cause

mdedetrich on 16 Oct 2017

Let's summarize:

cancelling a response entity stream will close an HTTP connection -> this is so by design, but it's questionable if it's the right behavior in all cases (#1479 to provide other options)
cancelling a response entity stream leading to connection closure should at least be logged at DEBUG level to help debugging
the original here is still a bug in the connection pool: if ever possible the PoolSlot should not send a new request to a connection that is soon to be closed

So, removing the cancelling code in akka-stream-json is a good workaround for now, but it might also mean that a connection stays now open longer than necessary because it isn't cancelled any more. As I understand it the cancellation in akka-stream-json was only necessary to be able to ignore potential trailing extra (json) data after the main response. (I'll also have a look at the tickets over there.)

jrudolph on 16 Oct 2017

@jrudolph

the original here is still a bug in the connection pool: if ever possible the PoolSlot should not send a new request to a connection that is soon to be closed

To clarify, the current workaround for this issue is to use entity.dataBytes.runWith(Sink.ignore) instead of discardEntityBytes (in which case, when akka-stream-json gets updated I will remove the safeEntity fix but the discardEntityBytes workaround will remain)

mdedetrich on 16 Oct 2017

Has this been fixed in the new client connection pool implementation introduced in 10.1.0? Or would a fix require an implementation of #1686?

benjamingeer on 6 Jun 2018

Could someone help me to understand why entity.dataBytes.runWith(Sink.ignore) can be different of discardEntityBytes?

From the code, it is exactly what discardEntityBytes is doing:
override def discardBytes(mat: Materializer): HttpMessage.DiscardedEntity = new HttpMessage.DiscardedEntity(dataBytes.runWith(Sink.ignore)(mat))

And DiscardedEntity is just a wrapper for Sink.ignore: Future[Done]