Go: net/http: Re-connect with upgraded HTTP2 connection fails to send Request.body

Created on 5 Jun 2019  ·  23Comments  ·  Source: golang/go

What version of Go are you using (go version)?

$ go version
go version go1.12.5 darwin/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output

$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/bfife/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/bfife/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/x7/yphtc2sj4ml2lrflpwdblvrxsmcq9d/T/go-build801671760=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

This is a follow up of issues #25009 #28206

We are sending requests as follows:

proto.transport = &http.Transport{}
proto.client = &http.Client{Transport: proto.transport}
req, err := http.NewRequest(request.Method, url, bytes.NewBuffer(request.Body))
resp, err := proto.client.Do(req)

What did you expect to see?

Good behaviour

With the above client configuration, I was able to determine the following code was sending out the request using the RoundTrip function noted here:

https://github.com/golang/go/blob/release-branch.go1.12/src/net/http/transport.go#L427

    if t.useRegisteredProtocol(req) {
        altProto, _ := t.altProto.Load().(map[string]RoundTripper)
        if altRT := altProto[scheme]; altRT != nil {
            if resp, err := altRT.RoundTrip(req); err != ErrSkipAltProtocol {
                return resp, err
            }
        }
    }

In our case, for the first and following requests t.useRegisteredProtocol(req) returns true, so altRT.RoundTrip(req) tries to send via h2_bundle's RoundTrip. As no connection exists at this point in time, this will fail due to error "http2: no cached connection was available", bubbling up ErrSkipAltProtocol "net/http: skip alternate protocol", and the code continues on. n.b. subsequent requests will be sent via altRT.RoundTrip(req) if there are cached connections available, however a GoAway received from the server will trigger the same behaviour of advancing to the next section of code.

https://github.com/golang/go/blob/release-branch.go1.12/src/net/http/transport.go#L447

    for {
        select {
        case <-ctx.Done():
            req.closeBody()
            return nil, ctx.Err()
        default:
        }

        // treq gets modified by roundTrip, so we need to recreate for each retry.
        treq := &transportRequest{Request: req, trace: trace}
        cm, err := t.connectMethodForRequest(treq)
        if err != nil {
            req.closeBody()
            return nil, err
        }

        // Get the cached or newly-created connection to either the
        // host (for http or https), the http proxy, or the http proxy
        // pre-CONNECTed to https server. In any case, we'll be ready
        // to send it requests.
        pconn, err := t.getConn(treq, cm)
        if err != nil {
            t.setReqCanceler(req, nil)
            req.closeBody()
            return nil, err
        }

        var resp *Response
        if pconn.alt != nil {
            // HTTP/2 path.
            t.decHostConnCount(cm.key()) // don't count cached http2 conns toward conns per host
            t.setReqCanceler(req, nil)   // not cancelable with CancelRequest
            resp, err = pconn.alt.RoundTrip(req)
        } else {
            resp, err = pconn.roundTrip(treq)
        }
        if err == nil {
            return resp, nil
        }
...

In the above for loop, for the first request without any connections available, a new transport connection is created, and pconn.alt.RoundTrip(req) or pconn.RoundTrip(req) is then used to submit and fulfill the Request, and this likely completes successfully. So far, so good.

As noted above, subsequent requests from our client continue to be sent from altRT.RoundTrip(req), at which point they are returned, without reaching the for loop.

Bad behaviour

After some time, our load balancer may respond with a GoAway response after a full request is transmitted, including data from the Request.Body, as seen with the following debug:

2019/06/04 02:22:26.750150 http2: Framer 0xc000470a80: wrote HEADERS flags=END_HEADERS stream=17 len=12
2019/06/04 02:22:26.750321 http2: Framer 0xc000470a80: wrote DATA stream=17 len=125 data="STRINGOFDATA"
2019/06/04 02:22:26.750386 http2: Framer 0xc000470a80: wrote DATA flags=END_STREAM stream=17 len=0 data=""
2019/06/04 02:22:26.753144 http2: Framer 0xc000470a80: read GOAWAY len=8 LastStreamID=15 ErrCode=NO_ERROR Debug=""
2019/06/04 02:22:26.753217 http2: Transport readFrame error on conn 0xc0001a3e00: (*errors.errorString) EOF
2019/06/04 02:22:26.753291 http2: Transport failed to get client conn for server:443: http2: no cached connection was available

This results in another transport being created in the for loop shown above, _however_ the Request.Body _is not_ being reset before the request is sent on the newly created transport connection.

The rewind code at https://github.com/golang/go/blob/release-branch.go1.12/src/net/http/transport.go#L497 doesn't get executed if pconn.alt.RoundTrip(req) or pconn.RoundTrip(req) are successful in sending a request after a connection is re-established. From my limited testing, I think the only time the rewind is reached is when TLSNextProto is manually set as it is in TestTransportBodyAltRewind, because this skips setting up the altProto map in transport.go/onceSetNextProtoDefaults(), as altRT is nil.

It looks like moving the rewind code from L497 to L455 would satisfy the case where an existing connection receives a GoAway frame and the Body had been sent to the server (as described above), however it would also execute in the case of the initial request where a transport does not exist and the request hasn't been sent (which makes that execution not necessary).

I believe this test case demonstrates the issue described above:

func TestTransportBodyGoAwayReceived(t *testing.T) {
        var err error
        var resp *Response
        req, _ := NewRequest("POST", "https://example.org/", bytes.NewBufferString("request"))
        tr := &Transport{DisableKeepAlives: true }
        tr.RegisterProtocol("https", httpsProto{})
        c := &Client{Transport: tr}
        resp, err = c.Do(req)
        if err != nil {
                t.Error(err)
        }
        if resp != nil {
                defer resp.Body.Close()
                ioutil.ReadAll(resp.Body)
        }
}

type httpsProto struct{}

func (httpsProto) RoundTrip(req *Request) (*Response, error) {
        defer req.Body.Close()
        ioutil.ReadAll(req.Body)
        return nil, ErrSkipAltProtocol
}

output

../bin/go test -v -run=TestTransportBodyGoAwayReceived net/http
=== RUN   TestTransportBodyGoAway
2019/06/04 21:36:24 Error enabling Transport HTTP/2 support: protocol https already registered
--- FAIL: TestTransportBodyGoAway (0.42s)
    transport_internal_test.go:274: Post https://example.org/: http: ContentLength=7 with Body length 0
FAIL
FAIL    net/http    0.495s

Moving the rewind code to the top of the for loop as described above results in success:

./bin/go test -v -run=TestTransportBodyGoAwayReceived net/http
=== RUN   TestTransportBodyGoAway
2019/06/04 21:35:54 Error enabling Transport HTTP/2 support: protocol https already registered
--- PASS: TestTransportBodyGoAway (0.56s)
PASS
ok      net/http    0.642s
NeedsFix release-blocker

Most helpful comment

This issue was incorrectly titled (x/net/http instead of net/http) and therefore incorrectly milestoned (Unreleased) and left out of Go 1.15 release planning. I've retiteld and remilestoned. Since this bug is affecting an important existing use case (Kubernetes), I've also marked it a release blocker. I also sent a fix.

The code change is fairly straightforward, but net/http itself is quite subtle, so it would be best to get the change into beta1 to get the most possible testing. Therefore I did not add the "okay-after-beta1" label.

All 23 comments

As I am not in a position to sign the CLA and submit a fix, it would be good to get some maintainers to have a look at this issue.

/cc @bradfitz @FiloSottile

Sorry to bump, but is anyone able to have a look at this ? @bradfitz @FiloSottile

Thanks for opening this issue @brianfife. Also, thanks for linking to #25009 and #28206, as they appear to be related (even though they were marked as closed as of Go 1.12). I'm on 1.12.7 and I'm seeing this issue with clients connecting to the API I work on.

You did a good job of showing the issue in the actual transport code, but I figured I'd add some of my client debug logs to the issue for additional context.

As you described, I always see this issue manifest after the server sends a GOAWAY frame after a request body has been sent. The internal retry logic in http2shouldRetryRequest in h2_bundle.go calls .GetBody() on the request in order to "recreate" the body to send on a new TCP connection, but the bytes never make it onto the wire:

...
2019/07/18 12:51:41 http2: Framer 0xc00027a000: read GOAWAY len=8 LastStreamID=1999 ErrCode=NO_ERROR Debug=""
2019/07/18 12:51:41 http2: Transport received GOAWAY len=8 LastStreamID=1999 ErrCode=NO_ERROR Debug=""
2019/07/18 12:51:41 http2: Transport failed to get client conn for api.foobar.com:443: http2: no cached connection was available
...
2019/07/18 12:51:41 http2: Transport creating client conn 0xc000508900 to 12.34.56.78:443
2019/07/18 12:51:41 http2: Transport encoding header "content-type" = "application/x-www-form-urlencoded"
2019/07/18 12:51:41 http2: Transport encoding header "content-length" = "51"
2019/07/18 12:51:41 http2: Transport encoding header "accept-encoding" = "gzip"
2019/07/18 12:51:41 http2: Framer 0xc000409500: wrote HEADERS flags=END_HEADERS stream=1 len=135
2019/07/18 12:51:41 http2: Framer 0xc000409500: wrote DATA flags=END_STREAM stream=1 len=0 data=""

In the above, we can clearly see that the request contained a 51 byte, form-encoded body but the H/2 framer never sent it, setting the END_STREAM flag after sending 0 bytes of data.

We typically see this happen when the client is using a moderate amount of concurrency, since the chances are increased that the periodic GOAWAY frames from the server will arrive after a request body has been sent.

Interestingly, the Go bug manifests when the transport is left to choose H/2 on it's own via the connection upgrade codepath (aka TLSNextProto).

So, either of these options will result in a client that does not repro this bug:

// Force H/2 explicitly
client := &http.Client{
    Transport: &http2.Transport{},
}

// OR Force HTTP/1.1 explicitly
client := http.Client{
    Transport: &http.Transport{
        TLSNextProto: make(map[string]func(string, *tls.Conn) http.RoundTripper),
    },
}

// OR, run the repro code with GODEBUG="http2client=0" to disable H/2

To echo @brianfife, I'm also sorry to bump this, but I'm wondering if there's any movement or thoughts on how to move forward with this particular issue? @bradfitz @FiloSottile

Correct, body is not re-winded after altRT.RoundTrip here:
https://github.com/golang/go/blob/cf06b9aa81004a017e5c98422fce2fafd1f24921/src/net/http/transport.go#L427-L434
I've missed that during #25009 fix because the trigger condition is less frequent and was not covered by my go-only test-case.
This condition triggers when http2Transport.RoundTripOpt is failing to get a connection during retry (e.g. on attempt to retry after GOAWAY when all other connections in pool are terminated), which results to http2noCachedConnError that is transformed into ErrSkipAltProtocol in http2noDialH2RoundTripper.RoundTrip.

If we change to something like that, everything should be OK:

    if t.useRegisteredProtocol(req) {
        altProto, _ := t.altProto.Load().(map[string]RoundTripper)
        if altRT := altProto[scheme]; altRT != nil {
            if resp, err := altRT.RoundTrip(req); err != ErrSkipAltProtocol {
                return resp, err
            } else if req.GetBody != nil {
                // Rewind the body after ErrSkipAltProtocol, because it can
                // be already used.
                newReq := *req
                var err error
                newReq.Body, err = req.GetBody()
                if err != nil {
                    return nil, err
                }
                req = &newReq
            }
        }
    }

I've tested this change with my repro and got zero "400 Bad Request" errors, but I'm not sure that it is correct solution, probably we can refactor something to make this rewind more explicit.

/cc @bradfitz

the above looks reasonable to me, @ernado do you want to open a PR?

Change https://golang.org/cl/210123 mentions this issue: net/http: rewind request body on ErrSkipAltProtocol

@ernado @bcmills any updates on this?

@brianfife, I left a comment on @ernado's CL 210123 indicating what I think is the way forward: namely, we need to add an optional method that a RoundTripper can implement in order to decline a request without reading its Body.

/cc

Could we fix this issue before next release? As an important feature(GOAWAY) in Kubernetes API Server rely on that.

No ETA from me for CL 210123, so please submit other CL if anybody wants, just check that commends from my CL are addressed.

Also current workaround is to use http2-only client.

See also https://golang.org/cl/234358 from #39086

Thanks, let me see if there is anything I can do to fix this issue ASAP.

Change https://golang.org/cl/234894 mentions this issue: net/http: handle body rewind in HTTP/2 connection loss better

This issue was incorrectly titled (x/net/http instead of net/http) and therefore incorrectly milestoned (Unreleased) and left out of Go 1.15 release planning. I've retiteld and remilestoned. Since this bug is affecting an important existing use case (Kubernetes), I've also marked it a release blocker. I also sent a fix.

The code change is fairly straightforward, but net/http itself is quite subtle, so it would be best to get the change into beta1 to get the most possible testing. Therefore I did not add the "okay-after-beta1" label.

Thanks for correcting the title, and submitting a patch - glad it's finally getting some traction!

This issue was incorrectly titled (x/net/http instead of net/http) and therefore incorrectly milestoned (Unreleased) and left out of Go 1.15 release planning. I've retiteld and remilestoned. Since this bug is affecting an important existing use case (Kubernetes), I've also marked it a release blocker. I also sent a fix.

The code change is fairly straightforward, but net/http itself is quite subtle, so it would be best to get the change into beta1 to get the most possible testing. Therefore I did not add the "okay-after-beta1" label.

Is it possible make this bugfix backport to old version like v1.14.x or v1.13.x ? Since kubernetes is still using go1.13 to build itself.

Thanks @rsc for your PR, which is very similar with my idea to fix this issue.

Indeed, @answer1991, thanks for that CL as a starting point.

@gopherbot please backport, as Kubernetes is affected.

Backport issue(s) opened: #39278 (for 1.13), #39279 (for 1.14).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

Change https://golang.org/cl/242117 mentions this issue: [release-branch.go1.14] net/http: handle body rewind in HTTP/2 connection loss better

Was this page helpful?
0 / 5 - 0 ratings