Google-cloud-go: storage: handle connection reset errors

Created on 17 Mar 2020  路  11Comments  路  Source: googleapis/google-cloud-go

I don't think the client library is handling connection reset errors

Client

Storage

Environment

ECS Service on AWS

Code

package main

func main() {
        bucket := "..."
        path := "..."
        r := bytes.NewReader("...")

    w := s.Bucket(bucket).Object(path).NewWriter(ctx)
    w.ContentType = "application/gzip"
        if _, err = io.Copy(w, in.Body); err != nil {
        panic(err)
    }

    if err = w.Close(); err != nil {
        panic(err)
    }
}

Expected behavior

I believe this a temporary error, and it should be retried by client library (following https://cloud.google.com/storage/docs/exponential-backoff). It should rarely be bubbled up to the caller - and only do so after exhausting retries.

Actual behavior

Sometimes, an error is returned by the client:

Post "https://www.googleapis.com/upload/storage/v1/b/{omitted}/o?alt=json&prettyPrint=false&projection=full&uploadType=multipart": read tcp 172.17.0.21:52100->172.217.3.170:443: read: connection reset by peer

I've only seen this during the w.Close call so far. For our production environment, this happens on approximately 0.1% of the time for some projects.

Screenshots

N/A

Additional context

N/A

storage question

Most helpful comment

cc @tritone @codyoss

Marking this as a question for now because we're solidifying our retry guidance/implementation for a couple other issues, too. Not sure what the right next step is for this quite yet.

All 11 comments

cc @tritone @codyoss

Marking this as a question for now because we're solidifying our retry guidance/implementation for a couple other issues, too. Not sure what the right next step is for this quite yet.

And thank you for reporting this!

Is unexpected EOF another similar error that could be retried? At least for certain types of requests perhaps. I observed a few of these errors when listing objects due to https://status.cloud.google.com/incident/zall/20003 I believe. Thank you!

This bug looks similar to https://github.com/googleapis/google-cloud-go/issues/1638 which will be solved https://code-review.googlesource.com/c/gocloud/+/48990 which I have just chimed in on. That linked CL whose motivation was already approved by the GCS team will retry 1net/url.Error.Temporary() -> true1 errors os that should also fix this bug report.

Is unexpected EOF another similar error that could be retried? At least for certain types of requests perhaps. I observed a few of these errors when listing objects due to https://status.cloud.google.com/incident/zall/20003 I believe. Thank you!

I am not too sure about that @horgh, great question! I think that though it should be a client retryable error by the caller, because ideally the transport layer would have retried the underlying error as many times then finally bubbled it up to the caller once all the retries were exhausted.

Here is a full reproducer without having to go out of the network

package main

import (
    "context"
    "fmt"
    "net"
    "net/url"
    "sync"

    "cloud.google.com/go/storage"
    "google.golang.org/api/option"
)

func main() {
    ln, err := net.Listen("tcp", ":0") // Reserve an address for us.
    if err != nil {
        panic(err)
    }

    addr := ln.Addr().String()
    serverURL := "http://" + addr

    var wg sync.WaitGroup
    defer wg.Wait()

    wg.Add(1)
    go func() {
        defer wg.Done()

        conn, err := ln.Accept()
        if err != nil {
            panic(err)
        }
        conn.(*net.TCPConn).SetLinger(0)
        // Immediately close it after accepting to toggle an ECONNRESET.
        conn.Close()
    }()

    ctx := context.Background()
    client, err := storage.NewClient(ctx, option.WithEndpoint(serverURL))
    if err != nil {
        panic(err)
    }
    if _, err := client.Bucket("foo").Attrs(ctx); err != nil {
        oe := err.(*url.Error).Err.(*net.OpError)
        fmt.Printf("Temporary: %t\nUnderlying error: %#v\n", oe.Temporary(), oe.Err)
        panic(err)
    }
}

which prints

Temporary: false
Underlying error: &os.SyscallError{Syscall:"read", Err:0x36}
panic: Get "http://[::]:54680/b/foo?alt=json&prettyPrint=false&projection=full": read tcp [::1]:54682->[::1]:54680: read: connection reset by peer

goroutine 1 [running]:
main.main()
    /Users/emmanuelodeke/Desktop/openSrc/bugs/google-cloud-go/1832/repro.go:52 +0x45c
exit status 2

Hi all, I merged https://code-review.googlesource.com/c/gocloud/+/55970 which should resolve this issue for _most_ operations (reads, object metadata operations, etc). However, retries for writes use a different code path involving logic in google-api-go-client so I am working on a separate fix for that. Will update here when that is ready.

Awesome, thank you!

Regarding unexpected EOF - I'm not sure I follow the rationale why it's something the caller needs to take care of vs. the others. It'd simplify my life as a caller if it was retried too! For example internally I currently wrap all read type calls and retry them if it's that or if it's a net.OpError. It sounds like that will still be necessary, which seems unfortunate!

@horgh are you experiencing this issue on writes only or on other operations as well?

The error io.ErrUnexpectedEOF should be retried for writes at least at the level of the generated client; see https://github.com/googleapis/google-api-go-client/blob/326e17a21103f4ccf44ac1b40587ce7bcdd58b14/internal/gensupport/resumable.go#L162 . However there is an open bug regarding this at googleapis/google-api-go-client#449 that says that this is not working as intended. I'm working on fixing this as well.

It was read operations where I observed it happening. For example, I saw it when listing objects.

The fix for writes has been released in storage/v1.10.0 . I'm going to close this issue.

@horgh feel free to file another issue for unexpectedEOF and I will look into whether that can be addressed as well (need to do some further research on my end).

Was this page helpful?
0 / 5 - 0 ratings