Google-cloud-go: storage: connection reset by peer on Cloud Functions

Created on 12 Dec 2018 · 8Comments · Source: googleapis/google-cloud-go

Client

Storage

Describe Your Environment

Google Cloud Functions, using cloud-functions-go.

Expected Behavior

Transient network failures are retried.

Actual Behavior

Getting "connection reset by peer" on both downloads and uploads.

Details

See #108. I'm not sure I can expand much further on https://github.com/googleapis/google-cloud-go/issues/108#issuecomment-442793515, at least publicly.

Repeating myself, for reference:

This seems to be a property of the environment. Instances, when not used, are left in a frozen state. Server drops the connection but client doesn't realize it because it's frozen. Node.js package authors have avoided Keep-Alive because of this, but that has it's own issues (like: poorer performance, and more connections and DNS queries, both of which have quotas). See this.

What am I doing?

My particular function downloads a file, converts it to something else, then uploads the result.
Storage operations are pretty standard, but spread out through app code.
Since this happens rarely, I don't think I can offer an MCVE.

Basically, it's a Cloud Function, written in Go. In production, I'm using cloud-functions-go, on the Node 8 Beta run time. I've tested Google Cloud Functions for Go, and I have no reason to believe it might fix the issue, but I haven't used it enough to run into the issue often.

I'm caching the client (according to best practices to reduce DNS lookups and TCP connections).
I do this once (and store it in a global variable):

if gcs == nil {
    gcs, err = storage.NewClient(context.Background())
}

Then I do this to download the file:

ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
defer cancel()

obj := gcs.Bucket(inBucket).Object(inObject)
r, err := obj.NewReader(ctx)
if err != nil {
    return err
}
defer r.Close()
    ...

Then I do my thing, and upload the result with:

ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
defer cancel()

obj := gcs.Bucket(outBucket).Object(outObject)
w := obj.NewWriter(ctx)
    ...
    return w.Close()

All pretty standard, and that's all I do/did.

I was observing a connection reset by peer first on the download connection (to the storage.googleapis.com domain). This had me answer with a 500, the client retries, download is now fine, but I get the same issue on the upload connection (to the www.googleapis.com domain).

What I've done to mitigate the issue

Whenever I get a storage error (and before I respond with the 500):

if gcs != nil {
    err = gcs.Close()
    gcs = nil
}

This reduced the occurrence of the first/download error (I guess recycling the connection on timeouts, other errors, helps), and basically eliminated the second/upload error (connection was already recycled on the download error, both connections naturally timeout when the instance is freezed).

storage p2 bug

Source

ncruces

👍3

All 8 comments

@ncruces Heads up - we are investigating whether this should be addressed in Go's http stack. See: https://github.com/golang/go/issues/29308.

jadekler on 17 Dec 2018

I posted my findings and a proposition at https://github.com/golang/go/issues/29308#issuecomment-504617974 and as the Go 1.14 cycle proceeds, we'll hopefully get this fixed.

odeke-em on 25 Jun 2019

👍1

Closing this in favour of https://github.com/golang/go/issues/29308. Please track further progress there.

jadekler on 10 Jul 2019

@ncruces https://github.com/golang/go/issues/29308 is now closed. Could you see whether this fixes your issue?

jadekler on 6 Dec 2019

I've since moved away from cloud-functions-go to the official Go runtime, which makes it impossible to test a specific Go version. Also I was never able to repro this on a small scale (it happened only sporadically).

I'm not sure if this is going to be backported, but if it is (or if the Go runtime get updated) in the near future, I'll report.

ncruces on 7 Dec 2019

👍1

Still having this issue. Using cloudsql-proxy. The kubernetes deployment config:

...
        - name: cloudsql-proxy
          image: gcr.io/cloudsql-docker/gce-proxy:1.14
          command:
            - "/cloud_sql_proxy"
            - "-instances={{ .Values.cloudSql.instanceConnectionName }}=tcp:3306"
            - "-credential_file=/secrets/cloudsql/gcp-cloud-sql-client.json"
          securityContext:
            runAsUser: 2  # non-root user
            allowPrivilegeEscalation: false
          volumeMounts:
            - name: cloudsql-instance-credentials
              mountPath: /secrets/cloudsql
              readOnly: true
      volumes:
        - name: cloudsql-instance-credentials
          secret:
            secretName: cloud-sql-client-credentials
...

Periodically (every hour) having this error which triggers my pod to restart and throws timeouts to upstream services:

2020/01/27 10:03:10 Reading data from local connection on 127.0.0.1:3306 had error: read tcp 127.0.0.1:3306->127.0.0.1:57728: read: connection reset by peer
2020/01/27 10:03:10 Reading data from local connection on 127.0.0.1:3306 had error: read tcp 127.0.0.1:3306->127.0.0.1:57730: read: connection reset by peer
2020/01/27 10:03:10 Reading data from local connection on 127.0.0.1:3306 had error: read tcp 127.0.0.1:3306->127.0.0.1:57732: read: connection reset by peer
2020/01/27 10:03:10 Reading data from local connection on 127.0.0.1:3306 had error: read tcp 127.0.0.1:3306->127.0.0.1:57736: read: connection reset by peer
2020/01/27 10:03:10 Reading data from local connection on 127.0.0.1:3306 had error: read tcp 127.0.0.1:3306->127.0.0.1:57740: read: connection reset by peer

This would be ok, but it happens when our jobs are running so they fail..
How can we fix it?

n-sviridenko on 27 Jan 2020

@n-sviridenko, the fix seems to be part of the go1.14beta1 branch. It does not appear to have been cherry picked for backport to 1.13. So you won't have the fix if you have go 1.13.

ncruces on 27 Jan 2020

go1.15 I'm still occasionally experiencing read: connection reset by peer