Google-cloud-python: Cloud Functions & Storage: fails intermittently with ProtocolError + ConnectionResetError

Created on 1 Sep 2018 · 37Comments · Source: googleapis/google-cloud-python

This is a cross post originally detailed at https://issuetracker.google.com/issues/113672049

Essentially the problem is that in a Google Cloud Functions python endpoint the google-cloud-storage API is intermittently throwing a ProtocolError and ConnectionResetError when getting a blob.

bug storage p2 needs more info

Source

brianmhunt

👍3

Most helpful comment

@brianmhunt OK, good to know. Here is a workaround:

from urllib3.exceptions import ProtocolError
from google.api_core import retry

predicate = retry.if_exception_type(
    ConnectionResetError, ProtocolError)
reset_retry = retry.Retry(predicate)

def generateUrlPdfBytesMap(urls):
    upm = dict()
    for url in urls:
        bucket_name, path = urlToBucketPath(url)
        account = path.split('/')[1]
        bucket = storage.bucket(bucket_name)
        blob = bucket.get_blob(path)  # Note: makes API call
        upm[url] = reset_retry(blob.download_as_string)()
    return upm

tseaver on 1 Aug 2019

👍5

All 37 comments

Linking possibly relevant Golang issue: GoogleCloudPlatform/google-cloud-go#108

brianmhunt on 1 Sep 2018

/cc @frankyn

tseaver on 1 Sep 2018

I'm not sure what's happening here, acking for now in GCS library weekly.

frankyn on 6 Sep 2018

I did not get additional input in the weekly meeting.

@brianmhunt could you tell me more about your use-case so I can try to reproduce?
Also what is the size of a file you're trying to read?

frankyn on 8 Sep 2018

@frankyn Of course, thanks for the follow-up.

Our function is given an array of slices of PDFs and converts that to a single PDF. An oversimplified version is something like this:

def consolidatePdf(pdfSlices):
    """Create a PDF from the given slices.

    pdfSlices is an iterable of dicts of this form: { url: "gs://", range: [start, end] }
    """
    newPdf = PdfWriter()
    for slice in pdfSlices:
        reader = PdfReader(getPdfFromGS(slice['url']))
        start, end = slice['range']
        pdfWriter.append(reader.getPages(start, end))
    return newPdf.getBytes()

Where the getPdfFromGS performs the blob storage read. If you think it'll help I'm happy to share the code if you email me at brianmhunt at gmail.com.

I've only seen the failure occur on the very first file being read (but that doesn't mean that the problem is limited to the first file being read).

The files failing are fairly small, in the 150k range.

brianmhunt on 10 Sep 2018

Thanks, I'd like to keep this discussion as much as possible through Github. So if someone else hits a similar issue they can find it later.

Is PdfReader wrapping around the google-cloud-storage package? Could you share a portion of that code as well as PdfWriter?

frankyn on 11 Sep 2018

👍1

Thanks, here's the salient bit of code that's throwing.

def generateUrlPdfBytesMap(urls):
    upm = dict()
    for url in urls:
        bucket, path = urlToBucketPath(url)
        account = path.split('/')[1]
        blob = storage.bucket(bucket).get_blob(path) # 🔥
        upm[url] = blob.download_as_string()
    return upm

Where urlToBucketPath converts a gs:// or https:// Firebase url to the bucket, path pair per https://stackoverflow.com/questions/52064868.

It really is as simple as one could possibly imagine; I was going to put the PDF reader code in but it looks like it never gets called because the slurp (generateUrlPdfBytesMap) occurs before.

Related, according to [email protected], it appears that a newer version of google-cloud-storage is available (i.e. I was using 1.10.0; version 1.11.0 is out).

I will switch to the new version and report any occurrences.

brianmhunt on 11 Sep 2018

This continues to occur with google-cloud-storage version 1.11.0.

brianmhunt on 16 Sep 2018

This issue occurs to me as well. But in my case, it throws connection error when I'm trying to get a bucket before signing a url.
Code:

bucket = storage.get_bucket(bucket)

Traceback:

File "/code/xxx/yyy/models.py", line 106, in generate_signed_url
    storage_bucket = STORAGE_CLIENT.get_bucket(GCS_BUCKET)
  File "/usr/lib/python3.6/site-packages/google/cloud/storage/client.py", line 225, in get_bucket
    bucket.reload(client=self)
  File "/usr/lib/python3.6/site-packages/google/cloud/storage/_helpers.py", line 108, in reload
    _target_object=self)
  File "/usr/lib/python3.6/site-packages/google/cloud/_http.py", line 290, in api_request
    headers=headers, target_object=_target_object)
  File "/usr/lib/python3.6/site-packages/google/cloud/_http.py", line 183, in _make_request
    return self._do_request(method, url, headers, data, target_object)
  File "/usr/lib/python3.6/site-packages/google/cloud/_http.py", line 212, in _do_request
    url=url, method=method, headers=headers, data=data)
  File "/usr/lib/python3.6/site-packages/google/auth/transport/requests.py", line 201, in request
    method, url, data=data, headers=request_headers, **kwargs)
  File "/usr/lib/python3.6/site-packages/requests/sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3.6/site-packages/requests/sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3.6/site-packages/requests/adapters.py", line 495, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

arvindnrbt on 21 Sep 2018

Same here, my context is I'm in a pubsub worker and am retrieving a URL to download and store into a bucket. I've spent an ungodly amount of time trying to debug the retrieval method I was running only to realise it's actually the cloud stores https machinery that is failing.

rasputnik on 26 Sep 2018

I get the same error as OP @brianmhunt and @arvindnrbt.
Connection reset by peer

It happens during:

storage.bucket(bucket).get_blob(path)
and bigquery_client.insert_rows(table, rows_to_insert).

This is running Google Cloud Functions with Python 3.7 and google-cloud-storage==1.11.0.

Not all the time, about 10% failure rate. Function deployed to us-east1 (I also tried us-central1, about the same).

cmey on 28 Sep 2018

👍4

@anyone-watching, is this still occurring ? We were considering migrating from AWS Lambda but this may hold us up.

yiga2 on 4 Oct 2018

👍2

I'm heads down on another issue. @tseaver could you take a look?

frankyn on 5 Oct 2018

@frankyn This is a request for the same feature as [Python] Storage: automatic retry behavior for transient server failures (exponential backoff + jitter) in our feature backlog (we would just need to ensure that ProtocolError and ConnectionError are tracked as transient errors for that feature).

tseaver on 5 Oct 2018

👍4

Would like to add that I'm experiencing same error in GCF environment using google-cloud-storage==1.12.0

I see that 1.13.0 has been released and I may try that to see if problem persists.

parthmishra on 5 Oct 2018

trace.txt

I get the same error:

google-cloud-storage==1.13.0
runtime: python37
python_version: 3

frossigneux on 15 Oct 2018

Hi,
I'm also having this issue. There is a scenario for which this occurs for me every time.

I send a dataset to a non-google ML provider. We then wait for the prediction to finish. When the ML prediction does finish, I take the .CSV file and then I wish to deposit it on the Google Storage. The wait can be up to 10mins (normally 6mins). I get the ConnectionResetError. If this is useful I'm happy to share more details.

g-c-s==1.13.0
runtime python37

stephenk289 on 20 Oct 2018

Remember CF times out after 9min so the error could be "normal".
I would check the Stackdriver log to see what exact http code is returned and raise the issue on issuetracker. Not here as it sounds like related to the core API.

yiga2 on 20 Oct 2018

If, as @yiga2 suggests, the CF run is being terminated due to a time limit, there is nothing we can do in google-cloud-storage to address it: the Python process itself is being killed in that case.

tseaver on 22 Oct 2018

Not to disregard @stephenk289 s issue, but lots of us are having issues when GCS throws 50x errors that are not related to timeouts.

Some retry semantics built-in would be really useful.

rasputnik on 22 Oct 2018

Thanks very much for the answers / feedback @tseaver and @yiga2.

If you'll forgive me a follow-up -

I tried to respond to the error by recreating my client object [client = storage.Client()] as this was the closest I could find to re-establishing a connection so that I could interact afresh with Google Storage with a decaying re-try logic - it didn't seem to work. Given the advice that there is a time limit on any connection to GS (~9mins), I will, therefore, risk losing said connections during long-wait times. Any steer most welcome.

stephenk289 on 22 Oct 2018

I'm also hitting this error from Cloud Function trying to read from Cloud Storage.
The majority of invocations works, but I get this every so often, and it makes a lot of noise even if I retry.

My CF are finishing in under 2 seconds in most cases, and I've followed the recommendations on https://cloud.google.com/functions/docs/bestpractices/networking to avoid re-establishing connections, which I suspect is also the culprit of the problem.

Could it be that the same connection are (as intended) re-used for multiple invocations of the CF, and eventually the remote server (GCS) drops it when I try to use it?

ilons on 12 Nov 2018

After a lot of research and interactions with G support, it turns out that indeed, connection may either timeout (our case, resumable upload) or just disconnect sporadically - happens at other cloud storage providers.
No big deal as ConnectionResetError (104) is a retriable error but this must be handled on the client side.

Unfortunately google-client-python - and many others - is delegating exponential backup to lower-level dependencies - signaled by deprecated num_retries - which means less control on when retries should trigger. And these dependencies do not catch 104 as retriable.

If you search out there, you will see there is a lot of debate (resistance ?) on where and how best to address this, in googleapi libs, requests or urllib3 (as I even read this may be Py3 specifc),...
For me, this is as simple as adding 104 to the list of transient errors (500-505 range or so) but I may oversimplify.
Despite the numerous posts - many recent ones -, no real resolution.

If you can't wait for resolution or patch your own fork, you can look at gcsfs which we use for streaming from/to GCS (from GCF). The ConnectionResetError is retriable there - it does log an exception (looks like an Error in SD logging) but the retries do happen and function does not end abruptly.

yiga2 on 12 Nov 2018

👍3

@yiga2 Can you elaborate on how "ConnectionResetError" is retriable in gcsfs? Is there an argument I need to pass to the constructor to enable this? My Cloud Function is crashing right after the error is thrown.

dustinfarris on 27 Dec 2018

I had found that that exception is simply a generic unhandled exception. Be sure to wrap all of your connections . I.e Api calls, GCS file transfers etc.. with try except error handling. You will see that one of your connections maybe failing. But it could be for any number of reasons.

-------- Original message --------
From: Dustin Farris notifications@github.com
Date: 12/27/18 1:43 PM (GMT-05:00)
To: googleapis/google-cloud-python google-cloud-python@noreply.github.com
Cc: "Bugbee, Jeffrey" Jeffrey.Bugbee@Kronos.com, Manual manual@noreply.github.com
Subject: Re: [googleapis/google-cloud-python] Cloud Functions & Storage: fails intermittently with ProtocolError + ConnectionResetError (#5879)

@yiga2https://github.com/yiga2 Can you elaborate on how "ConnectionResetError" is retriable in gcsfs? Is there an argument I need to pass to the constructor to enable this? My Cloud Function is crashing right after the error is thrown.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHubhttps://github.com/googleapis/google-cloud-python/issues/5879#issuecomment-450208017, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AqeO-fPy-s4RxZ_-iRQ67rP6kVdKWDsuks5u9RTcgaJpZM4WWQoH.

jbugbee126 on 27 Dec 2018

@dustinfarris @jbugbee126 you don't need to wrap a connection call in gcsfs for the 104 error - although a good practice overall for any (new) uncaught error.

See https://github.com/dask/gcsfs/issues/12 and related commit , gcsfs just logs the retriable 104 error as debug - which shows as such in Stackdriver Logging - and your GCF should not break on it.

(Thanks @martindurant, the main contributor (and author) of gcsfs !)

yiga2 on 28 Dec 2018

CHEERS!
Lots of stuff to learn from the comments itself.
So, I was trying to do a similar thing, like accessing a file from GCS and then updating it. Same error still exists, so other than adding a 'try except' statement and trying to retrieve the file again, is there any other solution? Because this also might be unreliable.

gururajasekhar on 4 Jan 2019

@crwilcox @frankyn @jkwlui @tseaver (Is there a storage team alias?)

Would you mind looking into the internal issue? It's been closed as 'Won't fix - not reproducible', but folks have commented here and over on the issue since then.

busunkim96 on 17 Jul 2019

👍1

@busunkim96 what's the internal tracking issue?

frankyn on 17 Jul 2019

@brianmhunt ConnectionResetError seems like the kind of error one might see when the VM is being torn down. Can you tell whether your function is failing due to a time limit? If so, there isn't really much we can do in google-cloud-storage to mitigate the issue.

tseaver on 1 Aug 2019

👍1

@tseaver Thanks. The problem was not the time limit. i.e. it can fail in the first ~5 seconds.

brianmhunt on 1 Aug 2019

👍1

@brianmhunt OK, good to know. Here is a workaround:

from urllib3.exceptions import ProtocolError
from google.api_core import retry

predicate = retry.if_exception_type(
    ConnectionResetError, ProtocolError)
reset_retry = retry.Retry(predicate)

def generateUrlPdfBytesMap(urls):
    upm = dict()
    for url in urls:
        bucket_name, path = urlToBucketPath(url)
        account = path.split('/')[1]
        bucket = storage.bucket(bucket_name)
        blob = bucket.get_blob(path)  # Note: makes API call
        upm[url] = reset_retry(blob.download_as_string)()
    return upm

tseaver on 1 Aug 2019

👍5

@tseaver awesome, have passed this on to our devs. Some GCP client libraries seem to expose a retry option, some don't. This will be very handy, thanks!

rasputnik on 1 Aug 2019

Thanks @tseaver, IIUC, updating retry strategy for download and upload blob can help with GCF time limits when the VM is unscheduled.

How are retry defaults defined in the GCS Python library? Do they mainly come from api_core and do you have examples of best practices with modifying retry strategy in other Python manual libraries?

@tritone is a new GCS DPE who will be taking a look at fixing it.

team alias: @googleapis/storage

frankyn on 11 Sep 2019

We are working to correct the issues with Python libraries retry strategy and will continue to update on-going work in https://github.com/googleapis/google-cloud-python/issues/9298.

As stated by @tseaver, the workaround is the following:

from urllib3.exceptions import ProtocolError
from google.api_core import retry

predicate = retry.if_exception_type(
    ConnectionResetError, ProtocolError)
reset_retry = retry.Retry(predicate)

def generateUrlPdfBytesMap(urls):
    upm = dict()
    for url in urls:
        bucket_name, path = urlToBucketPath(url)
        account = path.split('/')[1]
        bucket = storage.bucket(bucket_name)
        blob = bucket.get_blob(path)  # Note: makes API call
        upm[url] = retry_reset(blob.download_as_string)()
    return upm

Thank you for your patience.

frankyn on 25 Sep 2019

👍3

Two weeks ago I've started getting ConnectionResetError 10% of the calls to blob.download_to_filename() from a GCE instance.
I've tried wrapping download_to_filename with Retry but still getting the same error.

Is there something wrong with my code?
How can I verify that ConnectionResetError was caught and the download failed several times?

from google.api_core import retry
from google.api_core.exceptions import InternalServerError
from google.api_core.exceptions import TooManyRequests

predicate = retry.if_exception_type(ConnectionResetError, InternalServerError, TooManyRequests)
r = retry.Retry(predicate=predicate)
r(blob.download_to_filename)(filename)

pablopla on 5 Nov 2019

👍4

I've wrapped the call to blob.download_to_filename with try/catch and the exception e.__class__.__name__ gives me ChunkedEncodingError.
Why does the error say ConnectionResetError while the exception is ChunkedEncodingError?
Shouldn't ChunkedEncodingError added to google.api_core.retry.if_transient_error?

pablopla on 10 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings