We're operating at scale on GCS and are regularly experiencing transient HTTP 410 status codes when accessing Cloud storage. Those 410 status codes returned by Cloud storage are bogus though, as they are effectively just hiding an internal backend error on GCS, which is reflected in the error details:
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException:
410 Gone { "code" : 503, "errors" : [ { "domain" : "global", "message" : "Backend Error", "reason" : "backendError" } ], "message" : "Backend Error" }
The google-cloud-storage client does not treat the 410 status code as retryable, understandibly so. It should be retrying on backend errors, though, which are typically exposed with status code 500 or 503. I'm suggesting to treat backend errors in the client in the same way as it treats internal errors, namely match on reason == backendError independently of HTTP status code.
Note that we're not the first ones to experience this, and the client should be resilient against these transient GCS errors.
I've contacted the storage backend team and if they aren't against it I'll add the retry logic.
Based on the discussion with storage backend the 410 happens during a JSON API resumable upload session. The error likely indicates that the upload session has already been terminated and retrying the individual HTTP request would not work (the entire upload session has to be restarted). An internal bug has been filed and storage team is actively working on it.
@hzyi-google and @JustinBeckwith
Is this bug still blocked by internal error?
For googlers: b/116709007 is the internal bug.
According to the bug it seems they decided that it's not possible to fix it in the client libraries.
Dataflow already effectively retries 410s by retrying every failed shard 4 times regardless of why it failed, so it won't be a problem for Dataflow users.
b/115694839 tracks the implementation of resumable uploads. People who are having these 410s directly in their projects might need to wait for this feature.
This issue is important and unfortunately not solvable by clients. I'm going to close this issue, since there's nothing we can do here.
@hzyi-google Could you please update the status of the internal bug? It's been almost a year...
@romange Sorry I do not work in this repo any more. cc/ @kolea2
Most helpful comment
Based on the discussion with storage backend the 410 happens during a JSON API resumable upload session. The error likely indicates that the upload session has already been terminated and retrying the individual HTTP request would not work (the entire upload session has to be restarted). An internal bug has been filed and storage team is actively working on it.