When we run into the following problematic state, the result is an infinite loop that is handled by setting a time out.
Scenario / the problematic state:
Although it is optimal debug the root cause that puts us into this scenario, this issue is more concerned about the handling of this scenario. When this scenario occurs in the library, there should be a way to not be in a state of an infinite retry, forcing anyone running into this scenario to implement a timeout as backup or setting ChunkSize = 0.
Potentially Related: https://github.com/googleapis/google-cloud-go/issues/470
Please note for anyone experiencing this issue with GCP, these are the current possible workarounds to mitigate the issue :
Thank you for the report @kimskimchi and for the patience!
On Monday, July 15th 2019, @frankyn and I debugged this issue exploring all the possible routes for how this could happen and the results of our session were:
tl;dr: retries on 429 and 503 after uploading chunks entirely create a new resumableUploadId whose start range doesn't match 0. The server then returns 503 due to the range mismatch, yet we retry 503s and go in limbo until exponential backoff hits its limits. We can fix this by either removing the retry in the writer.go code or in gensupport manually set the last used ?upload_id when we detect such issues
a) Most likely a 429 (too many requests) occurred to the user, (or just a 503 -- backendError with the storage service itself) that's in a *googleapi.Error, in the middle of an resumable upload
b) Due to the retry logic in
https://github.com/googleapis/google-cloud-go/blob/71971b35976fc2f904ed2772536790a5458d9996/storage/writer.go#L153-L160
notice that we invoke call.Do inside that retry? That'll create a fresh UploadId
c) With a fresh UploadId, given that the media hasn't been rewound i.e. the reader position n bytes from the start yet a new UploadId requires a fresh upload with a start range of 0 bytes. On attempting to upload from the last position, this then returns a 503 from the backend
d) Due to that retry logic, where we retry on 429 and 503
https://github.com/googleapis/google-cloud-go/blob/71971b35976fc2f904ed2772536790a5458d9996/storage/go110.go#L21-L32
Tonight, with the description of the problem I've worked on a repro that 100% reproduces this condition and standalone with a server too to reproduce what triggered the problem https://gist.github.com/odeke-em/8ff0d5a815c024d8b2f0424250205c64#file-main-go
package main
import (
"context"
"fmt"
"io"
"io/ioutil"
"log"
"net/http"
"net/http/httptest"
"net/http/httputil"
"strings"
"sync/atomic"
"time"
"golang.org/x/oauth2"
"cloud.google.com/go/storage"
"google.golang.org/api/option"
)
func main() {
log.SetFlags(0)
uploadRoute := "/upload"
var resumableUploadIDs atomic.Value
resumableUploadIDs.Store(make(map[string]time.Time))
lookupUploadID := func(resumableUploadID string) (time.Time, bool) {
t, ok := resumableUploadIDs.Load().(map[string]time.Time)[resumableUploadID]
return t, ok
}
memoizeUploadID := func(resumableUploadID string) {
resumableUploadIDs.Load().(map[string]time.Time)[resumableUploadID] = time.Now().UTC()
}
cst := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
blob, _ := httputil.DumpRequest(r, false)
log.Printf("\033[33mRequest in:\n%s\033[00m\n\n", blob)
resumableUploadID := r.URL.Query().Get("upload_id")
path := r.URL.Path
switch {
case path == "/b": // Bucket creation
log.Print("Creating bucket")
w.Write([]byte(`{"kind":"storage#bucket","id":"bucket","name":"bucket"}`))
return
case strings.HasPrefix(path, "/b/") && strings.HasSuffix(path, "/o"):
body, _ := ioutil.ReadAll(r.Body)
println("Invoking the object Creation endpoint with body: ", string(body))
if resumableUploadID == "" {
uploadID := time.Now().Format(time.RFC3339Nano)
w.Header().Set("X-GUploader-UploadID", uploadID)
// Now for the resumable upload URL.
w.Header().Set("Location", fmt.Sprintf("http://%s?upload_id=%s", r.Host+uploadRoute, uploadID))
} else {
w.Write([]byte(`{"kind":"storage#object","bucket":"bucket","name":"bucket"}`))
}
for false {
blob, err := ioutil.ReadAll(io.LimitReader(r.Body, 10))
log.Printf("Server read: %q\nerr: %v\n", blob, err)
}
return
case path == uploadRoute:
start, _, _, completedUpload, spamThem := parseContentRange(r.Header)
if resumableUploadID != "" {
_, ok := lookupUploadID(resumableUploadID)
if !ok {
if start == "0" {
// First time that we are encountering this upload
// and it is at byte 0, so memoize the uploadID.
memoizeUploadID(resumableUploadID)
} else {
// If the start and end range are non-zero this is the exact
// error in https://github.com/googleapis/google-cloud-go/issues/1507
// mismatched_content_start (Invalid request. According to the Content-Range header,
// the upload offset is 1082130432 byte(s), which exceeds already uploaded size of 0 byte(s).)
errStr := fmt.Sprintf("mismatched_content_start (Invalid request. According to the Content-Range header,"+
"the upload offset is %s byte(s), which exceeds already uploaded size of 0 byte(s).)\n%s", start, r.Header["Content-Range"])
http.Error(w, errStr, http.StatusServiceUnavailable)
return
}
}
log.Printf("resumableUploadID: %q\n", resumableUploadID)
}
if spamThem {
// Reproduce https://github.com/googleapis/google-cloud-go/issues/1507
// by sending then a retryable error on the last byte.
w.WriteHeader(http.StatusTooManyRequests)
return
}
if completedUpload {
// Completed the upload.
return
}
// Consume the body since we can accept this body.
_, _ = ioutil.ReadAll(r.Body)
w.Header().Set("X-Http-Status-Code-Override", "308")
return
default:
println("Unknown path ", path)
http.Error(w, "Unimplemented", http.StatusNotFound)
return
}
}))
defer cst.Close()
hc := &http.Client{
Transport: &oauth2.Transport{
Source: new(tokenSupplier),
},
}
ctx := context.Background()
opts := []option.ClientOption{option.WithHTTPClient(hc), option.WithEndpoint(cst.URL)}
sc, err := storage.NewClient(ctx, opts...)
if err != nil {
log.Fatalf("Failed to create storage client: %v", err)
}
defer sc.Close()
obj := sc.Bucket("issue-1507").Object("object")
w := obj.NewWriter(ctx)
maxFileSize := 1082166404
chunkSize := maxFileSize / 4
w.ChunkSize = chunkSize
defer w.Close()
for i := 0; i < maxFileSize; {
nowStr := time.Now().Format(time.RFC3339Nano)
n, _ := fmt.Fprintf(w, "%s%s", nowStr, strings.Repeat("a", w.ChunkSize))
println("looping ", i)
<-time.After(20 * time.Millisecond)
i += n
}
}
type tokenSupplier int
func (ts *tokenSupplier) Token() (*oauth2.Token, error) {
return &oauth2.Token{
AccessToken: "access-token",
TokenType: "Bearer",
RefreshToken: "refresh-token",
Expiry: time.Now().Add(time.Hour),
}, nil
}
func parseContentRange(hdr http.Header) (start, end, max string, completed, spamThem bool) {
cRange := strings.TrimPrefix(hdr.Get("Content-Range"), "bytes ")
rangeSplits := strings.Split(cRange, "/")
prelude := rangeSplits[0]
max = rangeSplits[1]
if len(prelude) == 0 || prelude == "*" {
// Completed the upload.
completed = true
return
}
startEndSplit := strings.Split(prelude, "-")
start, end = startEndSplit[0], startEndSplit[1]
if max != "*" { // They've uploaded the last byte.
// Reproduce https://github.com/googleapis/google-cloud-go/issues/1507
// by sending then a retryable error on the last byte.
spamThem = true
}
return
}
go run server.go
looping 0
lastUsedResumableUpload
Request in:
POST /b/issue-1507/o?alt=json&prettyPrint=false&projection=full&uploadType=resumable HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 40
Content-Type: application/json
User-Agent: google-api-go-client/0.5
X-Goog-Api-Client: gl-go/adcb2b1e7a gccl/20180226
X-Upload-Content-Type: text/plain; charset=utf-8
Invoking the object Creation endpoint with body: {"bucket":"issue-1507","name":"object"}
Backoff: <nil>
Request in:
POST /upload?upload_id=2019-07-20T01:07:39.453196-07:00 HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 270794752
Content-Range: bytes 0-270794751/*
Content-Type: text/plain; charset=utf-8
User-Agent: google-api-go-client/0.5
X-Guploader-No-308: yes
resumableUploadID: "2019-07-20T01:07:39.453196-07:00"
looping 270541633
Request in:
POST /upload?upload_id=2019-07-20T01:07:39.453196-07:00 HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 270794752
Content-Range: bytes 270794752-541589503/*
Content-Type: text/plain; charset=utf-8
User-Agent: google-api-go-client/0.5
X-Guploader-No-308: yes
resumableUploadID: "2019-07-20T01:07:39.453196-07:00"
looping 541083266
Request in:
POST /upload?upload_id=2019-07-20T01:07:39.453196-07:00 HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 270794752
Content-Range: bytes 541589504-812384255/*
Content-Type: text/plain; charset=utf-8
User-Agent: google-api-go-client/0.5
X-Guploader-No-308: yes
resumableUploadID: "2019-07-20T01:07:39.453196-07:00"
looping 811624899
Request in:
POST /upload?upload_id=2019-07-20T01:07:39.453196-07:00 HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 269782276
Content-Range: bytes 812384256-1082166531/1082166532
Content-Type: text/plain; charset=utf-8
User-Agent: google-api-go-client/0.5
X-Guploader-No-308: yes
resumableUploadID: "2019-07-20T01:07:39.453196-07:00"
LastSeen uploadID: "2019-07-20T01:07:39.453196-07:00"
TransferChunk: 429 Too Many Requests, err: <nil>
Body in error:
Request in:
POST /upload?upload_id=2019-07-20T01:07:39.453196-07:00 HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 269782276
Content-Range: bytes 812384256-1082166531/1082166532
Content-Type: text/plain; charset=utf-8
User-Agent: google-api-go-client/0.5
X-Guploader-No-308: yes
resumableUploadID: "2019-07-20T01:07:39.453196-07:00"
LastSeen uploadID: "2019-07-20T01:07:39.453196-07:00"
TransferChunk: 429 Too Many Requests, err: <nil>
Body in error:
Request in:
POST /upload?upload_id=2019-07-20T01:07:39.453196-07:00 HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 269782276
Content-Range: bytes 812384256-1082166531/1082166532
Content-Type: text/plain; charset=utf-8
User-Agent: google-api-go-client/0.5
X-Guploader-No-308: yes
resumableUploadID: "2019-07-20T01:07:39.453196-07:00"
LastSeen uploadID: "2019-07-20T01:07:39.453196-07:00"
TransferChunk: 429 Too Many Requests, err: <nil>
Body in error:
Request in:
POST /upload?upload_id=2019-07-20T01:07:39.453196-07:00 HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 269782276
Content-Range: bytes 812384256-1082166531/1082166532
Content-Type: text/plain; charset=utf-8
User-Agent: google-api-go-client/0.5
X-Guploader-No-308: yes
resumableUploadID: "2019-07-20T01:07:39.453196-07:00"
Sleeping for 235.010052ms
lastUsedResumableUpload
Request in:
POST /b/issue-1507/o?alt=json&prettyPrint=false&projection=full&uploadType=resumable HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 40
Content-Type: application/json
User-Agent: google-api-go-client/0.5
X-Goog-Api-Client: gl-go/adcb2b1e7a gccl/20180226
X-Upload-Content-Type: text/plain; charset=utf-8
Invoking the object Creation endpoint with body: {"bucket":"issue-1507","name":"object"}
Backoff: <nil>
Request in:
POST /upload?upload_id=2019-07-20T01:07:43.192198-07:00 HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 269782276
Content-Range: bytes 812384256-1082166531/1082166532
Content-Type: text/plain; charset=utf-8
User-Agent: google-api-go-client/0.5
X-Guploader-No-308: yes
LastSeen uploadID: "2019-07-20T01:07:43.192198-07:00"
TransferChunk: 503 Service Unavailable, err: <nil>
Body in error: mismatched_content_start (Invalid request. According to the Content-Range header,the upload offset is 812384256 byte(s), which exceeds already uploaded size of 0 byte(s).)
[bytes 812384256-1082166531/1082166532]
Request in:
POST /upload?upload_id=2019-07-20T01:07:43.192198-07:00 HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 269782276
Content-Range: bytes 812384256-1082166531/1082166532
Content-Type: text/plain; charset=utf-8
User-Agent: google-api-go-client/0.5
X-Guploader-No-308: yes
LastSeen uploadID: "2019-07-20T01:07:43.192198-07:00"
TransferChunk: 503 Service Unavailable, err: <nil>
Body in error: mismatched_content_start (Invalid request. According to the Content-Range header,the upload offset is 812384256 byte(s), which exceeds already uploaded size of 0 byte(s).)
[bytes 812384256-1082166531/1082166532]
Request in:
POST /upload?upload_id=2019-07-20T01:07:43.192198-07:00 HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 269782276
Content-Range: bytes 812384256-1082166531/1082166532
Content-Type: text/plain; charset=utf-8
User-Agent: google-api-go-client/0.5
X-Guploader-No-308: yes
LastSeen uploadID: "2019-07-20T01:07:43.192198-07:00"
TransferChunk: 503 Service Unavailable, err: <nil>
Body in error: mismatched_content_start (Invalid request. According to the Content-Range header,the upload offset is 812384256 byte(s), which exceeds already uploaded size of 0 byte(s).)
[bytes 812384256-1082166531/1082166532]
Request in:
POST /upload?upload_id=2019-07-20T01:07:43.192198-07:00 HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 269782276
Content-Range: bytes 812384256-1082166531/1082166532
Content-Type: text/plain; charset=utf-8
User-Agent: google-api-go-client/0.5
X-Guploader-No-308: yes
Sleeping for 1.331776149s
lastUsedResumableUpload
Request in:
POST /b/issue-1507/o?alt=json&prettyPrint=false&projection=full&uploadType=resumable HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 40
Content-Type: application/json
User-Agent: google-api-go-client/0.5
X-Goog-Api-Client: gl-go/adcb2b1e7a gccl/20180226
X-Upload-Content-Type: text/plain; charset=utf-8
Invoking the object Creation endpoint with body: {"bucket":"issue-1507","name":"object"}
Backoff: <nil>
Request in:
POST /upload?upload_id=2019-07-20T01:07:45.25755-07:00 HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 269782276
Content-Range: bytes 812384256-1082166531/1082166532
Content-Type: text/plain; charset=utf-8
User-Agent: google-api-go-client/0.5
X-Guploader-No-308: yes
LastSeen uploadID: "2019-07-20T01:07:45.25755-07:00"
TransferChunk: 503 Service Unavailable, err: <nil>
Body in error: mismatched_content_start (Invalid request. According to the Content-Range header,the upload offset is 812384256 byte(s), which exceeds already uploaded size of 0 byte(s).)
[bytes 812384256-1082166531/1082166532]
Request in:
POST /upload?upload_id=2019-07-20T01:07:45.25755-07:00 HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 269782276
Content-Range: bytes 812384256-1082166531/1082166532
Content-Type: text/plain; charset=utf-8
User-Agent: google-api-go-client/0.5
X-Guploader-No-308: yes
LastSeen uploadID: "2019-07-20T01:07:45.25755-07:00"
TransferChunk: 503 Service Unavailable, err: <nil>
Body in error: mismatched_content_start (Invalid request. According to the Content-Range header,the upload offset is 812384256 byte(s), which exceeds already uploaded size of 0 byte(s).)
[bytes 812384256-1082166531/1082166532]
Request in:
POST /upload?upload_id=2019-07-20T01:07:45.25755-07:00 HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 269782276
Content-Range: bytes 812384256-1082166531/1082166532
Content-Type: text/plain; charset=utf-8
User-Agent: google-api-go-client/0.5
X-Guploader-No-308: yes
Sleeping for 2.760398085s
lastUsedResumableUpload
Request in:
POST /b/issue-1507/o?alt=json&prettyPrint=false&projection=full&uploadType=resumable HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 40
Content-Type: application/json
User-Agent: google-api-go-client/0.5
X-Goog-Api-Client: gl-go/adcb2b1e7a gccl/20180226
X-Upload-Content-Type: text/plain; charset=utf-8
Invoking the object Creation endpoint with body: {"bucket":"issue-1507","name":"object"}
Backoff: <nil>
Request in:
POST /upload?upload_id=2019-07-20T01:07:48.694532-07:00 HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 269782276
Content-Range: bytes 812384256-1082166531/1082166532
Content-Type: text/plain; charset=utf-8
User-Agent: google-api-go-client/0.5
X-Guploader-No-308: yes
LastSeen uploadID: "2019-07-20T01:07:48.694532-07:00"
TransferChunk: 503 Service Unavailable, err: <nil>
Body in error: mismatched_content_start (Invalid request. According to the Content-Range header,the upload offset is 812384256 byte(s), which exceeds already uploaded size of 0 byte(s).)
[bytes 812384256-1082166531/1082166532]
Request in:
POST /upload?upload_id=2019-07-20T01:07:48.694532-07:00 HTTP/1.1
Host: 127.0.0.1:56376
Accept-Encoding: gzip
Authorization: Bearer access-token
Content-Length: 269782276
Content-Range: bytes 812384256-1082166531/1082166532
Content-Type: text/plain; charset=utf-8
User-Agent: google-api-go-client/0.5
X-Guploader-No-308: yes
LastSeen uploadID: "2019-07-20T01:07:48.694532-07:00"
TransferChunk: 503 Service Unavailable, err: <nil>
Body in error: mismatched_content_start (Invalid request. According to the Content-Range header,the upload offset is 812384256 byte(s), which exceeds already uploaded size of 0 byte(s).)
[bytes 812384256-1082166531/1082166532]
a) We can either remove the retry code in storage/writer.go because the code in googleapis/gensupport/resumable.go tries its own uploads but the one in storage/writer.go erroneously assumes statelessness without reusing the ?upload_id or X-GUploader-UploadId header
b) or in googleapis/gensupport/resumable.go manually copy and reuse the first used ?upload_id when we detect offsets are beyond 0 since we don't want to create a new upload_id
Hey @asonawalla,
Although we need to make sure to diagnose your scenario so that you can remove the workaround safely, we weren't completely satisfied with options A and B:
Options:
A) Remove the workaround for a certain set amount of time (eg. 1 day) and capture logs if the scenario is triggered.
B) We update the code to resolve the issue with retrying a non-retryable state. Then you proceed team would remove the workaround. However this ends up with a similar state of uncertainty, of whether or not your bug is 100% addressed.
C) Gain more context on how you are using the client (environment, upload patterns, object sizes), and how often the issue happens. -- Engineers (tagged already on this ticket) may also have further questions.
Please let us know what option you feel most comfortable with. Thanks!
Thanks @kimskimchi and @odeke-em for the thorough reproduction! This does capture the loop we saw ourselves in. Though I can't say for certain what _caused_ us to get into that loop (whether it was a http.StatusTooManyRequests as in the example above or something else), it seems like either approach a) or b) in the proposed fix would solve for it.
For the context on where we've been seeing these errors, they've been observed on the brokers of our pub/sub system gazette (see the notes under "Brokers do not provide long-term storage of journal content."; brokers would occasionally get stuck uploading journal "fragments" to cloud storage). Note that we're using an older version of gazette (no longer on the master branch) which disables chunked uploads here.
wrt attempting to recreate, in practice we were only seeing this issue roughly once every other week, so not sure we'd get anything useful out of turning chunking back on for a day. Having said that, I'd at least personally be comfortable with re-enabling chunking and calling this resolved if we can solve for the scenario put forward in the reproduction above.
@frankyn @odeke-em @asonawalla How about this: we make the proposed fix (thanks @odeke-em ) on a branch, and then @asonawalla would you mind putting that branch into prod and seeing if you see reduced errors?
@jadekler yeah, that sounds great to me
Thank you for the patience @asonawalla! I've posted up a branch for you, please take a look at https://github.com/orijtech/google-cloud-go/blob/storage-remove-writer-indefinite-retry/GOMODULES_REPLACE.md if you use Go modules, otherwise if you are just doing a go get from a specific spot, please see the diff in https://github.com/orijtech/google-cloud-go/commit/e62b8e7c2ef8772630cd7d0df789988b31f5321a#diff-f3b64f8fe04302431a7368f5af80d756 or inlined below
diff --git a/storage/writer.go b/storage/writer.go
index 3a58c404..c0b23019 100644
--- a/storage/writer.go
+++ b/storage/writer.go
@@ -144,21 +144,11 @@ func (w *Writer) open() error {
call.UserProject(w.o.userProject)
}
setClientHeader(call.Header())
- // If the chunk size is zero, then no chunking is done on the Reader,
- // which means we cannot retry: the first call will read the data, and if
- // it fails, there is no way to re-read.
- if w.ChunkSize == 0 {
- resp, err = call.Do()
- } else {
- // We will only retry here if the initial POST, which obtains a URI for
- // the resumable upload, fails with a retryable error. The upload itself
- // has its own retry logic.
- err = runWithRetry(w.ctx, func() error {
- var err2 error
- resp, err2 = call.Do()
- return err2
- })
- }
+
+ // The internals that perform call.Do automatically retry
+ // requests for us hence no need to add retries here.
+ // See issue https://github.com/googleapis/google-cloud-go/issues/1507.
+ resp, err = call.Do()
}
if err != nil {
w.mu.Lock()
Hey @odeke-em, I just deployed our application with your forked storage lib in canary; so far so good. Will report back in a day or two to confirm no issues.
@odeke-em looks good in my testing, haven't seen any signs of the retry loop resurfacing.
Thanks @asonawalla, any update this week or from the past weekend?
@frankyn smooth sailing
Thanks @asonawalla. @odeke-em, could you merge the changes into master?
Roger that @frankyn, and thanks for the confirmation and canarying @asonawalla!
I've mailed CL https://code-review.googlesource.com/c/gocloud/+/44071 which will be merged soon.
Most helpful comment
Thank you for the report @kimskimchi and for the patience!
On Monday, July 15th 2019, @frankyn and I debugged this issue exploring all the possible routes for how this could happen and the results of our session were:
tl;dr: retries on 429 and 503 after uploading chunks entirely create a new resumableUploadId whose start range doesn't match 0. The server then returns 503 due to the range mismatch, yet we retry 503s and go in limbo until exponential backoff hits its limits. We can fix this by either removing the retry in the writer.go code or in gensupport manually set the last used
?upload_idwhen we detect such issuesa) Most likely a 429 (too many requests) occurred to the user, (or just a 503 -- backendError with the storage service itself) that's in a *googleapi.Error, in the middle of an resumable upload
b) Due to the retry logic in
https://github.com/googleapis/google-cloud-go/blob/71971b35976fc2f904ed2772536790a5458d9996/storage/writer.go#L153-L160
notice that we invoke
call.Doinside that retry? That'll create a fresh UploadIdc) With a fresh UploadId, given that the media hasn't been rewound i.e. the reader position n bytes from the start yet a new UploadId requires a fresh upload with a start range of 0 bytes. On attempting to upload from the last position, this then returns a 503 from the backend
d) Due to that retry logic, where we retry on 429 and 503
https://github.com/googleapis/google-cloud-go/blob/71971b35976fc2f904ed2772536790a5458d9996/storage/go110.go#L21-L32
Repro
Tonight, with the description of the problem I've worked on a repro that 100% reproduces this condition and standalone with a server too to reproduce what triggered the problem https://gist.github.com/odeke-em/8ff0d5a815c024d8b2f0424250205c64#file-main-go
end-to-end.go
logs.md
Proposed fix
a) We can either remove the retry code in
storage/writer.gobecause the code ingoogleapis/gensupport/resumable.gotries its own uploads but the one instorage/writer.goerroneously assumes statelessness without reusing the?upload_idorX-GUploader-UploadIdheaderb) or in
googleapis/gensupport/resumable.gomanually copy and reuse the first used?upload_idwhen we detect offsets are beyond 0 since we don't want to create a new upload_id