Client
Firestore
Environment
alpine:3.11 Docker image running on GKE
Go Environment
Built within docker image: golang:1.14-alpine3.11
$ go version
go version go1.14.4 linux/amd64
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build686252092=/tmp/go-build -gno-record-gcc-switches"
Code
$ grep firestore go.mod
cloud.google.com/go/firestore v1.2.0
func (db *DB) watchSnapshots(ctx context.Context, id string, t time.Time, cb ForEachCallback) error {
qsnaps := db.collection().
Query.
Where(idField, "==", id).
Where(timeField, ">", t).
OrderBy(timeField, firestore.Asc).
Snapshots(ctx)
defer qsnaps.Stop()
for {
qsnap, err := qsnaps.Next()
if err == iterator.Done {
return errors.New("stop called") // never expected since we only call `Stop()` via defer
}
if err != nil {
return fmt.Errorf("next: %w", err)
}
if err := cb(ctx, qsnap); err != nil {
return err
}
}
}
Expected behavior
The provided watchSnapshots() method should never generate a "stop called" error because the documentation says:
Next never returns iterator.Done unless it is called after Stop.
And the watchSnapshots() method only calls Stop() from defer.
Actual behavior
We see occasional instances where iterator.Done is returned from QuerySnapshotIterator.Next().
My hypothesis is something like:
Under the hood, the firestore package turns any io.EOF into an iterator.Done.
It assumes this io.EOF came from the watchStream.stop() method, but I believe they can come from underlying network issues/disconnects and the library misinterprets it.
For example, perhaps watchStream.recv()s call to watchStream.open() failed with io.EOF (either in it鈥檚 call to Listen() or Send()). The latter eventually calls grpc.ClienltStream.SendMsg() which does say that it can return io.EOF. There's some effort to retry on "non-permanent watch errors", but that's only done on errors from the underlying gRPC connection's Recv(), but apparently intentionally not during the initial/re open (which makes sense since if the connection can't even be established that may be a permanent networking issue).
As a stopgap the documentation could be changed to indicate iterator.Dones can be produced due to networking issues, but I think a better overall solution would be to propagate the io.EOF (or similar) error if a connection cannot be (re)opened so the caller can decide how to react to that situation.
Screenshots
NA
Additional context
NA
Thanks for your report! @benwhitehead any thoughts on this? We can definitely make the proposed docs fix in the immediate term.
@tritone I think getting the docs updated sounds good.
Then we can look at investigating what code change would need to happen to correct the misinformed Done.
Hey @cpick your hypothesis is correct. QuerySnapshotIterator.Next, when an io.EOF is received from the underlying stream, will modify the err to be iterator.Done
I think the following portion of this is reasonable.
As a stopgap the documentation could be changed to indicate iterator.Dones can be produced due to networking issues,
The current docstring for Next() does state this:
// Next is not expected to return iterator.Done unless it is called after Stop.
// Rarely, networking issues may also cause iterator.Done to be returned.
Though it seems query.go has this where document.go doesn't. Perhaps we need to ensure that it is mentioned on all watch types.
Regarding a suggested change to behavior:
I think a better overall solution would be to propagate the io.EOF (or similar) error if a connection cannot be (re)opened so the caller can decide how to react to that situation.
@jba I know this is likely out of your mind, but could you possibly share why io.EOF is transformed to iterator.Done for the user, instead of surfacing the inner io.EOF?
The docs are due to the earlier PR I made on this issue. I haven't done any further work to dig into why or at what level this occurs.
Best I can guess, I assumed I only saw io.EOF when the underlying connection was closed legitimately.
From a distance, it seems to me that a reasonable solution is for stop to indicate it was called using some other mechanism, like setting a bool. Then the code that transforms io.EOF into Done could first check that bool.
FYI: #3643 fixed a small error in how watch.stop was done. I believe this fix is tangential to this bug.
I also have a branch, https://github.com/googleapis/google-cloud-go/compare/master...crwilcox:refine-EOF?expand=1#diff-fbf8020e8989d7c99881c746391dda1da4706a1ce1216d16ca1dfce4f62b9284R453, where I have marked areas we might be returning done incorrectly. Though more debugging is needed.
I have had a watch running for ~3 weeks trying to find this issue. Unfortunately I haven't observed this error.
While I am going to keep digging into this, as I can't manage to reproduce there isn't much we can do at this time. Please reopen (or create a new issue) if you come across this again.