Google-cloud-go: firestore: `QuerySnapshotIterator.Next()` can return `iterator.Done` even if `Stop()` hasn't been called

Created on 14 Jul 2020 · 8Comments · Source: googleapis/google-cloud-go

Client

Firestore

Environment

alpine:3.11 Docker image running on GKE

Go Environment

Built within docker image: golang:1.14-alpine3.11
$ go version
go version go1.14.4 linux/amd64
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build686252092=/tmp/go-build -gno-record-gcc-switches"

Code

$ grep firestore go.mod
cloud.google.com/go/firestore v1.2.0

func (db *DB) watchSnapshots(ctx context.Context, id string, t time.Time, cb ForEachCallback) error {
    qsnaps := db.collection().
        Query.
        Where(idField, "==", id).
        Where(timeField, ">", t).
        OrderBy(timeField, firestore.Asc).
        Snapshots(ctx)
    defer qsnaps.Stop()

    for {
        qsnap, err := qsnaps.Next()
        if err == iterator.Done {
            return errors.New("stop called") // never expected since we only call `Stop()` via defer
        }
        if err != nil {
            return fmt.Errorf("next: %w", err)
        }

        if err := cb(ctx, qsnap); err != nil {
            return err
        }
    }
}

Expected behavior

The provided watchSnapshots() method should never generate a "stop called" error because the documentation says:

Next never returns iterator.Done unless it is called after Stop.

And the watchSnapshots() method only calls Stop() from defer.

Actual behavior

We see occasional instances where iterator.Done is returned from QuerySnapshotIterator.Next().

My hypothesis is something like:
Under the hood, the firestore package turns any io.EOF into an iterator.Done.

It assumes this io.EOF came from the watchStream.stop() method, but I believe they can come from underlying network issues/disconnects and the library misinterprets it.

For example, perhaps watchStream.recv()s call to watchStream.open() failed with io.EOF (either in it’s call to Listen() or Send()). The latter eventually calls grpc.ClienltStream.SendMsg() which does say that it can return io.EOF. There's some effort to retry on "non-permanent watch errors", but that's only done on errors from the underlying gRPC connection's Recv(), but apparently intentionally not during the initial/re open (which makes sense since if the connection can't even be established that may be a permanent networking issue).

As a stopgap the documentation could be changed to indicate iterator.Dones can be produced due to networking issues, but I think a better overall solution would be to propagate the io.EOF (or similar) error if a connection cannot be (re)opened so the caller can decide how to react to that situation.

Screenshots

Additional context

firestore p2 bug

Source

cpick

All 8 comments

Thanks for your report! @benwhitehead any thoughts on this? We can definitely make the proposed docs fix in the immediate term.

tritone on 15 Jul 2020

@tritone I think getting the docs updated sounds good.

Then we can look at investigating what code change would need to happen to correct the misinformed Done.

BenWhitehead on 15 Jul 2020

Hey @cpick your hypothesis is correct. QuerySnapshotIterator.Next, when an io.EOF is received from the underlying stream, will modify the err to be iterator.Done

I think the following portion of this is reasonable.

As a stopgap the documentation could be changed to indicate iterator.Dones can be produced due to networking issues,

The current docstring for Next() does state this:

// Next is not expected to return iterator.Done unless it is called after Stop.
// Rarely, networking issues may also cause iterator.Done to be returned.

Though it seems query.go has this where document.go doesn't. Perhaps we need to ensure that it is mentioned on all watch types.

Regarding a suggested change to behavior:

I think a better overall solution would be to propagate the io.EOF (or similar) error if a connection cannot be (re)opened so the caller can decide how to react to that situation.

@jba I know this is likely out of your mind, but could you possibly share why io.EOF is transformed to iterator.Done for the user, instead of surfacing the inner io.EOF?

crwilcox on 27 Jan 2021

👍1

The docs are due to the earlier PR I made on this issue. I haven't done any further work to dig into why or at what level this occurs.

tritone on 27 Jan 2021

👍1

Best I can guess, I assumed I only saw io.EOF when the underlying connection was closed legitimately.

From a distance, it seems to me that a reasonable solution is for stop to indicate it was called using some other mechanism, like setting a bool. Then the code that transforms io.EOF into Done could first check that bool.

jba on 29 Jan 2021

FYI: #3643 fixed a small error in how watch.stop was done. I believe this fix is tangential to this bug.

I also have a branch, https://github.com/googleapis/google-cloud-go/compare/master...crwilcox:refine-EOF?expand=1#diff-fbf8020e8989d7c99881c746391dda1da4706a1ce1216d16ca1dfce4f62b9284R453, where I have marked areas we might be returning done incorrectly. Though more debugging is needed.

crwilcox on 2 Feb 2021

I have had a watch running for ~3 weeks trying to find this issue. Unfortunately I haven't observed this error.

crwilcox on 24 Feb 2021

While I am going to keep digging into this, as I can't manage to reproduce there isn't much we can do at this time. Please reopen (or create a new issue) if you come across this again.

crwilcox on 26 Feb 2021

Was this page helpful?

0 / 5 - 0 ratings