Google-cloud-go: firestore: Deadlock when using emulator and parallel read in the transaction

Created on 14 Jul 2020  路  12Comments  路  Source: googleapis/google-cloud-go

Client

Firestore

Environment

Emulator running in the way described here: https://firebase.google.com/docs/rules/emulator-setup

firebase emulators:start --only firestore

Go Environment

$ go version
go version go1.14.4 linux/amd64

$ go env
O111MODULE=""
GOARCH="amd64"
GOBIN="/home/robert/go/bin"
GOCACHE="/home/robert/.cache/go-build"
GOENV="/home/robert/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/robert/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build613925690=/tmp/go-build -gno-record-gcc-switches"
FIRESTORE_EMULATOR_HOST=localhost:8080

Code

e.g.

package main

import (
    "context"
    "fmt"
    "sync"
    "time"

    "cloud.google.com/go/firestore"
)

func main() {
    // it can be even 1 minute, it's still working in the same way
    timeout := time.Second * 1

    ctx, cancel := context.WithTimeout(context.Background(), timeout)
    defer cancel()

    workersCount := 2 // two workers are enough to reproduce
    workersDone := sync.WaitGroup{}
    workersDone.Add(workersCount)

    docID := fmt.Sprintf("%d", time.Now().UnixNano())

    startWorkers := make(chan struct{})

    // it doesn't matter if I'm re-using the client and using separate one
    client := newFirestoreClient()

    for worker := 0; worker < workersCount; worker++ {
        workerNum := worker

        go func() {
            defer workersDone.Done()
            <-startWorkers

            err := client.RunTransaction(ctx, func(ctx context.Context, transaction *firestore.Transaction) error {
                docRef := client.Collection("test-coll").Doc(docID)

                // without that function, it's working fine
                _, _ = transaction.Get(docRef)

                if err := transaction.Set(docRef, map[string]string{"foo": "bar"}); err != nil {
                    return err
                }

                return nil
            })
            fmt.Printf("err from worker %d: %+v\n", workerNum, err)
        }()
    }

    close(startWorkers)
    workersDone.Wait()
}


func newFirestoreClient() *firestore.Client {
    firebaseClient, err := firestore.NewClient(context.Background(), "project-id")
    if err != nil {
        panic(err)
    }
    return firebaseClient
}

Expected behavior

No deadlock in RunTransaction.

Actual behavior

After running the code, RunTransaction is blocked until the context is canceled because of timeout.

err from worker 0: rpc error: code = DeadlineExceeded desc = context deadline exceeded
err from worker 1: rpc error: code = DeadlineExceeded desc = context deadline exceeded

When we will get rid of _, _ = transaction.Get(docRef) everything is working fine.

Additional context

I'm not sure that it's not just a problem with the newest version of the emulator or client - we have this issue for a couple of months.
It's also hard to say if it's the emulator problem or Go client. Anyway, I tried to find any similar issues in other clients and I didn't found anything useful.

If you think that it may be useful, I can also create issue for the emulator and link it here :)

firestore external p2 bug

Most helpful comment

Great; thanks very much for the quick response. I'll make sure this use case is supported and we can look into filing a bug in the emulator if the problem is there.

All 12 comments

Thanks for filing the issue. I don't see where in your code or environment you are enabling the emulator-- could you clarify on this?

@benwhitehead do we support this functionality through the emulator?

Hello @tritone, I added the info :)

Sounds good-- just wanted to confirm that you are also using FIRESTORE_EMULATOR_HOST to indicate that the client library should call the emulator?

Yes, I also added that for clarification.

As I said, commenting _, _ = transaction.Get(docRef) out is fixing the issue, so I'm sure that the problem is not with connection timeout, but with transaction.

Great; thanks very much for the quick response. I'll make sure this use case is supported and we can look into filing a bug in the emulator if the problem is there.

Do you have any news about that @tritone?

I'm asking, because I'm writing an article about testing repositories which includes Firestore, MySQL, and in-memory implementation.

Unfortunately, in Firestore implementation I need to explicitly skip some tests because they are failing because of that bug :(

I didn't see any obvious reason that this shouldn't be supported in the emulator. @benwhitehead is taking a look-- he's going to see if he can repro using the Java client as well. We will keep you posted.

I've spent some time attempting to create a repro in Java. Some mixed news I'm afraid. I was able to create what appeared to be a hang when attempting to have multiple transactions modify the same document at the same time, however, the hang resulted in a transaction timeout and each of the failed transactions were retried and then succeed. The noticeable hang I was able to observe in my testing was ~30 seconds while the transaction was timing out.

A few more questions to try and help narrow things down:

  1. When you are observing the deadlock, are you able to determine if it is longer than 30 seconds?
  2. In the stdout of your emulator do you ever see the message INFO: operation failed: transaction timeout?
  3. Are you using a new emulator process each time? (The reason I ask this, is I observed that transaction timeouts became less likely the longer I had the emulator running and was running my tests. My suspicion here is that the emulators JVM is able to JIT some optomizations to reduce the window of possible transaction contention.)

hello @BenWhitehead :)

ad 1

I increased timeout to 60s on the client level, and it's correct - it's timeouts after 30s

2020/07/30 08:45:12 create firestore
2020/07/30 08:45:12 starting workers
2020/07/30 08:45:12 running tx
2020/07/30 08:45:12 running tx
2020/07/30 08:45:42 err from worker 0: rpc error: caode = Unknown desc =
2020/07/30 08:45:42 err from worker 1: rpc error: code = Unknown desc =

ad 2

When I'm setting 60s timeout on the client level

Jul 30, 2020 8:45:12 AM io.gapi.emulators.netty.HttpVersionRoutingHandler channelRead
INFO: Detected HTTP/2 connection.
Jul 30, 2020 8:45:31 AM com.google.cloud.datastore.emulator.impl.util.WrappedStreamObserver onError
INFO: operation failed: transaction timeout
Jul 30, 2020 8:45:31 AM com.google.cloud.datastore.emulator.impl.util.WrappedStreamObserver onError
INFO: operation failed: transaction timeout
Jul 30, 2020 8:45:42 AM com.google.cloud.datastore.emulator.impl.util.WrappedStreamObserver onError
INFO: operation failed: transaction timeout
Jul 30, 2020 8:45:42 AM com.google.cloud.datastore.emulator.impl.util.WrappedStreamObserver onError
INFO: operation failed: transaction timeout

With smaller 1s timeout on the consumer level, it actually also return an error. But what is interesting it's after the client is already dead (I assume that it's not handling the situation when the client connection is closed?)

Jul 30, 2020 8:48:22 AM io.gapi.emulators.netty.HttpVersionRoutingHandler channelRead
INFO: Detected HTTP/2 connection.
Jul 30, 2020 8:48:52 AM com.google.cloud.datastore.emulator.impl.util.WrappedStreamObserver onError
INFO: operation failed: transaction timeout
Jul 30, 2020 8:48:52 AM com.google.cloud.datastore.emulator.impl.util.WrappedStreamObserver onError
INFO: operation failed: transaction timeout

ad 3

It's interesting, but on the first run the error is different. But anyway, the problem is the same (deadlock and error after client or server timeout, in that case it's on the client side):

# first run
2020/07/30 08:51:21 err from worker 0: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL
2020/07/30 08:51:21 err from worker 1: rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL

# second run
2020/07/30 08:51:37 err from worker 0: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2020/07/30 08:51:37 err from worker 1: rpc error: code = DeadlineExceeded desc = context deadline exceeded

# third run
2020/07/30 08:51:46 err from worker 1: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2020/07/30 08:51:46 err from worker 0: rpc error: code = DeadlineExceeded desc = context deadline exceeded

UPDATE:
I also tried to run fresh emulator and 60s timeout, in that case server didn't timeout after 30s as usual, but client timeout worked after 60s

2020/07/30 08:53:23 create firestore
2020/07/30 08:53:23 starting workers
2020/07/30 08:53:23 running tx
2020/07/30 08:53:23 running tx
2020/07/30 08:54:23 err from worker 1: context deadline exceeded
2020/07/30 08:54:23 err from worker 0: context deadline exceeded

Thanks for the additional info @roblaszczak. This sounds like it is something that will need some deeper investigation into the internals for the Go Firestore client to determine the best way to address the issue.

There is an integration tests in the Java Firestore client that specifically tests this type of scenario and ensure that after the contention, there is a re-attempt. We probably need to ensure that something similar is added to the test suite here.

Any progress on this issue?

Was this page helpful?
0 / 5 - 0 ratings