Dgraph: Concurrent write of nodes referencing the same node fails with Dgraph v1.1

Created on 27 Sep 2019 · 11Comments · Source: dgraph-io/dgraph

What version of Dgraph are you using?

v1.1

Have you tried reproducing the issue with the latest release?

yes

What is the hardware spec (RAM, OS)?

16Gb, Linux 4.19.67-1-lts

Steps to reproduce the issue (command/config used to run Dgraph).

run test

package main

import (
    "context"
    "encoding/json"
    "strconv"
    "testing"

    "github.com/dgraph-io/dgo/v2"
    "github.com/dgraph-io/dgo/v2/protos/api"
    "github.com/stretchr/testify/assert"
    "github.com/stretchr/testify/require"
    "google.golang.org/grpc"
)

func Test_it_writes_concurrently(t *testing.T) {
    conn, err := grpc.Dial(dgraphAddr(), grpc.WithInsecure())
    if err != nil {
        t.Fatalf("conntect: %v", err)
    }

    dg := dgo.NewDgraphClient(api.NewDgraphClient(conn))
    defer dg.Alter(context.Background(), &api.Operation{DropAll: true})

    schema := `
type node {
  name: string
  child: uid
}

name: string .
child: uid .
`
    err = dg.Alter(context.Background(), &api.Operation{Schema: schema})
    require.NoError(t, err)

    type node struct {
        ID   string `json:"uid"`
        Name string `json:"name"`

        Child *node `json:"child"`
    }

    child := node{ID: "_:blank-0", Name: "child"}
    js, err := json.Marshal(child)
    require.NoError(t, err)

    res, err := dg.NewTxn().Mutate(context.Background(), &api.Mutation{SetJson: js, CommitNow: true})
    require.NoError(t, err)

    in := []node{}
    for i := 0; i < 2; i++ {
        in = append(in, node{ID: "_:blank-0", Name: strconv.Itoa(i + 1), Child: &node{ID: res.GetUids()["blank-0"]}})
    }

    errChan := make(chan error)
    for i := range in {
        go func(n node) {
            js, err := json.Marshal(n)
            require.NoError(t, err)

            _, err = dg.NewTxn().Mutate(context.Background(), &api.Mutation{SetJson: js, CommitNow: true})
            errChan <- err
        }(in[i])
    }

    errs := []error{}
    for i := 0; i < len(in); i++ {
        errs = append(errs, <-errChan)
    }

    for _, e := range errs {
        assert.NoError(t, e)
    }
}

Expected behaviour and actual result.

Expected behaviour: no error and two nodes are created
Actual behaviour: One mutate request returns Transaction has been aborted. Please retry

Problem description

I can reproduce this to be working with Dgraph 1.0.16. If the nodes that are written concurrently don't reference the same child node everything works. Hence I guess the issue is with the child node reference.

aremutations kinquestion statuaccepted

Source

jostillmanns

👍1

Most helpful comment

I think concurrent http requests which end up in a mutation referencing the same node is verly likely, hence I have to disagree on this beeing correct behavior.

jostillmanns on 28 Sep 2019

👍3

All 11 comments

This is a very interesting issue, thanks for the report.

temporary workaround

For now, you can build a workaround to this by simply retrying, so your for loop would look like this:

errChan := make(chan error)
for i := range in {
    go func(n node) {
        js, err := json.Marshal(n)
        require.NoError(t, err)

        for  {
            _, err = dg.NewTxn().Mutate(context.Background(), &api.Mutation{SetJson: js, CommitNow: true})
            if err != dgo.ErrAborted {
                break
            }
        }
        errChan <- err
    }(in[i])
}

This, unfortunately requires the next version of dgo which has not been released yet.
If you want to try that with v2.0.0 you could simply check for the string error.

Instead of err != dgo.ErrAborted you should write err == nil || !strings.Contains(err.Error(), "aborted").

Is this a bug?

Regardless of the workaround it is worth talking about whether one of the transactions should indeed fail.

I tested the same code using [email protected] and [email protected] and I was able to reproduce the error. So no, it's not a regression.

But still, the question remains.

Should this transaction be aborted since it refers to the same ID or is this something that should be accepted?

While they both refer to the same uid (the one you create first in child), they do not really modify any predicate coming out of it.
But I think rejecting it is the expected behavior, since one of the transaction could be deleting the child while the other one adds one more reference to it.

I'm not 100% sure, so I think @manishrjain is the one that should answer this one.

campoy on 28 Sep 2019

I think concurrent http requests which end up in a mutation referencing the same node is verly likely, hence I have to disagree on this beeing correct behavior.

jostillmanns on 28 Sep 2019

👍3

Also your workaround isn't really feasible for me, as some of my mutations share a transaction with other requests. Implementing this workaround would require a huge effort on my side, as I build an abstraction layer around Dgraph that build up these transactions.

jostillmanns on 1 Oct 2019

I’m trying to understand the cause of txn abort, but I don’t see the schema here. Can you please paste the schema you’re using?

But, in general, I think any transactional system can result in aborts. Even if we were to figure out that somehow our txn system is a bit too strict and can be made lenient, aborts won’t go away. So, a way to retry is unavoidable.

Perhaps, a mechanism could be built by Dgraph doing the retry internally if a user asks for it, if CommitNow is set to true.

manishrjain on 1 Oct 2019

I updated the code to also write the schema.

jostillmanns on 1 Oct 2019

I'm having similar issues,

Perhaps, a mechanism could be built by Dgraph doing the retry internally if a user asks for it, if CommitNow is set to true.

I think that would be a great feature.

Kingloko on 1 Oct 2019

So, looking at the conflict generation code in v1.0.16, for how it was treating uids:
https://github.com/dgraph-io/dgraph/blob/0590ee95d6fde75eff5f97370dd087e9a0d44158/posting/list.go#L293-L303

And with my changes: https://github.com/dgraph-io/dgraph/commit/693e7db24 , which consolidates the logic for conflict detection in list types:
https://github.com/dgraph-io/dgraph/blob/7e74f43ce7fc2f079b4575d14afeebd9c95a94b8/posting/list.go#L460-L476

The behavior is the same. If the (sub, pred, obj) where obj = UID, is the same, there would be conflict. That hasn't changed between the two versions. We do this to ensure that if a user has facets on that edge, they don't get overwritten. OR, if one op sets an object, while the other op deletes the object, we only apply one of those and reject the other. Seems like a logical approach.

I have three potential solutions here:

Perhaps, we could add a NO_ABORT flag in a transaction, which would skip the conflict key generation and just write. That would simplify your operations, but at the cost of perhaps some data overwrites. The good thing is, you could control this at the transaction level.
Alternatively, we could do this at the predicate level, where one could specify in the schema if this predicate should not be involved in conflict generation. For examples, predicates which are not crucial to the correctness of the data, like say names, or aliases, etc. Those could be marked in the schema as relaxed and we can just avoid doing conflict detection on those. Might be a better approach than the first one.
Third approach would be to retry the mutation at the server level.

manishrjain on 9 Oct 2019

🎉1

@manishrjain, I can't speak for @jostillmanns but all three of these approaches would work in my use case. Options 1 & 2 would be my preference, but I think option 3 would also solve my problem.

Kingloko on 10 Oct 2019

@manishrjain, thanks for sharing the deeper insights here.

I would agree and say that all 3 of those options sound extremely appealing to me 😄.

erhlee-bird on 10 Oct 2019

What is the status of this issue? If this is low priority for you guys I'd appreciate if you can point me in the direction to fix it myself. This particular issue is a huge priority for my team.

Kingloko on 17 Oct 2019

Hi @manishrjain. Sorry it took me so long to write back on this issue. It took us another incident regarding this issue to look back at this :wink: . We discussed this again and would like to advocate for solution 1, as we have knowledge about the cases when overwrites would be okay.

Example: we have scheduled processes, which write data in parallel and we can make sure that there are no conflicts on the client level. On the other hand we accept HTTP requests, where the current behavior (NO_ABORT = false) is desirable.

jostillmanns on 18 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings