v1.1
yes
16Gb, Linux 4.19.67-1-lts
run test
package main
import (
"context"
"encoding/json"
"strconv"
"testing"
"github.com/dgraph-io/dgo/v2"
"github.com/dgraph-io/dgo/v2/protos/api"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"google.golang.org/grpc"
)
func Test_it_writes_concurrently(t *testing.T) {
conn, err := grpc.Dial(dgraphAddr(), grpc.WithInsecure())
if err != nil {
t.Fatalf("conntect: %v", err)
}
dg := dgo.NewDgraphClient(api.NewDgraphClient(conn))
defer dg.Alter(context.Background(), &api.Operation{DropAll: true})
schema := `
type node {
name: string
child: uid
}
name: string .
child: uid .
`
err = dg.Alter(context.Background(), &api.Operation{Schema: schema})
require.NoError(t, err)
type node struct {
ID string `json:"uid"`
Name string `json:"name"`
Child *node `json:"child"`
}
child := node{ID: "_:blank-0", Name: "child"}
js, err := json.Marshal(child)
require.NoError(t, err)
res, err := dg.NewTxn().Mutate(context.Background(), &api.Mutation{SetJson: js, CommitNow: true})
require.NoError(t, err)
in := []node{}
for i := 0; i < 2; i++ {
in = append(in, node{ID: "_:blank-0", Name: strconv.Itoa(i + 1), Child: &node{ID: res.GetUids()["blank-0"]}})
}
errChan := make(chan error)
for i := range in {
go func(n node) {
js, err := json.Marshal(n)
require.NoError(t, err)
_, err = dg.NewTxn().Mutate(context.Background(), &api.Mutation{SetJson: js, CommitNow: true})
errChan <- err
}(in[i])
}
errs := []error{}
for i := 0; i < len(in); i++ {
errs = append(errs, <-errChan)
}
for _, e := range errs {
assert.NoError(t, e)
}
}
Expected behaviour: no error and two nodes are created
Actual behaviour: One mutate request returns Transaction has been aborted. Please retry
I can reproduce this to be working with Dgraph 1.0.16. If the nodes that are written concurrently don't reference the same child node everything works. Hence I guess the issue is with the child node reference.
This is a very interesting issue, thanks for the report.
For now, you can build a workaround to this by simply retrying, so your for loop would look like this:
errChan := make(chan error)
for i := range in {
go func(n node) {
js, err := json.Marshal(n)
require.NoError(t, err)
for {
_, err = dg.NewTxn().Mutate(context.Background(), &api.Mutation{SetJson: js, CommitNow: true})
if err != dgo.ErrAborted {
break
}
}
errChan <- err
}(in[i])
}
This, unfortunately requires the next version of dgo which has not been released yet.
If you want to try that with v2.0.0 you could simply check for the string error.
Instead of err != dgo.ErrAborted you should write err == nil || !strings.Contains(err.Error(), "aborted").
Regardless of the workaround it is worth talking about whether one of the transactions should indeed fail.
I tested the same code using [email protected] and [email protected] and I was able to reproduce the error. So no, it's not a regression.
But still, the question remains.
Should this transaction be aborted since it refers to the same ID or is this something that should be accepted?
While they both refer to the same uid (the one you create first in child), they do not really modify any predicate coming out of it.
But I think rejecting it is the expected behavior, since one of the transaction could be deleting the child while the other one adds one more reference to it.
I'm not 100% sure, so I think @manishrjain is the one that should answer this one.
I think concurrent http requests which end up in a mutation referencing the same node is verly likely, hence I have to disagree on this beeing correct behavior.
Also your workaround isn't really feasible for me, as some of my mutations share a transaction with other requests. Implementing this workaround would require a huge effort on my side, as I build an abstraction layer around Dgraph that build up these transactions.
I鈥檓 trying to understand the cause of txn abort, but I don鈥檛 see the schema here. Can you please paste the schema you鈥檙e using?
But, in general, I think any transactional system can result in aborts. Even if we were to figure out that somehow our txn system is a bit too strict and can be made lenient, aborts won鈥檛 go away. So, a way to retry is unavoidable.
Perhaps, a mechanism could be built by Dgraph doing the retry internally if a user asks for it, if CommitNow is set to true.
I updated the code to also write the schema.
I'm having similar issues,
Perhaps, a mechanism could be built by Dgraph doing the retry internally if a user asks for it, if CommitNow is set to true.
I think that would be a great feature.
So, looking at the conflict generation code in v1.0.16, for how it was treating uids:
https://github.com/dgraph-io/dgraph/blob/0590ee95d6fde75eff5f97370dd087e9a0d44158/posting/list.go#L293-L303
And with my changes: https://github.com/dgraph-io/dgraph/commit/693e7db24 , which consolidates the logic for conflict detection in list types:
https://github.com/dgraph-io/dgraph/blob/7e74f43ce7fc2f079b4575d14afeebd9c95a94b8/posting/list.go#L460-L476
The behavior is the same. If the (sub, pred, obj) where obj = UID, is the same, there would be conflict. That hasn't changed between the two versions. We do this to ensure that if a user has facets on that edge, they don't get overwritten. OR, if one op sets an object, while the other op deletes the object, we only apply one of those and reject the other. Seems like a logical approach.
I have three potential solutions here:
Perhaps, we could add a NO_ABORT flag in a transaction, which would skip the conflict key generation and just write. That would simplify your operations, but at the cost of perhaps some data overwrites. The good thing is, you could control this at the transaction level.
Alternatively, we could do this at the predicate level, where one could specify in the schema if this predicate should not be involved in conflict generation. For examples, predicates which are not crucial to the correctness of the data, like say names, or aliases, etc. Those could be marked in the schema as relaxed and we can just avoid doing conflict detection on those. Might be a better approach than the first one.
Third approach would be to retry the mutation at the server level.
@manishrjain, I can't speak for @jostillmanns but all three of these approaches would work in my use case. Options 1 & 2 would be my preference, but I think option 3 would also solve my problem.
@manishrjain, thanks for sharing the deeper insights here.
I would agree and say that all 3 of those options sound extremely appealing to me 馃槃.
What is the status of this issue? If this is low priority for you guys I'd appreciate if you can point me in the direction to fix it myself. This particular issue is a huge priority for my team.
Hi @manishrjain. Sorry it took me so long to write back on this issue. It took us another incident regarding this issue to look back at this :wink: . We discussed this again and would like to advocate for solution 1, as we have knowledge about the cases when overwrites would be okay.
Example: we have scheduled processes, which write data in parallel and we can make sure that there are no conflicts on the client level. On the other hand we accept HTTP requests, where the current behavior (NO_ABORT = false) is desirable.
Most helpful comment
I think concurrent http requests which end up in a mutation referencing the same node is verly likely, hence I have to disagree on this beeing correct behavior.