Dgraph: @id directive not ensuring uniqueness when adding arrays

Created on 4 Mar 2020  路  10Comments  路  Source: dgraph-io/dgraph

What version of Dgraph are you using?

Dgraph v2.0.0-rc1 (dgraph/standalone:v2.0.0-rc1 Docker image)

Have you tried reproducing the issue with the latest release?

This is the latest release.

What is the hardware spec (RAM, OS)?

MacBook Pro (15-inch, 2019), Mojave, 10.14.6

Steps to reproduce the issue (command/config used to run Dgraph).

Following the graphql example, and running the following query:

mutation {
  addAuthor(input: [
    { username: "John" },
    { username: "John" }
  ]) {
    author {
      username
    }
  }
}

I get a response like the following:

{
  "data": {
    "addAuthor": {
      "author": [
        {
          "username": "John"
        },
        {
          "username": "John"
        }
      ]
    }
  }
}

When I query authors again, I can confirm that the multiple authors with the same username are indeed inserted into the database.

Expected behaviour and actual result.

Looking at the schema definition:

type Author {
  username: String! @id
  # ...
}

The username field uses the @id directive, which in the example it says :

The schema specified those usernames as @id, so Dgraph makes sure that they are unique and you鈥檒l get an error if you try to add an author with a username that already exists (you can update an existing author with the updateAuthor mutation).

So I expect an error if I try to insert multiple authors with the same username.

aregraphql kinbug statuaccepted

Most helpful comment

Hi @n8jadams, thanks for reporting the bug and your thoughtful comments. I think some of the discussion above got a bit sidetracked with lots of possible options, so I just wanted to touch base on the approach and why.

Dgraph has always had the focus that there's nothing special you have to do to add data - there's no special graph format, no structure you have to build - just serialise to json.

We've kept the same approach in our GraphQL offering. Compare the ease of use of our mutations (where you can just serialise your client side data structures), with some other GraphQL offerings (where you have to write code to build special mutation structures where links to existing data and inserting new data require you to client side process your data structures into the structure that the API needs).

We are taking a 'strict' approach, but also considering serliasation of graph structures and ease of use. As an example of the way we'll handle this issue, let's assume the author and posts example in the comments above.

In my client-side program, I've got data structures for posts and authors and a post has a property with a 'pointer' to an author. So, client-side, I might build two new posts (P1 and P2) and a new author (A1) who authored both (so A1 is the target of both P1.author and P2.author). That's just three things, a simple graph, but if I serialise an array of [P1, P2], I'll generally get:

[ 
  { ...P1..., "author": { "username": "mjc", ...A1... } },
  { ...P2..., "author": { "username": "mjc", ...A1... } },
]

And that perfectly reasonable set of data structures and perfectly reasonable serialisation, just became an invalid mutation 馃槥... and then you'd have to write code client-side to pick through and unwind your data structures and make multiple mutations or make sure you mutate particular objects in particular orders. And that makes using the API so much harder.

The answer is simply that the schema said username is the unique identifier for the User type and the mutation gave us two JSON objects that are the serialisation of exactly the same thing (same id, same everything else), so we'll treat that as a reference to the one user node in the graph. (Dgraph already does the same thing with it's JSON mutation format and blank nodes.)

I think it's exactly the pit of success that you are talking about - makes it easy to serialise data in code and use the API.

All 10 comments

@abhimanyusinghgaur this is similar to the issue you discovered for adding rules within a group. We were also discussing this with @MichaelJCompton. One solution here is that the upsert variables that are generated by the query should also have the value for @id field in their name.

Just FYI, I have seen this bug as well and would appreciate if either @id gets enforced,
or a @unique annotation would be added that enforces the unique criterion.

Currently, I circumvent the problem but I don't exactly feel confident doing so b/c I already have at least one use case on my plate that would require a properly enforced @unique annotation of fields.

Hi, thanks for the report.

For clarification: this bug only occurs when the same id value is occurs within a single mutation. Two mutations within a block or separate mutations work as expected. So a viable workaround for now is to break up your mutation.

Also, note that any mutation that looks like it's asking for the same unique id to inserted twice is going to produce an error, so the result of mutations like the example is a response with an error and no new stored values.

We are working on a fix.

@MichaelJCompton You're the champ, yes, splitting mutations does the trick for now.

There are a few possible approaches with this:

  1. Error out the whole mutation without doing any change in db.
  2. Try to perform the mutation, author with the first unique username gets added in the db, and when an author with repeated username is encountered, mutation returns an error at that point without proceeding further.
  3. Perform the mutation completely, anytime an author object is encountered with a repeated username, merge the previous occurrence with the new one by overriding existing properties with the properties specified in the newly encountered occurrence. This way no error is generated and the db contains only one author for the username being repeated. For example, lets assume Author has following schema:
    graphql type Author { username: String! @id phone: String address: String # ... }
    So, if following mutation is performed:
    graphql mutation { addAuthor(input: [ { username: "John", phone: "1234" }, { username: "John", address: "SF" }, { username: "John", phone: "5678" } ]) { author { username phone address } } }
    Then we get below result:
    json { "data": { "addAuthor": { "author": [ { "username": "John", "phone": "5678", "address": "SF" } ] } } }

1st Approach:
The 1st approach is easier when there are mutations like addAuthor where the @id directive is a property of top-level object in input. But, in case of mutations like addPost where the @id directive is applicable on a field deep inside input :

mutation {
  addPost(input: [
    { ..., author: { username: "John" }},
    { ..., author: { username: "John" }}
  ])
}

There are 2 cases:
a. The author may be already existing in db, in which case we just need to link the post with the existing author.
b. The author may not exist in db, in which case we need to give error and not do any change in db.

Such deep mutations get a bit complex to handle although perfectly doable.

2nd Approach:
The 2nd approach returns error the moment it encounters a repeated @id value without proceeding further. This kind of leaves the system in an ambiguous state where some part of the mutation gets applied while some part may not execute. So, we are not going with this approach.

3rd Approach:
The 3rd approach is also as complex as the 1st approach, but just a bit easier.

So, between the 1st and the 3rd approach, we are going with the 3rd one, as rather than giving an error it is better to give a correct response if we can.

@abhimanyusinghgaur This seems kind of side-effecty. I as a consumer of the API would prefer a more strict and predicable behavior, like the 1st approach mentioned.

Hi @n8jadams, @marvin-hansen,

Just wanted your thoughts on a hybrid approach that we are taking:

Given the following schema:

type Author {
  username: String! @id
  phone: String
  address: String
  questions: [Question] @hasInverse(field: author)
  answers: [Answer] @hasInverse(field: author)
}

interface Post {
  id: ID!
  text: String! @search(by: [fulltext])
  datePublished: DateTime @search
  author: Author!
  comments: [Comment] @hasInverse(field: commentsOn)
}

type Question implements Post {
  answers: [Answer] @hasInverse(field: inAnswerTo)
}

type Answer implements Post {
  inAnswerTo: Question!
}

type Comment implements Post {
  commentsOn: Post!
}

Mutation Types
There can only be two kinds of cases where the @id directive can be involved:

  1. Top level mutation, like the case of addAuthor:
    graphql mutation { addAuthor(input: [ { username: "John", ...}, ... ]) { author { username } } }
  2. Deep Mutation, like the case of addPost (it will be something more concrete like addQuestion or addComment, but using addPost just for demo purposes):
    graphql mutation { addPost(input: [ { ..., author: { username: "John", ... }}, ... ]) }

Client Inputs
And, irrespective of where @id is involved, there can be following kinds of cases that clients can send as input for the Author type:

  1. All the Author objects have a unique username, e.g.,:
    graphql { username: "John", ...}, { username: "Mark", ...}, { username: "Chris", ...}
  2. There are repeating usernames:
    graphql { username: "John", ...}, { username: "John", ...}
    Now, for the case of repeating usernames, there can again be following kind of cases:

    1. Only first object has non @id properties, rest of the objects just refer the @id:

      graphql { username: "John", phone: "1234", address: "SF"}, { username: "John"}, { username: "John"}, ... { username: "John"}

    2. All the author objects are the same:

      graphql { username: "John", phone: "1234", address: "SF"}, { username: "John", phone: "1234", address: "SF"}, { username: "John", phone: "1234", address: "SF"}, ... { username: "John", phone: "1234", address: "SF"}

    3. Any other case that doesn't satisfy above two, like:

      graphql { username: "John", phone: "1234", address: "SF"}, { username: "John", phone: "5678", address: "SF"}, ...

      OR

      graphql { username: "John", phone: "1234"}, { username: "John", address: "SF"}, ...

DB State
Now again, irrespective of the input being sent by the client, there can be two cases:

  1. The username doesn't already exist in db
  2. The username already exists in db

Expected Behaviour
Now, let me list which approach is allowed in which case:

  • For top-level mutation:

    • Client inputs of type 1 will create authors only for those objects whose username doesn't already exist in db. If a username already exists in the db, then only for that object an error is returned.

    • Client inputs of type 2 always yield an error.

  • For deep mutations:

    • Client inputs of type 1 will create a new author if the username doesn't already exist in db. If the username exists in db, then that existing author will be linked to the post (If properties other than username are also provided in this case, then they will be discarded).

    • Client inputs of type 2.i and 2.ii will follow the same behaviour as in above point, but only one author will be created for repeating authors.

    • Client inputs of type 2.iii will always return error.

@abhimanyusinghgaur I like the approach you came up with, with the minor exception that I would rather have an error message when the Client Input is of types 2.i and 2.ii.

I guess ultimately I just prefer strictness, so you create a "pit of success" for consumers of the API. More specifically I think it's better to place the burden of "merging" Authors on the user, and have DGraph provide a useful error message if their query or variables are malformed. (And consider duplicate, unmerged Author objects as malformed.)

But that's just my opinion. I'm just the guy that reported the bug. 馃槈

Hi @n8jadams, thanks for reporting the bug and your thoughtful comments. I think some of the discussion above got a bit sidetracked with lots of possible options, so I just wanted to touch base on the approach and why.

Dgraph has always had the focus that there's nothing special you have to do to add data - there's no special graph format, no structure you have to build - just serialise to json.

We've kept the same approach in our GraphQL offering. Compare the ease of use of our mutations (where you can just serialise your client side data structures), with some other GraphQL offerings (where you have to write code to build special mutation structures where links to existing data and inserting new data require you to client side process your data structures into the structure that the API needs).

We are taking a 'strict' approach, but also considering serliasation of graph structures and ease of use. As an example of the way we'll handle this issue, let's assume the author and posts example in the comments above.

In my client-side program, I've got data structures for posts and authors and a post has a property with a 'pointer' to an author. So, client-side, I might build two new posts (P1 and P2) and a new author (A1) who authored both (so A1 is the target of both P1.author and P2.author). That's just three things, a simple graph, but if I serialise an array of [P1, P2], I'll generally get:

[ 
  { ...P1..., "author": { "username": "mjc", ...A1... } },
  { ...P2..., "author": { "username": "mjc", ...A1... } },
]

And that perfectly reasonable set of data structures and perfectly reasonable serialisation, just became an invalid mutation 馃槥... and then you'd have to write code client-side to pick through and unwind your data structures and make multiple mutations or make sure you mutate particular objects in particular orders. And that makes using the API so much harder.

The answer is simply that the schema said username is the unique identifier for the User type and the mutation gave us two JSON objects that are the serialisation of exactly the same thing (same id, same everything else), so we'll treat that as a reference to the one user node in the graph. (Dgraph already does the same thing with it's JSON mutation format and blank nodes.)

I think it's exactly the pit of success that you are talking about - makes it easy to serialise data in code and use the API.

@MichaelJCompton thanks for the explanation of the why behind the approach! It all sounds good to me.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

janardhan1993 picture janardhan1993  路  3Comments

pepoospina picture pepoospina  路  3Comments

protheusfr picture protheusfr  路  4Comments

yupengfei picture yupengfei  路  4Comments

pjebs picture pjebs  路  4Comments