Dgraph: Support External Ids in Dgraph

Created on 16 Feb 2018 · 24Comments · Source: dgraph-io/dgraph

Earlier dgraph had support to _xid_ or custom uid (I have used till 0.8.3)
Now it seems that both of these are removed. This may result in a lot of problems

As far as I remember Dgraph does not do dedup
Earlier I used to use assigned Uid (which was a func() on _xid_), and keep on mutating the nodes without worrying about anything
Now it seems like I have to first query the node, get the Uid and then start operating onto it

Most of the devs would want this as the core feature of the dgraph.

kinfeature

Source

akshaydeo

👍16

Most helpful comment

I don't think any action is needed here. Most users use a xid predicate to map Dgraph IDs to their own external ids. That works well already.

@manishrjain the point of this issue is related to node updates.

For updating a node,

I will have to retrieve it using eq
Then update it

Another way would be to keep a uid <-> mapping externally.

With external ID support, you can directly perform this operation in one single query.

Correct me if I am wrong?

I would highly recommend having this feature inside dgraph for most of the use cases where dgraph will be used as a knowledge graph against the main database of the system.

akshaydeo on 25 Mar 2019

👍10

All 24 comments

What I have done to solve this problem is create a helper function getExtIdToUidMap that takes an array of ext ids and returns me the array of uids so that I can operate on them. Agreed that one database call is better than two, but at least a helper can encapsulate the query -> mutate process.

I think stopping support for XID and treating it like other edges allowed the Dgraph team to simplify the code base for v1.

Another interesting problem affecting external IDs is related to the Dgraph team's suggestion to have many small predicates rather than a single large predicate (like "external id"). So the best practice solution may be to have different XID predicates for different data types... like person.xid, place.xid, company.xid.

jeffkhull on 19 Feb 2018

Indeed, this was giving us headaches for a bit as well. What we ended up doing was to create a central "gateway" microservice that handles database writes, with an internal cache of uid mappings, as well as a Redis layer behind that.

malzzz on 19 Feb 2018

😕4

Yeah @antikantian this was the first thing I had thought of. Using the inbuilt badger for maintaining a map of xid to uid. But this feature seems like a candidate for a product. This feature was already there which was removed.

akshaydeo on 19 Feb 2018

@akshaydeo, I totally agree. External id support (or, I guess re-support in this case) would make things a lot easier.

malzzz on 19 Feb 2018

👍3

+1 for this. I'm actually using dgraph with Redis for UID mapping. XID-to-UID map is a fundamental function for normal applications. I think one of the best dgraph feature is its simplicity. But when it comes to UID map, we need another big storage like Redis which might make people to look for alternatives.

hoonmin on 26 Feb 2018

@pawanrawal @manishrjain can anyone from dgraph help us on this?

akshaydeo on 26 Feb 2018

👍2 🎉1

I have an update. Seeing the least enthusiasm from the maintainers of the project 😞, I started digging into the source code.

One of the main reason, of removing this support seems like to avoid the collision of the UIDs
In the current setup, the current leader does the UID allocation.
The lease is shared with all the nodes for faster verifications etc.

Hacky approach

There is a function called VerifyUid which does the verification of the UID passed with the quad against the available lease.
Refer: https://sourcegraph.com/github.com/dgraph-io/dgraph/-/blob/query/mutation.go#L96:6
I just returned nil via the function.

Pros

Passing all the test cases of the project 🍾
Passing all the test cases I have in my project 👍
If you don't pass your own UID entire flow works the same.

Cons 🚑

This approach won't handle collisions of the UID. As far as my tests go, it just overwrites which was desired behavior for my app
I have tried my best to check all the flows that might break, but could not find any, but it's still a hacky approach.

@pawanrawal @manishrjain Could you guys let us know if there could be any issues you could think of in this approach, until this is officially supported by dgraph?

akshaydeo on 23 Mar 2018

👍1

Glad someone is prepared to get their hands dirty :)

Could you please expand a bit more on “This approach won't handle
collisions of the UID. As far as my tests go, it just overwrites which was
desired behavior for my app”? Are you saying that this is not functionality
to deal with the ‘if node exists’ vs. ‘If node doesn’t exist’ case, but
simply ensured that there can’t ever be 2 nodes with the same xid?

On Fri, 23 Mar 2018 at 12:08, Akshay Deo notifications@github.com wrote:

I have an update. Seeing the least enthusiasm from the maintainers of the
project 😞, I started digging into the source code.

One of the main reason, of removing this support seems like to avoid
the collision of the UIDs

In the current setup, the current leader does the UID allocation.

The lease is shared with all the nodes for faster verifications etc.

Hacky approach

There is a function called VerifyUid which does the verification of
the UID passed with the quad against the available lease.
Refer:
https://sourcegraph.com/github.com/dgraph-io/dgraph/-/blob/query/mutation.go#L96:6

I just returned nil via the function.

Pros

Passing all the test cases of the project 🍾

Passing all the test cases I have in my project 👍

If you don't pass your own UID entire flow works the same.

Cons 🚑

This approach won't handle collisions of the UID. As far as my tests
go, it just overwrites which was desired behavior for my app

I have tried my best to check all the flows that might break, but
could not find any, but it's still a hacky approach.

@pawanrawal https://github.com/pawanrawal @manishrjain
https://github.com/manishrjain Could you guys let us know if there could

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/dgraph-io/dgraph/issues/2134#issuecomment-375645458,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACKSZl8IhJO_ru2FfWWjvoK1Pkjjqwnxks5thOW2gaJpZM4SIfoF
.

>

Thanks,
Vlad

jimanvlad on 23 Mar 2018

Yes. So here is the example:

I perform mutate by following NQuads

{Subject: "0x1", Predicate: "Name", Object: "Jarvis"}
{Subject: "0x1", Predicate: "Color", Object: "White"}

Premise is Node with UID 0x1 is not in the system

And fetch the record by UID 0x1 would return this node with predicates Name and Color with Values Jarvis And White

If I perform one more mutation after this as follows

{Subject: "0x1", Predicate: "Name", Object: "Jarvis2"}

And fetch the record by UID 0x1 would return this node with predicates Name and Color with Values Jarvis2 And White

akshaydeo on 23 Mar 2018

But you can already perform mutations that write to a UID and ensure that
there’s no duplication. Can the UID in your example be replaced with an
XID? For example, can it be something like “account_193748”?

On Fri, 23 Mar 2018 at 12:39, Akshay Deo notifications@github.com wrote:

Yes. So here is the example:

I perform mutate by following NQuads

{Subject: "0x1", Predicate: "Name", Object: "Jarvis"}
{Subject: "0x1", Predicate: "Color", Object: "White"}

Premise is Node with UID 0x1 is not in the system

And fetch the record by UID 0x1 would return this node with predicates
Name and Color with Values Jarvis And White

If I perform one more mutation after this as follows

{Subject: "0x1", Predicate: "Name", Object: "Jarvis2"}

And fetch the record by UID 0x1 would return this node with predicates
Name and Color with Values Jarvis2 And White

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
https://github.com/dgraph-io/dgraph/issues/2134#issuecomment-375652299,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACKSZjl1pw2AUy8qnn6ApvSeCdQdSZ8Aks5thOzkgaJpZM4SIfoF
.

>

Thanks,
Vlad

jimanvlad on 23 Mar 2018

Yes yes, but with a twist.
So In dgraph, uid is the ultimate identifier, which has to be a hex integer.

But there is a workaround which I have been using since I have started using dgraph (0.7.x)

Have a functional mapping of uid and xid. I do something like

func uidFromxid(xid string) uint64 {
    f := fnv.New64a()
    f.Write([]byte(xid))
    return f.Sum64()
}

And pass this as uid in the above example.

This way the mapping becomes programmatical

akshaydeo on 23 Mar 2018

Ah, so all you're doing is disabling the verification of the UID. Your function to translate an XID into a UID will lead to the data being stored with UIDs that are not sequential, but all over the place. So then, as your question stated above, you need to understand the unintended consequences of this.

jimanvlad on 23 Mar 2018

👍1

At least superficially, it sounds like it would fit our scenario has well. Could that validation be made optional?

liqweed on 23 Mar 2018

I can quickly send a pull request, which can skip the verification based on a param sent with dgraph server?

Would that be a right way to introduce that?

akshaydeo on 23 Mar 2018

@akshaydeo I was doing this as well for a time. When you say you bypass the VerifyUid function, that means dgraph does no checks of the UID at all? I ask because when I tried maintaining my own set of UID's and passing those in mutations, I eventually encountered an error to the effect of "UID value cannot be greater than lease" or something to that effect.

malzzz on 25 Mar 2018

That's the error verification he's bypassing, check the code that he's
linked to :)

On Sun, 25 Mar 2018 at 11:13, antikantian notifications@github.com wrote:

@akshaydeo https://github.com/akshaydeo I was doing this as well for a
time. When you say you bypass the VerifyUid function, that means dgraph
does no checks of the UID at all? I ask because when I tried maintaining my
own set of UID's and passing those in mutations, I eventually encountered
an error to the effect of "UID value cannot be greater than lease" or
something to that effect.

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
https://github.com/dgraph-io/dgraph/issues/2134#issuecomment-375959262,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACKSZqp4BXc9zvwOEq1YdIcEz2_GjtuMks5th23bgaJpZM4SIfoF
.

>

Thanks,
Vlad

jimanvlad on 25 Mar 2018

Hey @akshaydeo

Its great to see the interest in this feature and I appreciate your effort into this. Let me first clarify why we have the VerifyUid function.

As you know Dgraph allocates uids sequentially, now if we were to allow the user to set data with a random uid, like say 2500, then when the uid allocator actually gets to that uid (when 2499 new uids have already been allocated), it would append data to the uid instead of creating a new node (because its hard to check if a uid has already been used). So node with uid 2500 would have the old data that was set directly by the user as well as the new data.

Another problem is the hashing function that you use to transform the xid to a uid. Since it is a hashing function, it can lead to collisions (same uid for two different xids), which would lead to inconsistent data. The probability of collision is more for longer xids.

The only scenario in which this would work reliably is if no new nodes are created using Dgraph and if the xids are uint64. That is when all data is created outside of Dgraph with a uint64 identifier and it is just stored and retrieved from Dgraph. If that is your use case, then sure bypassing the check can work.

pawanrawal on 25 Mar 2018

any updates on a good way to do this? My usecase needs to use xids in the form of URLs and i've just now learned they're no longer in, would really be interested in some updates on, at the very least, some decent ways to handle this

lambdadog on 1 Jul 2018

is there really no other way than running a substitution on the RDF nquads and creating a mapping from xids to uids?

lambdadog on 1 Jul 2018

So, to answer the general questions here:

We could allow a way by which you could do your own UID-allocations. Dgraph would then no longer be the one responsible for handing out new UIDs, or doing conflict checking. Let me know if that'd help, @akshaydeo ?

@beta-phenylethylamine : You could assign an xid edge to the nodes, set type as string, and put hash index on them. Then, you could query via the eq function.

In general, we have no plans to natively support an external ID -- because the index based support is the best that we can do at our end, but that mechanism is already available.

manishrjain on 1 Jul 2018

👍2

@manishrjain thanks! I actually just realized that I was misinterpreting the docs on that bit myself. Still a bit of a PitA transform on outside RDF data but much more manageable than directly mapping everything imo

lambdadog on 1 Jul 2018

I don't think any action is needed here. Most users use an xid predicate to map Dgraph IDs to their own external ids. That works well already.

manishrjain on 7 Nov 2018

So, to answer the general questions here:

We could allow a way by which you could do your own UID-allocations. Dgraph would then no longer be the one responsible for handing out new UIDs, or doing conflict checking. Let me know if that'd help, @akshaydeo ?

@beta-phenylethylamine : You could assign an xid edge to the nodes, set type as string, and put hash index on them. Then, you could query via the eq function.

In general, we have no plans to natively support an external ID -- because the index based support is the best that we can do at our end, but that mechanism is already available.

If you use this method, it is very likely that one xid will have multiple query results (corresponding to multiple uids). How can you solve this problem

478682649 on 25 Mar 2019

I don't think any action is needed here. Most users use a xid predicate to map Dgraph IDs to their own external ids. That works well already.

@manishrjain the point of this issue is related to node updates.

For updating a node,

I will have to retrieve it using eq
Then update it

Another way would be to keep a uid <-> mapping externally.

With external ID support, you can directly perform this operation in one single query.

Correct me if I am wrong?

I would highly recommend having this feature inside dgraph for most of the use cases where dgraph will be used as a knowledge graph against the main database of the system.

akshaydeo on 25 Mar 2019

👍10

Was this page helpful?

0 / 5 - 0 ratings