Dgraph: Using weighted edges between two entities with Dgraph?

Created on 16 Jan 2017  Â·  22Comments  Â·  Source: dgraph-io/dgraph

Working on a small project of my own, most likely will remain a hobby, it stores Topics that my application layer qualifies following several rules, and once qualified, inserts it in my graph database.

Once that's done, for that given topic, the application retrieves content from top n pages, wikipedia, amazon, etc... and qualifies a list of Terms (n-grams, where n is between 1 and 5) are related and belong to this given topic.

For each term qualified, the application inserts it in the graph database on its own, and then adds relationships:

  • topic ABC is defined_by term XYZ
  • term XYZ belongs_to topic ABC

Obviously, a very same term can describe several topics with different weights. And also, a topic can be the parent_of, child_of or just related_to another topic. Besides related_to, all relationships should be unidirectional.

So far, my schema is like this:

scalar (
    slug: string @index
    label: string
    context: string
    summary: string
    wordcount: int
    available: int @index
    created: datetime @index
    topic.topic.parent_of: uid
    topic.topic.child_of: uid
    topic.topic.related_to: uid @reverse
    topic.topic.defined_by: uid
    term.term.belongs_to: uid
)

type Topic {
    slug: string
    label: string
    context: string
    summary: string
    available: bool
    created: datetime
    children: Topic
    parents: Topic
    relations: Topic
    vocabulary: Term
}

type Term {
    slug: string
    label: string
    wordcount: int
    created: datetime
    owners: Topic
}

_I have a problem, though._

The predicates would be expressed as:

( Topic A -> parent_of -> Topic B )
( Topic B -> child_of -> Topic A )

( Topic A -> related_to -> Topic Z )
( Topic Z -> related_to -> Topic A )

( Term FFF -> belongs_to -> Topic A )
( Term FFF -> belongs_to -> Topic B )

( Topic A -> defined_by -> Term FFF )
( Topic B -> defined_by -> Term FFF )

But I would like to weigh some of them such as:

( Topic A -> defined_by (weight: .95) -> Term FFF )
( Term FFF -> belongs_to (weight: .95) -> Topic A )

( Term FFF -> belongs_to (weight: .325) -> Topic B )
( Topic B -> defined_by (weight: .325)-> Term FFF )

How to weight these relationships (defined_by, belongs_to, related_to)? Is there a way to add a property called 'weight' to a predicate/relationship (typed as float)?

Creating a third entity acting as a relationship table would be similar to what I would do in MySQL (topic_id, term_id, weight) which would defeat the point of using a graph database perhaps?

Most helpful comment

Shortest path queries (with weighted edges) are supported from from 6869cde5b1bc96bf8d372d4daf10030e28ac6ca0.

All 22 comments

Repeating from the discourse topic:

We don't have a way to attach weights to edges. That's been on my mind for a while, as something we should implement. But, hadn't come across a good use case to warrant it.

What kind of queries would you run, which could utilize the weights? Would you use these as filters? Or, as a way to sort the nodes at the end of the edge?

The project is basically building a knowledge graph of topics with terms associated before even doing anything. Topics would mainly be drawn from wikipedia, amazon categories, quora, stackoverflow, etc.

Then, many topics will share some terms but I cannot afford to have the same relationship between topic a and term x, and topic b and term x, if clearly the probability of term x being used with topic a is much greater.

The weight of a relationship is usually computed by my python or php workers who assess the source of the information and other rules (how many time it was found, in an h1 tag, as an anchor text, etc.)

The use is to then scan for example my blog for each article, find out the topics relevant to each article from the terms used in the content, and suggest relevant topics (related_to, parent_of, child_of) for a new piece of content (sister, parent or narrowed down topic idea queried from Dgraph). There are several other uses but that's an example.

Without a weight on each edge, they are all treated equally which is impossible in my logic, not all terms have the same weight when defining a topic.

(Should I copy this to discourse?)

Understand the point about the weight data and why it's useful. But, can you shed more light on what you want the queries to do, in terms of using this data? How would they use the weight?

For example, are the queries just going to return the weight data; or do you want the database to do something more, like filtering or sorting or something.

P.S. No need to copy to discourse.

I think this idea is along the lines of Neo4J edge attributes. The edge
weight could represent the cost of using the edge aka a network path or the
length or travel time if the edge was a salesman's route. Neo4K takes it
further with unlimited attributes on the edge.

On Jan 16, 2017 2:31 PM, "Manish R Jain" notifications@github.com wrote:

Understand the point about the weight data and why it's useful. But, can
you shed more light on what you want the queries to do, in terms of using
this data? How would they use the weight?

For example, are the queries just going to return the weight data; or do
you want the database to do something more, like filtering or sorting or
something.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/dgraph-io/dgraph/issues/495#issuecomment-272966530,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABL-moOJs2csJHMUna4AmYC412Wioo93ks5rS-GYgaJpZM4Lkamb
.

With a weighted graph you can find the shortest path to an object. A perfect use case would be loading OSM data into dgraph..finding the fastest way from point a to point b, finding the cheapest set of flights to a location, find me the nearest ice cream (by travel time) shop that serves rocky road, optimize routing for IP traffic, etc.

RDF NQuad spec doesn't provide a way to accept weights. So, I think to support this case, we'll have to modify how we accept mutations to Dgraph. Maybe something like this might work -- in slight deviation from NQuad spec.

    set {
        <0x0a> predicate <value> (key1=weight, key2=weight, key3=weight, label=value) .
    }

Any suggestions are welcome!

Yes, the idea is to find the shortest path from one entity to another when the number of objects in between them are the same, but their relationships aren't weighted the same. It would be used for sorting and filtering ideally within the query to get the most relevant results returned. Thanks everybody!

Just to feed back, edge weights would be great. One of the things I checked early on (having used neo4j previously) was whether ngraph supported edge weights and edge attributes in general.

I've briefly used neo4j in the past to perform routing with Open Street Map data using edge weights for travel time like @colek42 mentions.

And for dgraph I have a use case similar to what @lazharichir mentions where I'd periodically update edge weights from python and query dgraph with filtering and sorting on the logic.

Made up example:
score how strong a social network user's friendship is with each friend, save this score on the friendship predicate, and then query "find the top 10 friends of my 3 best friends who I'm not friends with". (Not sure if there's a way to do this at the moment without edge weights - for this query my mind goes straight to edge weights).

TLDR...
I think having the option of multiple (possibly-directed) edge weights and general attributes would be ideal. Is there a way to use use a unique id per predicate to do this? And then keep as close to Nquad syntax as possible and store these edge weights/attributes in separate nquads.

Quick thought on the implementation...

If a posting list looks like this...

Entity Attribute ValueId


Me friend person0
Me friend person1
Me friend person2
Me friend person3

could you store an extra id unique for each edge (here called PredicateUID), e.g.:

"friend" posting list:
Entity ValueId PredicateUID


Me person0 friend645143413
Me person1 friend431341234
Me person2 friend38302943
Me person3 friend531434311

Then one could add any attributes they wanted to the edge with a normal nquad:
friend645143413 0.787
friend431341234 0.877
friend431341234 17/01/2017
friend431341234 16

I don't know enough about dgraph's internals to judge if storing PredicateUIDs would be remotely possible. (I'm sure they could be stored, really I mean whether it would work with the nquad syntax, uploading in batch, doing distributed queries, etc.).

Side node:
Where edge attributes really shine I think is with advanced queries. If dgraph is as fast as it sounds, then maybe writing a single query to do something complicated isn't always necessary. Once could run a query and if needed aggregate and compute client-side, and store the intermediate results in the graph on individual edges, and then do queries using these edge results.

E.g. one could for all social profiles store the number of friends two users have in common on the friendship relationship between them (maybe updating this each query or likely periodically), then use this in other queries...

...e.g. to run a query filtering first on a user's friends who aren't in their main social circle.
*Made up example. The point is the case where something is either hard or not possible in a single query. Could edge attributes be used to store intermediate results such as number_of_friends_in_common. Taking advance of dgraphs speed, two queries where intermediate results are stored in the graph might work (if not live in production at least in regular batch runs).

Edit: apologies for the essay! Started drinking coffee again :).

@MikeLeonard : So, in this query: [find the top 10 friends of my 3 best friends who I'm not friends with], it would look something like this:

{
  var(id: me) {
    friends (first: 3, order: weight) {
      F: friends (first: 10, order: weight) @filter(has(friends, 0x0a))
    }
  }

  nonfriends(F) {
    name
  }
}

So, if we can order by weight, we can execute the query easily.

Alternatively, if the data was modified to store an intermediate edge between PERSON and FRIEND, it would work today.

ME -FRIEND-> INTERMEDIATE NODE -ID-> FRIEND NODE
                               -SCORE-> 0.87
                               -ATTR-> somethingelse

That data model would just immediately work for the friend query.

So, I think the only real reason for adding weights to edges, would be to support shortest path sort of queries. In fact, I think if you need to store more than the weight, the data modelling should be changed to accommodate that.

Alternatively, we could try and simplify some of the data modeling challenges by handling them within Dgraph -- this needs further thought.

But without weight, my topic/term example becomes difficult when it fact it is a perfect use of graph database. The only way to do it properly and efficiently is to add weight to each edge between TOPIC->PREDICATE(weight = 0.95)->TERM so when querying, we can sort and/or filter by the edge(s)' weight.

If at first it cannot be performed through a query (computing the shortest path) then it would be ideal to store that weight for each edge so we can retrieve it and use it in our application layer.

FIND the shortest path BETWEEN topic.X AND topic.Y so it can give us the closest nodes in between them, but there are many other use cases.

Looking forward to see how could it be shaped and added - I do like the abstract implementation that @MikeLeonard used of the Entity ValueId PredicateUID as it allows for more than just using a weight, and virtually adding any attribute(s) to a predicate. Not sure about the penalty inflicted to performance and other precious values Dgraph has, though.

tldr

So I've had another idea separate to the predicateUID idea above.

Let's take your example @manishrjain...

ME -FRIEND-> INTERMEDIATE NODE -ID-> FRIEND NODE
                               -SCORE-> 0.87
                               -ATTR-> somethingelse

How about: under the hood dgraph stores the edge weights/attributes using an intermediate node like in the example. A bit like the @reverse edge feature, the query/mutation syntax could be extended so that when a user writes a filter or sort modifier on the -FRIEND-> edge in a query, dgraph converts this query into one that uses a path via an intermediate node.

More details

If the user has for example:

ME -FRIEND-> FRIEND NODE

Let's say I want to add via a mutation the following (not sure of the syntax to this but just imagine nice syntax for now)...

  • mutation: add to that friend edge a score of 0.7 and the attribute somethingelse, to use your example.

Then like how dgraph creates the reverse edge when @reverse is set in the schema... could dgraph create the following:

ME - ___FRIEND -> INTERMEDIATE NODE -___FRIEND-ID-> FRIEND NODE
                               -SCORE-> 0.87
                               -ATTR-> somethingelse

so that we have:

ME -FRIEND-> FRIEND NODE
ME - ___FRIEND -> INTERMEDIATE NODE -___FRIEND-ID-> FRIEND NODE
                               -SCORE-> 0.87
                               -ATTR-> somethingelse

Note that the predicate names ___FRIEND and ___FRIEND-ID are essentially internal to dgraph and are formed by dgraph (not the user) from some predictable function so that dgraph can translate a query through the predicate 'FRIEND' into a query through the predicates '___FRIEND' and then '___FRIEND-ID'. (I guess advanced users could also directly query these as well, no need to hide them - so the predictable function should be easy, like adding a few underscores).

*edit: or another way would be to have the predicate also called FRIEND, and then specially mark intermediate nodes in the FRIEND posting list (and put them all at the end of the list) so that dgraph always filters them out when a normal FRIEND query is requsted, and uses them when a filter/sort/etc FRIEND query is requested.

@manishrjain your query example:

{
  var(id: me) {
    friends (first: 3, order: weight) {
      F: friends (first: 10, order: weight) @filter(has(friends, 0x0a))
    }
  }

  nonfriends(F) {
    name
  }
}

could then work as follows:

  • dgraph knows that your trying to filter on an edge weight/attribute
  • so it translates your query on the server into the correct query that uses the ---___FRIEND and ___FRIEND-ID predicates, ie one that you could write now with the current functionality [would add it below, but I'm don't actually know how to do that query quite yet, assuming it's possible - keen to here the solution if anyone can post it]

Notes:

Why ___FRIEND and not just use the same predicate 'FRIEND'? Maybe you could, would need more thought. see edit above

Origin of the idea:

If I wanted, like @manishrjain pointed out I could store edge weights/attributes now using an intermediate node. Only it's a bit confusing (at least for me) because then, if I have just:

ME -FRIEND-> INTERMEDIATE NODE -ID-> FRIEND NODE
                               -SCORE-> 0.87
                               -ATTR-> somethingelse

then I always need to write my queries remembering I have the intermediate nodes, i.e. I need to filter them out if I don't want to use edge weights.
Plus I need to decide that I want edge attributes in advance, otherwise I'd have to fuss changing me-friend-friend_node relationships that I already have into ones that use an intermediate node. In neo4j it's easy, just add edge weights/attributes to any existing edge.

Of course I could have both:

ME -FRIEND->FRIEND NODE
ME -FRIEND-> INTERMEDIATE NODE -ID-> FRIEND NODE
                               -SCORE-> 0.87
                               -ATTR-> somethingelse

but then I've got to either:
(a) have both predicates the same (FRIEND). But then queries where I don't need the edge attributes have got to filter the intermediate nodes out.
(b) have a FRIEND and some other predicate like FRIEND (I guess now you see where this idea came from...) such as ___FRIEND.

To be fair I think either (a) or (b) are fine - this was how I had thought about edge attributes for my use case (I prefer (b) but need to think about (a) more).

Summary

If there's a way for dgraph to handle edge weights/attributes for me versus me having to create intermediate nodes myself then you've got something great! I think people logically get the idea of edge weights... especially people familiar with neo4j... but teaching users how to do edge weights using intermediate nodes might be a hard challenge, so the idea is, in summary: maybe there's a way to do this automagically for people.

It would be overly complicated to implement such a thing. We wouldn't know exactly which edges need to have intermediary nodes and which don't. So, this would essentially trigger either doing it for everything, which means every traversal would now be two traversals; or not doing it at all.

One could argue that for some predicates, a @intermediate keyword could be specified in the schema. But that'd have to be done from the get-go. Otherwise, adding such a keyword after the predicate data has been written; would require rewriting everything in this new form.

Not to mention, this is a recipe for abuse. Instead of designing good schemas which tackle such cases, users would want everything to be intermediary, just in case they need it; without realizing the performance impact of such things.

What we can do, is to have multiple weights attached to each edge; and store it so they're sorted; thus allowing querying the shortest path; or shortest N paths based on (using Google maps example) time or distance or other features. But, they'd all be essentially weights with different keys, using float type; and not arbitrary values.

@manishrjain not that I need that but do you think predicate attributes could also be strings? or to you, if you start giving too many attributes besides 'measures' (weight distance time) it should become an entity/node ?

I was thinking of something similar to Micheal. My idea would look like this

Alice - edge -> 'Friend(Alice,Bob)' - edge -> Bob

This amounts to a pair of triples

Alice edge 'Friend(Alice,Bob)' .
'Friend(Alice,Bob)' edge Bob .

Where 'edge' is a generic edge, and 'Friend(Alice,Bob)' is just a name for a unique node shared by only these two triples.

This, I think, lets one simulate the subject - predicate -> object model from Cypher without requiring changes to the basic Dgraph implementation.

We can easily attach weights to the 'Friend(Alice,Bob)' node when creating the two triples.

I am not sufficiently familiar with GraphQL to know how to query such a construct.

@lazharichir -- Yeah, I think if you have to give other attributes besides measures, those must be done via intermediate nodes. And in general, intermediate nodes must be something Dgraph is fed, as opposed to Dgraph creating and maintaining them.

@djhenderson: The Friend(Alice, Bob) is what I was referring to as intermediary nodes. And yes, anything can be attached to them easily. For example, Freebase has intermediate nodes between actor and film, so they can track the character that the actor played in the film.

https://wiki.dgraph.io/Legacy_Test_Queries#Film_Schema

That's what I thought and makes total sense - looking forward to these edge weights in the future, definitely a massive step ahead and I believe in the right direction!

Also filed #549 to handle the requests made by folks here and elsewhere on Slack.

Shortest path queries (with weighted edges) are supported from from 6869cde5b1bc96bf8d372d4daf10030e28ac6ca0.

Will try it out soon as it's at the core of my project so will raise an issue if anything goes wrong or if i see any improvement possible.

Thanks a lot guys, you truly are efficient.

Closing this now. Feel free to open a new issue for improvements or suggestions on shortest path queries.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

fritzblue picture fritzblue  Â·  5Comments

armaneous picture armaneous  Â·  3Comments

mbudge picture mbudge  Â·  3Comments

yupengfei picture yupengfei  Â·  4Comments

ShawnMilo picture ShawnMilo  Â·  4Comments