Orientdb: MATCH bug in location existence check

Created on 18 Nov 2016  路  28Comments  路  Source: orientechnologies/orientdb

OrientDB Version, operating system, or hardware.

  • v2.0 SNAPSHOT[ ] - .18[ ] .17[ ] .16[ ] .15[ ] .14[ ] .13[ ] .12[ ] .11[ ] .10[ ] .9[ ] .8[ ] .7[ ] .6[ ] .5[ ] .4[ ] .3[ ] .2[ ] .1[ ] .0[ ]
  • v2.1 SNAPSHOT[ ] - .16[ ] .15[ ] .14[ ] .13[ ] .12[ ] .11[ ] .10[ ] .9[ ] .8[ ] .7[ ] .6[ ] .5[ ] .4[ ] .3[ ] .2[ ] .1[ ] .0[ ]
  • v2.2 SNAPSHOT[ ] - .13[x] .rc1[ ] .beta2[ ] .beta1[ ]

Operating System

  • [x] Linux
  • [x] MacOSX
  • [ ] Windows
  • [ ] Other Unix
  • [ ] Other, name?

Expected behavior and actual behavior

Expected: If a traversal location named baz exists in the output, then checking $matched.baz IS null should be false in the MATCH query.

Actual: See below for an instance where data is returned for a location named baz, and yet the query evaluates $matched.baz IS null as true.

Steps to reproduce the problem

Here's what the formed structure looks like:

Foo ==> Bar ==> Baz
  \
   \=> Far

Setup:

CREATE CLASS Foo EXTENDS V
CREATE CLASS Bar EXTENDS V
CREATE CLASS Baz EXTENDS V
CREATE CLASS Far EXTENDS V
CREATE CLASS Foo_Bar EXTENDS E
CREATE CLASS Bar_Baz EXTENDS E
CREATE CLASS Foo_Far EXTENDS E

CREATE VERTEX Foo SET name = 'foo'
CREATE VERTEX Bar SET name = 'bar'
CREATE VERTEX Baz SET name = 'baz'
CREATE VERTEX Far SET name = 'far'

CREATE EDGE Foo_Bar FROM (SELECT FROM Foo) TO (SELECT FROM Bar)
CREATE EDGE Bar_Baz FROM (SELECT FROM Bar) TO (SELECT FROM Baz)
CREATE EDGE Foo_Far FROM (SELECT FROM Foo) TO (SELECT FROM Far)

Problematic queries:

  • baz must always exist because it's not optional
MATCH {
    class: Foo,
    as: foo
}.out('Foo_Bar') {
    as: bar
}, {
    class: Bar,
    as: bar
}.out('Bar_Baz') {
    as: baz
}, {
    class: Foo,
    as: foo
}.out('Foo_Far') {
    where: ($matched.baz IS null),
    as: far
}
RETURN $matches

// returns:
// +----+-----+-----+-----+-----+
// |#   |foo  |bar  |far  |baz  |
// +----+-----+-----+-----+-----+
// |0   |#11:0|#12:0|#14:0|#13:0|
// +----+-----+-----+-----+-----+
// 
// Note the existence of a result for "baz".
  • baz might not exist because now it's optional
MATCH {
    class: Foo,
    as: foo
}.out('Foo_Bar') {
    as: bar
}, {
    class: Bar,
    as: bar
}.out('Bar_Baz') {
    optional: true,
    as: baz
}, {
    class: Foo,
    as: foo
}.out('Foo_Far') {
    where: ($matched.baz IS null),
    as: far
}
RETURN $matches

// returns:
// +----+-----+-----+-----+-----+
// |#   |foo  |bar  |far  |baz  |
// +----+-----+-----+-----+-----+
// |0   |#11:0|#12:0|#14:0|#13:0|
// +----+-----+-----+-----+-----+
// 
// Note the existence of a result for "baz".

This is extremely frustrating, I was hoping that your tests would have caught and prevented such a simple bug from making it into a release. This entire feature is broken at the moment.

bug

Most helpful comment

If you are not doing late evaluation of $matched until 3.0, I think it's a better idea to implement the heuristics on top of the topological sort. It's a much cleaner solution because it's provably correct and doesn't need to retry with multiple entry points like you say. Also, those retries can be really expensive -- getting a good order on the first try is much more likely to be faster overall.

In 3.0, you'll be able to update the topological sort with "explicitly ignored dependencies" where $matches conditions evaluate to true early and then are re-evaluated later. Topological sort doesn't prevent that optimization from being implemented, but it will prevent many issues until 3.0 comes out.

Since late evaluation of $matched isn't coming until 3.0, please reconsider topological sort + heuristics. Tracking down and fixing correctness errors wastes time for us both, and money for both our companies.

All 28 comments

Hi @obi1kenobi

I'm working on this right now

Thanks

Luigi

Hi @obi1kenobi

I just pushed a fix for this. It should cover all the cases, for sure it covers your one, but I'm still doing some tests to make sure that everything is ok.
You can find the fix on 2.2.x branch, it will be released with 2.2.14

Thanks

Luigi

@luigidellaquila,

Your fix seems to convert the "WHERE" clauses to a string and then check if they contain $matched. I think this is incorrect -- consider a query like the following:

MATCH {
    class: Foo,
    where: (name == "$matched.something"),
    as: foo
} RETURN $matches

Your code will incorrectly believe that an alias called "something" is checked in the filtering step, when that is not the case.

I don't think this issue is fixed, I think the bug just moved elsewhere.

Here are a few examples that prove my point.

This one causes a NullPointerException because sortEdges() returns an empty list, even though estimateRootEntries() returned a nonempty map:

MATCH {
    class: Foo,
    where: (name != 'abc'),
    as: foo
}.out('Foo_Bar') {
    as: bar
}, {
    class: Bar,
    where: (name != 'def'),
    as: bar
}.out('Bar_Baz') {
    as: baz
}, {
    class: Foo,
    as: foo
}.out('Foo_Far') {
    where: ($matched.baz IS NOT null),
    as: far
}
RETURN $matches

Here's an example that triggers the incorrect .contains("$matched.") logic I pointed out above. This one causes estimateRootEntries() to return an empty map, which obviously means the query fails with a NullPointerException because there is no entry point.

MATCH {
    class: Foo,
    where: (name != '$matched.abc'),
    as: foo
}.out('Foo_Bar') {
    as: bar
}, {
    class: Bar,
    where: (name != '$matched.def'),
    as: bar
}.out('Bar_Baz') {
    as: baz
}, {
    class: Foo,
    as: foo
}.out('Foo_Far') {
    where: ($matched.baz IS NOT null),
    as: far
}
RETURN $matches

Ouch... you're right, I missed a little piece there... thank you for pointing it out.
That has to be replaced with a deep check on the conditions

Working on it

Thanks

Luigi

I think you also need to verify that the $matches dependencies don't form a cycle like:

foo --requires--> $matches.bar
bar --requires--> $matches.baz
baz --requires--> $matches.foo

I also believe that the two-pass evaluation (no $matches, then all $matches edges) in sortEdges() has unhandled edge cases. For example:

First pass:
foo does not use $matches, and is placed first.
bar uses $matches.baz, so is moved to the second pass.
baz uses $matches.far, so is moved to the second pass.
qux uses $matches.bar, so is moved to the second pass.
far does not use $matches, and is placed second.

The second pass simply iterates over lateEvaluatedEdges in an undefined order. Nothing in the code (as far as I can tell) guarantees that the placement order in the second round will be baz, bar, qux, as it must be for all the $matches references to be correctly resolved.

To do this correctly, you must do a topological sort (which is why there mustn't be cycles, as above). The topological sort will take into account all the dependencies and produce a viable execution order.

Also, you may want to remove the "resolved" label, because it is misleading.

Hi @obi1kenobi

Yes, about cyclic dependencies, I realized it while developing the fix and I already have a point in my TODO list.

Thanks

Luigi

Ok, first fix applied.

Also a basic check on circular dependencies is implemented.

Closing, thanks!

Luigi

Luigi, this bug is not fixed, and the issue shouldn't be closed. From above:

The two-pass evaluation (no $matches, then all $matches edges) in sortEdges() has unhandled edge cases. For example:

First pass:
foo does not use $matches, and is placed first.
bar uses $matches.baz, so is moved to the second pass.
baz uses $matches.far, so is moved to the second pass.
qux uses $matches.bar, so is moved to the second pass.
far does not use $matches, and is placed second.

The second pass simply iterates over lateEvaluatedEdges in an undefined order. Nothing in the code (as far as I can tell) guarantees that the placement order in the second round will be baz, bar, qux, as it must be for all the $matches references to be correctly resolved.

To do this correctly, you must do a topological sort (which is why there mustn't be cycles, as above). The topological sort will take into account all the dependencies and produce a viable execution order.

Until you add a topological sort, there will always be situations with no cycles that still evaluate in the wrong order. If you don't believe me, I will keep producing counter-examples.

Here's a clarification of what I mean by "topological sort" (also known as "dependency resolution"):
https://en.wikipedia.org/wiki/Topological_sorting

@luigidellaquila Im missing the tests in the commit.

Doing too many things at once ;-)

@obi1kenobi unfortunately I cannot use topological sort out of the box for at least two reasons:

  1. the pattern is not a DAG, so in general a topological sort cannot be found
  2. even in cases when the pattern is a DAG, the optimization of the query execution depends on the number of loaded vertices and traversed edges, that depends a lot on the graph structure, not only on the conditions on the $matched conditions

I'm studying a good algorithm that is a trade off between performance and completeness, but it will be an iterative process, so I think in this first phase some queries will be rejected from the parser (like in cases when you have cascading $matched conditions).

Test cases for the supported cases are coming of course.

Thanks

Luigi

I don't think that line of reasoning is correct.

  1. the pattern is not a DAG, so in general a topological sort cannot be found

The pattern is not always DAG, but it is always a directed graph. If it is cyclic, it is by definition an invalid query because it can't be computed. If it isn't cyclic, then it's a DAG and it has a proper topological ordering.

  1. even in cases when the pattern is a DAG, the optimization of the query execution depends on the number of loaded vertices and traversed edges, that depends a lot on the graph structure, not only on the conditions on the $matched conditions

The topological sort doesn't specify which vertex you choose next. What it does specify is which vertices you're allowed to choose without violating the dependency constraints. This is not a completeness criterion, it's a correctness criterion. You should by all means apply the heuristics to choose which of the vertices with satisfied dependencies to execute next, but those heuristics can't tell you to execute a vertex that has an unsatisfied dependency, in violation of the topological order.

In other words, what you are proposing is going to return incorrect answers with high performance, which is not useful. What I am proposing is going to return correct answers with the best performance due to the heuristics.

Side note: if dependencies aren't correctly evaluated, or if acyclic queries are rejected by the parser / query evaluator, that would make my GraphQL-to-MATCH compiler useless. There's no way for it to know what valid acyclic queries the database will refuse to execute, so I won't be able to open-source it. That would make me, my company and perhaps @lvca and the wider OrientDB community quite sad.

No, wait, I'm not proposing to return incorrect results.

My point is a bit different, we have two alternative to evaluate a $matched condition:

  1. evaluate it in-line
  2. evaluate it after the complete pattern calculation

You'd say that the first solution is always more efficient, but it's not the case in general. Consider the following:

MATCH
{class:Person, as:a} -FriendOf-> {class:Person, as:b, where:(age = 99 and $matched.name = $a.name)}
RETURN $patterns

Evaluating the $matched in-line, will oblige me to start from a, but maybe I have 6 billion people in my db with 100 friends each, so I'd be forced to calculate 600 billion patterns.

If I start from b though, I'd just load a few hundreds of people (who are 99yo) evaluating $matched.name = $a.name to true at the beginning and then re-evaluating it later, after the complete pattern calculation. In this case I'd just evaluate a few thousands of patterns.

This second case would be much much more efficient as you can see.

Also the topic of the DAG is not simple, a pattern can be traversed in both directions in some cases (an out() is just the opposite of an in()) but not in others (eg. if I have a while condition), the query optimizer has to take this into consideration.

Of course I'll use some heuristics from graph theory to do that (spanning trees, topological sort and so on), but that's not all, I also have to consider data related assumptions. That's where things get trickier.

Bad news is that most of these things are quite simple to implement in v. 3.0 where I already have flatted WHERE conditions (only one level of AND and OR conditions) that I can easily analyze to do these assumptions, while in v. 2.2 I lack some of these pieces, so for now I'm 'asymptotically' aiming at an efficient and complete solution for 2.2, but I think we will have to deal with some "unsupported" cases (I mean explicitly unsupported, not wrong results) or with some very sub-optimal execution plans in this phase.

Also please consider that MATCH statement is a new addition, so it still has some sharp edges and probably in v 3.0 we will change some default behaviors (I'll discuss this later, when the first alpha/M1 release is finalized). Of course any help on both conceptual points like this, and on the code, is always very appreciated :)

Thanks

Luigi

Can you define "explicitly unsupported"? Is this closer to "the documentation clearly explains what you're not allowed to do", or to "the database will sometimes return an error for some MATCH queries"?

I'd say both...

Unless there is clear documentation that defines what MATCH queries will always be executable, there is no point in compiling GraphQL to MATCH. If the GraphQL compiler happens to generate MATCH that OrientDB rejects at runtime, there's nothing that can be done -- that GraphQL query just cannot be done. This is a very slippery slope.

I understand your point, I think I'll update the docs asap and make it clear what is possible now and what will be possible in the near future

Thanks

Luigi

Thanks! Are you still planning on correcting the execution order with something more thorough than the two-pass heuristic by the next hotfix release? This is currently the highest-priority bug for me to fix at my company.

Yes, I'd like to review the whole strategy based on

  • n-step heuristic for consistent behavior
  • multiple entry-points for the traversal, so that if the first heuristic fails for the first "starting" node it can be retried with another one with lower score in terms of performance
  • (in v 3.0) evaluation of strategies where a later evaluation is preferred compared to an in-line evaluation of the $matched conditions

it's not a matter of two hours, but I think I can guarantee good improvements in next few days (always depending on further problems and priorities that can arise in the meantime...)

Thanks

Luigi

If you are not doing late evaluation of $matched until 3.0, I think it's a better idea to implement the heuristics on top of the topological sort. It's a much cleaner solution because it's provably correct and doesn't need to retry with multiple entry points like you say. Also, those retries can be really expensive -- getting a good order on the first try is much more likely to be faster overall.

In 3.0, you'll be able to update the topological sort with "explicitly ignored dependencies" where $matches conditions evaluate to true early and then are re-evaluated later. Topological sort doesn't prevent that optimization from being implemented, but it will prevent many issues until 3.0 comes out.

Since late evaluation of $matched isn't coming until 3.0, please reconsider topological sort + heuristics. Tracking down and fixing correctness errors wastes time for us both, and money for both our companies.

@luigidellaquila any movement on this? This is currently the biggest roadblock to the GraphQL compiler I'm developing, and I'd really love to see this fix in the next release, which I assume will probably be sometime next week. Please let me know how I can help, and please reconsider the topological sort approach as I described in the above post.

@luigidellaquila @lvca Here's an example of the GraphQL that the topological sort in the MATCH dependency resolution would enable. Say you have an OrientDB database describing a zoo, with a variety of animals that are fed on some schedule, with the following schema:

CREATE CLASS Entity EXTENDS V ABSTRACT
CREATE PROPERTY Entity.name String

CREATE CLASS Animal EXTENDS Entity

CREATE CLASS Event EXTENDS Entity

CREATE CLASS Animal_ParentOf EXTENDS E
CREATE PROPERTY Animal_ParentOf.out LINK Animal
CREATE PROPERTY Animal_ParentOf.in LINK Animal

CREATE CLASS Animal_FedAt EXTENDS E
CREATE PROPERTY Animal_FedAt.out LINK Animal
CREATE PROPERTY Animal_FedAt.in LINK Event

My GraphQL compiler would allow users to write complex GraphQL queries with minimal effort. For example, here's how a user could query "the names of all animals who were fed together with their offspring":

{
    Animal {
        name @output(out_name: "animal_name")
        out_Animal_ParentOf {
            out_Animal_FedAt {
                name @tag(tag_name: "offspring_fed_at_event")
            }
        }
        out_Animal_FedAt {
            name @filter(op_name: "=", value: ["%offspring_fed_at_event"])
        }
    }
}

Here's the MATCH code my compiler produces as output:

MATCH {
    class: Animal,
    as: Animal___1
}.out('Animal_ParentOf') {
    as: Animal__out_Animal_ParentOf___1
}.out('Animal_FedAt') {
    as: Animal__out_Animal_ParentOf__out_Animal_FedAt___1
} , {
    class: Animal,
    as: Animal___1
}.out('Animal_FedAt') {
    where: (name = $matched.Animal__out_Animal_ParentOf__out_Animal_FedAt___1.name),
    as: Animal__out_Animal_FedAt___1
}
RETURN Animal___1.name AS animal_name

My compiler can turn GraphQL that everyone knows and loves into MATCH that OrientDB executes quickly. Just look at how much simpler the GraphQL is compared to the equivalent MATCH! Some of my company's GraphQL queries expand to hundreds of lines of MATCH, which would be impossible without the GraphQL compiler. No other database today can execute GraphQL as a single efficient query.

However, without a topological sort on the $matched dependencies, the generated MATCH queries return incorrect results -- locations that exist are resolved to null and the whole result set is incorrect. The N-step heuristics and retrying if one resolution attempt fails will not fix this -- someone will always come up with GraphQL that doesn't do the right thing. Even with your concerns on topological sort's performance, GraphQL via MATCH will still be so much faster than competing solutions that it's not even a fair comparison.

I'd love to open-source this in the next few weeks, but I'm not going to open-source a project that doesn't work. Please help me address the scheduling problem in the December release of OrientDB.

See #6983 for an example of hard-to-track-down bug that comes up due to not using topological sort.

Was this page helpful?
0 / 5 - 0 ratings