Exist: sibling axes used without element name very inefficient

Created on 20 Jul 2018 · 18Comments · Source: eXist-db/exist

What is the problem

whereas an XPath expression like
preceding-sibling::a
works as it should, more general expressions involving the sibling axes are _extremely_ inefficient.
preceding-sibling::* preceding-sibling::node() preceding-sibling::text()
do seem to work, but take ages to evaluate both as steps in an XPath and as filter expressions. The same goes for following-sibling.

What did you expect

fast execution

Describe how to reproduce or add a test

On a collection containing just a few thousand elements <a>, run a query like
$coll//a[following-sibling::*]
sit back and relax (for 10-30 minutes!!!)

I would also like to add that these queries cannot be killed, or more precisely, killing them takes just as long as running them.

Context information

Please always add the following information

eXist-db version : all recent versions, I'm using 4.3 at the moment
Java version (Java8u161)
Operating system (Windows 10)
64 bit
How is eXist-db installed? (JAR installer)

bug investigate in progress

Source

merenyics

Most helpful comment

So my findings so far are:

the EQName Test query (preceding-sibling::a) takes a very different path to both, the Wildcard name test query (preceding-sibling::*), and the AnyKindTest query (preceding-sibling::node()). This disjunction happens at line 888 of https://github.com/eXist-db/exist/blob/develop/src/org/exist/xquery/LocationStep.java#L888
1. The Wildcard and AnyKindTest queries are considered wildcard tests, these fall-back to doing a range scan over the dom.dbx file, which is known to be much slower than using an index.
2. The EQName test query uses the Structural Index to lookup the siblings in the structure.dbx file, which is much faster than a range scan.

However, a profiler shows that the problem with the wildcard tests is not actually the dom.dbx range scan, but rather that almost all of the execution time is spent in org.exist.util.FastQSort for NewArrayNodeSet.

After some examination, it seems that NewArrayNodeSet is less than ideal to use for sibling lookups (and other operations as well I suspect). The LocationStep#PrecedingSiblingFilter calls NewArrayNodeSet#get and then NewArrayNodeSet#add for every sibling it finds whilst scanning dom.dbx. This causes the NewArrayNodeSet to be re-sorted on every call to NewArrayNodeSet#get, i.e. once for every single preceding-sibling that matches the expression.

AFAIK NewArrayNodeSet has a number of problems/choices in it's design which make it unsuitable to use when building node sets.

So how can we improve this:

As the items in NewArrayNodeSet, are sorted on every call to it's get#, we know that it is already partially sorted (that being an understatement!). So instead of eXist-db's IntroSort we could use a sort which exploits that fact, if I switch NewArrayNodeSet from IntroSort to TimSort for example, then the query doc("/db/anchors.xml")/anchors/a[position() eq 9999]/preceding-sibling::node() which previously executed in 5076ms now executes in 1052ms. An improvement of 482%!

However, (1) is really just reducing the problem of the bad design. Instead we can try replacing LocationStep's use of NewArrayNodeSet with AVLTreeNodeSet, which doesn't sort constantly, as items are stored in a sorted tree manner. The query doc("/db/anchors.xml")/anchors/a[position() eq 9999]/preceding-sibling::node() which previously executed in 5076ms now executes in 20ms, an improvement of 25,380%! This is actually faster than the index based version, which makes me think that the index based version also needs some improvement!
I suspect that AVLTreeNodeSet has some performance/memory overhead that we don't even necessarily need. If I stay with NewArrayNodeSet and implement a HashMap inside PrecedingSiblingFilter#accept which tracks which nodes have been seen, so we don't need to call NewArrayNodeSet#get then I can get the performance for this query: doc("/db/anchors.xml")/anchors/a[position() eq 9999]/preceding-sibling::node() to about 11ms, i.e. a reduction in execution time of 46,145%.

I need to discuss these findings with @wolfgangmm but I think there is much room for improvement here. In summary:

Don't use NewArrayNodeSet when building unique sets... or maybe at all
When appending small nodes to it and then reading them backNewArrayNodeSet may benefit from a sort algorithm that can exploit it's already semi-sorted data.

I think ultimately we may need to implement some better NodeSets for various use-cases, or perhaps NodeSet's that operate in distinct builder and built phases.

to be continued...

adamretter on 23 Jul 2018

👍2

All 18 comments

@merenyics I don't have enough concrete information to reprdoduce exactly your XQuery and XML.

Anyway, I tried a couple of things, and I think I can see the issue that you have in 2.ii, 2.iii, 3.ii, and 3.iii.

If I create a document /db/anchors.xml with 10,000 <a> elements by executing the following XQuery:

xmldb:store("/db", "anchors.xml",
    document {<anchors> {(1 to 10000) ! <a id="{.}"/> }</anchors>})

I see the following simple query executions:
1. doc("/db/anchors.xml")//preceding-sibling::a: 78ms
2. doc("/db/anchors.xml")//preceding-sibling::*: 77ms
3. doc("/db/anchors.xml")//preceding-sibling::node(): 77ms
4. doc("/db/anchors.xml")//preceding-sibling::text(): 11ms
If I try and exercise the preceding-sibling axis further:
1. doc("/db/anchors.xml")/anchors/a[position() eq 4321]/preceding-sibling::a: 55ms
2. doc("/db/anchors.xml")/anchors/a[position() eq 4321]/preceding-sibling::*: 895ms
3. doc("/db/anchors.xml")/anchors/a[position() eq 4321]/preceding-sibling::node(): 832ms
4. doc("/db/anchors.xml")/anchors/a[position() eq 4321]/preceding-sibling::text(): 9ms
and yet further:
1. doc("/db/anchors.xml")/anchors/a[position() eq 9999]/preceding-sibling::a: 70ms
2. doc("/db/anchors.xml")/anchors/a[position() eq 9999]/preceding-sibling::*: 5066ms
3. doc("/db/anchors.xml")/anchors/a[position() eq 9999]/preceding-sibling::node(): 5076ms
4. doc("/db/anchors.xml")/anchors/a[position() eq 9999]/preceding-sibling::text(): 12ms

So the problem here seems to be with the preceding-sibling::* and preceding-sibling::node() expressions. We will have to dig into this to see if indexes are not being used or with a profiler to ascertain where the performance loss comes from...

adamretter on 23 Jul 2018

So my findings so far are:

the EQName Test query (preceding-sibling::a) takes a very different path to both, the Wildcard name test query (preceding-sibling::*), and the AnyKindTest query (preceding-sibling::node()). This disjunction happens at line 888 of https://github.com/eXist-db/exist/blob/develop/src/org/exist/xquery/LocationStep.java#L888
1. The Wildcard and AnyKindTest queries are considered wildcard tests, these fall-back to doing a range scan over the dom.dbx file, which is known to be much slower than using an index.
2. The EQName test query uses the Structural Index to lookup the siblings in the structure.dbx file, which is much faster than a range scan.

AFAIK NewArrayNodeSet has a number of problems/choices in it's design which make it unsuitable to use when building node sets.

So how can we improve this:

As the items in NewArrayNodeSet, are sorted on every call to it's get#, we know that it is already partially sorted (that being an understatement!). So instead of eXist-db's IntroSort we could use a sort which exploits that fact, if I switch NewArrayNodeSet from IntroSort to TimSort for example, then the query doc("/db/anchors.xml")/anchors/a[position() eq 9999]/preceding-sibling::node() which previously executed in 5076ms now executes in 1052ms. An improvement of 482%!

However, (1) is really just reducing the problem of the bad design. Instead we can try replacing LocationStep's use of NewArrayNodeSet with AVLTreeNodeSet, which doesn't sort constantly, as items are stored in a sorted tree manner. The query doc("/db/anchors.xml")/anchors/a[position() eq 9999]/preceding-sibling::node() which previously executed in 5076ms now executes in 20ms, an improvement of 25,380%! This is actually faster than the index based version, which makes me think that the index based version also needs some improvement!
I suspect that AVLTreeNodeSet has some performance/memory overhead that we don't even necessarily need. If I stay with NewArrayNodeSet and implement a HashMap inside PrecedingSiblingFilter#accept which tracks which nodes have been seen, so we don't need to call NewArrayNodeSet#get then I can get the performance for this query: doc("/db/anchors.xml")/anchors/a[position() eq 9999]/preceding-sibling::node() to about 11ms, i.e. a reduction in execution time of 46,145%.

I need to discuss these findings with @wolfgangmm but I think there is much room for improvement here. In summary:

Don't use NewArrayNodeSet when building unique sets... or maybe at all
When appending small nodes to it and then reading them backNewArrayNodeSet may benefit from a sort algorithm that can exploit it's already semi-sorted data.

I think ultimately we may need to implement some better NodeSets for various use-cases, or perhaps NodeSet's that operate in distinct builder and built phases.

to be continued...

adamretter on 23 Jul 2018

👍2

I would add that I see the same problems and possible improvements with the following-sibling as well, as it is implemented in the same way as preceding-sibling

adamretter on 23 Jul 2018

Thank you very much for looking into the problem, very promising findings...
I'm developing an advanced search application for a huge dictionary project and I build a lot of complex XQueries dynamically. So far my other major problem with filtering expressions has been that [1] is sometimes (usually when combined with a range index lookup, but I'm not sure about the exact circumstances) much slower than[position() = 1]. (Of course, I could always use the latter, more verbose variant.) This is probably unrelated to the current issue, but the performance degradation seems to be on a similar scale.

merenyics on 23 Jul 2018

I would like to add that text() is also problematic in my case, probably because I do have text nodes around the elements I'm filtering for, unlike your test case which has no text nodes at all (?)

merenyics on 23 Jul 2018

@merenyics about the [position() = 1] one thing, there was a problem with the old Saxon used in exist version 4, if you have access to exist 5.0.0RC could you try the same queries with it to see if there is still a difference in performance?

duncdrum on 23 Jul 2018

Sorry, right now I don't have existdb v5.0.0RC on my computer. I will install it and test my queries when I have the time, and will get back to you.

merenyics on 23 Jul 2018

👍1

@duncdrum The Saxon update won't help @merenyics here. We only use Saxon for XSLT. eXist-db has its own XQuery engine.

adamretter on 23 Jul 2018

There was a gitlab discussion many months ago about this issue:

https://gitlab.existsolutions.com/tei-publisher/tei-publisher-app/issues/92

I don't know if that is public, but maybe Adam can see it there. Wolfgang certainly can. I have copied in below the most relevant bits. The gist of it is that iterating over every node in the document for preceding and following (especially during recursion!) is a real performance nightmare. This is still causing a lot of trouble for the application I support.

From the GitLab discussion:

Craig Berry @craig commented 9 months ago

I finally have an explanation for the pathological performance, though not really a satisfactory solution. Our documents have word-level tagging but we do not store spaces between words. Spaces are inferred during processing. The predicate for when we insert a space after the word element "w" in our ODD looks like this:

not(@join = 'right' or @join = 'both') 
and not(following::*[1][self::w and (@join = 'left' or @join='both')]) 
and not(following::*[1][self::pc and not(@join='right')])
and not(parent::q/w[last()] = node()) 
and not(parent::quote/w[last()] = node())

That normally works just fine and we've had it in place for months with no serious problems. However, adding the retrieval of the related note to the processing of the ref element as above does not mix well with this. The reason I know it's the interaction of these two things is that if I change "following" to "following-sibling" in the above predicate, page next to a page with notes on it goes from many minutes to a couple of seconds.

following-sibling is not really correct, though, since words are not always at the same level in the hierarchy as the words or punctuation adjacent to them. I really do want following::*[1], i.e., the node immediately after the context node. And that normally works fine, but making it execute one level deeper in the recursion hierarchy (w within note retrieved while processing ref) makes things go haywire.

I know nothing about XPath implementations, but I would think following::* with a subscript would be the easiest of all possible expressions to implement. You just add the subscript to the enumeration of the context node in document order and that gives you the document-order number of the node you want -- no scanning or index lookups or anything.
Edited 9 months ago by Craig Berry

Craig Berry
Craig Berry @craig commented 7 months ago

I've confirmed the speculations in my previous post by looking at the getFollowing method in eXist's xquery implementation here.

It checks whether there is a positional predicate but, if I'm reading it correctly, then makes no use of the position when processing a wildcard path. Instead it iterates over every node in the entire document. So what should be the cheapest of all XPath expressions to evaluate is instead going to be very expensive. I'd be happy to be proven wrong about this, but even happier to see it fixed.

Wolfgang
Wolfgang @wolfgang commented 7 months ago
Owner

@craig yes, this is correct and I stumbled across the issue before. We need a way to make expressions like this terminate early. I'll try to come up with a fix.

craigberry on 25 Jul 2018

Apologies. I assumed this ticket would be about common flaws in wildards in combination with any before and after axis, but reading more carefully, it seems to be specific to preceding-sibling and following-sibling. Which surprises me, since those axes perform massively better than plain preceding and following. Let me know if I should open a separate issue.

craigberry on 26 Jul 2018

@craigberry I am not involved in tei-publisher and I don't have visibility of it, so I can't help you with things that are specific to that. However, it does look like you have identified an issue with eXist-db, which is something I might be able to help with :-)

I would agree that your question is not about the same issue that is detailed here. Could you please open a new issue and move your comments there. Feel free to reference this issue as having some "similarity". Also, if you open an issue for eXist-db please detail how to reproduce it in eXist-db terms, i.e. no references to tei-publisher, but rather just the simplest XML files and XQuery you can create to reproduce the problem you see. The simpler the data and query, and the easier it is to reproduce the quicker we can fix it.

adamretter on 26 Jul 2018

@craigberry The changes in this PR may help, it is more intelligent about sibling scans and overall does less work to get the same results: https://github.com/eXist-db/exist/pull/2113 please can you test?

If not, can you open a new issue succinctly describing your performance problem and how to reproduce (outside of the TEI Publisher).

adamretter on 13 Aug 2018

I compared the average of five trials of the following script in 4.3.1 and eXist-5.0.0-RC1-165-ge607599f4. The script basically starts from a known node in the middle of a largish document (getting to that point is not timed) and then times gathering the 5 preceding and 5 following nodes relative to that starting node. 4.3.1 averaged 0.439 seconds for the timed portion of the script, and the new development version 1.174 seconds, or 2.6 times worse than before. I will try to get another ticket open about that as that's exercising the following and preceding axes, but suffice it to say that gathering 11 nodes in sequence should not take 400+ milliseconds when you start from a known node reference in the middle of that sequence.

For the current ticket, I replaced following with following-sibling and preceding with preceding-sibling axes in the query below. In 4.3.1, I get 462 seconds and in 5.0.0-RC1-165-ge607599f4 I get 386 seconds. So at least it's an improvement, but still up to a thousand times slower than without the "sibling" part of the axes.

I'm thoroughly puzzled because in my real-world examples, it is the non-sibling versions of the axes that are at least an order of magnitude slower.

The development code also produced the following error in the exist log while running this query:

[Fatal Error] :-1:-1: Premature end of file.

Here is the query.

xquery version "3.1";

let $words := for $i in (1 to 50000) return <w xml:id="{$i}">{$i}</w>

let $x := if (file:exists('/db/words.xml')) then xmldb:remove('/db/' , 'words.xml') else ()

let $docname := xmldb:store('/db', 'words.xml', document {<words>{$words}</words>})

let $w := doc($docname)//w[@xml:id='25000']

let $start-time := util:system-time()
let $before := for $i in (1 to 5) return $w/preceding::*[5 + 1 - $i][self::w]/text()
let $word := '[' || $w/text() || ']'
let $after := for $i in (1 to 5) return $w/following::*[$i][self::w]/text()
let $seconds := (util:system-time() - $start-time) div xs:dayTimeDuration("PT1S")

let $string := string-join(($before, $word, $after), ' ')

return "Took " || $seconds || " seconds to produce '" || $string || "'"

craigberry on 21 Aug 2018

Do you have timings for 2.2 please?

adamretter on 22 Aug 2018

With the script posted above but the "-sibling" versions of the axes used, the xquery version changed from "3.1" to "3.0" and xml:id changed to id to get it to work with 2.2, I get:

Took 305.148 seconds to produce '24995 24996 24997 24998 24999 [25000] 25001 25002 25003 25004 25005'

so that is quite a bit faster than 4.3.1 and somewhat faster than the 5.x development branch.

All tests on mid-2014 MacBook Pro core i7 2.5GHz running macOS 10.13.6 in case that's of interest, which is doubtful.

craigberry on 23 Aug 2018

@merenyics can you confirm that the fixes in #2113 bring actually work for you.
@craigberry if they do I ll close this, and move discussion of following to #2129

duncdrum on 13 Jan 2019

@duncdrum were these fixes already present in version 4.4.0? If so, I'm afraid the performance is still horrible. If I need to check a more recent version, I'm afraid I won't be able to do so in the next few days, but I will get back to you with my findings as soon as I can.

Anyway, using 4.4.0 this is what I get:

In my real life application, I have a collection with a large number of dictionary entries.
The expression $coll//hint is evaluated in just 0.052s, giving me 24618 results
On the other hand $coll//hint[following-sibling::text()] takes 105.051s, giving around 20000 results,
and $coll//hint[following-sibling::*] takes 9.861s, giving 7981 results

I'm not sure if there's any improvement at all, in any case, I would expect execution times well under 1s.

merenyics on 14 Jan 2019

@duncdrum Sorry, I just realized that I have access to a server running "5.0.0-RC4". Unfortunately, performance is just as poor as on my local machine. Instead of ~105s, I did get ~50s execution time for the most demanding following-sibling::text() filter, but this is probably due to hardware and OS differences, and still several orders of magnitude beyond what seems reasonable.

merenyics on 14 Jan 2019

Was this page helpful?

0 / 5 - 0 ratings