Exist: Dramatically slower performance when using positional predicate on collection

Created on 26 Jun 2020 · 7Comments · Source: eXist-db/exist

What is the problem

Trying to subset my collection of JATS article metadata leads to a response in 20/40 seconds (eXist-db 4.7.1, resp. Linux/Windows), whereas the non-subset collection results are available in fractions of a second.

What did you expect

a subset to process (almost) as quickly as the original set.

Describe how to reproduce or add a test

Describe how we can can reproduce the problem.

With the collection posted here, referenced as '/db/meta' in the example below:

let $contrib as element(contrib)+ := collection( '/db/meta' )//contrib[1]
return
    count($contrib)

This returns a count of all first contributors, but it is slow.
If I remove the [1] the query is much faster (but it returns all contributors, which is not what I want!).

Context information

Please always add the following information

eXist-db version 4.7.1
Java version 1.8.0.251
Operating system (Windows 2016 and Ubuntu 18.04)
64 bit
How is eXist-db installed? JAR installer
Any custom changes in e.g. conf.xml: memory, data location, global cache, iprange.

bug triage performance

Source

PieterLamers

Most helpful comment

I stepped through the execution with the Java debugger and found that...

For the query without the [1] predicate, eXist-db does a single call to findDescendantsByTagName against the Structural Index, matching all Elements that have the QName LastName.

For the version with the predicate, the same as the above happens, but then the predicate has to be applied to the results from the Structural Index. To filter by the predicate, for the set of elements, a new set of all parent nodes is created. For each parent, the set of elements from the structural index is then filtered to find descendants of the parent, the resultant set then. The resultant set is then filtered by the first element (the [1] predicate) in that set. Unfortunately the approach taken here is rather complex and could likely be improved, but it is not a simple matter.

As the query without the [1] predicate can be answered directly from the Structural Index, it is much faster!

adamretter on 10 May 2021

👍2

All 7 comments

I can confirm that the same performance variation exists on 5.3.0-snapshot 55e77cc52f8555dda5ed058ef98f283d69a72a8e

PieterLamers on 27 Apr 2021

👎1

@adamretter the link I posted in the description above is lo longer working because of the Slack message limit. However, the collection I am referring to is available to you on our test server at /db/data/shared.benjamins.com/meta

PieterLamers on 3 May 2021

On eXist-db 5.3.0-SNAPSHOT with a single 257 MB PubMed document I can confirm that a query like:

let $contrib as element(LastName)+ := doc( '/db/pubmed_result-257m.xml' )//LastName[1]
return
    count($contrib)

Executes in ~500ms and returns a count of 123,226 elements.

Whereas the query without the [1] predicate:

let $contrib as element(LastName)+ := doc( '/db/pubmed_result-257m.xml' )//LastName
return
    count($contrib)

Executes in ~25ms and returns the same count of 123,226 elements.

adamretter on 10 May 2021

I stepped through the execution with the Java debugger and found that...

For the query without the [1] predicate, eXist-db does a single call to findDescendantsByTagName against the Structural Index, matching all Elements that have the QName LastName.

As the query without the [1] predicate can be answered directly from the Structural Index, it is much faster!

adamretter on 10 May 2021

👍2

@adamretter based on your explanations I have changed my query as follows:


let $contrib-group as element(contrib-group)+ := collection( '/db/data/shared.benjamins.com/meta' )//contrib-group
let $contrib as element(contrib)+ := $contrib-group ! ./contrib[1]
return
    count($contrib)

and this returns a result in just over a second, which is absolutely fine.
Do note that the following does not perform well at all (have not yet got a result):

let $contrib as element(contrib)+ := $contrib-group/contrib[1]

I am not sure whether it should return the same result though.

PieterLamers on 14 May 2021

It did return the same result:

PieterLamers on 14 May 2021

I will close the issue as I now think it is more of a feature than it is a bug, given Adam's explanations. I think we would need to add a few lines of documentation on this topic, stating that querying //node() is not always a good idea, esp when you need node's parent.

PieterLamers on 1 Jun 2021

Was this page helpful?

0 / 5 - 0 ratings