Exist: Dramatically slower performance when using positional predicate on collection

Created on 26 Jun 2020  路  7Comments  路  Source: eXist-db/exist

What is the problem

Trying to subset my collection of JATS article metadata leads to a response in 20/40 seconds (eXist-db 4.7.1, resp. Linux/Windows), whereas the non-subset collection results are available in fractions of a second.

What did you expect

a subset to process (almost) as quickly as the original set.

Describe how to reproduce or add a test

Describe how we can can reproduce the problem.

With the collection posted here, referenced as '/db/meta' in the example below:

let $contrib as element(contrib)+ := collection( '/db/meta' )//contrib[1]
return
    count($contrib)
  • This returns a count of all first contributors, but it is slow.
  • If I remove the [1] the query is much faster (but it returns all contributors, which is not what I want!).

Context information

Please always add the following information

  • eXist-db version 4.7.1
  • Java version 1.8.0.251
  • Operating system (Windows 2016 and Ubuntu 18.04)
  • 64 bit
  • How is eXist-db installed? JAR installer
  • Any custom changes in e.g. conf.xml: memory, data location, global cache, iprange.
bug triage performance

Most helpful comment

I stepped through the execution with the Java debugger and found that...

For the query without the [1] predicate, eXist-db does a single call to findDescendantsByTagName against the Structural Index, matching all Elements that have the QName LastName.

For the version with the predicate, the same as the above happens, but then the predicate has to be applied to the results from the Structural Index. To filter by the predicate, for the set of elements, a new set of all parent nodes is created. For each parent, the set of elements from the structural index is then filtered to find descendants of the parent, the resultant set then. The resultant set is then filtered by the first element (the [1] predicate) in that set. Unfortunately the approach taken here is rather complex and could likely be improved, but it is not a simple matter.

As the query without the [1] predicate can be answered directly from the Structural Index, it is much faster!

All 7 comments

I can confirm that the same performance variation exists on 5.3.0-snapshot 55e77cc52f8555dda5ed058ef98f283d69a72a8e

@adamretter the link I posted in the description above is lo longer working because of the Slack message limit. However, the collection I am referring to is available to you on our test server at /db/data/shared.benjamins.com/meta

On eXist-db 5.3.0-SNAPSHOT with a single 257 MB PubMed document I can confirm that a query like:

let $contrib as element(LastName)+ := doc( '/db/pubmed_result-257m.xml' )//LastName[1]
return
    count($contrib)

Executes in ~500ms and returns a count of 123,226 elements.


Whereas the query without the [1] predicate:

let $contrib as element(LastName)+ := doc( '/db/pubmed_result-257m.xml' )//LastName
return
    count($contrib)

Executes in ~25ms and returns the same count of 123,226 elements.

I stepped through the execution with the Java debugger and found that...

For the query without the [1] predicate, eXist-db does a single call to findDescendantsByTagName against the Structural Index, matching all Elements that have the QName LastName.

For the version with the predicate, the same as the above happens, but then the predicate has to be applied to the results from the Structural Index. To filter by the predicate, for the set of elements, a new set of all parent nodes is created. For each parent, the set of elements from the structural index is then filtered to find descendants of the parent, the resultant set then. The resultant set is then filtered by the first element (the [1] predicate) in that set. Unfortunately the approach taken here is rather complex and could likely be improved, but it is not a simple matter.

As the query without the [1] predicate can be answered directly from the Structural Index, it is much faster!

@adamretter based on your explanations I have changed my query as follows:


let $contrib-group as element(contrib-group)+ := collection( '/db/data/shared.benjamins.com/meta' )//contrib-group
let $contrib as element(contrib)+ := $contrib-group ! ./contrib[1]
return
    count($contrib)

and this returns a result in just over a second, which is absolutely fine.
Do note that the following does not perform well at all (have not yet got a result):

let $contrib as element(contrib)+ := $contrib-group/contrib[1]

I am not sure whether it should return the same result though.

It did return the same result:
image

I will close the issue as I now think it is more of a feature than it is a bug, given Adam's explanations. I think we would need to add a few lines of documentation on this topic, stating that querying //node() is not always a good idea, esp when you need node's parent.

Was this page helpful?
0 / 5 - 0 ratings