Is related to issue #957
Describe the bug
Applying a reindex on a document by calling the XQuery function
xmldb:reindex($collection-uri, $doc-uri)
always adds a new document entry in a Lucene index.
I skipped through the exist code with a debugger and found that
IndexController.removeCollection). IndexController.reindexing and removing the nodes to update before rewriting them in LuceneIndexWorker.write.Regarding our use case:
We use the Lucene index to track references to a document from other documents
<lucene>
<module uri="http://awb.saw-leipzig.de/xquery/facet-utils" prefix="fu" at="xmldb:exist:////db/projects/awb/scripts/facet-utils.xq"/>
<analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
<text qname="task">
<field name="numberAutoInstances" expression="fu:countInstancesPointingToTaskWithId(./@id)"/>
<facet dimension="hasInstances" expression="fu:countInstancesPointingToTaskWithId(./@id) > 0)"/>
</text>
</lucene>
We trigger the reindex of a referenced file in a trigger method trigger:before-update-document($uri as xs:anyURI) on the referencing document.
As can be seen in the screenshot of the Luke Index Browser, after removing and adding the reference in the referencing document several times, each time a new document entry with the same docNodeId was added. But a simple call of xmldb:reindex($collection-uri, $doc-uri) also leads to new entries.
Expected behavior
A reindex should update existing entries in the Lucene index.
Supposed fix
Call IndexController.setReindexing(true); in the call chain of reindexDocument(...) methods or use IndexWriter.updateDocument instead of IndexWriter.addDocument in the method LuceneIndexWorker.write.
To Reproduce
/db/projects/test with two documents:XML
<root id="1">LuceneTest1</root>
XML
<root id="2"><child/>LuceneTest2</root>
/db/system/config/db/projects/test/collection.xconf for this collection:<collection xmlns="http://exist-db.org/collection-config/1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xmldb="http://exist-db.org/xquery/xmldb">
<index>
<lucene>
<analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
<text qname="root">
<facet dimension="testFacet" expression="empty(./*)"/>
</text>
</lucene>
</index>
</collection>
xquery version "3.1";
let $ri0 := xmldb:reindex('/db/projects/test')
let $ri1_1 := xmldb:reindex('/db/projects/test', 'test1.xml')
let $ri2_1 := xmldb:reindex('/db/projects/test', 'test1.xml')
let $ri1_2 := xmldb:reindex('/db/projects/test', 'test2.xml')
let $ri2_2 := xmldb:reindex('/db/projects/test', 'test2.xml')
let $options := map {
'facets': map {
'testFacet': ()
}
}
let $results := collection('/db/projects/test')//root[ft:query(., (), $options)]
let $testFacet := ft:facets($results, 'testFacet', ())
return <result>{
element facets {
attribute dimension {'testFacet'},
map:for-each($testFacet, function($label, $count) {
element facet {
attribute v {$label},
attribute n {$count}
}
})
}
}</result>
<result>
<facets dimension="testFacet">
<facet v="true" n="3"/>
<facet v="false" n="3"/>
</facets>
</result>
Context:
Additional context
Here is an XQSuite test that further simplifies the test supplied above. It demonstrates that each time we reindex using xmldb:reindex#2, the number of facet hits is incremented. However, when we use xmldb:reindex#1, the correct results are returned. Similarly, the expected number of hits is returned on a plain call to ft:query, regardless of which reindex function is called.
Thus, there is an issue with xmldb:reindex#2 and the count returned by ft:facet.
xquery version "3.1";
module namespace t="http://exist-db.org/xquery/test";
declare namespace test="http://exist-db.org/xquery/xqsuite";
declare variable $t:XML := document {
<root>foo</root>
};
declare variable $t:xconf := <collection xmlns="http://exist-db.org/collection-config/1.0">
<index>
<lucene>
<analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
<text qname="root">
<facet dimension="test-facet" expression="'bar'"/>
</text>
</lucene>
</index>
</collection>;
declare
%test:setUp
function t:setup() {
let $testCol := xmldb:create-collection("/db", "test")
let $indexCol := xmldb:create-collection("/db/system/config/db", "test")
return
(
xmldb:store("/db/test", "test.xml", $t:XML),
xmldb:store("/db/system/config/db/test", "collection.xconf", $t:xconf),
xmldb:reindex("/db/test")
)
};
declare
%test:tearDown
function t:tearDown() {
xmldb:remove("/db/test"),
xmldb:remove("/db/system/config/db/test")
};
declare
%test:assertEquals("1", "1", "1", "1", "1")
function t:facets-after-reindex-arity-2() {
let $reindex := xmldb:reindex("/db/test")
for $i in (1 to 5)
let $hits := collection("/db/test")//root[ft:query(., ())]
let $facets := ft:facets($hits, "test-facet")
let $reindex-doc := xmldb:reindex("/db/test", "test.xml")
return
$facets?bar
};
declare
%test:assertEquals("1", "1", "1", "1", "1")
function t:facets-after-reindex-arity-1() {
let $reindex := xmldb:reindex("/db/test")
for $i in (1 to 5)
let $hits := collection("/db/test")//root[ft:query(., ())]
let $facets := ft:facets($hits, "test-facet")
let $reindex-col := xmldb:reindex("/db/test")
return
$facets?bar
};
declare
%test:assertEquals("1", "1", "1", "1", "1")
function t:hits-after-reindex-arity-2() {
let $reindex := xmldb:reindex("/db/test")
for $i in (1 to 5)
let $hits := collection("/db/test")//root[ft:query(., ())]
let $reindex-doc := xmldb:reindex("/db/test", "test.xml")
return
count($hits)
};
declare
%test:assertEquals("1", "1", "1", "1", "1")
function t:hits-after-reindex-arity-1() {
let $reindex := xmldb:reindex("/db/test")
for $i in (1 to 5)
let $hits := collection("/db/test")//root[ft:query(., ())]
let $reindex-col := xmldb:reindex("/db/test")
return
count($hits)
};
This test suite returns the following results:
<testsuite package="http://exist-db.org/xquery/test" timestamp="2021-07-25T16:19:46.77-04:00"
tests="4" failures="1" errors="0" pending="0" time="PT0.207S">
<testcase name="facets-after-reindex-arity-1" class="t:facets-after-reindex-arity-1"/>
<testcase name="facets-after-reindex-arity-2" class="t:facets-after-reindex-arity-2">
<failure message="assertEquals failed." type="failure-error-code-1">1 1 1 1 1</failure>
<output>1 2 3 4 5</output>
</testcase>
<testcase name="hits-after-reindex-arity-1" class="t:hits-after-reindex-arity-1"/>
<testcase name="hits-after-reindex-arity-2" class="t:hits-after-reindex-arity-2"/>
</testsuite>
I used eXist 5.3.0 1934cd7cd0c0ff3decac0b770969cab435409e52 20210626123843.
Thus, there is an issue with
xmldb:reindex#2and the count returned byft:facet.
I guess there is no issue with ft:facet. Actually this one works as expected in the sense of counting all indexed objects. The problem is, that there are more objects in the index than there should be, caused by just adding new objects to the index if xmldb:reindex#2 is used.
The test could also be implemented (or may be extended) by using ft:field like:
<root><foo>bar</foo></root><field name="foo" expression=".//foo"/>xmldb:reindex#1//root[ft:query(., 'foo:bar')] → get 1 resultsft.field($result, 'foo') → bar//root[ft:query(., 'foo:foo')] → get 0 results<foo> from 'bar' to 'foo'xmldb:reindex#2//root[ft:query(., 'foo:bar')] → get 1 results (Expected: 0)ft.field($result, 'foo') → bar//root[ft:query(., 'foo:foo')] → get 1 resultsft.field($result, 'foo') → fooThis is because both 'states' was indexed and saved into separate index objects (added without deletion of existing record) still co-existing in the index database. The result count or build up by ft:query somehow deduplicates, because if you search for objects containing with 'foo' or 'bar' in field foo, you also only get one record.