Exist: [BUG] Unicode characters beyond the BMP get distorted in conversion from node to string values

Created on 5 Dec 2020  Â·  10Comments  Â·  Source: eXist-db/exist

Describe the bug
Characters, in my case Chinese characters such as "𨛌" can be saved and retrieved as character content of XML documents without problems. However, when processed in XQuery where an implicit string conversion occurs, or when retrieving the text content with the XQuery text() function, the text gets distorted.

Expected behavior
The returned content should always be the character content of the document

To Reproduce
The following when pasted and run in eXide will demonstrate the problem:
===cut here ===

xquery version "3.1";


let
 $node := <graph>𨛌</graph>
, $savedoc := xmldb:store("/db", "temp.xml", $node)
, $doc := doc("/db/temp.xml")
 return 
<res>
    <node>{$node},{$node/text()},{string-length($node)}</node>
    <doc>{$doc},{$doc/text()},{string-length($doc)}</doc>
</res>

=== cut here ==

The result I receive is:

<res>
    <node>
        <graph>𨛌</graph>,𨛌,1</node>
    <doc>
        <graph>𨛌</graph>,,2</doc>
</res>

the expected result would be

<res>
    <node>
        <graph>𨛌</graph>,𨛌,1</node>
    <doc>
        <graph>𨛌</graph>,𨛌,1</doc>
</res>

Note: the two character string is not shown in eXide, but the string-length is shown to be 2. In actual usage in a web application, frequently the characters �� are observed, eg two characters with the Unicode codepoint U+FFFD

Context (please always complete the following information):

  • OS: Ubuntu 20.04
  • eXist-db version: 5.2.0
  • Java Version : openjdk 11.0.9.1 2020-11-04

Additional context

  • How is eXist-db installed? eXist is installed by expanding the distribution tarball
  • Any custom changes in e.g. conf.xml?
    No changes except index parameters, triggers for backup etc.

Most helpful comment

Great, good to know that this is fixed!

All 10 comments

Please could you attach the files, and maybe some screenshots of the documents? In de web/my mac the characters are not rendered correctly.
image
image

Even if you can't see the correct character, you should be able to observe that when read from the database the character is not displayed, on the other hand it has the string-length of 2, instead of 1. Anyway, here is a screenshot from my system which might be easier to understand. The arrow marks the problem
exist-astral-unicode-bug

sounds like a regression against https://github.com/eXist-db/exist/issues/780 no?
@cwittern can you test this with 5.3.0-SNAPSHOT a potential fix has been merged but is not part of 5.2.0

Dear @duncdrum ,
Thanks for the heads up, I vaguely remembered that other issue, but could not find it.

Anyway, I have now tested with the latest snapshot (exist-distribution-5.3.0-SNAPSHOT-unix+20201210035501 to be precise) and can confirm the bug is still there.
Screenshot from 2020-12-11 16-25-26
The only change is that the string-length reported is now 1, as expected, not erroneously 2 as before. However, the character is still not showing.

Ok so progress, in 5.3.: The hidden conversion of multi-byte characters into components is fixed, there is one codepoint as expected.

From the #780 we now need to check if that codepoint is U+FFFD which means all the character info is lost, or if its indeed U+286CC and we just have a display problem with fonts like here on GitHub.

@cwittern could you take a look at exist-core/src/test/xquery/unicode.xqm and add a string-length test which should pass, the existing tests suggest this might be a font problem. It's also possible that you encounter another instance of the old bug by calling a function that still uses old code, if that's the case can you add a test to the existing testsuite that reproduces this?

@adamretter this was one of your fixes, any ideas why this might still be a problem here

Well, I gave it a try, but something seems to be amiss. No idea. @duncdrum, you should be able to spot this easily.

To properly compare the results of in-memory and on-disk, we would need to wrap the <graph> element in a document node. Otherwise, $node is a graph element and $doc is a document node containing a graph element, so of course the /text() node test would return an empty sequence when performed on a document node.

Here is a version that implements the changes showing that eXist is working correctly - (1) wrap the <graph> element in a document node and (2) update the node tests to descend from the document nodes in $node and $doc to reach the <graph> element's child text node:

xquery version "3.1";

let
 $node := document {<graph>𨛌</graph>}
, $savedoc := xmldb:store("/db", "temp.xml", $node)
, $doc := doc("/db/temp.xml")
 return 
<res>
    <node>{$node},{$node/graph/text()},{string-length($node)}</node>
    <doc>{$doc},{$doc/graph/text()},{string-length($doc)}</doc>
</res>

In 5.3.0-SNAPSHOT, this query returns the correct results:

<res>
    <node>
        <graph>𨛌</graph>,𨛌,1</node>
    <doc>
        <graph>𨛌</graph>,𨛌,1</doc>
</res>

Thus, as far as I can see, eXist is performing correctly and as expected.

Yup i must have accidentally reproduced on 5.2. Using a document-node constructor on 5.3 returns the correct results. I can no longer reproduce, looks like its fixed on 5.3.0-SNAPSHOT and that our tests are working.

Excellent, then I'll close this. Of course, if we find an issue, please report it!

Great, good to know that this is fixed!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

adamretter picture adamretter  Â·  4Comments

mathias-goebel picture mathias-goebel  Â·  4Comments

cmil picture cmil  Â·  3Comments

merenyics picture merenyics  Â·  3Comments

mathias-goebel picture mathias-goebel  Â·  4Comments