Exist: incorrect "Invalid byte 2 of 4-byte UTF-8 sequence" error on import

Created on 30 Apr 2020 · 15Comments · Source: eXist-db/exist

Describe the bug
When importing the attached UTF-8-encoded file by clicking on the "Stores one or more..." button in Java Admin client, I get the following error:

org.xml.sax.SAXParseException; systemId: file:///var/folders/gb/9qzz8hdm8xjf6c00r6_l9z000000gp/T/exist-db-temp-file-manager-2284598656963329848/exist-db-temp-1832595209426251608.tmp; lineNumber: 29146; columnNumber: 16; Invalid byte 2 of 4-byte UTF-8 sequence.

So I assumed the error was correct and set about trying to identify the invalid character. But I ran the file through various checkers, including various regexes I found online, like this:

perl -l -ne '/
 ( ([\x00-\x7F])              # 1-byte pattern
   |([\xC2-\xDF][\x80-\xBF])   # 2-byte pattern
   |((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
   |((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2}))       # 4-byte pattern
  )/x or print'

and both of Adam Retter's utf8 validators:

https://github.com/digital-preservation/utf8-validator
https://github.com/adamretter/utf8-validator-c

The file checked out OK with every tool I could find. It does have a number of Unicode characters, but all of them appear to be valid.

I then tried to see if there was anything odd about the line 29146 / column 16 mentioned in the error message. There are no non-ASCII characters on that line or even nearby. I deleted some insignificant whitespace on the preceding line and it moved the reported column, then deleted more insignificant whitespace earlier in the file and it moved the reported line to the next line even though I had not changed the number of lines or the significant content of any line. So whatever else is wrong, the line/column part of the error message is meaningless.

I was unfamiliar with the org.xml.sax interface and thought it might be an issue with that rather than eXist as such. So I built the ancient program at http://people.apache.org/~edwingo/jaxp-ri-1.2.0-fcs/samples/SAXLocalNameCount/SAXLocalNameCount.java which uses that interface and ran the file through that. It worked fine -- no parsing errors.

So the problem seems to be specific to what eXist is doing when importing files. The first thing that occurred to me was that it's possibly feeding a byte stream to a parser that expects characters and gets incomplete characters on a buffer boundary. But that's wild speculation and I would expect a lot more problems if that were the case.

Expected behavior
Allow a valid file with various widths of UTF-8 characters to be imported without error.

To Reproduce
Simply import the attached file into any collection you like.
A35965.zip

Context (please always complete the following information):

OS: macOS 10.15.4
eXist-db version: 5.2.0
Java Version: 1.8.0_252-b09 (AdoptOpenJDK)(build 25.252-b09, mixed mode)

Additional context

How is eXist-db installed? DMG
Any custom changes in e.g. conf.xml? No

triage

Source

craigberry

Most helpful comment

Nothing interesting is visible with a hex editor (no non-ASCII characters in the vicinity as I reported before). Multiple validating parsers have no problem validating the file against its schema, which I don't think would happen if there were valid UTF-8 characters that were not valid in XML.

I have a slightly whittled-down version of the file that so far is the minimum I've found to produce the error, but pretty much any line I delete from the file makes the error go away. Here's what that reduced file gets me with Xerces-J using the method Adam demonstrated for me:

$ java -classpath .:../xercesImpl.jar sax.Counter ../../tmp.xml
[Fatal Error] tmp.xml:29147:6: Invalid byte 2 of 4-byte UTF-8 sequence.

But the equivalent program from Xerces-C works fine with the exact same XML file:

wget https://downloads.apache.org/xerces/c/3/sources/xerces-c-3.2.3.zip
unzip xerces-c-3.2.3.zip
cd xerces-c-3.2.3
cmake .
make
samples/SAXCount -f  ../tmp.xml
../tmp.xml: 70 ms (28161 elems, 77892 attrs, 0 spaces, 306104 chars)

So I really think this is a bug in Xerces-J, and after much hunting I finally found XERCESJ-1668 in the Xerces-J bug tracker, which to me sounds like the same problem. Unfortunately, the bug was originally reported 13 years ago, patched, closed, reverted, reopened, and now the current bug has been sitting there with a different patch unapplied for 4 1/2 years. So despite being tagged "Major" it doesn't seem to be a high priority bug.

So, I guess it's not an eXist problem as such but it does prevent me from loading a particular file into eXist. It doesn't seem to make a lot of sense for eXist to maintain its own branch of Xerces-J with the patch in the Xerces-J ticket applied, although apparently some people have done this. If anyone here can think of a workaround, or if anyone here is involved in Xerces-J development and can upvote that issue, please step up. I will leave this ticket open just a bit for any further comments.

craigberry on 5 May 2020

👍4

All 15 comments

@craigberry i remember some tweaks i had to do related to UTF-8 in the docker images. Can you try to run this with a docker image, if you don't get the same error i might have an idea where the problem is.

duncdrum on 4 May 2020

On May 3, 2020, at 5:11 PM, Duncan Paterson notifications@github.com wrote:

@craigberry i remember some tweaks i had to do related to UTF-8 in the docker images. Can you try to run this with a docker image, if you don't get the same error i might have an idea where the problem is.

How do you run the Java Admin Client from a Docker instance of eXist?

craigberry on 4 May 2020

Never mind. I ran the following from the command line

$ /Applications/eXist-db.app/Contents/Resources/bin/client.sh

to get the Java Admin client from my 5.2.0 installation to connect to the running Docker instance of eXist 5.3.0-SNAPSHOT 2103fac0a4a46b9876186092a659668bebc1d39f 20200502132126. The error message when uploading the file is exactly the same, which it would pretty much have to be unless the Docker eXist had a completely different implementation of file importing than any other method of building eXist.

Now to figure out how to uninstall Docker and get a few gigabytes back.

craigberry on 4 May 2020

@craigberry I think @duncdrum likely suggested Docker because its environment is strictly configured for UTF-8. An incorrect environment LANG variable could possibly be the cause of your issue, can you tell me what LANG is set to on your machine running eXist-db?

adamretter on 4 May 2020

So it seems that the problem is occurring within the Apache Xerces XML parser that we use:

Caused by: org.xml.sax.SAXParseException; systemId: file:///tmp/A35965.xml; lineNumber: 29146; columnNumber: 16; Invalid byte 2 of 4-byte UTF-8 sequence.
    at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
    ... 31 more
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
    at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
    at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
    at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
    at org.apache.xerces.impl.XMLEntityScanner.skipSpaces(Unknown Source)
    at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown Source)
    at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
    ... 28 more

Debugging issues with Xerces is quite the pain, as the versions of the libraries they publish have the debug information stripped from them (for better performance).

adamretter on 4 May 2020

👀1

This is not an eXist-db issue. It is either an issue with Xerces or with the XML input file.

To reproduce the problem with Xerces, save Craig's XML file to `/tmp/A35965.xml/ and then run:

$ cd /tmp
$ wget https://apache.panu.it//xerces/j/binaries/Xerces-J-bin.2.12.1.tar.gz
$ tar zxvf Xerces-J-bin.2.12.1.tar.gz
$ cd xerces-2_12_1/samples/sax
$ javac Counter.java
$ cd ..
$ java -classpath .:/tmp/xerces-2_12_1/xercesImpl.jar sax.Counter /tmp/A35965.xml

You will get the error:

[Fatal Error] A35965.xml:29146:16: Invalid byte 2 of 4-byte UTF-8 sequence.

Now this is either a problem in Xerces, or a problem with the XML. The Xerces FAQ suggests this - https://xerces.apache.org/xerces2-j/faq-common.html#faq-2

@craigberry can you follow the Xerces FAQ please and use a hex editor to rule out any non-printable invalid characters?

adamretter on 4 May 2020

can you tell me what LANG is set to on your machine running eXist-db?

LANG=en_US.UTF-8

craigberry on 4 May 2020

LANG=en_US.UTF-8

Okay that is fine. Please see the Xerces stuff.

adamretter on 4 May 2020

Thanks. I can reproduce the problem using Xerces. It will take some head-scratching to figure out where it's going wrong. If I strip all lines from the XML file containing only ASCII characters like so:

$ perl -ne 'print unless $_ =~ m/^[\x00-\x7f]+$/;' < tmp.xml > tmp2.xml

then add back an XML declaration and wrapper tag to get well-formedness, I get a file of about 1800 lines, all of which have Unicode characters and look something thile this:

      <w lemma="☿" pos="sy" xml:id="A35965-072-b-0190">☿</w>
      <w lemma="🜂" pos="n1" xml:id="A35965-072-b-0690">🜂</w>
      <w lemma="☉" pos="sy" xml:id="A35965-072-b-0960">☉</w>
      <w lemma="☽" pos="sy" xml:id="A35965-072-b-0980">☽</w>
      <w lemma="☿" pos="sy" xml:id="A35965-072-b-1100">☿</w>

If I then run this reduced file containing only lines with funny characters through the Xerces counter program:

$ java -classpath .:../xercesImpl.jar sax.Counter ../../tmp2.xml
../../tmp2.xml: 25 ms (1836 elems, 5524 attrs, 0 spaces, 15557 chars)

All is well. So whatever it's complaining about appears to be something other than what's really bothering it. I guess I will go back to the original file and keep chopping it in half until the error goes away and see if I can spot what sequence is bothering it.

craigberry on 4 May 2020

@craigberry I would use a Hex editor to see if there are any non-visible (and unexpected) byte sequence around the offset that Xerces complains about. Also worth remembering not all UTF-8 characters are valid for use in XML.

adamretter on 4 May 2020

$ java -classpath .:../xercesImpl.jar sax.Counter ../../tmp.xml
[Fatal Error] tmp.xml:29147:6: Invalid byte 2 of 4-byte UTF-8 sequence.

But the equivalent program from Xerces-C works fine with the exact same XML file:

wget https://downloads.apache.org/xerces/c/3/sources/xerces-c-3.2.3.zip
unzip xerces-c-3.2.3.zip
cd xerces-c-3.2.3
cmake .
make
samples/SAXCount -f  ../tmp.xml
../tmp.xml: 70 ms (28161 elems, 77892 attrs, 0 spaces, 306104 chars)

craigberry on 5 May 2020

👍4

@craigberry I think the best thing to do would be for you to try the patch provided in XERCESJ-1668, to see if that fixes your problem.

If it does, I would suggest signing-up to their JIRA and adding a comment to the issue with your file that reproduces it, as your file is very small compared to the one posted in that issue.

From there we can decide whether to fork of try and get it upstreamed, Xerces has been a little more active recently than in the past years.

adamretter on 5 May 2020

Thanks, @adamretter. I can confirm that applying the patch, rebuilding, and running against the built version of xercesImpl.jar gets past the error when using the sax.Counter program:

java -classpath .:../tools/xercesImpl.jar sax.Counter tmp.xml
[Fatal Error] tmp.xml:29147:6: Invalid byte 2 of 4-byte UTF-8 sequence.

cd ..

patch -p0 -i surrogate.patch

./build.sh
./build.sh jar

cd samples

java -classpath .:../build/xercesImpl.jar sax.Counter tmp.xml
tmp.xml: 77 ms (28161 elems, 77890 attrs, 0 spaces, 306128 chars)

I thought I'd try to test this with eXist and verify that I can import the problem file. In a dirty build directory, I see:

./exist-distribution/target/exist-distribution-5.3.0-SNAPSHOT-dir/lib/xercesImpl-2.12.1-xml-schema-1.1.jar
./exist-distribution/target/exist-distribution-5.3.0-SNAPSHOT.app/Contents/Java/xercesImpl-2.12.1-xml-schema-1.1.jar

So I guess I would need to get the "schema-1.1" version of Xerces-J, do a build, and then plonk the built version of xercesImpl.jar on top of those, renaming to the expected version-specific names? Yes, I'm sure there would be some Maven magic to do this properly. For now I'm just interested in verifying that I can import my file into eXist.

I'm not sure the Xerces-J bug needs my test file as the patch there already includes a 50-line Java test program that reproduces the issue by pushing a single 4-byte character through a 3-byte buffer. But perhaps another real-world example would help get the patch some attention.

craigberry on 6 May 2020

I have updated the Xerces-J JIRA ticket and also verified that after putting a patched xercesImpl.jar into a build of eXist 5.3.0-SNAPSHOT I can successfully load the file into eXist.

craigberry on 6 May 2020

The Lucene project has a workaround for a related bug that involves decoding the input using standard JVM methods before passing to the Xerces-J reader.parse. I'm not sure eXist knows the encoding of the input file before it parses it, though, so I'm not sure whether the same solution would work for eXist.

craigberry on 29 May 2020

Was this page helpful?

0 / 5 - 0 ratings