Saving HTML files to the database (tested with webdav, eXide and notices on deploying a .xar) eXist serializes brackets, even those inside CDATA sections. So > becomes>
This does not happen in the 3.x.x branch, the brackets inside CDATA sections are preserved. This is important for the JavaScript sections embedded in HTML.
Save the below into eXide as html:
<div xmlns="http://www.w3.org/1999/xhtml">
<h1>Test CDATA</h1>
<p>Error: When this file is saved to the database (tested with webdav, eXide and notices on deploying a .xar)
eXist serializes brackets (> becomes >), even those inside CDATA sections.</p>
<p>eXist 4.0.0 seems to be okay, error exists in 4.2.0, 4.3.0, 4.3.1 and eXist-5.0.0-RC2</p>
<script type="text/javascript"><![CDATA[ > Test < ]]></script>
</div>
Please always add the following information
Confirmed on 5.0.0 RC2 3eb07483b
Saving as html removes CDATA tags and escapes brackets afaik there is no cdata in strict html,
but does the same in when saving as xml! Which should obviously not remove CDATA or convert to escaped brackets….
After saving and reopening test.xml looks like this:
<div xmlns="http://www.w3.org/1999/xhtml">
<h1>Test CDATA</h1>
<p>Error: When this file is saved to the database (tested with webdav, eXide and notices on deploying a .xar)
eXist serializes brackets (> becomes >), even those inside CDATA sections.</p>
<p>eXist 4.0.0 seems to be okay, error exists in 4.2.0, 4.3.0, 4.3.1 and eXist-5.0.0-RC2</p>
<script type="text/javascript"> > Test < </script>
</div>
@joewiz any idea about unintended serialization changes?
I'm able to reproduce this result with 4.3.1 with curl - ruling out eXide as the source of the behavior, presumably, since eXide also simply submits an HTTP PUT request to the database.
curl 'http://localhost:8080/exist/apps/eXide/store/db/temp/test.html' -X PUT -H 'Content-Type: application/xml' --data-binary $'<div xmlns="http://www.w3.org/1999/xhtml">\n <h1>Test CDATA</h1>\n <p>Error: When this file is saved to the database (tested with webdav, eXide and notices on deploying a .xar) \n eXist serializes brackets (> becomes >), even those inside CDATA sections.</p>\n <p>eXist 4.0.0 seems to be okay, error exists in 4.2.0, 4.3.0, 4.3.1 and eXist-5.0.0-RC2</p>\n <script type="text/javascript"><![CDATA[ > Test < ]]></script>\n</div>'
For this PUT request to work with curl, of course, you have to first create the /db/temp collection and chmod it to o+w:
xmldb:create-collection("/db", "temp"),
sm:chmod("/db/temp", "o+w")
When opening the resulting file via WebDAV or eXide, the CDATA section is removed - as in Duncan's pasted result above.
Saving the same file to 4.3.1 via WebDAV (using oXygen) or via XML-RPC (using Java admin client), the CDATA is preserved, though the angle brackets within the CDATA section are escaped:
<xhtml:div xmlns="http://www.w3.org/1999/xhtml" xmlns:xhtml="http://www.w3.org/1999/xhtml">
<h1>Test CDATA</h1>
<p>Error: When this file is saved to the database (tested with webdav, eXide and notices on deploying a .xar)
eXist serializes brackets (> becomes >), even those inside CDATA sections.</p>
<p>eXist 4.0.0 seems to be okay, error exists in 4.2.0, 4.3.0, 4.3.1 and eXist-5.0.0-RC2</p>
<script type="text/javascript"><![CDATA[ > Test < ]]></script>
</xhtml:div>
Could it be that some additional escaping is needed when submitting CDATA blocks inside PUT requests? Somehow WebDAV and XML-RPC is managing to upload and preserve CDATA blocks successfully, whereas plain HTTP PUT requests are not. Any ideas why or how to fix the plain HTTP PUT requests?
All interfaces are causing the angle brackets inside the CDATA block to be escaped though. Presumably that is a bug?
Upon further study, I don't believe any additional escaping should be needed when submitting CDATA blocks inside HTTP PUT requests.
Thus, I think we have two problems here:
/rest endpoint, CDATA delimiters are stripped. /rest, /webdav, or /xmlrpc), <, >, and & characters contained inside CDATA delimiters are escaped as <, >, and &. (Even worse, subsequent saves of the document to the database double-escape these blocks, turning < into &lt;, and so on.)This does not seem to be limited to CDATA blocks, I also encounter this when attempting to serialize my data as ttl, or plain text (v4.4.0)
xquery version "3.1";
declare namespace output="http://www.w3.org/2010/xslt-xquery-serialization";
declare namespace tei = "http://www.tei-c.org/ns/1.0";
declare namespace dcterms = "http://purl.org/dc/terms/";
let $prefix :=
"@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>. 
"
return
(response:set-header("Content-Type", "text/turtle; charset=utf-8"),
response:set-header("method", "text"),
response:set-header("media-type", "text/plain"),
serialize($prefix,
<output:serialization-parameters>
<output:method>text</output:method>
<output:media-type>text/plain</output:media-type>
</output:serialization-parameters>))
Returns:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
It should look like this:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix wdata: <https://www.wikidata.org/wiki/Special:EntityData/> .
See: http://syriaca.org/person/13/ttl (running on version 3.5.0)
@wsalesky Remove the serialize function and declare the output method in the prolog:
xquery version "3.1";
declare namespace output="http://www.w3.org/2010/xslt-xquery-serialization";
declare option output:method "text";
let $prefix :=
"@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>. 
"
return
(response:set-header("Content-Type", "text/turtle; charset=utf-8"),
response:set-header("media-type", "text/plain"),
$prefix)
This yields the expected result:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
As for the unwanted stripping of CDATA section markings and escaping of characters within, I believe this is answered in https://github.com/eXist-db/exist/issues/2233. Without support for the cdata-section-elements serialization parameter, eXist is falling back on the default handling of CDATA sections.
I wish I could find the commit that changed eXist's behavior, but I suspect that eXist is behaving correctly here - it's just not giving us control over the serialization of CDATA sections that we would have if it supported cdata-section-elements.
Thanks @joewiz !
Still a problem, for sure, is the double escaping of CDATA blocks saved (PUT) to eXist over REST, WebDAV, and XML-RPC: https://github.com/eXist-db/exist/issues/2081#issuecomment-409664770.
Still a mystery to me!