Semanticmediawiki: dumpRDF.php: encoding error on page title with special chars and long sortkey with special chars

Created on 19 Jun 2020  ·  3Comments  ·  Source: SemanticMediaWiki/SemanticMediaWiki

Setup and configuration

  • SMW version: 3.1.6
  • MW version: 1.34.1
  • PHP version: 7.3
  • DB system (MySQL, Blazegraph, etc.) and version: 10.4.13-MariaDB

Issue

When I run dumpRDF I get illegal characters on the swivt:wikiPageSortKey element (of a subobject) at the end of the text node. A subobject

  <swivt:Subject rdf:about="http://test.wiki.terminologi.no/index.php/Special:URIResolver/MRT2-3ANrrøt-23_ML718af25fff3e23325023b14c2ce036e1">
                <swivt:masterPage rdf:resource="&wiki;MRT2-3ANrrøt"/>
                <swivt:wikiNamespace rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">3002</swivt:wikiNamespace>
                <property:Language_code rdf:datatype="http://www.w3.org/2001/XMLSchema#string">nb</property:Language_code>
                <swivt:wikiPageSortKey rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Nrrøt#fiskeart i laksefamilien. Ligner på laks, mens skiller seg fra denne særlig ved mindre slank bygning, høyere halerot, flere skjell (oftest 13–15) i en skrålinje fra fettfinnen ned til sidelinjen og flere flekker på sidene. Er også gjennomg�</swivt:wikiPageSortKey>
                <property:Text rdf:datatype="http://www.w3.org/2001/XMLSchema#string">fiskeart i laksefamilien. Ligner på laks, mens skiller seg fra denne særlig ved mindre slank bygning, høyere halerot, flere skjell (oftest 13–15) i en skrålinje fra fettfinnen ned til sidelinjen og flere flekker på sidene. Er også gjennomgå</property:Text>
        </swivt:Subject>
</property:Text>
        </swivt:Subject>

Steps to reproduce

Add to page MRT2:Nrrøt

[[Rdfs:label::fiskeart i laksefamilien. Ligner på laks, mens skiller seg fra denne særlig ved mindre slank bygning, høyere halerot, flere skjell (oftest 13–15) i en skrålinje fra fettfinnen ned til sidelinjen og flere flekker på sidene. Er også gjennomgå@nb]]

Running from cli

php dumpRDF.php -q --page "MRT2:Nrrøt"  > ~/trout.xml && xmllint ~/trout.xml

returns

Input is not proper UTF-8, indicate encoding !
Bytes: 0xC3 0x3C 0x2F 0x73

If I move the the page to MRT2:Nrrit without making any changes the same cli commands return the full xml.

If I remove the last character "å" or write ASCII only after the last period, the export also works.
Export to RDF from special pages also seems to return valid XML.

see history of https://test.wiki.terminologi.no/index.php?title=MRT2:Nrr%C3%B8t&action=edit

All 3 comments

I'm not able to consistently reproduce the error.

Thanks for reporting.

Export to RDF from special pages also seems to return valid XML.

Indeed, this worked fine for me. So there is probably something in the water with the script.

I downloaded the export file to my local computer, and OxygenXML (21.1) which I think uses its own patched xerces to work with xml files, handles the file with no issues.

Running (newer) xmllint on local computer still results in the error.

I'll just replace all <wikiPageSortKey... for now as a workaround.

I could probably also work around it by configuring another xmlreader in the processing pipeline.

Thank you for the swift respone :+1:

Was this page helpful?
0 / 5 - 0 ratings