Semanticmediawiki: XML: error parsing RDF document

Created on 22 Apr 2018  Â·  9Comments  Â·  Source: SemanticMediaWiki/SemanticMediaWiki

Setup and configuration

  • SMW version: 3.0.0-alpha
  • MW version: 1.30.0
  • PHP version: 7.0.27
  • MariaDB: 10.1.26

Issue

XML informs error at parsing:
<property:Datenqualität/Herkunft rdf:resource="&wiki;Der_Datensatz_wurde_basierend_auf_der_ÖK50-2C_Stand_2011_digitalisiert._Es_wurden_alle_Waldbestände_für_die_Gemeinde_Kopfing_erfasst."/>

Should we escape "/" by &#47; or &#x2F;or &sol;? [0]

Stack trace

See sandbox.

[0] https://en.wikipedia.org/wiki/Slash_(punctuation)#Encoding

bug

Most helpful comment

Maybe it is not a blocker but, given sandbox is testing 3.0.0-rc.1, this issue is not over yet, look.
https://sandbox.semantic-mediawiki.org/wiki/Sp%C3%A9cial:Export_RDF/Lorem_ipsum_Export

The issue here is with [[Datenverantwortliche Stelle – E-Mailkontakt::[email protected]]] where the first – is not a normal dash but a Unicode symbol hence it is not recognized as dash in the ASCII format.

All 9 comments

Well, considering the age of the code it is at least not a regression for 3.0.0

Maybe it is not a blocker but, given sandbox is testing 3.0.0-rc.1, this issue is not over yet, look.

Maybe it is not a blocker but, given sandbox is testing 3.0.0-rc.1, this issue is not over yet, look.
https://sandbox.semantic-mediawiki.org/wiki/Sp%C3%A9cial:Export_RDF/Lorem_ipsum_Export

The issue here is with [[Datenverantwortliche Stelle – E-Mailkontakt::[email protected]]] where the first – is not a normal dash but a Unicode symbol hence it is not recognized as dash in the ASCII format.

– as HTML entity is &ndash; and should be banned from being part of a property name. Adding it to smwgPropertyInvalidCharacterList should avoid creating properties that look a like but in fact are not such as Foo-Bar vs. Foo–Bar.

It should be noted that this is about the property name and not about any value representation, so adding some restrictions should help users and administrators instead of the "you can do all" motto.

You could try using htmlentities($uri, ENT_COMPAT, "UTF-8") in Escaper::encodeUri to filter invalid entities but I'm not so eager on doing that unless there is a very good reason for it.

Thanks for the explanation @mwjames

Added at different spots to make sure this does not get overlooked:

Maybe create a PR and add – to smwgPropertyInvalidCharacterList because a normal users cannot distinguish Foo-Bar from Foo–Bar by just looking at it (it took me some time and tools to figure out the true nature of the issue).

Maybe create a PR and add – to smwgPropertyInvalidCharacterList because a normal users cannot distinguish Foo-Bar from Foo–Bar by just looking at it (it took me some time and tools to figure out the true nature of the issue).

Doing something like this could be endless since there are many characters that are alike. However since I consider this a pretty popular pitfall at least for Germans I will to a pull tomorrow.

Was this page helpful?
0 / 5 - 0 ratings