Openrefine: add parseXML() GREL function

Created on 10 Nov 2018  路  9Comments  路  Source: OpenRefine/OpenRefine

Is your feature request related to a problem or area of OpenRefine? Please describe.
with parseHtml() you can parse XML in GREL. But I it is not perfect, because Jsoup interprets the string as HTML : it adds html and body elements, puts the element names in lowercase, change the encoding of accented letters...

Ex :

'<?xml version="1.0" encoding="UTF-8"?><BODY>mon corps</BODY><SOUL>mon 芒me</SOUL>'.parseHtml()

Resulting tree :

<?xml version="1.0" encoding="UTF-8"?>
<html>
 <head></head>
 <body>
  mon corps
  <soul>
   mon &acirc;me
  </soul>
 </body>
</html>

So in this example, getting the element of my XML won't work well, because of the added by the parser:

```'mon corpsmon 芒me'.parseHtml().select ('BODY')````

Result:

[ 
<body>
 mon corps
 <soul>
  mon &acirc;me
 </soul>
</body> ]

Describe the solution you'd like
It seems that Jsoup library can have an option for using a XML parser instead of a HTML one : https://jsoup.org/apidocs/org/jsoup/parser/Parser.html#xmlParser--

So, would it be possible to implement that in Openrefine, for example by creating a new parseXML() function, based on Jsoup but using a XML parser ? Or maybe by adding option in parseHTML() ?

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

enhancement

Most helpful comment

Thanks @msaby

So I think I can improve the code, but what I have working so far is:

value.parseXml().select("foaf|Person") ->
forEach(value.parseXml().select("foaf|Person foaf|name"),x,x.ownText()).join("|") -> John Doe|H茅lo茂se Dupont
forEach(value.parseXml().select("foaf|Person head"),x,x.ownText()).join("|") -> head1|head2|head3
value.parseXml().select("BODY")[1].ownText() -> nice body
value.parseXml().select("foaf|homepage")[0].xmlAttr("rdf:resource") -> http://www.example.com

This is close, but not identical to what you suggest, but it is close. The main difference is that you need to use ownText() to get the text content of an XML element, and xmlAttr uses the colon after the namespace not the pipe (this is different to 'select' but seems to be how JSOUP handles it by default so I'm not inclined to change it unless there is a strong argument for changing)

This all seems reasonable and useful to me. I'll start trying to tidy up the code (basically at the moment I've duplicated a lot of code which I'd prefer not to duplicate and I need to fix that)

All 9 comments

Oh yes, the current situation is pretty bad indeed! What you are proposing makes a lot of sense to me.

I let the java guys writing something, I don't know the language ;-)
Note that Jsoup version used in Openrefine seems very old, so maybe it is necessary to update it to use the xml parser.

I've started some work on this

@msaby would you be able to give some test cases for parsing XML that I could include as tests to make sure this works as desired? i.e. starting XML, some process and expected output

Hi
What do you think of this valid XML for testing?

<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <foaf:Person>
        <foaf:name>John Doe</foaf:name>
        <head>head1</head>
        <head>head2</head>
        <BODY>nice body</BODY>
        <foaf:homepage rdf:resource="http://www.example.com"/>
    </foaf:Person>
    <foaf:Person>
        <foaf:name>H茅lo茂se Dupont</foaf:name>
        <head>head3</head>
        <BODY>nice body</BODY>
        <foaf:title/>
    </foaf:Person>
</root>

I am not sure about the tests, but it could be something like this:

value.parseXML().select("foaf|Person") -> an array with the 2 elements foaf:Person

foreach(value.parseXML().select("foaf|Person foaf|name"),x,innerHTML(x) ).join(|) -> "John Doe|H茅lo茂se Dupont"

foreach(value.parseXML().select("foaf|Person head"),x,innerHTML(x) ).join(|) ->"head1head2head3"

value.parseXML().select("BODY")[1]-> "nice body"

value.parseXML().select("foaf|homepage[rdf|resource]")[0] -> "http://www.example.com"

Note that the name of innerHTML() could be strange if parseXML() is created. Maybe it is possible to create innerXML() that would just be an alias of innerHTML? Same for other ...HTML functions?

Thanks @msaby

So I think I can improve the code, but what I have working so far is:

value.parseXml().select("foaf|Person") ->
forEach(value.parseXml().select("foaf|Person foaf|name"),x,x.ownText()).join("|") -> John Doe|H茅lo茂se Dupont
forEach(value.parseXml().select("foaf|Person head"),x,x.ownText()).join("|") -> head1|head2|head3
value.parseXml().select("BODY")[1].ownText() -> nice body
value.parseXml().select("foaf|homepage")[0].xmlAttr("rdf:resource") -> http://www.example.com

This is close, but not identical to what you suggest, but it is close. The main difference is that you need to use ownText() to get the text content of an XML element, and xmlAttr uses the colon after the namespace not the pipe (this is different to 'select' but seems to be how JSOUP handles it by default so I'm not inclined to change it unless there is a strong argument for changing)

This all seems reasonable and useful to me. I'll start trying to tidy up the code (basically at the moment I've duplicated a lot of code which I'd prefer not to duplicate and I need to fix that)

For the pipe after the namespace, I saw that in Jsoup documentation, so I thought it was already the case for GREL functions based on Jsoup...

In Jsoup it seems a pipe char is used when selecting elements with a namespace, but not attributes. Hence:

value.parseXml().select("foaf|homepage")[0].xmlAttr("rdf:resource") -> http://www.example.com

ah ok. And it works already like that with parseHtml(). I have just tested

Closed by #1845.

Was this page helpful?
0 / 5 - 0 ratings