Dataverse: Dataset (metadata) Title with '&' (ampersand) in the text causes 'Publish' to fail.

Created on 22 May 2017  路  18Comments  路  Source: IQSS/dataverse

A dataset title containing the character '&' gives the UI error "This dataset may not be published because the DataCite Service is currently inaccessible. Please try again. Does the issue continue to persist? Please contact Dataverse Support for assistance."

However, the logfile shows the following xml error:

Response code: 400, [xml] xml error: The entity name must immediately follow the '&' in the entity reference.
at edu.harvard.iq.dataverse.DataCiteRESTfullClient.postMetadata(DataCiteRESTfullClient.java:183)
at edu.harvard.iq.dataverse.DOIDataCiteRegisterService.createIdentifier(DOIDataCiteRegisterService.java:89)

A colleague had tried to publish a dataset with '&' in the title and got this error.
I took a guess and replaced the & with 'and' and Publish succeeded. So the '&' is definitely the issue.

Adding a '&' to the subtitle metadata field did not cause an error so I am not sure if the issue applies only to the title or to other fields as well.

A more informative error message for the user would be nice to have (this applies to all DOI-related errors since the user only sees the error message listed above which gives no indication of the actual underlying problem).

Or escaping problematic characters in forming the xml for the DOI minting...

-M.

DOI & Handle Bug

Most helpful comment

In (delayed) reply to @jggautier about stripping out all HTML from the description to send to DataCite, I think that is perfectly acceptable for ADA... Thanks!

All 18 comments

@mdmADA thanks for the detailed bug report!

It looks like retString = client.postMetadata(xmlMetadata) at https://github.com/IQSS/dataverse/blob/v4.6.1/src/main/java/edu/harvard/iq/dataverse/DOIDataCiteRegisterService.java#L89 is getting a RuntimeException because a "201" response code is not coming back from DataCite here: https://github.com/IQSS/dataverse/blob/v4.6.1/src/main/java/edu/harvard/iq/dataverse/DataCiteRESTfullClient.java#L183

Thanks for the line numbers, @mdmADA .

Still DV version 4.6.1.

The characters "&" and ";" in the Metadata Description cause publish to fail so they should probably be avoided in the metadata completely.

Having "&" in the Description field gave the usual 'Datacite unavailable" error in the UI and the following error in server.log (and different to the one from the original issue):

Caused by: java.lang.RuntimeException: Response code: 400, [xml] xml error: cvc-complex-type.2.4.a: Invalid content was found starting with element 'p'. One of '{"http://datacite.org/schema/kernel-3":br}' is expected.
at edu.harvard.iq.dataverse.DataCiteRESTfullClient.postMetadata(DataCiteRESTfullClient.java:183)
at edu.harvard.iq.dataverse.DOIDataCiteRegisterService.createIdentifier(DOIDataCiteRegisterService.java:89)

Interestingly enough, when I validated the xml (as written to the doidataciteregistercache table) against the external schemas, it passed.

I guessed the "element p" in the server.log error message referred to the "p" in "& amp;" (had to put the space so the amp; part would show up). Perhaps it is actually the ";" causing the issue, however, since we had to change all of the ";" in the Description text to "," to get it to publish.

I am not sure if I should create a new issue or edit the title of this one to reflect that it is not just "&" and not just the title where these issues occur.... let me know and I will do as advised.

Thanks!

DV 4.6.1.

Running into same problem with < br >, < p >, etc markup in the Metadata description causing publish to fail.

It seems to boil down to valid HTML required throughout the UI is being sent, incorrectly, as valid XML to the Datacite DOI URL.

In the DOIDataciteRegisterService createIdentifier() method, it calls:
metadataTemplate.setDescription(dataset.getLatestVersion().getDescription());

The DatasetVersion getDescription() method calls the MarkupChecker.sanitizeBasicHTML() method.

This makes sure that the description text is valid html and converts < br >< /br > to < br >< br >.

< br >< br > is valid HTML but not XML as XML requires the closing tag.
The publishing doi process requires valid xml and so throws an Exception due to the 'sanitized' HTML being invalid XML.

I believe that & is not allowed in XML either so needs to be escaped before sending the XML to Datacite.

I am sure that whoever is assigned the bug fix can figure this out but maybe my own investigations can assist...

@mdmADA can you please look at b1ae906 and let me know if those test match your expectations? I'm trying to understand if there's a bug in the library we're using (jsoup).

Hi Phil. I believe jsoup is behaving as it should (no bugs) in that it properly sanitizes text input to valid html.

The issue is that this html is being sent to Datacite as part of the XML in the postMetadata() method for DOI minting.


The html being sent as part of the XML renders that XML invalid so Datacite is throwing a "400" status with "bad xml".



Example: Enter <br></br> into the Description field and hit 'Save Change'.

=> The description is saved to the datasetfieldvalue table with the <br></br> intact:

select value from datasetfieldvalue where value like '%<br>%';

                 value

paper for conference plus materials <br></br>\r+
testing line breaks



Now hit 'Publish'.
=> In the DOIDataCiteRegisterService createIdentifier() method, the dataset.getLatestVersion().getDescription() calls the MarkupChecker.sanitizeBasicHTML() on this description text which correctly converts the <br></br> to <br><br> as it should for valid html. This is then embedded as part of the xml sent by the postMetadata() method (see the description element):

<?xml version="1.0" encoding="UTF-8"?> <resource xsi:schemaLocation="http://datacite.org/schema/kernel-3 http://schema.datacite.org/meta/kernel-3/metadata.xsd" xmlns="http://datacite.org/schema/kernel-3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <identifier identifierType="DOI">10.5072/82/AH5ZIW</identifier> <creators><creator><creatorName>Test Author 1</creatorName><affiliation>ANU</affiliation></creator><creator><creatorName>Test Author 2</creato rName><affiliation>ANU</affiliation></creator></creators> <titles> <title>Output dataset-testing and self-ingest for Replication Materials</title> </titles> <publisher>DEV ADA Dataverse</publisher> <publicationYear>2017</publicationYear> <resourceType resourceTypeGeneral="Dataset"/> <descriptions> <description descriptionType="Abstract">paper for conference plus materials <br> <br> testing line breaks</description> </descriptions> <contributors><contributor contributorType="ContactPerson"><contributorName>Contributor 1</contributorName><affiliation>ANU</affiliation></ contributor></contributors> </resource>

The <br><br> however, is not valid XML and Datacite is throwing a 400 'bad xml' error because of it.


Finally, I tested if using &lt;br&gt; allows publishing. While it does, the <br> tags show up in the UI as part of the text which is not what we want either.


I assume this is an issue when trying to include ANY of the html tags (<em>,<b>,etc) in the decription Metadata field...


I hope I am describing this clearly!!

@mdmADA yes, this is helping. Thanks! 馃槃

@mdmADA I started working on this issue and pushed some scratch work to https://github.com/pdurbin/dataverse/commit/4ed4c00f5b106d266050ecd708c003b5cad70d7e if you're interested. You're definitely right about how Dataverse sometimes sends XML that's not well formed to DataCite. I'll keep updating this issue with my progress.

I gave @rbhatta99 a brain dump this morning and just pushed a branch called 3845-datacite-xml as a common starting point.

@mdmADA neither one of us are are able to reproduce the "& in the title" bug. I'm not sure why.

We definitely can excecise the "<br></br> in the description" bug.

Along the way, I discovered that while using DataCite rather than EZID, I can't publish a dataset created via SWORD because contributorName wasn't being sent. I pushed a fix in e93d6b3 to my branch and this relates strongly to #3802 and #3839.

@matthew-a-dunlap @rbhatta99 and I observed that descriptions of datasets aren't even being shown at https://search.datacite.org/works/10.7910/dvn/eiwf4p so part of me wonders if an easy fix would be to always send an empty string. Does DataCite do anything with descriptions? Maybe @jggautier or @pameyer would know.

I'll keep hacking away but I wanted to give an update and get that branch pushed.

as of commit https://github.com/IQSS/dataverse/commit/dd55c08b3082e0794623398edeee04bedc6f0c64 on develop (v 4.7.1), a dataset still gets published with an & in the title.
Although the addition of HTML tags in the description still causes it to fail.

Which version of the DataCite XML schema are we sending to DataCite's API? From a quick check (edit test file and re-run validator), v3.1 doesn't support html tags in the description.

We're using 3.1. Both 3.1 and the newest version, and 4.0, pretty strongly recommend including the description (although I also noticed that dataset descriptions aren't displayed on search.datacite).

Edit: Some dataset descriptions aren't displayed on search.datacite, but this one is: https://search.datacite.org/works/10.6084/m9.figshare.4223907, and the Datacite.xml you can export from that page includes the dataset description, whereas the Datacite.xml export on this page, https://search.datacite.org/works/10.7910/dvn/eiwf4p, doesn't. If we plan on using DataCite's service to add datacite metadata in schema.org/JSON-LD format to dataset pages (#3793), than finding a way to send DataCite the description would help even more.

Maybe there is a difference between 4.6.1 and 4.7.1 (I am not sure when we will move to that version) with the & in the title?

I am using this fake dataset to test: https://dataverse-dev.ada.edu.au/dataset.xhtml?persistentId=doi:10.5072/82/AH5ZIW

When I add & to the title (and change nothing else in the metadata), this is the xml attempted to send to Datacite:

<?xml version="1.0" encoding="UTF-8"?> <resource xsi:schemaLocation="http://datacite.org/schema/kernel-3 http://schema.datacite.org/meta/kernel-3/metadata.xsd" xmlns="http://datacite.org/schema/kernel-3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <identifier identifierType="DOI">10.5072/82/AH5ZIW</identifier> <creators><creator><creatorName>Test Author 1</creatorName><affiliation>ANU</affiliation></creator><creator><creatorName>Test Author 2</creatorName><affiliation>ANU</affili ation></creator></creators> <titles> <title>Output dataset testing & self ingest for Replication Materials</title> </titles> <publisher>DEV ADA Dataverse</publisher> <publicationYear>2017</publicationYear> <resourceType resourceTypeGeneral="Dataset"/> <descriptions> <description descriptionType="Abstract">paper for conference plus materials testing line breaks</description> </descriptions> <contributors><contributor contributorType="ContactPerson"><contributorName>Contributor 1</contributorName><affiliation>ANU</affiliation></contributor></contributors> </resource>



This is the error in server.log:

[2017-08-14T14:19:56.645+1000] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.dataverse.DataCiteRESTfullClient] [tid: _ThreadID=50 _ThreadName=jk-connector(1)] [timeMillis: 150268
4396645] [levelValue: 1000] [[
Response code: 400, [xml] xml error: The entity name must immediately follow the '&' in the entity reference.]]

2017-08-14T14:19:56.646+1000] [glassfish 4.1] [WARNING] [AS-EJB-00056] [javax.enterprise.ejb.container] [tid: _ThreadID=50 _ThreadName=jk-connector(1)] [timeMillis: 1502684396
646] [levelValue: 900] [[
A system exception occurred during an invocation on EJB DOIDataCiteRegisterService, method: public java.lang.String edu.harvard.iq.dataverse.DOIDataCiteRegisterService.create
Identifier(java.lang.String,java.util.HashMap,edu.harvard.iq.dataverse.Dataset) throws java.io.IOException]]

.
.
.

Caused by: java.lang.RuntimeException: Response code: 400, [xml] xml error: The entity name must immediately follow the '&' in the entity reference.
at edu.harvard.iq.dataverse.DataCiteRESTfullClient.postMetadata(DataCiteRESTfullClient.java:186)
at edu.harvard.iq.dataverse.DOIDataCiteRegisterService.createIdentifier(DOIDataCiteRegisterService.java:92)



If I make the simple change of & to 'and', it publishes. Not sure what the difference is...

I just created pull request #4075 and am moving this issue to code review. @jggautier seemed ok with sending plain text to DataCite so we're stripping out HTML tags (he started a thread at https://groups.google.com/d/msg/datacite-metadata/Di5TSstfafU/Zki5n44CAgAJ ). (Thanks to @rbhatta99 we do have code ready to go to in 64e1ee4 escape the HTML tags instead if that's what's desired.)

Hi @mdmADA and @philippconzett, does stripping all html from the description metadata sent to DataCite work for ADA?

In 2b466e7 and 47feb73 I addressed code review from @scolapasta . Moving to QA.

In (delayed) reply to @jggautier about stripping out all HTML from the description to send to DataCite, I think that is perfectly acceptable for ADA... Thanks!

@kcondon here's the template I mentioned we use when sending XML to DataCite:

<?xml version="1.0" encoding="UTF-8"?>
<resource xsi:schemaLocation="http://datacite.org/schema/kernel-3 http://schema.datacite.org/meta/kernel-3/metadata.xsd"
          xmlns="http://datacite.org/schema/kernel-3"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <identifier identifierType="DOI">${identifier}</identifier>
    <creators>${creators}</creators>
    <titles>
        <title>${title}</title>
    </titles>
    <publisher>${publisher}</publisher>
    <publicationYear>${publisherYear}</publicationYear>
    <resourceType resourceTypeGeneral="Dataset"/>
    <descriptions>
        <description descriptionType="Abstract">${description}</description>
    </descriptions>
    <contributors>{$contributors}</contributors>
</resource>

(From src/main/resources/edu/harvard/iq/dataverse/datacite_metadata_template.xml in the code.)

If you run asadmin set-log-levels edu.harvard.iq.dataverse.DOIDataCiteRegisterService=FINE you can see the XML in server.log right after it's constructed.

Like I said, the fix is to strip out HTML from the description before it's inserted into the template above. While we were in there, we also now strip out HTML from the description when inserting it into the "meta" tags that Zotero and other tools consume (#1393). The "meta" code/template looks like this:

<meta name="DC.identifier" content="#{DatasetPage.persistentId}"/>
<meta name="DC.type" content="Dataset"/>
<meta name="DC.title" content="#{DatasetPage.title}"/>
<meta name="DC.date" content="#{DatasetPage.publicationDate}"/>
<meta name="DC.publisher" content="#{DatasetPage.publisher}" />
<meta name="DC.description" content="#{DatasetPage.description}" />

Hope this helps.

Was this page helpful?
0 / 5 - 0 ratings