Forwarded from the ticket:
https://help.hmdc.harvard.edu/Ticket/Display.html?id=245607
Hello,
I tried to validate two items exported to DDI from dataverse.harvard.edu with codebook.xsd (2.5) and got the same types of validation errors described below for item1 (below the line, should work as a well-formed xml-file):
Item 1:https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/BAMCSI
Item 2: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/P4JTOD
What could be done about it (else than meddling with the schema?)
Best regards,
Joakim Philipson
Research Data Analyst, Ph.D., MLIS
Stockholm University Library
Stockholm University
SE-106 91 Stockholm
Sweden
Tel: +46-8-16 29 50
Mobile: +46-72-1464702
E-mail: joakim.[email protected]
<docDscr>
<citation>
<titlStmt>
<titl>What’s in a name? : Sense and Reference in biodiversity information </titl>
<IDNo agency="DOI">doi:10.7910/DVN/BAMCSI</IDNo>
</titlStmt>
<distStmt>
<distrbtr>Harvard Dataverse</distrbtr>
<distDate>2017-01-12</distDate>
</distStmt>
<verStmt source="DVN">
<version date="2017-01-12" type="RELEASED">1</version>
</verStmt>
<biblCit>Philipson, Joakim, 2017, "What’s in a name? : Sense and Reference in
biodiversity information", doi:10.7910/DVN/BAMCSI, Harvard Dataverse, V1</biblCit>
</citation>
<stdyInfo>
<subject>
<keyword>Medicine, Health and Life Sciences</keyword>
<keyword>Computer and Information Science</keyword>
<keyword vocab="casrai" URI="http://dictionary.casrai.org/Metadata"
>Metadata</keyword>
<keyword vocab="casrai" URI="http://dictionary.casrai.org/PID_system">PID
system</keyword>
<keyword vocab="wikipedia" URI="https://en.wikipedia.org/wiki/Biodiversity"
>Biodiversity</keyword>
<keyword vocab="smw-rda" URI="http://smw-rda.esc.rzg.mpg.de/index.php/Taxonomy"
>Taxonomy</keyword>
</subject>
<abstract>"That which we call a rose by any other name would smell as sweet.”
Shakespeare has Juliet tell her Romeo that a name is just a convention without
meaning, what counts is the reference, the 'thing itself', to which the property of
smelling sweet pertains alone. Frege in his classical paper “Über Sinn und
Bedeutung” was not so sure, he assumed names can be inherently meaningful, even
without a known reference. And Wittgenstein later in Philosophical Investigations
(PI) seems to deny the sheer arbitrariness of names and reject looking for meaning
out of context, by pointing to our inability to just utter some random sounds and by
that really implying e.g. the door. The word cannot simply be separated from its
meaning, in the same way as the money from the cow that could be bought for them (PI
120). Scientific names of biota, in particular, are often descriptive of properties
pertaining to the organism or species itself. On the other hand, in semantic web
technology and Linked Open Data (LOD) there is an overall effort to replace names by
their references, in the form of web links or Uniform Resource Identifiers (URIs).
“Things, not strings” is the motto. But, even in view of the many "challenges with
using names to link digital biodiversity information" that were extensively
described in a recent paper, would it at all be possible or even desirable to
replace scientific names of biota with URIs? Or would it be sufficient to just
identify equivalence relationships between different variants of names of the same
biota, having the same reference, and then just link them to the same “thing”, by
means of a property sameAs(URI)? The Global Names Architecture (GNA) has a resolver
of scientific names that is already doing that kind of work, linking names of biota
such as Pinus thunbergii to global identifiers and URIs from other data sources,
such as Encyclopedia of Life (EOL) and uBio Namebank. But there may be other
challenges with going from a “natural language”, even from a not entirely coherent
system of scientific names, to a semantic web ontology, a solution to some of which
have been proposed recently by means of so called 'lexical bridges'.</abstract>
<sumDscr/>
<contact affiliation="Stockholm University" email="[email protected]"
>Philipson, Joakim</contact>
<depositr>Philipson, Joakim</depositr>
<depDate>2017-01-12</depDate>
</stdyInfo>
<xs:complexType name="keywordType" mixed="true">
<xs:complexContent>
<xs:extension base="simpleTextType">
<xs:attribute name="vocab" type="xs:string"/>
<xs:attribute name="vocabURI" type="xs:string"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
<sumDscr/>
<contact affiliation="Stockholm University" email="[email protected]"
>Philipson, Joakim</contact>
<!-- In codebook: -->
<xs:complexType name="sumDscrType">
<xs:complexContent>
<xs:extension base="baseElementType">
<xs:sequence>
<xs:element ref="timePrd" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="collDate" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="nation" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="geogCover" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="geogUnit" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="geoBndBox" minOccurs="0"/>
<xs:element ref="boundPoly" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="anlyUnit" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="universe" minOccurs="0" maxOccurs="unbounded"/>
<xs:element ref="dataKind" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:extension>
</xs:complexContent>
</xs:complexType>
<xs:element name="sumDscr" type="sumDscrType">
<xs:annotation>
<xs:documentation>
<xhtml:div>
<xhtml:h1 class="element_title">Summary Data Description</xhtml:h1>
<xhtml:div>
<xhtml:h2 class="section_header">Description</xhtml:h2>
<xhtml:div class="description">Information about the and geographic coverage of the study and unit of analysis.</xhtml:div>
</xhtml:div>
</xhtml:div>
</xs:documentation>
</xs:annotation>
</xs:element>
<useStmt>CC0 Waiver</useStmt>
Thanks @jomtov for moving this issue from our support system!
I thought it might be helpful to give some background on the issue, list what might need to change when the DDI xml is made valid, and describe the errors.
As background for anyone else interested, the DDI xml that Dataverse generates for each dataset (and datafile) needs to follow DDI's schema, so that other repositories and applications using DDI xml can use it (e.g. during harvesting).
To answer jomtov's question, I think Dataverse's xml would need to be corrected. After fixing the errors and making sure the XML is valid, these are what I imagine will need to be adjusted:
There are five errors here, described in the dataverse_1062_philipsonErrorTypes.txt file in jomtov's post:
1. DDI schema doesn't like "DVN" as a value for source in
Only "archive" and "producer" are allowed as values.
2. DDI schema doesn't like the URI attribute being called "URI":
_Attribute 'URI' is not allowed to appear in element 'keyword'._
As jomtov points out, the keyword URI is called vocabURI in Dataverse. Unless there's a reason why it's called URI in the DDI XML, I think this is as easy as changing "URI" to "vocabURI", which is okay with the schema.
<keyword vocab="term" vocabURI="http://vocabulary.org/">Metadata</keyword>
3. DDI schema doesn't like where "contact" info is placed:
<sumDscr/>
<contact affiliation="A University" email="[email protected]">Name</contact>
_Invalid content was found starting with element '{"ddi:codebook:2_5":contact}'. One of '{"ddi:codebook:2_5":sumDscr, "ddi:codebook:2_5":qualityStatement, "ddi:codebook:2_5":notes, "ddi:codebook:2_5":exPostEvaluation}' is expected._
The DDI schema says that sumDscr shouldn't hold things like contact info. The contact element should be under useStmt:
<useStmt>
...
<contact affiliation="A University" email="[email protected]">Name</contact>
...
</useStmt>
4 and 5. DDI schema doesn't like <useStmt> being followed by a value, here the value being the license:
<useStmt>CC0 Waiver</useStmt>
Two of the elements that can be nested under <useStmt> are <restrctn> and <conditions>. Either element seems appropriate for holding license info. to me. The schema's descriptions of the two elements makes <conditions> sound like a catchall and <restrctn> sound like the primary element to use. However, ICPSR uses <conditions> for license-like info.
Lastly, this isn't one of the five errors reported, but DDI likes <dataAccs> a level under <useStmt>. (Right now it's a level under <stdydscr>.) So the following change should fix these errors:
<useStmt>
<dataAccs>
<conditions>CC0 Waiver</conditions>
<contact>...</contact>
</dataAccs>
</useStmt>
There may be more validation errors (since these two datasets have only some of all possible metadata). @raprasad and I talked yesterday about trying to validate all (or a greater number?) of Harvard Dataverse's DDI XML to find additional errors and make sure the DDI XML is always valid.
There was also some discussion about when and how Dataverse validates the DDI it generates, and making sure that process is working.
@jomtov would you be able to tell us what tools you're using to validate against a DDI 2.5 schema? I documented how to validate against DDI 2.0 using MSV (Multi Schema Validator) at http://guides.dataverse.org/en/4.6/developers/tools.html#msv but I seem to recall that DDI 2.5 is more complicate and requires multiple schema file or something. I don't think I ever figured out to use MSV to validate DDI 2.5. Do you use some other tool? Any tips for me? Thanks!
@pdurbin, I used the schema found in the schemaLocation of the exported xml-files of the item examples above:
<codeBook xmlns="ddi:codebook:2_5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd"
version="2.5">
in oXygen xml-editor 18 with Xerces validation engine.
I don't think you need to invoke multiple schemas here, the errortypes are clearly described and have corresponding entries in the codebook.xsd 2.5-schema.
Ah, thanks @jomtov. Judging from its Wikipedia page, the Oxygen XML Editor is not free and open source. Bummer.
In a491cd9 I just pushed some code to demonstrate the difficultly I've seen in validating against that codebook.xsd file you mentioned, which was I checked into the code base long ago when I first attempted (and failed) to get Dataverse to validate the DDI 2.5 it exports.
The failing Travis build from that commit at demonstrates the error I'm seeing:
Tests in error:
testValidateXml(edu.harvard.iq.dataverse.util.xml.XmlValidatorTest): src-resolve: Cannot resolve the name 'xml:lang' to a(n) 'attribute declaration' component.
That's from https://travis-ci.org/IQSS/dataverse/builds/208627544#L3805
Does anyone have any idea how to fix this test? Here's the line that's failing: https://github.com/IQSS/dataverse/blob/a491cd941493f498c320dc79f35d430e623710c8/src/test/java/edu/harvard/iq/dataverse/util/xml/XmlValidatorTest.java#L26
Well, @pdurbin, https://www.corefiling.com/opensource/schemaValidate.html (also on GitHub) is a free xml validator online that seems to work anyway. I uploaded the codebook.xsd and one of the erroneous export-items from above and validated - here attached as .txt -files, since .xsd and .xml are not supported by GitHub, to be 'reconverted' again before use:
codebook.txt
dataverse_1062_Philipson_newexp2DDIcb.txt
True, the validator did not find some of the other referenced schemas, but they are not relevant here, and all the specific codebook.xsd validation errors seems to be identified anyway (scrolling down in the results):
Validation 1, 504 cvc-enumeration-valid: Value 'DVN' is not facet-valid with respect to enumeration '[archive, producer]'. It must be a value from the enumeration.
Validation 1, 504 cvc-attribute.3: The value 'DVN' of attribute 'source' on element 'verStmt' is not valid with respect to its type, '#AnonType_sourceGLOBALS'.
Validation 1, 1314 cvc-complex-type.3.2.2: Attribute 'URI' is not allowed to appear in element 'keyword'.
Validation 1, 1402 cvc-complex-type.3.2.2: Attribute 'URI' is not allowed to appear in element 'keyword'.
Validation 1, 1498 cvc-complex-type.3.2.2: Attribute 'URI' is not allowed to appear in element 'keyword'.
Validation 1, 1600 cvc-complex-type.3.2.2: Attribute 'URI' is not allowed to appear in element 'keyword'.
Validation 1, 3918 cvc-complex-type.2.4.a: Invalid content was found starting with element 'contact'. One of '{"ddi:codebook:2_5":sumDscr, "ddi:codebook:2_5":qualityStatement, "ddi:codebook:2_5":notes, "ddi:codebook:2_5":exPostEvaluation}' is expected.
Validation 1, 4071 cvc-complex-type.2.4.a: Invalid content was found starting with element 'useStmt'. One of '{"ddi:codebook:2_5":method, "ddi:codebook:2_5":dataAccs, "ddi:codebook:2_5":othrStdyMat, "ddi:codebook:2_5":notes}' is expected.
Validation 1, 4091 cvc-complex-type.2.3: Element 'useStmt' cannot have character [children], because the type's content type is element-only.<
Maybe this could be useful?
@jomtov thanks for the pointer to https://www.corefiling.com/opensource/schemaValidate.html which I just tried. It seems to work great. It's perfect for one-off validation of an XML file against a schema. To be clear, what I was trying to say in https://github.com/IQSS/dataverse/issues/3648#issuecomment-284756817 is that I'd like to teach Dataverse itself to validate XML against a schema. It works for DDI 2.0 but not DDi 2.5. I still don't understand why. For the Java developers reading this, a491cd9 is the commit I made the other day.
Hello,
I tried to validate two items exported to DDI from dataverse.harvard.edu with codebook.xsd (2.5) and got the same types of validation errors described below for item1 (below the line, should work as a well-formed xml-file):Item 1:https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/BAMCSI (direct link to dataset's DDI xml)
Hi @jomtov. Here's the corrected DDI xml for the first dataset: valid_DDIXMLforItem1.zip. At first I misinterpreted the errors you posted, but I've got it down now. It's valid as far as I can tell. The online tool you mentioned keeps timing out for me. When you get the chance, could you check to see if the corrected DDI xml is valid with the tool you use?
A while back @pdurbin posted a DDI xml file for a dataset with most of the metadata fields that Dataverse exports. That file and the corrected file (validated with "topic classification" included) are here: invalid_and_valid_DDIxml.zip. Most of the corrections were just moving elements around in the xml, but some involved changing which fields the elements go into (e.g. CC0, or what's entered into Terms of Use if CC0 isn't chosen, can't go into useStmt since that element doesn't take a value; it takes only other elements, and license metadata doesn't fit in those subelements. I moved it to the copyright element, where ICPSR and ADA put their license metadata) or how many times an element can be repeated. These changes mean:
I'd like to rename this issue to something like "Make Dataverse produce valid DDI codebook 2.5 xml", which would involve "teaching Dataverse itself to validate" DDI xml against the codebook 2.5 schema.
@jomtov are you ok with renaming this issue as @jggautier suggests?
@pdurbin and @jggautier, Yes, I am OK with the renaming suggested. (Sorry for belated answer, been on vacation off-line for a while.) Keep up the good work!
The xml files in my earlier comment (ZIP file) don't have most of the metadata in the Terms tab, so the corrections don't take that metadata into account. Current exported DDI from Dataverse has most of the Terms metadata in the right DDI element, but just in the wrong place in the xml.
The exception is the Terms of Access metadata field - whatever's entered there is exported to DDI's dataAccs element, which shouldn't take a value (like the useStmt problem in my earlier comment). The Terms of Access field deals with file level restrictions, which may be handled differently with the upcoming work on DataTags integration, so work may need to be done to map file-level terms and access metadata to DDI.
I wrote a doc describing what I think are most of the mapping changes needed: https://drive.google.com/open?id=1ICXRL8DP5fCGYiRyRphh_3OotNaWOOak1VmnyufBNsM
I'm pointing our ADA friends to this issue and doc, especially the part about the Terms metadata, since I think the invalid mapping has complicated their own work mapping ADA's DDI to Dataverse's for their planned migration.
I rewrote the XML validator in Dataverse an now have a test to validate XML we send to DataCite (it operates on a static file) and I added a FIXME to use the validator with DDI as well: https://github.com/IQSS/dataverse/blob/825332bef8fbb2de23b6fe0fe261ae0bc173194d/src/test/java/edu/harvard/iq/dataverse/util/xml/XmlValidatorTest.java#L22
There was a recent PR submitted, related to codebooks, _739 html codebook #6081_.
In the document about making Dataverse's DDI XML valid, I added a section about how the XML becomes invalid when depositors enter double quotes in some of Dataverse's fields (specifically any field mapped to an element attribute, e.g. Author affiliation).
I also updated the example valid DDI XML to use https in the schema location URL (https://github.com/IQSS/dataverse/issues/6553)
Can't believe it took me this long to realize and ask about it, but in 2017 @pdurbin wrote:
To be clear, what I was trying to say in #3648 (comment) is that I'd like to teach Dataverse itself to validate XML against a schema. It works for DDI 2.0 but not DDi 2.5. I still don't understand why.
By "it works for DDI 2.0," does that mean that Dataverse's DDI exports validate against the DDI Codebook 2.0 schema (or used to validate against the 2.0 schema back in 2017)? If so, should the DDI exports be pointing to the 2.0 schema location instead of the 2.5 schema location?
@jggautier in a491cd9 I had a test in Dataverse that validates against the DDI 2.0 Codebook Schema...

... but that was in a branch called "3648-ddi-2.5-validation" that was never merged. It looks like I wrote about this a bit at https://github.com/IQSS/dataverse/issues/3648#issuecomment-284756817 and please see also the code comments above in the screenshot.
This is an excerpt from an e-mail I sent on Jan. 26, 2020, to @scolapasta after the European Dataverse workshop in Tromsø in January 2020.
_Here just a few references to the issues I mentioned then:_
https://github.com/IQSS/dataverse/issues/3648#issuecomment-315192962
_tried again with DDI md export of_ https://doi.org/10.7910/DVN/YLWCSU
_and this one_: https:/doi.org/10.7910/DVN/F6OLFG
http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd
_Although not the exact same errors as in the above issue, but still some instances of possibly unnecessary / avoidable errors._
@jomtov, some of the easier-to-fix validation errors, including most of the five discussed in this comment, are being addressed as part of an effort to improve exporting DDI from one Dataverse repository and importing that DDI into another Dataverse repository (#6669), although it doesn't fix everything. The doc I shared a few years ago addresses all of the errors and details solutions for some of the not so easy to fix errors. I plan to update that doc once the changes in #6669 go live.
@jggautier, Great! Appreciate your efforts.
Since this issue is from 2017, most of the 5 issues Julian mentioned were already fixed by https://github.com/IQSS/dataverse/issues/6650. I'll go over them below and which one I'm fixing.
1. DDI schema doesn't like "DVN" as a value for source in
Only "archive" and "producer" are allowed as values.
I changed the one instance of "DVN" into the default value "producer".
2. DDI schema doesn't like the URI attribute being called "URI":
_Attribute 'URI' is not allowed to appear in element 'keyword'._
This has already been fixed and merged.
3. DDI schema doesn't like where "contact" info is placed:
<sumDscr/> <contact affiliation="A University" email="[email protected]">Name</contact>
This has already been fixed and merged.
4 and 5. DDI schema doesn't like
<useStmt>being followed by a value, here the value being the license:
<useStmt>CC0 Waiver</useStmt>
This has already been fixed and merged. Dataverse now puts the license info in the notes element and not in the useStmt element:

Lastly, this isn't one of the five errors reported, but DDI likes
<dataAccs>a level under<useStmt>. (Right now it's a level under<stdydscr>.)
According to the codebook, <dataAccs> should be under <stdydscr> so there's no problem:

I'll also go over https://github.com/IQSS/dataverse/issues/6650 and Julian's google doc to see if there's extra improvements I can make. I've tried online XSD validators including the ones mentioned but they all fail in using the 2.5 codebook XSD.
@jggautier
geoBndBox
I want to make changes to the Geographic Bounding Box data below, but how can I test it on my local Dataverse? I can't seem to add geo metadata.

distrbtr
Have you concluded what you want to do with the logo URL yet? Currently, it's in the role attribute but this attribute is invalid.
Thanks @JingMa87.
Just an FYI, I commented that since some of the issues have been fixed, I would update the Google Doc that addresses all of the validation errors and details solutions for some of the not so easy to fix errors. But I haven't found time to do that, so some of the problems described in the Google Doc have already been fixed.
geoBndBox
I want to make changes to the Geographic Bounding Box data below, but how can I test it on my local Dataverse? I can't seem to add geo metadata.
You're not able to add the geospatial metadatablock to your local Dataverse? Have you had a chance to look at the metadata customization sections in the admin guides?
distrbtr
Have you concluded what you want to do with the logo URL yet? Currently, it's in the role attribute but this attribute is invalid.
I opened https://github.com/IQSS/dataverse/issues/4428 about one problem with the distributor and producer logo fields being broken URLs. @pdurbin mentioned issues with logo URLs that use http and https. But the issue doesn't mention how storing the logo URLs in the DDI export invalidates the DDI.xml. The Google Doc does. One solution mentioned in https://github.com/IQSS/dataverse/issues/4428 would involve removing logo URL metadata from the DDI export entirely, which would fix the DDI validation issues it causes. But the work of thinking through that and other solutions hasn't been prioritized.
@jggautier
_All four points?_
To me the optionality of each point of a bounding box is strange since you need all four to make the box.

Moreover, DDI requires a bounding box element to have exactly one occurrence of every point. The default minOccurs and maxOccurs is 1 when not specified.

To me it makes more sense that the four points should always be filled in when the "Geographic Bounding Box" checkbox is ticked, so the four options in the UI for the points should be removed. But this depends, of course, on the people who upload this kind of data.

_Unlimited boxes?_
Also, the geoBndBox element can only occur once according to DDI. In Dataverse you can add as many as you want, does it logically make sense to add unlimited amounts then? Do researchers define multiple boxes?


_Conclusion_
So we might need both the filtering of the metadata in the database when making the DDI XML for legacy data, as well as a change in the UI for future data in order to comply with DDI.
To me it makes more sense that the four points should always be filled in when the "Geographic Bounding Box" checkbox is ticked, so the four options in the UI for the points should be removed.
I think this is an excellent point but it probably deserves to be in its own issue (with the great screenshots). So @JingMa87 if you feel like creating one, please go ahead.