Dataverse: resourceType for dataset files

Created on 24 Sep 2018  路  11Comments  路  Source: IQSS/dataverse

File DOIs from Dataverse are marked with "Dataset" in DataCite Fabrica, thus in the same way as dataset DOIs are; see this screenshot:

image

According to @pdurbin (cf. this post in the Dataverse Google Group),

"Dataset" is coming from at https://github.com/IQSS/dataverse/blob/v4.9.2/src/main/resources/edu/harvard/iq/dataverse/datacite_metadata_template.xml#L12 which is referenced from https://github.com/IQSS/dataverse/blob/v4.9.2/src/main/java/edu/harvard/iq/dataverse/DOIDataCiteRegisterService.java#L279 . As you can see, it's hard coded to "Dataset". You're saying that for files it should be something other that "Dataset", right? "File" or whatever. If so, can you please open a GitHub issue about this? We recently worked on this part of the code at https://github.com/IQSS/dataverse/pull/4795 for https://github.com/IQSS/dataverse/issues/4782 if you'd like to take a look."

I suggest that the metadata of files in Dataverse be changed, so that their DOIs show up not as "Dataset", but as "Dataset file" in DataCite Fabrica. I'm not sure which metadata field we should use for this. The DataCite metadata field resourceType resourceTypeGeneral is mandatory, and I guess it is the value of this field that is reflected in DataCite Fabrica. But according to the DataCite Metadata Schema 4.0, resourceTypeGeneral can only contain the following controlled list values:

Audiovisual
Collection
Dataset
Event
Image
InteractiveResource
Model
PhysicalObject
Service
Software
Sound
Text (15)
Workflow
Other

The list does not contain "Dataset file" or similar. So maybe we just have to specify the field ResourceType, which can contain any value. I suggest a general term like "File", which covers the parts of most types of datasets. Combined with resourceTypeGeneral, we then would get the following resource type description for dataset files:

Dataset/File

where Dataset = resourceTypeGeneral, and File = resourceType.

Metadata

All 11 comments

I'm not sure whether I understand your question, @jggautier. But DataCite now displays all our files as datasets in the search engine; cf. . This search results in 1 041 datasets, but we only have 178 datasets. So the rest are files.

Hi @philippconzett. Did you mean this question?: "Are dataset and file metadata records already sent to EZID/DataCite being updated?"

I referenced this github issue in that issue (#5060), which is about investigating if EZID and DataCite are getting any new metadata that Dataverse sends (as Dataverse changes things like the resourceType values for files) and making sure that the existing metadata records that EZID and DataCite have are updated to reflect those changes. Please let me know if you have any questions.

But DataCite now displays all our files as datasets in the search engine

In the Google Group conversation I thought we were discussing only how the datasets and files were displayed in Fabrica. But here do you mean the list of resource types in DataCite Search?

screen shot 2018-10-05 at 10 40 41 am

Hi @jggautier, sorry for the confusion, but I think the display behavior in DataCite Fabrica and in DataCite Search are both based on the Resource type. But I'm not sure whether there is a Resource type = File (or Dataset File) in DataCite. I guess other data repository applications also are interested in getting their file DOIs viewed as files and not as datasets in both DataCite Fabrica and in DataCite Search.

I agree that in DataCite Search, the resource type is based on the controlled vocab you listed, and there's nothing like file. I like your earlier suggestion:

So maybe we just have to specify the field ResourceType, which can contain any value. I suggest a general term like "File", which covers the parts of most types of datasets. Combined with resourceTypeGeneral, we then would get the following resource type description for dataset files:

Dataset/File

where Dataset = resourceTypeGeneral, and File = resourceType

As long as we don't get too semantic with the word "file," since I imagine some people might ask "what about archived files, like zip files, or things in datasets that are collections of files?" Would you say the value is in being able to, in Dataset Search and Fabrica, distinguish between and filter for datasets versus the things within datasets that have bytes?

We'll have to get DataCite involved, and their metadata team has been responsive during similar conversations about resourceType in their DataCite Metadata forum.

Would you mind writing them about this use case?

Thanks, @jggautier, I have raised this issue in the DataCite Metadata forum; see this posting.

I suggest to distinguish between what can be done with the DataCite Metadata Schema now, and how the metadata schema could be updated in the future (the next schema release for the end of 2018 is basically finalized, so that would be second half of 2019 the earliest).

With the current schema resourceTypeGeneral Dataset is the best fit, and you can add granularity via resourceType (which is a free text field). I like DataFile, but would also consider DataDownload, which is used in DCAT and schema.org: https://schema.org/DataDownload.

@mfenner thanks for mentioning DataDownload, which seems like an emerging standard for providing the URLs to download individual files. Last week I wrote about it at https://github.com/whole-tale/whole-tale/issues/35#issuecomment-427411937 in the context of #4371.

I just noted that this issue is still discussed also by other users; cf. this thread in the Dataverse Google group.

I'd like to urge DataCite (@mfenner) to follow up on this issue. The current situation is quite unsatisfactory as file metadata is confused with dataset metadata, resulting in i.a. a proliferation of file metadata records listed in DataCite Search result lists and ORCID record search result lists.

Currently, DataCite (in DataCite Fabrica) offers the following values for Resource Type General:

image

For files within a dataset, I suggest we use _Dataset file_ or _Dataset part_ or _Part of Dataset_.

Thanks!

See also the the discussion thread Granularity of datasets in the PID Forum.

@philippconzett you beat me to it, I was just about to post the link.

Was this page helpful?
0 / 5 - 0 ratings