Dataverse: Expose individual components of citation string in datafile queries

Created on 23 Nov 2018 · 5Comments · Source: IQSS/dataverse

Currently it's difficult to extract information about a parent dataset when querying only a datafile. All the information is hidden in dataset_citation:

$ curl -s https://dataverse.harvard.edu/api/search?q=entityId:3040230 | jq '.data.items[0]'
{
  "name": "2017-07-31.tab",
  "type": "file",
  "url": "https://dataverse.harvard.edu/api/access/datafile/3040230",
  "file_id": "3040230",
  "published_at": "2017-07-31T22:27:23Z",
  "file_type": "Tab-Delimited",
  "file_content_type": "text/tab-separated-values",
  "size_in_bytes": 12025,
  "md5": "e7dd2f725941b978d45fed3f33ff640c",
  "checksum": {
    "type": "MD5",
    "value": "e7dd2f725941b978d45fed3f33ff640c"
  },
  "unf": "UNF:6:6wGE3C5ragT8A0qkpGaEaQ==",
  "dataset_citation": "Durbin, Philip, 2017, \"Open Source at Harvard\", https://doi.org/10.7910/DVN/TJCLKP, Harvard Dataverse, V2, UNF:6:6wGE3C5ragT8A0qkpGaEaQ== [fileUNF]"
}

Please consider exposing dataset's name, global_id, authors on file level too.

Source

Xarthisius

All 5 comments

@Xarthisius thanks for opening this issue. At the file level we are already indexing the title of its dataset under the field "parentName" and the DOI/PID of its dataset under "parentIdentifier" (and in URL form under "persistentUrl") so those two would be especially easy to add. As seen in the raw Solr output below, at the file level we don't currently index individual authors of the parent dataset.

      {
        "entityId":337,
        "dataverseVersionIndexedBy_s":"4.9.4",
        "identifier":"337",
        "persistentUrl":"https://doi.org/10.5072/FK2/CY07T3",
        "dvObjectType":"files",
        "fileNameWithoutExtension":["trees"],
        "fileName":["trees",
          "trees.png"],
        "name":"trees.png",
        "nameSort":"trees.png",
        "datasetVersionId":140,
        "fileAccess":["Public"],
        "isHarvested":false,
        "metadataSource":"Root",
        "dateSort":"2018-11-02T14:07:57.497Z",
        "dateFriendly":"Nov 2, 2018",
        "publicationStatus":["Published"],
        "publicationDate":"2018",
        "dsPublicationDate":"2018",
        "id":"datafile_337",
        "fileTypeDisplay":"PNG Image",
        "fileContentType":"image/png",
        "fileType":["PNG Image",
          "Image"],
        "fileTypeGroupFacet":"Image",
        "fileSizeInBytes":8361,
        "fileMd5":"0386269a5acb2c57b4eade587ff4db64",
        "fileChecksumType":"MD5",
        "fileChecksumValue":"0386269a5acb2c57b4eade587ff4db64",
        "description":"",
        "fileDescription":"",
        "filePersistentId":"doi:10.5072/FK2/CY07T3/XNNKK5",
        "subtreePaths":["/335"],
        "parentId":"336",
        "parentIdentifier":"doi:10.5072/FK2/CY07T3",
        "parentCitation":"Finch, Fiona, 2018, \"Darwin's Finches\", https://doi.org/10.5072/FK2/CY07T3, Root, DRAFT VERSION",
        "parentName":"Darwin's Finches",
        "_version_":1616031442019549184}

https://github.com/IQSS/dataverse/blob/v4.9.4/scripts/search/query is the script I use to see the raw Solr output above.

pdurbin on 26 Nov 2018

Adding a parentIdentifier would go a long way. All the other entries that I mentioned could be derived with second query using it.

Xarthisius on 26 Nov 2018

👍1

After some further thoughts: filePersistentId would also be very useful.

Xarthisius on 27 Nov 2018

👍1

@Xarthisius thanks we talked about this issue during sprint planning yesterday and gave it a size of "2" and while it's not in the next sprint it's relatively high up in the "Ready" column. For details, please see https://waffle.io/IQSS/dataverse

pdurbin on 29 Nov 2018

❤1

@Xarthisius I've added json entries for file_persistent_id, dataset_name, dataset_id & dataset_persistent_id. Hope this helps!

curl -s http://localhost:8080/api/search?q=entityId:50 | jq '.'
{
  "status": "OK",
  "data": {
    "q": "entityId:50",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "20180126_180209.jpg",
        "type": "file",
        "url": "http://localhost:8080/api/access/datafile/50",
        "file_id": "50",
        "published_at": "2019-02-11T21:13:22Z",
        "file_type": "JPEG Image",
        "file_content_type": "image/jpeg",
        "size_in_bytes": 4123458,
        "md5": "e708c4e02dbf1367cd4b0b099836d485",
        "checksum": {
          "type": "MD5",
          "value": "e708c4e02dbf1367cd4b0b099836d485"
        },
        "file_persistent_id": "doi:10.5072/FK2/PCCHV7",
        "dataset_name": "resr",
        "dataset_id": "49",
        "dataset_persistent_id": "doi:10.5072/FK2/6ZQAE9",
        "dataset_citation": "Admin, Dataverse, 2019, \"resr\", https://doi.org/10.5072/FK2/6ZQAE9, Root, V1"
      }
    ],
    "count_in_response": 1
  }
}