Currently it's difficult to extract information about a parent dataset when querying only a datafile. All the information is hidden in dataset_citation:
$ curl -s https://dataverse.harvard.edu/api/search?q=entityId:3040230 | jq '.data.items[0]'
{
"name": "2017-07-31.tab",
"type": "file",
"url": "https://dataverse.harvard.edu/api/access/datafile/3040230",
"file_id": "3040230",
"published_at": "2017-07-31T22:27:23Z",
"file_type": "Tab-Delimited",
"file_content_type": "text/tab-separated-values",
"size_in_bytes": 12025,
"md5": "e7dd2f725941b978d45fed3f33ff640c",
"checksum": {
"type": "MD5",
"value": "e7dd2f725941b978d45fed3f33ff640c"
},
"unf": "UNF:6:6wGE3C5ragT8A0qkpGaEaQ==",
"dataset_citation": "Durbin, Philip, 2017, \"Open Source at Harvard\", https://doi.org/10.7910/DVN/TJCLKP, Harvard Dataverse, V2, UNF:6:6wGE3C5ragT8A0qkpGaEaQ== [fileUNF]"
}
Please consider exposing dataset's name, global_id, authors on file level too.
@Xarthisius thanks for opening this issue. At the file level we are already indexing the title of its dataset under the field "parentName" and the DOI/PID of its dataset under "parentIdentifier" (and in URL form under "persistentUrl") so those two would be especially easy to add. As seen in the raw Solr output below, at the file level we don't currently index individual authors of the parent dataset.
{
"entityId":337,
"dataverseVersionIndexedBy_s":"4.9.4",
"identifier":"337",
"persistentUrl":"https://doi.org/10.5072/FK2/CY07T3",
"dvObjectType":"files",
"fileNameWithoutExtension":["trees"],
"fileName":["trees",
"trees.png"],
"name":"trees.png",
"nameSort":"trees.png",
"datasetVersionId":140,
"fileAccess":["Public"],
"isHarvested":false,
"metadataSource":"Root",
"dateSort":"2018-11-02T14:07:57.497Z",
"dateFriendly":"Nov 2, 2018",
"publicationStatus":["Published"],
"publicationDate":"2018",
"dsPublicationDate":"2018",
"id":"datafile_337",
"fileTypeDisplay":"PNG Image",
"fileContentType":"image/png",
"fileType":["PNG Image",
"Image"],
"fileTypeGroupFacet":"Image",
"fileSizeInBytes":8361,
"fileMd5":"0386269a5acb2c57b4eade587ff4db64",
"fileChecksumType":"MD5",
"fileChecksumValue":"0386269a5acb2c57b4eade587ff4db64",
"description":"",
"fileDescription":"",
"filePersistentId":"doi:10.5072/FK2/CY07T3/XNNKK5",
"subtreePaths":["/335"],
"parentId":"336",
"parentIdentifier":"doi:10.5072/FK2/CY07T3",
"parentCitation":"Finch, Fiona, 2018, \"Darwin's Finches\", https://doi.org/10.5072/FK2/CY07T3, Root, DRAFT VERSION",
"parentName":"Darwin's Finches",
"_version_":1616031442019549184}
https://github.com/IQSS/dataverse/blob/v4.9.4/scripts/search/query is the script I use to see the raw Solr output above.
Adding a parentIdentifier would go a long way. All the other entries that I mentioned could be derived with second query using it.
After some further thoughts: filePersistentId would also be very useful.
@Xarthisius thanks we talked about this issue during sprint planning yesterday and gave it a size of "2" and while it's not in the next sprint it's relatively high up in the "Ready" column. For details, please see https://waffle.io/IQSS/dataverse
@Xarthisius I've added json entries for file_persistent_id, dataset_name, dataset_id & dataset_persistent_id. Hope this helps!
curl -s http://localhost:8080/api/search?q=entityId:50 | jq '.'
{
"status": "OK",
"data": {
"q": "entityId:50",
"total_count": 1,
"start": 0,
"spelling_alternatives": {},
"items": [
{
"name": "20180126_180209.jpg",
"type": "file",
"url": "http://localhost:8080/api/access/datafile/50",
"file_id": "50",
"published_at": "2019-02-11T21:13:22Z",
"file_type": "JPEG Image",
"file_content_type": "image/jpeg",
"size_in_bytes": 4123458,
"md5": "e708c4e02dbf1367cd4b0b099836d485",
"checksum": {
"type": "MD5",
"value": "e708c4e02dbf1367cd4b0b099836d485"
},
"file_persistent_id": "doi:10.5072/FK2/PCCHV7",
"dataset_name": "resr",
"dataset_id": "49",
"dataset_persistent_id": "doi:10.5072/FK2/6ZQAE9",
"dataset_citation": "Admin, Dataverse, 2019, \"resr\", https://doi.org/10.5072/FK2/6ZQAE9, Root, V1"
}
],
"count_in_response": 1
}
}