Harvard Dataverse and Demo Dataverse, both on v4.14, aren't extracting metadata from FITS files anymore.
On Harvard Dataverse, it looks like the last published file where metadata was extracted was on June 19, 2018. Since then, no uploaded file whose file type is FITS has had its metadata extracted.
In this dataset on Demo Dataverse I tested publishing files whose metadata Harvard Dataverse has and has not extracted. When I open the three files in a text editor, I can see the metadata in the headers.
In this dataset, I tried uploading a FITS file and the metadata was not extracted. It is also not working on our instance at Scholars Portal Dataverse (currently on 4.17). Would anyone be able to look into this? Thanks!
@meghangoodchild - we'll bring this into a sprint and fix this in a release soon.
FYI : https://github.com/IQSS/dataverse/blob/develop/src/main/java/edu/harvard/iq/dataverse/ingest/IngestServiceBean.java#L169 looks suspicious. The next few lines take the storage identifier and strip any tmp:// prefix beofre turning it into a file path. Line 169 doesn't strip the prefix before prepending the tempfile dir and then sends that to the metadata extractor.
I just saw this while working on other things and haven't tested to see if this is really the issue.
Probably not the real issue - just saw that at least single file uploads don't have tmp:// at the start of their storageidentifiers in this method, so this may be a red herring or only a secondary problem (presumably some files have tmp:// in their storageidentifiers when they go through this method?).
Localhost server log errors from uploading two FITS files. Checked the temp directory path locally, files are not there.
[2020-01-24T11:43:15.823-0500] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.dataverse.ingest.IngestServiceBean] [tid: _ThreadID=30 _ThreadName=http-listener-1(4)] [timeMillis: 1579884195823] [levelValue: 1000] [[
Caught exception trying to extract indexable metadata from file muench2002.fits, Could not open temp file /Applications/NetBeans/glassfish-4.1_dvn/glassfish/domains/domain1/files/temp/16fd86f4196-50101587beb2]]
[2020-01-24T11:43:15.857-0500] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.dataverse.ingest.IngestServiceBean] [tid: _ThreadID=30 _ThreadName=http-listener-1(4)] [timeMillis: 1579884195857] [levelValue: 1000] [[
Caught exception trying to extract indexable metadata from file DHT02_Center_interp.fits, Could not open temp file /Applications/NetBeans/glassfish-4.1_dvn/glassfish/domains/domain1/files/temp/16fd86f8731-65de34ea8b3a]]
I wanted to point out that Alyssa Goodman pointed out somewhat recently at https://twitter.com/AlyssaAGoodman/status/1185182509556031489 that we can find some smallish FITS files to test with at https://dataverse.harvard.edu/dataverse/complete
I've been thinking it might be nice to one of those FITS fils to https://github.com/IQSS/dataverse-sample-data to aid in automated testing in the future.
@pdurbin I concur that our sample data should contain files that highlight all our features, FITS metadata ingest being one we're missing.
While looking for a smoking gun in the GitHub commit history, that is around the last known working timestamp of June 19, 2018, ~I found this commit by @sekmiller~ UPDATE: comment retracted, it seems Phil found something, sorry Stephen!
That change was released in 4.9, on Jun 6, 2018, but I don't see when it went to production. I do see that we quickly released 4.9.1 on Jun 27, 2018, which appears to have gone up to production on Jun 27, 2018.
This MASSES Dataverse contains over 1,500 FITS files uploaded on or around Jun 19, 2018, with no metadata ingested descriptions.
Hope some of that helps someone.
Yeah, when I showed pull request #6610 to @landreev and mentioned 4.9 he said a lot of refactoring happened in that time frame due to moving to S3.
Have been in this code due to the direct S3 upload work:
Looks like https://github.com/IQSS/dataverse/commit/dc6ff8637744a0ca863470fb9c9f22a12241dc29 added a delete of the temp file prior to when it is used by the FITS extractor...
Also fwiw looks like lines 169 and 174 define near-duplicate variables pointing to the temp file (the latter would have removed a tmp:// prefix if the storageidentifier had one, but the variable on 169 wouldn't be usable if it did, so I think they are the same.