Dataverse: AWS/S3 configuration and upload problem

Created on 17 Sep 2018  路  7Comments  路  Source: IQSS/dataverse

I have a test dataverse on AWS using S3 storage. (http://54.67.118.35:8080/). We were unable to upload files.

The jvm dataverse.files.directory is default set to -Ddataverse.files.directory=/usr/local/dvn/data. Once that option was deleted we could upload and display files.

Although it appears to work I'm really worried about just deleting a jvm option without know what it should be set to.

Jamie Jamison
UCLA data science center

File Upload & Handling

All 7 comments

I should have added I used the install directions for setting the storage to S3.
1) Removed: Ddataverse.files.storage-driver-id=file
2) Replaced with: Ddataverse.files.storage-driver-id=s3 and added the bucket name
3) set up the .aws credentials

@jmjamison thanks for opening this issue and for all the chatter in various places:

  • https://groups.google.com/d/msg/dataverse-community/cK5ix5p9Qy4/JTI2WEiJBwAJ
  • http://irclog.iq.harvard.edu/dataverse/2018-09-17#i_72453
  • https://help.hmdc.harvard.edu/Ticket/Display.html?id=267029

It's weird to me that deleting that dataverse.files.directory JVM option had an effect but it seems like it's been a good fix for you, which is great.

One thing I wanted to point out is that the dataverse.files.directory JVM option is set to "/usr/local/dvn/data" when using the "dataverse-ansible" configs (see https://github.com/IQSS/dataverse-ansible/blob/a7251f975c913924c8bc493264d942c5e06b56a0/defaults/main.yml#L25 ) but if you use the regular Dataverse installation process the directory "/usr/local/glassfish4/glassfish/domains/domain1/files" is used instead, as described at http://guides.dataverse.org/en/4.9.2/installation/config.html#file-storage-local-filesystem-vs-swift-vs-s3 . I mention this just so that developers aren't confused about this when working on this issue. @kcondon @matthew-a-dunlap and I discussed this a bit this morning so the three of us, at least, are on the same page.

Thanks again. At minimum, I believe we need to fix up the docs. It's also quite possible you've found a bug.

FWIW - dataverse.files.directory is used to determine the temp dir where files are written before transfer to S3. (If not set the default is /tmp/files/temp). So having it set probably isn't a problem itself, but if that dir doesn't exist or isn't writable...
Maybe the fact that this dir affects the temp file location for no File IO providers is the thing to document?

I've confirmed through testing it behaves the way @qqmyers described. Additionally, if I configure a path the glassfish user does not have write access to, you can select files to upload, they appear to upload but then do not appear in the uploaded list on the upload files page -they disappear.

Hey guys,
curious if there is room for improvement here? This could hit us (FZJ) pretty badly in production if drives start to get full or other stuff happens.

IMHO there should be an error message that the files cannot be uploaded right now before the potentially large upload starts. At least there should be a clear error message.

Maybe add check at startup time that ensures a writeable path? And maybe an extra check before the upload starts?

Ideally there should also be a check if the file can be uploaded in terms of storage capacity within the temp folder, but this is another story.

@poikilotherm there is definitely room for improvement. Any interest in poking around in the code and maybe making a pull request?

FWIW: The permission issue that this issue started with should be a one-time thing, i.e. once you're setup correctly, users shouldn't be running into it. A check could be written, but it might still be overkill to run it for every upload.

Managing disk space would trickier in many ways. Since multiple users could be uploading in parallel and one user can drag files into the upload interface sequentially, attempts to avoid running out of space would have to reserve space and effectively release it if/when not needed and the GUI would need logic to stop additions to an upload but allow the existing files to complete. Having just been through the code to find places where temp files are being left (pre 4.9.3, when users hit cancel on an upload, or deleted some files before save, temp files were left) I know there are multiple places any reservation would have to be released. (There's still the cases where a user leaves the page without cancelling or saving or a network drop where I haven't yet tracked down how to delete temp files where reservations would also have to be cancelled). At a minimum, if you have limited disk space for temp files, 4.9.3 should be better at making sure it doesn't fill up.

Some good news perhaps is that I think the upload problem is visible if it happens and probably could be made more visible: files dragged into the upload pane are only transferred to the bottom if the ingest completes and writes the temp file. I did submit code a while back to catch and show non-20x responses from the server during upload - I don't know if that's already catching when there's a permission of out-of-space issue (it should if the response is not 201). If so, there's already a GUI display when the error occurs and one could add more logic to the javascript triggering that to decide whether the failure is one that should shut down other ongoing uploads (or warn the user to do a 'cancel', etc.).

Was this page helpful?
0 / 5 - 0 ratings