This is based on a suggestion from a user (@mankoff) made earlier in the "optimize zip" issue (#6505). I believe something similar had also been proposed elsewhere earlier. I'm going to copy the relevant discussion from that issue and add it here.
I do not consider this as a possible replacement for the "download multiple files as zip" functionality. Unfortunately, we're pretty much stuck supporting zip, since it has become the de-facto standard for sharing mutli-file and folders bundles. But it could be something very useful to offer as another option.
The way it would work, there will be an API call (for example, /api/access/dataset/<id>/files) that would expose the files and folders in the dataset as crawl-able tree of links; similar to how static files and directories are shown on simple web servers. A command line user could point a client - for example, wget - to crawl and save the entire tree, or a sub-folder thereof. The advantages of this method are huge - the end result is the same as downloading the entire dataset as Zip and unpacking the archive locally, in one step. But it's achieved in a dramatically better way - by wget issuing individual GET calls for the individual files; meaning that those a) can be redirected to S3 and b) the whole process is completely resume-able in case it is interrupted; unlike the single continuous zip download that cannot be resumed at all.
The advantages are not as dramatic for the web UI users. None of the browsers I know of support drag-and drop downloads of entire folders out of the box. However, plugins that do that are available for major browsers. Still, even clicking through the folders, and being able to download the files directly (unlike in the current "tree view" on the page) would be pretty awesome. Again, see the discussion re-posted below for more information.
I would strongly support implementing this sometime soon (soon after v5.0 that is).
From 6505:
From @mankoff:
Hello. I was sent here from #4529.
I'm curious why zipping is a requirement for bulk download. It has been a long time since I've admin'd a webserver, but if I recall many servers (e.g. Apache) perform on-the-fly compression for files that they transfer.
I'm imagining a solution where appending /download/ to any dataverse or dataset URL (where this feature is enabled) exposes the files within as a virtual folder structure. The advantages of this are:
wget and other default tools (including GUI "DownThemAll" browser extension, for example) could be deployed against this URL, and would support filename filtering, inclusion, exclusion, etc. This offloads a whole bunch of functionality to the end-user download tool, rather than bloating Dataverse. If you zip, I promise there is or will be a feature request to "let me bulk download but filter on filename".Just some thoughts about how I'd like to see bulk download exposed as an end-user.
From @poikilotherm:
Independent from the pros and cons of ZIP files (like for many small files), I really like the idea proposed above. Both approaches don't merely exclude each other, too, which makes it even more attractive.
It should be as simple as rendering a very simple HTML page, containing the links to the files. So this still allows for control of direct or indirect access to the data, even using things like secret download tokens.
Obviously the same goal of bulk download could be done via some script, too, but using normal system tools like curl and wget is even a lower barrier for scientist/endusers than using the API.
From @landreev:
...
I actually like the idea; and would be interested in trying to schedule it for a near release. But I'm not sure this can actually replace the download-multiple-files-as-zip functionality, completely.
OK, so adding "/download" to the dataset url "exposes the files within as a virtual folder structure" - so, something that looks like your normal Apache directory listing? Again, I like the idea, but not entirely sure about the next sentence:
No waiting for zipping, which could be a long wait if the dataset is 100s of GB. The download starts instantly
Strictly speaking, we are not "waiting for zipping" - we start streaming the zipped bytes right away, as soon as the first buffer becomes available. But I am with you in principle, that trying to compress the content is a waste of cpu cycles in many cases. It's the "download starts instantly" part that I have questions about. I mean, I don't see how it starts instantly, or how it starts at all. That ".../download" call merely exposed the directory structure - i.e. it produced some html with a bunch of links. It's still the client's job to issue the download requests for these links. I'm assuming what you mean is that the user can point something like wget at this virtual directory, and tell it to crawl it. And then the downloads will indeed "start instantly", and wget will handle all the individual downloads and replicate the folder structure locally, etc. I agree that this would save a command line user a whole lot of scripting that's currently needed - first listing the files in the dataset, parsing the output, issuing individual download requests for the files etc. But as for the web users - the fact that it would require a browser extension for a folder download to happen automatically makes me think we're not going to be able to use this as the only multiple file download method. (I would definitely prefer not to have this rely on a ton of custom client-side javascript, for crawling through the folders and issuing download requests either...)
(Or is it now possible to create HTML5 folders, that a user can actually drag-and-drop onto their system - an entire folder at once? Last time I checked it couldn't be done; but even if it can be done, I would expect it not to be universally supported in all browsers/on all OSes...)
My understanding of this is that even if this can be implemented as a viable solution, for both the API and web users, we'll still have to support the zipping mechanism. Of which, honestly, I'm not a big fan either. It's kind of a legacy thing - something we need to support, because that's the format most/many users want. As you said, the actual compression that's performed when we create that ZIP stream is most likely a waste of cycles. With large files specifically - most of those in our own production are either compressed already by the user, or are in a format that uses internal compression. So this means we are using ZIP not for compression - but simply as a format for packaging multiple files in a single archive stream. Which is kind of annoying, true. But I've been told that "users want ZIP".
But please don't get me wrong, I am interested in implementing this, even if it does not replace the ZIP download.
From @mankoff:
Hi - you're right, this does not start the download. I was assuming wget is pointed at that URL, and that starts the downloads.
As for browser-users, sometimes I forget that interaction mode, but you are right, they would still need a way to download multiple files, hence zipping. If zipping is on-the-fly streaming and you don't have to wait for n files n GB in size to all get zipped, then it isn't as painful/bad as I assumed. You are right, exposing a virtual folder for CLI users (or DownThemAll browser extension users) and bulk download are separate issues. Perhaps this virtual folder should be its own Issue here on GitHub to keep things separate.
From @mankoff:
I realize that if appending /download to the URL doesn't start the download as @landreev pointed out, that may not be the best URL. Perhaps /files would be better. In which case, appending /metadata could be a way for computers to fetch the equivalent of the metadata tab that users might click on, here again via a simpler mechanism than the API.
From @landreev:
I realize that if appending
/downloadto the URL doesn't start the download ... that may not be the best URL. Perhaps/fileswould be better.
I like /files. Or /viewfiles? - something like that.
I also would like to point out that we don't want this option to start the download automatically, even if it were possible. Just like with zipped downloads, either via the API or the GUI, not everybody wants all the files. So we want the command line user to be able to look at the output of this /files call, and, for example, select a subfolder they want - and then tell wget to crawl it. Same with the web user.
From @landreev:
... If zipping is on-the-fly streaming and you don't have to wait for n files n GB in size to all get zipped, then it isn't as painful/bad as I assumed.
But I readily acknowledge that it's still bad and painful, even with streaming.
The very fact that we are relying on one long uninterrupted HTTP GET request to potentially download a huge amount of data is "painful". And the "uninterrupted" part is a must - because it cannot be resumed from a specific point if the connection dies (by nature of having to generate the zipped stream on the fly). There are other "bad" things about this process, some we have discussed already (spending CPU cycles compressing = potential waste); and some I haven't even mentioned yet... So yes, being able to offer an alternative would be great.
From @poikilotherm:
Just a side note: one might be tempted to create a WebDAV interface, which could be included in things like Nextcloud.
Related to #7174 - the /files view could expose versions, like this:
├── 1.0
├── 1.1
├── 2.0
├── 2.1
├── 2.2
├── 2.3
├── 2.4
├── 3.0
├── 4.0
├── 5.0
└── latest
More generally, it would be nice to always have access to the latest version of a file, even though the file DOI changes when the file updates. The behavior described here provides that feature. I'm not sure this is correct though, because that means doi:nn.nnnn/path/to/doi/for/v3/files/latest/, or doi:nn.nnnn/path/to/doi/for/v3/files/2.4/ will download versions that are not v3 (the actual DOI used in this example). Could be confusing...
@djbrooke I see you added this to a "Needs Discussion" card. Is there any part of the discussion I can help with?
Another use case that popped up today from a workshop: making such a structure available could help with integrating data in Dataverse with DataLad, https://github.com/datalad/datalad. DataLad is basically a wrapper around git-annex, which allows for special remotes
DataLad is gaining traction especially in communities with big data needs like neuroimaging.
Corss-Linking datalad/datalad#393 here.
@poikilotherm we're friendly with the DataLad team. In https://chat.dataverse.org the DataLad PI is "yoh" and I've had the privilege of having tacos with him in Boston and beers with him in Brussels ( https://twitter.com/philipdurbin/status/1223987847222431744 ). I really enjoyed the talk they gave at FOSDEM 2020 and you can find a recording here: https://archive.fosdem.org/2020/schedule/event/open_research_datalad/ . Anyway, we're happy to integrate with DataLad in whatever way makes sense (whenever it makes sense).
@mankoff "Needs Discussion" makes more sense if you look at our project board, which goes from left to right: https://github.com/orgs/IQSS/projects/2
Basically, "Needs Discussion" means that the issue is not yet defined well enough to be estimated or to have a developer pick it up. As of this writing it looks like there are 39 of these issues, so you might need to be patient with us. 😄
@mankoff @poikilotherm @pdurbin thanks for the discussion here. I'll prioritize the team discussing this as there seem to be a few use cases that could be supported here.
@scolapasta can you get this on the docket for next tech hours? (Or just discuss with @landreev if it's clear enough.)
Thanks all!
I agree that this should be ready to move into the "Up Next" column.
Whatever decisions may still need to be made, we should be able to resolve as we work on it.
The implementation should be straightforward enough. One big-ish question is whether there is already a good package that will render these crawl-able links that we can use; or if we should just go ahead and implement it from scratch. (since the whole point is to have these simple, straight html links w/no fancy ui features, the latter feels like a reasonable idea -?).
And I just want to emphasize that this is my understanding of what we want to develop: this is not another UI implementation of a tree view of the files and folders (like we already have on the dataset page, but with download links). This is not for human users (mostly), but for download clients (command line-based or browser extensions) to be able to crawl through the whole thing and download every file; hence this should output a simple html view of one folder at a time, with download links for files and recursive links to sub-folders. Again, similarly to how files and directories on a filesystem look like when exposed behind an httpd server.
@mankoff
Related to #7174 - the
/filesview could expose versions, like this:├── 1.0 ├── 1.1 ├── 2.0 ├── 2.1 ├── 2.2 ├── 2.3 ├── 2.4 ├── 3.0 ├── 4.0 ├── 5.0 └── latest
Thinking about this - I agree that this API should understand version numbers; maybe serve the latest version by default, or a select version, when specified. But I'm not sure about providing a top level access point for multiple versions at the same time, like in your example above. My problem with that is that if you have a file that happens to be in all 10 versions, a crawler will proceed to download and save 10 different copies of that file, if you point it at the top level pseudo folder.
I'm happy to hear this is moving toward implementation. I agree with your understanding of the features, functions, and purpose of this. This is also what you wrote when you open the ticket.
I was just going to repeat my 'version' comment when you posted your 2nd comment above. Yes to latest by default, with perhaps some method to access earlier versions (could be different URL you find from the GUI, not necessarily as sub-folders under the default URL for this feature).
Use case: I have a dataset A that updates every 12 days via the API. I am working on another project B that is doing something every day, ad I always wants the latest A. It would be good if the code in B is 1 line (wget to the latest URL, download if server version newer than local version). It would not be as good if B needed to include a complicated function to access the A dataset, parse something to get the URL for the latest version, and then download that.
Just a quick thought: what about making it WebDav compatible? It could be integrated into Next loud/owncloud this way (read-only for now)
The most useful minimal implementation is the latest files in a dataset: http://doi/view/latest exposes a simple wget-friendly view of all files and folders. Note that view is open for discussion - could be files or list or download or something else. Versioning would only show the files in that version, so http://doi/view/4.0 might show different files and folders.
Aux and metadata? I guess. I notice when I download a dataset I get MANIFEST.TXT even though I didn't ask for it. I'm not sure what happens if the dataset contains a real file called MANIFEST.TXT. But there could be a virtual folder of aux and metadata too.
I'm not sure what your 3rd point means. But the point of this feature is not for the GUI. It's a way to make bulk download easy and accessible to the most common tools and user experience - "similarly to how files and directories on a filesystem look like when exposed behind an httpd server."
Thanks @mankoff, I think we're all set, I was just capturing some discussion from the sprint planning meeting this afternoon. We'll start working on this soon, and I mentioned that we may run some iterations by you as we build it out.
Thinking about the behaviors requested here after reading and commenting on #7425, I see a problem.
The original request was to allow easy file access for a dataset, so doi:nnnn/files exposes the files for wget or a similar access method.
The request grew to add a logical feature to support easy access the latest version of the files. "easy" here presumably means via a fixed URL. But DOIs point to a specific version, so it is counter-intuitive for doi:nnn_v1/latest to point to something that is not v1.
I note that Zenodo provides a DOI that always points to the latest version, with clear links back to earlier DOId versions. Would this behavior be a major architecture change for Dataverse?
Or if you go to doi:nnnn/latest does it automatically redirect to a different doi, unless nnnn is the latest? I'm not sure if this is a reasonable behavior or not.
Anyway, perhaps "easy URL access to folders" and "fixed URL access to latest" should be treated as two distinct features to implement and test, although there is a connection between the two and latter should make use of the former.
How would a/dataset doi/dataset version or :latest/file path URI work? That would allow a stable URI for the file of a given path/name in the latest dataset version. If files are being replaced by files with different names this wouldn't work, but it would avoid trying to have both the dataset and file versioning schemes represented in the API.
If files are deleted or renamed, then a 404 or similar error seems fine.
Note that this ticket is about exposing files and folders in a simple view, so if you use this feature to link to the latest version of a dataset (not a file within the dataset), then everything "just works", because whatever files exist in the latest version would be in that folder, by definition.
Here are some use-cases:
YYYY-MM-DD.tif added every dayHow can we easily share this dataset with colleagues (and computers) so they always get the latest data? From your suggestions above, /dataset doi/dataset version won't find the latest, but could expose the files in dataset version in a virtual folder. The URL with :latest/file path won't work because the files for tomorrow don't exist in the 2nd example, where files get added every day. The URL dataset doi/view/latest could expose the latest version in a simple virtual folder, but may confuse people because of the DV vs. Zenodo architecture decision, where dataset doi is not meant to point to the latest version.
Not sure I follow. /10.5072/ABCDEF/:latest/file1.csv would always be the latest file with that name, and /10.5072/ABCDEF/:latest/ would always be a point where you'd get the latest dataset version's list of files, so new files with new dats would appear there. Does that support your two cases OK? (In essence the doi + :latest serves as the ID for the latest version versus there being a separate DOI for that.)
The concern I raised was that, because I could replace file1.csv with filen.csv in later versions, a scheme using the file path/name won't expose the relationship between file1.csv and filen.csv, but if the names are the same, or the names include dates that can be used to get the file you want, knowing Dataverse's internal relationship between those files may not be important. (Conversely - what happens if someone deleted file1 in dataverse and added a new file1, versus using the replace function? Should the API not provide a single URI that would get you to the latest?)
Please recall the opening description by @landreev, "similar to how static files and directories are shown on simple web servers." Picture this API allowing browsing a simple folder. This may help answer some of the questions below. If a file is deleted from a folder, it is no longer there. If a file is renamed, or replaced, the latest view should be clearly defined based on our shared common (Mac, Windows, Linux, not VAX or DropBox web view behavior) OS experiences of browsing folders containing files.
Another option that may simplify implementation: The :latest is only valid for a dataset, not a file. Recall again that we're talking about two things in this ticket: 1) :latest and to :view, providing the virtual folder. If :latest is limited to datasets and not files, then combining it with :view provides access to the files within the latest dataset.
Not sure I follow. /10.5072/ABCDEF/:latest/file1.csv would always be
the latest file with that name, and /10.5072/ABCDEF/:latest/ would
always be a point where you'd get the latest dataset version's list of
files, so new files with new dats would appear there. Does that
support your two cases OK? (In essence the doi + :latest serves as the
ID for the latest version versus there being a separate DOI for that.)
Yes this works for both use cases.
I still point out that 10.5072/ABCDEF is (in theory) the DOI for v1, so having it also point to latest because of an additional few characters (i.e., :latest), could be confusing. But I think that is a requirement given the architecture decision that there is no minted DOI that always points to the latest (like Zenodo).
Furthermore if /10.5072/ABCDEF/:latest/ is generalized to support :v1, :v2, etc. in addition to :latest, then any DOI for any version within a dataset can be used to access any other version. For my daily updating data, after a year I have 365 DOIs, each of which can be used to access all 365 versions.
The concern I raised was that, because I could replace file1.csv with
filen.csv in later versions, a scheme using the file path/name won't
expose the relationship between file1.csv and filen.csv, but if the
names are the same, or the names include dates that can be used to get
the file you want, knowing Dataverse's internal relationship between
those files may not be important.
I personally am not concerned by this. The relationship is still available for people to see in the GUI "Versions" tab.
(Conversely - what happens if someone deleted file1 in dataverse and
added a new file1, versus using the replace function? Should the API
not provide a single URI that would get you to the latest?)
Hmmm. Ugh :). So I see the following choices:
File is deleted and not in latest version.
File is replaced and in the latest version:
File is deleted, then added, and exists in latest version
[ ] API can return error: ambiguous file
[ ] API for ":latest" can look at the DOI used, and trace it downstream. If the DOI was for the earlier version that got deleted, then return the latest file before deletion. If the DOI was for an intermediate version where it did not exist, return error. If the DOI was for a later version after it was added, trace it downstream and return the latest one.
This seems overly complicated and I'd vote for "just return the latest".
given the architecture decision that there is no minted DOI that always points to the latest (like Zenodo)
@mankoff I'm confused by this. It's actually the opposite. The dataset DOI in Dataverse always points to the latest version of the dataset. If you use the "download all files in a dataset" API ( https://guides.dataverse.org/en/5.1.1/api/dataaccess.html#downloading-all-files-in-a-dataset ) and pass the dataset DOI, you will get the latest files from version 7 or whatever.
You're absolutely right that Zenodo mints DOIs for each version of a dataset and that Dataverse doesn't to this (it's been requested in #4499). But again, in Dataverse the dataset DOI always points to the latest version. In Dataverse, if you want to download files from a specific (possibly older) version, you pass "3.1" or whatever. Please see https://guides.dataverse.org/en/5.1.1/api/dataaccess.html#download-by-dataset-by-version
@pdurbin you are correct. I apologize for adding confusion to this conversation. I was confusing DOIs for datasets with DOIs for files. The file DOIs update.
@mankoff no problem, for downloading the latest version of a file (as you know, but for others) there's a new issue:
And now that I'm not as confused, I'll note that (I think) #7425 is solved if this ticket is implemented. If doi:nnnn/view/ exposes the dataset as a virtual folder, then you can link in there to get a fixed URL for a file as long as it exists in the latest version of the dataset.
Initial thoughts:
persistentId as a path param using a regex in JAX-RS to match the param instead of having to place it in a query parameter. This doesn't break current API spec, just extends it.curl? Show but download as text file stating the restricted access and how to gain access?wget, so Data Access API. But it allows for browsing, so more like the JSON file view, so Native API?I would vote for not introducing a new API path but keep it in line with what we have and stay consistent. Either stay with Access API or Native API. Yet we can mix them a bit: implement the view itself in the Native API but let the download links point to the Access API using the data file endpoints.
Let's create a more vivid example. If I would like to browse the files of https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/IPZBAU as virtual folders, I would go to:
edu.harvard.iq.dataverse.api.Access. Dropping the "view" verb will download the files as a ZIP package. Please note that the current Access API has no verbs (with one yet irrelevant exception).Native API already has endpoints for versions and files in them, but so far using JSON only. We need to be careful around there not to break anything relying on it. We could introduce /api/datasets/{id}/versions/{version}/tree for a folder view (there are already /files and /metadata for JSON).
This endpoint already has support for versions ":latest", ":draft", and ":latest-published" (which should also accept them without the colons).
Let's create a more vivid example. If I would like to browse the files of https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/IPZBAU as virtual folders, I would go to:
Subfolders have to be depicted in the URL (so folders get created during the download). Files would be links to https://demo.dataverse.org/api/access/datafile/{fileId} or redirects to the same place if given via the URL.
Again, a vivid example:
https://demo.dataverse.org/api/datasets/:persistentId/versions/latest/tree/subfolder/foo/bar/file.txt?persistentId=10.70122/FK2/IPZBAU redirects to https://demo.dataverse.org/api/access/datafile/xyz, triggering the download.
(Imagine using the PID in the URL directly... 🤩 )
Most helpful comment
@mankoff @poikilotherm @pdurbin thanks for the discussion here. I'll prioritize the team discussing this as there seem to be a few use cases that could be supported here.
@scolapasta can you get this on the docket for next tech hours? (Or just discuss with @landreev if it's clear enough.)
Thanks all!