User community request to be able to organize files in a dataset in a hierarchical manner so when a user exports files from Google Drive, Dropbox, or OSF (or any other location), the file structure is maintained without causing the user extra work on the Dataverse side.
One highly related thing which is extremely prevalent in microscopy and I would guess other fields, is that in addition to encoding metadata into the directory hierarchy, they have also encoded it into the filenames, usually underscore separated.
E.g. /usr/people/bioc0759/data/EB1-posterior-polarity/EB1-Colcemid-UV-inactivation/RMP_20090228_colcemid_UV_inactivation/rmp_20090228_colcemid-6hrs_EB1EB1_stg9_Az_18_R3D.dv
Some of this will be important metadata, some of it may not be. The ability to automatically or semi-automatically import some of this metadata (but hopefully not the junk) in the form of tag annotations sounds useful so that search/filtering can make use of them.
It might be useful to have a look at this tool, built for the Open Microscopy Environment. The UI is not beautiful, but it does this kind of metadata extraction into Tag Annotations: https://www.openmicroscopy.org/site/products/partner/omero.webtagging/
There is also a "search" tool which should really be called "navigation" because it allows the user to browse the graph of tags from any origin point. This resembled filesystem navigation somewhat and seemed to satisfy some users.
Caveat Emptor: The queries to do this navigation because the tags are stored in a relational DB can get very slow if there are large numbers of tags and/or large amounts of data tagged with them. It would be ideal to be storing and updating a graph DB for this functionality to make these queries performant.
FRD for this feature (work in progress): https://docs.google.com/document/d/1PqL6EljP-N51rt3puy3HedStrnV5DOJ3Gf7H_zPHcA0/edit?usp=sharing
Feedback from @pameyer: "preserving file naming and directory structure (with the exception of files.sha which holds the checksums) is important for users downloading the dataset, and doing computation locally on it".
Mostly I just want to make it clear that download is a use case. (We probably need a separate issue to talk about running computation on files.) In the FRD above this is currently a question ("Do these carry over into a folder structure when downloaded as a zip?") and the answer for many users, I think, is that they want/expect to be able to upload a zip and later download a zip that has the same directory structure inside it. Some months ago @cchoirat was talking about the importance of this for her (though she may not have been talking about zip files specifically). It's a common expectation. Right now Dataverse flattens your files into a single namespace/directory on upload.
I think this would be really valuable. It was how things worked with versions < 4.0, as I recall, and makes it somewhat unpredictable what will happen currently when uploading a project (e.g., via the API).
One possibility might be to do what S3 does with object keys that can have slashes in them:
Note that the Amazon S3 data model is a flat structure: you create a bucket, and the bucket stores objects. There is no hierarchy of subbuckets or subfolders; however, you can infer logical hierarchy using keyname prefixes and delimiters as the Amazon S3 console does. The Amazon S3 console supports a concept of folders.
The examples they give of object keys are:
Development/Projects1.xls
Finance/statement1.pdf
Private/taxdocument.pdf
s3-dg.pdf
This would allow a "flat" Dataset to contain files that can be batch downloaded into a hierarchical structure. Of course, I don't know if that works on the backend.
This issue was raised yesterday by @pameyer and others from @sbgrid . @bmckinney if you want you could assign yourself to this issue to at least think about. I remember @dpwrussell of OMERO fame talking about it during the 2015 Dataverse Community Meeting.
@leeper you're right. From what I've heard from @landreev in the DVN 3.x days a zip download would sort of reconstruct the file system hierarchy. I'm obviously fuzzy on the details.
I'm wondering whether there is a way to upload the entire directory (For example, by dragging a folder. Currently, it only supports dragging a file) so that the structure is maintained. The user can browse the directory and files by simply clicking into like Dropbox and github without explicitly downloading and unzipping the data.
:+1: on this request. The directory structure is often important. Sometimes there are even multiple subdirectories that contain identically named files, e.g., for different experimental subjects or different versions of an experiment.
@pdurbin is correct that download is a use case. So is online browsing of the dataset to get a feel for what's there -- the directory structure provides very useful organization.
My biggest concern with this issue is how versioning of files would be supported. Imagine if you could just rsync some files up to Dataverse. Then you publish your dataset as 1.0. You add some more files and publish the 2.0 version of your dataset. How do you toggle back and forth between the files in 1.0 and 2.0? When you uploaded your files the first time did they go into a directory called "1"? Does that directory get copied to "2" and then you can start uploading additional files there? That sounds potentially expensive on a file system that doesn't do any de-duplication. Do we use ZFS and snapshot the directory whenever a dataset version is published? How do other systems handle this?
@pdurbin I'm confused why hierarchical directory structure is related to versioning. If the directory is flat, don't you still have the problem that version 1.0 and version 2.0 might have some identical or similar files?
That said: The standard way to store identical files is symlinks. And the standard way to store similar files is to store only the latest version, together with diffs that make it possible to reconstruct the earlier versions on demand (as version-control systems from CVS to git have always done). Ideally, these storage details would be invisible to the user, who can just decide which version they want to grab (latest by default).
@pameyer brought this up on the community call 8/16 and asked if it would be included in 4.6. There are currently no plans to work on this for 4.6.
@cchoirat stopped by and asked about this today and reiterated its usefulness.
Yes, please. We are about to upload a directory hierarchy. At present, we have to do it as a single tarball, which means that browsing / replacement / versioning of single files is not possible.
Related to #3247
Retention of directory structure in a .zip file is a critical component to our Project TIER protocol, where we teach students how to create reproducible empirical research. The inability to upload and download .zip files that retain this structures is a real impediment to our application of Dataverse as a platform to showcase these efforts. It seems to me that users upload .zip files either for convenience (many files at one time) or necessity (folder structure is important to retain). Providing an option -- to unzip or leave intact -- on upload might satisfy both communities.
@nmedeiro as a workaround, your students can "double zip" their files as discussed in #2107.
@jeisner we still owe you a response to your question at https://github.com/IQSS/dataverse/issues/2249#issuecomment-222241350 about how hierarchical directory structure is related to versioning. I'm not an authority on this but if you look at http://phoenix.dataverse.org/schemaspy/latest/tables/filemetadata.html (screenshot below), you'll see how rows in the datafile table are associated with rows in the filemetadata table, which are associated with rows in the datasetversion table. The filesystemname field in datafile is a random UUID which is what the file is renamed to on disk after it is uploaded. All these files that have been renamed to UUIDs on disk are stored in a single directory for each dataset (and different versions of the dataset can include these files per the associations above). I can keep going but I hope this gives a flavor of how the system works now.

@bmckinney is giving a demo next week of some potential changes in this area as part of #3145 which will help guide future direction.
From @pameyer in #3247
Data files transferred through the Data Capture Module, data sets with data files that have directory structure available should display that structure.
Related to #2249, but distinct in that this doesn't require supporting re-organization of uploaded files (and possibly should support disabling such an option in the UI).
@nmedeiro see also this issue opened by @bjonnh about automating the "double zip" workaround: #3439
Let me pile on some more for hierarchical file structure support :)
Just want to point out that this issue is important for the mission of Dataverse. A key part of reproducibility is good data hygiene. And good data hygiene means nested folders with input, output, code etc. Having this feature makes Dataverse more consistent!
@pdeffebach thanks for the comment and for chatting over at http://irclog.iq.harvard.edu/dataverse/2018-01-11#i_62119 ! I'm glad the double zip workaround is working for you.
While I'm leaving a comment here, I thought I'd mention that there's also interest in the feature over at https://twitter.com/bshor/status/949417291132887041 which reads:
"@dataverseorg Is there any way to maintain the folder structure of studies in Dataverse? Seems mine was melted away."
I feel like there was another recent tweet but I can't find it. Suffice it to say there is broad interest in this feature.
To build on @pdeffebach's comment, here are some folks making the same point in general:
http://kbroman.org/steps2rr/pages/organize.html
"Perhaps the most important step to take towards ease of reproducibility is to be organized...Separate the data from the code. I prefer to put code and data in separate subdirectories."
http://www.fragilefamilieschallenge.org/author/matt-salganik/
"...we think it will be helpful to organize your input files, intermediate files, and output files into a standard directory (i.e., folder) structure. We think that this structure would help you create a modular research pipeline; see Peng and Eckel (2014) for more on the modularity of a research pipeline. This modular pipeline will make it easier for us to ensure computational reproducibility, and it will make it easy for other researchers to understand, re-use, and improve your code.
Here’s a basic structure that we think might work well for this project and others:
data/
code/
output/
README
LICENSE
"
I really appreciate your openness to community feedback on this and related issues. (It took me a few tries to format this correctly, my apologies.)
@setgree this is a very useful comment. Thanks. When I have a minute I'll try to make a screenshot from @leeper 's talk at the 2017 Dataverse Community Meeting that shows the file hierarchy expected by his academic discipline (political science).
Actually, I'll make the screenshot from https://osf.io/xfj5h/ now. Here it is:

I think concrete examples like this help explain the need for this feature.
That looks great! My only comment is that I do not think that makefiles are data; even one that preprocesses/cleans data is code, I think. (I was at that conference, BTW.) The ambiguity of such things is a good reason to allow readers a lot of flexibility in how they choose to subdivide. Anyway, this is just to say that I am looking forward to seeing nested folders on Dataverse.
Oh! I certainly don't get a chance to meet everyone at the community meeting.
I guess one other thing I'll mention is that over in https://github.com/IQSS/dataverse-client-r/issues/18 I made a little noise about a feature we had in DVN 3 (the predecessor to Dataverse 4) that allowed users to upload a zip file that gets expanded by DVN and then have other users download a zip of the files, but it didn't worked quite as well as I had hoped. Folders were renamed from folder1/sub1 to folder1-sub1 for example. The zip files are not quite identical. Anyway, I thought I'd mention that I looked into this at least. To me, supported zip upload and download would be a way to get some sort of support for file hierarchy, if it's implemented so that the zip files are as close as possible to being the same when they are uploaded and downloaded.
Folder name and hierarchy are critical to reproducibility, so I'd love to
see the option to retain zipped folders on upload. It may be that some
users are uploading zips as a convenience, whereby extracting these in DVN
is useful to them. For others, and especially for the work we do at
Project TIER, retaining the zipped folder in tact is essential. We've be
using the double-zip hack to achieve this, but would love to see a zip
retention option in future versions of the software.
Best,
Norm
Norm Medeiros
Associate Librarian of the College
Coordinator for Collection Management and Metadata Services
Haverford College
370 Lancaster Ave., Haverford, PA 19041
(610) 896-1173
On Thu, Mar 8, 2018 at 8:24 PM, Philip Durbin notifications@github.com
wrote:
Oh! I certainly don't get a chance to meet everyone at the community
meeting.I guess one other thing I'll mention is that over in
IQSS/dataverse-client-r#18
https://github.com/IQSS/dataverse-client-r/issues/18 I made a little
noise about a feature we had in DVN 3 (the predecessor to Dataverse 4) that
allowed users to upload a zip file that gets expanded by DVN and then have
other users download a zip of the files, but it didn't worked quite as well
as I had hoped. Folders were renamed from folder1/sub1 to folder1-sub1
for example. The zip files are not quite identical. Anyway, I thought I'd
mention that I looked into this at least. To me, supported zip upload and
download would be a way to get some sort of support for file hierarchy, if
it's implemented so that the zip files are as close as possible to being
the same when they are uploaded and downloaded.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/IQSS/dataverse/issues/2249#issuecomment-371682165,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL-rjGYl02IH1JDPTy03pMd-D4hgabD7ks5tcdnogaJpZM4E_vhX
.
Somehow I've managed to miss commenting on this issue until now:
tar instead of zip should avoid the double-zip hack; and should be a one-line change in a data deposition procedure.I would be interested in using folders for a Dataset, and for example retaining the folder structure when dragging a folder with sub-folders to the upload box.
More discussion on maintaining folders in these IRC logs. http://irclog.iq.harvard.edu/dataverse/2018-04-20
I wonder if it is possible to retain the folder structure? ... I think Nextcloud, Owncloud, Google Drive, Dropbox, almost all cloud based storage systems can handle a hierarchical folder structure.
But when you have hundreds of files in many folders this is still not going to reduce the number of files in the root folder I guess...
But when you have hundreds of files in many folders this is still not going to reduce the number of files in the root folder I guess...
This comment refers to the case the full path to a file is given
"/Example/Hierarchy/Structure/README.docx".
Thanks for the additional info, @mdehollander. Please feel free to add any other details of this feature specific to your use case here in this GitHub issue. We're still in the early stages of researching and designing potential solutions, and your feedback will help in that process.
In the short term, we are considering using the file hierarchy information as metadata, stored in the database, rather than having the files in a hierarchy on disk. This would allow users to view the hierarchy in a preview, with the file display in the table (adding filtering and sorting capabilities).
Users could add or move things around from the UI by specifying or editing the file's path.
On download, the original structure is recreated in the zip file they download.
This doesn't address all desired, however we are interested in getting comments on this proposal. Here is a more detailed description:
Depositor drags a zip (not double zip) in to the dataset. The file will unzip and preserve the directory structure (see #3448). Individual files will be ingested (if necessary) and displayed just like any other file in a dataset – flat. Individual files can be downloaded. If all or any files are downloaded, the hierarchy will be re-created in a zip, matching the structure of the file that was uploaded in the first place.
A user wanting to access data selects “Download all” and downloads the original zip hierarchy. The system behavior is transparent to depositors.
Add files
Move files
Similar function to above, provide a way to edit the file path
Versioning
Consider moving or adding a metadata change and display a new version in the version table
File removed - same as any other file
View hierarchy
Show a “preview” of the hierarchical contents of the dataset.
Replace/Unzip existing .zip
For existing double .zip. Users can delete original .zip and then upload with a single .zip.
Download
For Stata file add a toggle for original or ingested? Decide to show one? What about the download limit? How might that affect download? Can we leverage the S3/large/package file download UI?
@dpwrussell @pameyer @leeper @wddabc @jeisner @nmedeiro @christophergandrud @pdeffebach @setgree @mdehollander (and anyone else who is following this issue) good news! Dataverse 4.12 has support for organizing files into folders!
Can you all please try it out at https://demo.dataverse.org and give us feedback? Here are some screenshots that show how to introduce a folder hierarchy to your dataset's files:


This feature is documented as "File Path" at http://guides.dataverse.org/en/4.12/user/dataset-management.html#file-path and here's a screenshot of the docs:

Please just leave a comment below! Thanks!
@pdurbin Good that this works now with a zip file. Ideally I would like to see this also working with drag&drop. And that you can browser through folders in the interface in stead of listing the folder name for each file. But hey, thanks for making this already possible!
@mdehollander great suggestion! Please feel free to open a new issue for this.
Everyone, while I'm writing I'll mention that I also wrote about the progress so far in this "Control over dataset file hierarchy + directory structure (new feature in Dataverse 4.12)" thread and feedback is welcome there as well: https://groups.google.com/d/msg/dataverse-community/8gn5pq0cVc0/MCMQAQHRAQAJ
If anyone want to reply via Twitter, I would suggest piling on to one of these tweets:
We're currently working on "Enable the display of file hierarchy metadata on the dataset page" in #5572.
Phil, this is great! It worked perfectly with the test dataset I uploaded
to the demo site. Thanks very much to you and your team for getting this
much-needed functionality into Dataverse. It's critical to the
computational reproducibility we're teaching.
All the best,
Norm
Norm Medeiros
Associate Librarian of the College
Coordinator for Collection Management and Metadata Services
Haverford College
370 Lancaster Ave., Haverford, PA 19041
(610) 896-1173
On Fri, Apr 5, 2019 at 7:35 AM Philip Durbin notifications@github.com
wrote:
@dpwrussell https://github.com/dpwrussell @pameyer
https://github.com/pameyer @leeper https://github.com/leeper @wddabc
https://github.com/wddabc @jeisner https://github.com/jeisner
@nmedeiro https://github.com/nmedeiro @christophergandrud
https://github.com/christophergandrud @pdeffebach
https://github.com/pdeffebach @setgree https://github.com/setgree
@mdehollander https://github.com/mdehollander (and anyone else who is
following this issue) good news! Dataverse 4.12 has support for organizing
files into folders!Can you all please try it out at https://demo.dataverse.org and give us
feedback? Here are some screenshots that show how to introduce a folder
hierarchy to your dataset's files:[image: Screen Shot 2019-04-05 at 7 26 47 AM]
https://user-images.githubusercontent.com/21006/55624812-fcc36f00-5774-11e9-927c-1a5747ea98da.png[image: Screen Shot 2019-04-05 at 7 27 03 AM]
https://user-images.githubusercontent.com/21006/55624811-fc2ad880-5774-11e9-97d3-cb8c6d504e97.pngThis feature is documented as "File Path" at
http://guides.dataverse.org/en/4.12/user/dataset-management.html#file-path
and here's a screenshot of the docs:[image: Screen Shot 2019-04-05 at 7 32 27 AM]
https://user-images.githubusercontent.com/21006/55624810-fc2ad880-5774-11e9-9123-09983378ba2a.pngPlease just leave a comment below! Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/IQSS/dataverse/issues/2249#issuecomment-480243678,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL-rjETidrUUwPj-aDJ7yZwbYsoN2eR-ks5vdzT1gaJpZM4E_vhX
.
@nmedeiro fantastic! If you have a public sample zip file with a folder hierarchy that you give to your students that we can also use in our own testing, please let us know where to download it. 😄
Yes, I've been thinking that this is an important step toward more automated reproducibility. Code Ocean, for example, wants a "data" folder and "code" folder, as I wrote about at https://github.com/IQSS/dataverse/issues/4714#issuecomment-443344987 . Here's a screenshot:

I loaded one to the demo site
https://doi.org/10.5072/FK2/86JG25
Feel free to use for testing.
Norm Medeiros
Associate Librarian of the College
Coordinator for Collection Management and Metadata Services
Haverford College
370 Lancaster Ave., Haverford, PA 19041
(610) 896-1173
On Fri, Apr 5, 2019 at 12:00 PM Philip Durbin notifications@github.com
wrote:
@nmedeiro https://github.com/nmedeiro fantastic! If you have a public
sample zip file with a folder hierarchy that you give to your students that
we can also use in our own testing, please let us know where to download
it. 😄Yes, I've been thinking that this is an important step toward more
automated reproducibility. Code Ocean, for example, wants a "data" folder
and "code" folder, as I wrote about at #4714 (comment)
https://github.com/IQSS/dataverse/issues/4714#issuecomment-443344987 .
Here's a screenshot:[image: 49315649-70e6c100-f4bc-11e8-9c04-9034186e1571]
https://user-images.githubusercontent.com/21006/55640790-4f635200-579a-11e9-8305-3ce74daf0936.png—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/IQSS/dataverse/issues/2249#issuecomment-480329943,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL-rjOhkgCDMUMSRVbbmzIDT5d2bkDgPks5vd3MdgaJpZM4E_vhX
.
@nmedeiro thanks! It's only 6.5 MB. Can I make it public by attaching it to this issue?
Sure.
Norm Medeiros
Associate Librarian of the College
Coordinator for Collection Management and Metadata Services
Haverford College
370 Lancaster Ave., Haverford, PA 19041
(610) 896-1173
On Fri, Apr 5, 2019 at 12:46 PM Philip Durbin notifications@github.com
wrote:
@nmedeiro https://github.com/nmedeiro thanks! It's only 6.5 MB. Can I
make it public by attaching it to this issue?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/IQSS/dataverse/issues/2249#issuecomment-480344489,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AL-rjIHMbeXCe9MJhqN_nlNNtqZsglCVks5vd33PgaJpZM4E_vhX
.
@nmedeiro thanks! Here it is: dataverse_files.zip
Inside the "Replication Documentation for Midlife Crisis Paper" directory are the following files:
Original-Data/importable-pew.dta
Original-Data/original-pew.sav
Original-Data/original-wdi.xlsx
Command-Files/5-analysis.do
Command-Files/4-data-appendix.do
Analysis-Data/country-analysis.dta
Analysis-Data/individual-analysis.dta
Introduction to the Tier Protocol (v. 3.0) demo project. 2017-07-11.pdf
Introduction to the Tier Protocol (v. 3.0) demo project. 2017-07-11.docx
Thanks all for the feedback as we evaluated and implemented this in Dataverse. Very exciting to see this feature added.
@nmedeiro here's how the files and folders look in the "tree" view we shipped in Dataverse 4.13:

Thanks again!
Beautiful! Thanks for your efforts with this. It's very important to our
work with students.
Norm Medeiros
Associate Librarian of the College
Coordinator for Collection Management and Metadata Services
Haverford College
370 Lancaster Ave., Haverford, PA 19041
(610) 896-1173
On Fri, May 10, 2019 at 5:58 AM Philip Durbin notifications@github.com
wrote:
@nmedeiro https://github.com/nmedeiro here's how the files and folders
look in the "tree" view we shipped in Dataverse 4.13:[image: Screen Shot 2019-05-10 at 5 57 45 AM]
https://user-images.githubusercontent.com/21006/57519094-8fad7700-72e8-11e9-8e1f-a49dbc9fff05.pngThanks again!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/IQSS/dataverse/issues/2249#issuecomment-491233369,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AC72XDAN3ZIG4ESOFOPRF5LPUVBNTANCNFSM4BH67BLQ
.
Most helpful comment
Let me pile on some more for hierarchical file structure support :)