We had discussion about deeply nested data structures in-depth. This issue summarizes that meeting and our perceived take aways and then we close it once we have concrete issues (action items 馃槃) from that.
Some obvious existing GUI issues are #2495 and #3689 - and those just need to be fixed.
Here is the @jmchilton summary of this part of the meeting.
list:list:pairs instead of a list:list:list*:pair (where the list* always contains element identifiers condition and control) - the workflow would be executable today in Galaxy. The part that isn't there with the proposed approach is to take Macs without modification and feed "control" sublist and "condition" sublist to separate inputs.As I have been thinking about the meeting I had some more thoughts about record types and my final impression was that simply adding constraints to lists would get us farther, faster than building up record types - which I see as being more general (and potentially too general to useful in the context of our GUI). I want to implement record types - but the GUI problems seem more tractable with list constraints.
Digesting all of that I'm tempted to create these concrete issues:
Here is an example from my discussion with @shiltemann @yhoogstrate David van Zessen and Andrew Stubbs.
Suppose you have data from multiple patients. Each patient has three (or, really, any number) types of biopsies taken. Each biopsy is sequenced in several technical replicates with paired-end approach. In addition, there are other types of data about a patient such as smoker/non-smorer, age, sex, etc... So it looks something like this:
| UID | Patient | Feature 1 | Feature 2 | Feature 3 | ... | Metadata |
|-------------|--------------|---------------------|--------------------|------------|----|----------------|
| 1 | P1 | Biopsy 1 | Replicate 1 | Forward | | smoker, 41 years |
| 2 | P1 | Biopsy 1 | Replicate 1 | Reverse | | smoker, 41 years |
| 3 | P1 | Biopsy 1 | Replicate 2 | Forward | | smoker, 41 years |
| 4 | P1 | Biopsy 1 | Replicate 2 | Reverse | | smoker, 41 years |
| 5 | P1 | Biopsy 2 | Replicate 1 | Forward | | smoker, 41 years |
| 6 | P1 | Biopsy 2 | Replicate 1 | Reverse | | smoker, 41 years |
| 7 | P2 | Biopsy 1 | Replicate 1 | Forward | | non-smoker, 25 years |
| 8 | P2 | Biopsy 1 | Replicate 1 | Reverse | | non-smoker, 25 years |
This is taken from this image:

To upload such a structure into Galaxy users must be able to create a spreadsheet-like manifest in which he can associate individual files with appropriate metadata. This example is very similar to the ChIP-seq example we have discussed during the team meeting.
This is also somehow related to the ISA-tab discussions we had for the metabolomics datatypes. It would be nice to have a general concept of uploading data as an archive with a self-describing format - that can be converted into list-of-list and so on.
I've updated the original issue with whiteboard pictures - a huge thanks to Jen for taking these.
Just for clarification and discussion purposes. My understanding of this is the following:
The collection data is uploaded to Galaxy through the FTP loader (or maybe as a single compressed file)
Additionally the user uploads a manifest file which is basically a tabular file with assignments and data attributes. One row corresponds to one file.
Users can click on the history dropdown and select something like "Create collection from Manifest"
Once selected a table/grid view with the manifest data is displayed.
Users are now able to edit attributes and assignments through text and select input fields.
When the user confirms the manifest rows, the data is send to the backend which builds a collection dataset from the provided inputs.
thank you @guerler
@guerler and others, yes, this sounds really really great.
@nekrut asked us to provides some more info on our process, so here is my 2 cents:
The collection data is uploaded to Galaxy through the FTP loader (or maybe as a single compressed file)
most of our users just upload it as separate files through the upload menu (unless files are very big). With the drag-and-drop feature and multiple file select it is easy enough to upload many files at once this way as well.
Not sure if you were thinking of doing this upon upload, but often our users would want to change their initial design later (e.g. remove poor quality samples, fix mistakes, change/add metadata etc) or build their collection from data already on Galaxy (think shared data libraries, imported from data sources, or uploaded by others and shared with them) so the ability to edit or build a manifest file from scratch in Galaxy from items in the history would be great.
Our experimental design/manifest usually looks like the one described by Anton. To give you a concrete example, right now we have an experiment where we have 100 samples we are analyzing with mothur, 3 technical replicates each, and metadata associated with each of them. Additionally we have 10 negative control samples, also consisting of 3 replicates each. Each sample has one negative control associated with it, but each negative control is associated with 10 of the samples. So one of the features/metadata of a sample could be a reference to another dataset as well.
I've created two big issues for what I see as the next big steps in the direction outlined here - #4707 for the advanced dataset input piece and #4733 for getting large amounts of nested data into Galaxy. We can keep this open for general comments - but specific comments about those two big issues I guess should be redirected to said issues?
Alright, I think action points 3 and 4 would help a lot with things that came up in #740
I'm going to close this issue - it was a good conversation and it shaped a half of year of my development time and I'm proud of the outcome. I don't think we are done by any means but the landscape has really shifted - we've made a lot of progress on all of these issues with 18.05 I think - and we should have a new discussion at some point that reflects the current state of things and the new constructs we have to address these concerns. @mvdbeek and I will discuss a bunch of the enhancements we've made to tackle these problems at the GCC.