Galaxy: Discuss Better Support for Deeply Nested Dataset Structures

Created on 3 May 2017 · 9Comments · Source: galaxyproject/galaxy

We had discussion about deeply nested data structures in-depth. This issue summarizes that meeting and our perceived take aways and then we close it once we have concrete issues (action items 😄) from that.

Some obvious existing GUI issues are #2495 and #3689 - and those just need to be fixed.

Here is the @jmchilton summary of this part of the meeting.

James outlined a use case for deeply nested collections based on a Chip-Seq workflow - I believe this example had multiple samples, replicates, then a control and condition for each replicate - where the control or condition could each be a single fastq file or a pair of fastq files. James walked through whats tools need what inputs and such.
John argued the backend supports 85% of this - If the inputs were supplied as two separate list:list:pairs instead of a list:list:list*:pair (where the list* always contains element identifiers condition and control) - the workflow would be executable today in Galaxy. The part that isn't there with the proposed approach is to take Macs without modification and feed "control" sublist and "condition" sublist to separate inputs.
- I think people were generally convinced that a list is a generalization of the control/condition concept (or for that matter the pair concept) but it would be good if users could describes these constraints for lists ahead of time. I don't think we discussed where and how they could do this in-depth.
- We discussed in a small way how sub-parts of the collection should be supplied to the tool. John argued that tools could be augmented to consume structured data via conditionals - for instance the cuff suite has some intelligence about how to consume nested collections. James argued that tools shouldn't be augmented. John agreed that the tool form should allow users to slice up collections and supply them to tools and workflows in different ways - but still feels augmenting the tools themselves in many cases would be a superior user experience and less potentially error prone. The action item here was to allow the tool form to capture a subset of a collection. I think we were generally in agreement that dataset inputs need an "advanced" selection dialog of some kind.
We discussed that the frontend needs a lot of work regardless of the state of the backend.
We talked in depth about how create such collections. I think there was some consensus that we should allow sample sheet uploads first along with files - and then refine that to allow building up the equivalent sheet structures from lists of files.
We agreed that how mapping in the UI works doesn't give a user a clear impression of what is going to generated when they click execute. We discussed potentially visualizing the outputs that would be expected from the supplied input.
There was a lot of conversations about record types and and a few comments about composite types.

As I have been thinking about the meeting I had some more thoughts about record types and my final impression was that simply adding constraints to lists would get us farther, faster than building up record types - which I see as being more general (and potentially too general to useful in the context of our GUI). I want to implement record types - but the GUI problems seem more tractable with list constraints.

Digesting all of that I'm tempted to create these concrete issues:

Redesign tool inputs with an "advanced mode" drop down that allows greater flexibility in selecting inputs.
Once done with 1., create a visualization for mapping ahead of time using output types and input collection types.
Augment the tool execution APIs and workflow representation to allow slicing out sub-lists when executing tools.
Once done with 2. and 3., redesign tool inputs to allow selecting sub-parts of the collection.
Allow creation of nested collections in the GUI via sample sheets.
Allow users to create abstract list definitions that constrain the keys.
We didn't discuss where the user would do this - but since the meeting I have been thinking workflow inputs is a good start as well as during creation itself. When uploading/creating collections - you can create the list constraints directly or import an input collection definition from a workflow via one of its inputs.

areAPI areUI-UX aredataset-collections aretool-framework areworkflows kinfeature statuplanning

Source

jmchilton

❤2

All 9 comments

Here is an example from my discussion with @shiltemann @yhoogstrate David van Zessen and Andrew Stubbs.

Suppose you have data from multiple patients. Each patient has three (or, really, any number) types of biopsies taken. Each biopsy is sequenced in several technical replicates with paired-end approach. In addition, there are other types of data about a patient such as smoker/non-smorer, age, sex, etc... So it looks something like this:

| UID | Patient | Feature 1 | Feature 2 | Feature 3 | ... | Metadata |
|-------------|--------------|---------------------|--------------------|------------|----|----------------|
| 1 | P1 | Biopsy 1 | Replicate 1 | Forward | | smoker, 41 years |
| 2 | P1 | Biopsy 1 | Replicate 1 | Reverse | | smoker, 41 years |
| 3 | P1 | Biopsy 1 | Replicate 2 | Forward | | smoker, 41 years |
| 4 | P1 | Biopsy 1 | Replicate 2 | Reverse | | smoker, 41 years |
| 5 | P1 | Biopsy 2 | Replicate 1 | Forward | | smoker, 41 years |
| 6 | P1 | Biopsy 2 | Replicate 1 | Reverse | | smoker, 41 years |
| 7 | P2 | Biopsy 1 | Replicate 1 | Forward | | non-smoker, 25 years |
| 8 | P2 | Biopsy 1 | Replicate 1 | Reverse | | non-smoker, 25 years |

This is taken from this image:

img_20170501_145452

To upload such a structure into Galaxy users must be able to create a spreadsheet-like manifest in which he can associate individual files with appropriate metadata. This example is very similar to the ChIP-seq example we have discussed during the team meeting.

nekrut on 4 May 2017

This is also somehow related to the ISA-tab discussions we had for the metabolomics datatypes. It would be nice to have a general concept of uploading data as an archive with a self-describing format - that can be converted into list-of-list and so on.

bgruening on 4 May 2017

👍1

I've updated the original issue with whiteboard pictures - a huge thanks to Jen for taking these.

jmchilton on 4 May 2017

👍1

Just for clarification and discussion purposes. My understanding of this is the following:

The collection data is uploaded to Galaxy through the FTP loader (or maybe as a single compressed file)
Additionally the user uploads a manifest file which is basically a tabular file with assignments and data attributes. One row corresponds to one file.
Users can click on the history dropdown and select something like "Create collection from Manifest"
Once selected a table/grid view with the manifest data is displayed.
Users are now able to edit attributes and assignments through text and select input fields.
When the user confirms the manifest rows, the data is send to the backend which builds a collection dataset from the provided inputs.

guerler on 11 May 2017

👍1

thank you @guerler

nekrut on 15 May 2017

@guerler and others, yes, this sounds really really great.

@nekrut asked us to provides some more info on our process, so here is my 2 cents:

The collection data is uploaded to Galaxy through the FTP loader (or maybe as a single compressed file)

most of our users just upload it as separate files through the upload menu (unless files are very big). With the drag-and-drop feature and multiple file select it is easy enough to upload many files at once this way as well.

Not sure if you were thinking of doing this upon upload, but often our users would want to change their initial design later (e.g. remove poor quality samples, fix mistakes, change/add metadata etc) or build their collection from data already on Galaxy (think shared data libraries, imported from data sources, or uploaded by others and shared with them) so the ability to edit or build a manifest file from scratch in Galaxy from items in the history would be great.

Our experimental design/manifest usually looks like the one described by Anton. To give you a concrete example, right now we have an experiment where we have 100 samples we are analyzing with mothur, 3 technical replicates each, and metadata associated with each of them. Additionally we have 10 negative control samples, also consisting of 3 replicates each. Each sample has one negative control associated with it, but each negative control is associated with 10 of the samples. So one of the features/metadata of a sample could be a reference to another dataset as well.

shiltemann on 16 May 2017

I've created two big issues for what I see as the next big steps in the direction outlined here - #4707 for the advanced dataset input piece and #4733 for getting large amounts of nested data into Galaxy. We can keep this open for general comments - but specific comments about those two big issues I guess should be redirected to said issues?

jmchilton on 2 Oct 2017

Alright, I think action points 3 and 4 would help a lot with things that came up in #740

mvdbeek on 3 Feb 2018

I'm going to close this issue - it was a good conversation and it shaped a half of year of my development time and I'm proud of the outcome. I don't think we are done by any means but the landscape has really shifted - we've made a lot of progress on all of these issues with 18.05 I think - and we should have a new discussion at some point that reflects the current state of things and the new constructs we have to address these concerns. @mvdbeek and I will discuss a bunch of the enhancements we've made to tackle these problems at the GCC.

jmchilton on 16 May 2018

Was this page helpful?

0 / 5 - 0 ratings