Galaxy: followup for gzipped format support

Created on 26 Jan 2017 · 25Comments · Source: galaxyproject/galaxy

Implemented in https://github.com/galaxyproject/galaxy/pull/3145

Please add what is needed @blankenberg @jmchilton @mvdbeek

solve double-counting of data towards quota (the implicit conversion of .gz file creates a hidden fastq that counts towards the quota)
provide 'peak' of gzipped contents
modify fastqgroomer to take in and output .gz files
find and adapt other high-gain tools

areUI-UX kinenhancement

Source

martenson

All 25 comments

Can we now import gzipped FASTQ files into a shared library (pointing at the original file)? This will be a big win for many local instances using this for distributing raw reads.

peterjc on 27 Jan 2017

@peterjc This should work out of the box since #3145 - if there are specific problems with library import or something we can work to address them.

jmchilton on 27 Jan 2017

Thanks @jmchilton - looks like nothing on https://wiki.galaxyproject.org/Admin/DataLibraries/UploadingLibraryFiles needs updating (as it never talked about compression).

peterjc on 27 Jan 2017

@peterjc I expect the same as @jmchilton - this should now work. If it does not we will fix it.

martenson on 27 Jan 2017

added the point that the implicit conversion of .gz file creates a hidden file that counts towards the quota - this is a major regression I believe

martenson on 27 Jan 2017

They were going to have an uncompressed fastq file counting against their quota anyway - so I don't see it as a major regression. If you dislike the fastq.gz thing being the default - can we just remove the sniffer? That should be less controversial I guess since then people would have to opt into optimized workflows.

jmchilton on 27 Jan 2017

@jmchilton I think that before this update the compressed file has been thrown after decompression and not being counted towards quota.

martenson on 27 Jan 2017

Right.

uncompressed counting against quota = big problem
compressed counting against quota = compressed problem

jmchilton on 27 Jan 2017

Having the .gz and uncompressed versions of a file contributing to the 250 GB quota on Main will make staying under the limit even more difficult for the user.

It is incredibly difficult to do a real analysis of NGS experimental data and stay under the 250 GB limit. Anything we can do to help the user stay under 250 GB would benefit them.

MoHeydarian on 27 Jan 2017

👍1

I guess it would make sense to disable the sniffing -- we can enable it once the major tools support fastq.gz and/or we have better ways to educate users about implicit conversion and the quota.

mvdbeek on 27 Jan 2017

Fixed in #3510

@MoHeydarian I hope you understand that a large part of what we are trying to do this is to improve that situation - large uncompressed fastq consume a large portion of many Galaxy quotas on many different Galaxy servers. We can't improve on the quota usage of these fastq files if users insist on keeping them uncompressed and using older tools indefinitely. I hope we get to the point where we can compress them by default within a release or two.

jmchilton on 27 Jan 2017

👍1

Can think of few options to address #3511 - but what do others suggest?

Wanted to get that considered before this change is finalized.

jennaj on 27 Jan 2017

@jennaj #3511 is attempting to fix this I believe.

Update: I meant #3510.

jmchilton on 27 Jan 2017

The hidden fastq.gz datasets are showing up as valid selection for tool inputs. Along with the active dataset. They share a dataset number (expected).

Problem: When the hidden dataset is selected, it causes tools to error. An example is BWA-Mem (pic below). Would correcting this be part of this ticket or should it be put in a distinct ticket?

In the example below, the history contains two active fasta.gz datasets. Each has two associated hidden datasets (unexpected?).

This will go away _if this class of hidden dataset are corrected to not be selectable inputs_ - but notice that when one of those two active datasets is already selected (18) on the tool form for another option (reverse reads), the primary dataset for 18 shows up in the choices but the hidden dataset 18 does not. The fastq.gz input that has not already been selected for by any other tool choices (dataset 17) has both the primary and hidden dataset in the select menu. Odd, but likely an effect of the hidden dataset's metadata on the select menu filtering method - might be a clue to fixing.

screen shot 2017-02-15 at 9 47 39 am

jennaj on 15 Feb 2017

All issues related to this seem fixed now - closing out, thank you!

https://usegalaxy.org/u/jen/h/compressed-uploaded-fq-tests

jennaj on 23 Mar 2018

@jennaj are you sure these two are solved?

solve double-counting of data towards quota (the implicit conversion of .gz file creates a hidden fastq that counts towards the quota)
find and adapt other high-gain tools

martenson on 24 Mar 2018

solve double-counting of data towards quota (the implicit conversion of .gz file creates a hidden fastq that counts towards the quota)

I don't think that's really solvable though, or do you want to change the hidden-by-default behavior ?

find and adapt other high-gain tools

That's not a galaxy issue IMO.

mvdbeek on 26 Mar 2018

@mvdbeek Maybe Galaxy should delete the implicitly converted dataset after the job that used it finishes? I think there are many possible solutions and that the current situation is detrimental to the new and great feature of handling compressed datasets.

martenson on 26 Mar 2018

It might be important to note that per @natefoo this is one of the main reasons why we do not have SNIFF_COMPRESSED_FASTQS enabled on Main - fear of people not being able to use their quota efficiently.

martenson on 26 Mar 2018

@martenson This is a transient problem - no one running in any modern workflow should need uncompressed FASTQ files I don't think. The underlying tools all support them and if there is some odd tool that doesn't support them we just need to solve that at the wrapper level - Galaxy shouldn't ever really need to implicitly convert these files. Solving the tools problem makes the implicit conversion space problem go away and we shouldn't be spending energy on it IMO.

jmchilton on 26 Mar 2018

👍1

I don't think we can make a reasonable choice between deleting the converted or the source dataset. We could though offer something in the user preferences (or the history action ?) that would list and/or delete all converted datasets.

mvdbeek on 26 Mar 2018

@jmchilton in that case we should enable sniffing of compressed formats on Main imo, because unless you specify it explicitly we will always uncompress it now.

martenson on 26 Mar 2018

👍1

People run old versions of things for a long time on Main, so I don't think this problem is going to go away as quickly as you hope, @jmchilton. How difficult would it be to have implicit datasets appear as some kind of notification on the parent?

natefoo on 27 Mar 2018

I've removed the bug label. Nate's suggestion is good though, plus we could have something like delete all implicitly converted datasets (if the original still exists).

mvdbeek on 7 Nov 2019

Don't intend to muck this up more but some tools require uncompressed and I don't think that is going to change. Or maybe it will?

Example: Trinity (Beta) De novo assembly of RNA-Seq data Using Trinity on PSC's Bridges (Galaxy Version 0.0.1) as run at usegalaxy.org

I see a lot of histories where people are trying to use compressed but then run into a tool like this one .. and the quota explodes. Trinity is heavily used -- if it just needs a wrapper update to use compressed, then needs a ticket & priority. What repo? Who do I ping on it?

jennaj on 14 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings