Galaxy: followup for gzipped format support

Created on 26 Jan 2017  路  25Comments  路  Source: galaxyproject/galaxy

Implemented in https://github.com/galaxyproject/galaxy/pull/3145

Please add what is needed @blankenberg @jmchilton @mvdbeek

  • solve double-counting of data towards quota (the implicit conversion of .gz file creates a hidden fastq that counts towards the quota)
  • provide 'peak' of gzipped contents
  • modify fastqgroomer to take in and output .gz files
  • find and adapt other high-gain tools
areUI-UX kinenhancement

All 25 comments

Can we now import gzipped FASTQ files into a shared library (pointing at the original file)? This will be a big win for many local instances using this for distributing raw reads.

@peterjc This should work out of the box since #3145 - if there are specific problems with library import or something we can work to address them.

Thanks @jmchilton - looks like nothing on https://wiki.galaxyproject.org/Admin/DataLibraries/UploadingLibraryFiles needs updating (as it never talked about compression).

@peterjc I expect the same as @jmchilton - this should now work. If it does not we will fix it.

added the point that the implicit conversion of .gz file creates a hidden file that counts towards the quota - this is a major regression I believe

They were going to have an uncompressed fastq file counting against their quota anyway - so I don't see it as a major regression. If you dislike the fastq.gz thing being the default - can we just remove the sniffer? That should be less controversial I guess since then people would have to opt into optimized workflows.

@jmchilton I think that before this update the compressed file has been thrown after decompression and not being counted towards quota.

Right.

uncompressed counting against quota = big problem
compressed counting against quota = compressed problem

Having the .gz and uncompressed versions of a file contributing to the 250 GB quota on Main will make staying under the limit even more difficult for the user.

It is incredibly difficult to do a real analysis of NGS experimental data and stay under the 250 GB limit. Anything we can do to help the user stay under 250 GB would benefit them.

I guess it would make sense to disable the sniffing -- we can enable it once the major tools support fastq.gz and/or we have better ways to educate users about implicit conversion and the quota.

Fixed in #3510

@MoHeydarian I hope you understand that a large part of what we are trying to do this is to improve that situation - large uncompressed fastq consume a large portion of many Galaxy quotas on many different Galaxy servers. We can't improve on the quota usage of these fastq files if users insist on keeping them uncompressed and using older tools indefinitely. I hope we get to the point where we can compress them by default within a release or two.

Can think of few options to address #3511 - but what do others suggest?

Wanted to get that considered before this change is finalized.

@jennaj #3511 is attempting to fix this I believe.

Update: I meant #3510.

The hidden fastq.gz datasets are showing up as valid selection for tool inputs. Along with the active dataset. They share a dataset number (expected).

Problem: When the hidden dataset is selected, it causes tools to error. An example is BWA-Mem (pic below). Would correcting this be part of this ticket or should it be put in a distinct ticket?

In the example below, the history contains two active fasta.gz datasets. Each has two associated hidden datasets (unexpected?).

This will go away _if this class of hidden dataset are corrected to not be selectable inputs_ - but notice that when one of those two active datasets is already selected (18) on the tool form for another option (reverse reads), the primary dataset for 18 shows up in the choices but the hidden dataset 18 does not. The fastq.gz input that has not already been selected for by any other tool choices (dataset 17) has both the primary and hidden dataset in the select menu. Odd, but likely an effect of the hidden dataset's metadata on the select menu filtering method - might be a clue to fixing.

screen shot 2017-02-15 at 9 47 39 am

All issues related to this seem fixed now - closing out, thank you!

https://usegalaxy.org/u/jen/h/compressed-uploaded-fq-tests

@jennaj are you sure these two are solved?

  • solve double-counting of data towards quota (the implicit conversion of .gz file creates a hidden fastq that counts towards the quota)
  • find and adapt other high-gain tools

solve double-counting of data towards quota (the implicit conversion of .gz file creates a hidden fastq that counts towards the quota)

I don't think that's really solvable though, or do you want to change the hidden-by-default behavior ?

find and adapt other high-gain tools

That's not a galaxy issue IMO.

@mvdbeek Maybe Galaxy should delete the implicitly converted dataset after the job that used it finishes? I think there are many possible solutions and that the current situation is detrimental to the new and great feature of handling compressed datasets.

It might be important to note that per @natefoo this is one of the main reasons why we do not have SNIFF_COMPRESSED_FASTQS enabled on Main - fear of people not being able to use their quota efficiently.

@martenson This is a transient problem - no one running in any modern workflow should need uncompressed FASTQ files I don't think. The underlying tools all support them and if there is some odd tool that doesn't support them we just need to solve that at the wrapper level - Galaxy shouldn't ever really need to implicitly convert these files. Solving the tools problem makes the implicit conversion space problem go away and we shouldn't be spending energy on it IMO.

I don't think we can make a reasonable choice between deleting the converted or the source dataset. We could though offer something in the user preferences (or the history action ?) that would list and/or delete all converted datasets.

@jmchilton in that case we should enable sniffing of compressed formats on Main imo, because unless you specify it explicitly we will always uncompress it now.

People run old versions of things for a long time on Main, so I don't think this problem is going to go away as quickly as you hope, @jmchilton. How difficult would it be to have implicit datasets appear as some kind of notification on the parent?

I've removed the bug label. Nate's suggestion is good though, plus we could have something like delete all implicitly converted datasets (if the original still exists).

Don't intend to muck this up more but some tools require uncompressed and I don't think that is going to change. Or maybe it will?

Example: Trinity (Beta) De novo assembly of RNA-Seq data Using Trinity on PSC's Bridges (Galaxy Version 0.0.1) as run at usegalaxy.org

I see a lot of histories where people are trying to use compressed but then run into a tool like this one .. and the quota explodes. Trinity is heavily used -- if it just needs a wrapper update to use compressed, then needs a ticket & priority. What repo? Who do I ping on it?

Was this page helpful?
0 / 5 - 0 ratings