Galaxy: vcf.gz files can be loaded, but can't then be used as the input to another step; vcf files are fine

Created on 15 Feb 2019  路  8Comments  路  Source: galaxyproject/galaxy

Its possible to upload a vcf.gz file (e.g. see 19:) but if an attempt is made to use it (e.g. VCF-BedIntersect in main panel, 19: shows as (unavailable).

If an uncompressed vcf file is loaded (e.g. as 21:) there is no such issue (e.g. see 22: , 23:)

screenshot 2019-02-15 at 11 24 19

Best, Tim

aredatatypes kinbug

All 8 comments

The datatype assigned is confusing. Is there a reason we not using vcf.gz instead of vcf_bgzip? Would be more consistent naming compared to other datatypes. Example: fasta + fasta.gz

The Upload tool will uncompress vfc.gz data to vcf when it is added to the history with "autodetect datatype". True as of at least release 19.01.

We should decide whether to preserve/detect compressed format or not in Upload, then autodetect datatype appropriately (however named).

There are probably many tools that won't work with compressed vcf data.

The differences are because vcf_bgzip is not using our compressed datatypes framework (i.e. auto_compressed_types).

The datatype is defined as <datatype extension="vcf_bgzip" type="galaxy.datatypes.tabular:VcfGz" display_in_upload="true"> instead.

vcf_bgzip is not a gzipped vcf, it's compressed with bgzip and indexed.

@nsoranzo @martenson Ah, right, the index -- forgot about that. Helps!

Questions now are:

  1. Should we add in the datatype vcf.gz and support it? (gzip version)
  2. The extension is the same on both outputs ".gz" -- how can we avoid user confusion? Even our code is confusing, eg type="galaxy.datatypes.tabular:VcfGz".
  3. Should we add help to tool forms noting which version of vfc data is accepted? Any way to do that in batch (across tools, without modifying the wrappers)? Maybe a message inside the input select area? That is a Galaxy function, not wrapper function, yes or do I have that wrong?
  4. What can we do in the Upload tool to check if the wrong datatype was selected? Maybe not easy/possible if compressed, although we do (or did) check for some (ex: bam).

This has come up in other Q&A recently. I told the users to Upload with autodetect, which uncompress gzip ".gz" files. Didn't test specifically for a "vcf_bgzip ".gz" file with autodetect (yet, but will and post back, curious about behavior).

Manuals for two different types to help with discussion:

So we just need an implicit converter to vcf for this datatype I think, that should be fairly straight forward.

See also https://github.com/galaxyproject/galaxy/issues/4892 : uploading a vcf.gz (compressed with gzip, not bgzip) results in an error (red dataset). I'm working on a fix, just need to finish the tests.

@nsoranzo See the test I ran today in https://github.com/galaxyproject/galaxy/issues/4892. Not a red dataset anymore, but a misassigned vcf_bgzip datatype with warnings. Your PR should fix that anyway .. or looks like it will.

This different issue will be fixed by the converter @almahmoud is working on. We can stage it all on Test before 19.05 goes out and see how it all works together once integrated on a public server (with everything else that is changing in the release).

Thanks all!

Can I ask about how to deal with other file/data types that are gzipped? For instance, we have .cel.gz files that, when I try to upload to our local galaxy, it says warning: the file type does not match the uploaded file. I tried this both with distro datatypes_conf.xml as well as after addition of auto_compressed_types="gz" flag into "cel" datatype. Both produced the same result/warning. Is there any way to set the file type for these .cel.gz files correctly? Or are we left with only using "auto" (does that work?)?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

katbeaulieu picture katbeaulieu  路  3Comments

selten picture selten  路  4Comments

martenson picture martenson  路  5Comments

mvdbeek picture mvdbeek  路  4Comments

scholtalbers picture scholtalbers  路  5Comments