Galaxy: Upload tool assigns `unsorted.bam` to any BAMs without the "SO:coordinate" header and these are missing a downloadable index.

Created on 30 Oct 2018 · 13Comments · Source: galaxyproject/galaxy

Any uploaded bam that is not coordinate sorted results in a dataset with unsorted.bam assigned as the datatype. The index cannot be downloaded (server error). This data _does_ work directly with Galaxy's tools as of release 18.09.

Workaround for users until the index issue is corrected:

_If you plan to use the BAM dataset(s) in Galaxy_, tools will coordinate sort the data during runtime (using temporary files/indexes). _No datatype changes or other modifications are needed by the end user._ Avoid changing the datatype to bam with Edit Attributes > Datatype, or an unstable dataset will be created and downstream tools will pause (or fail). The datatype must remain as unsorted.bam or be changed back to that (as needed). Alternatively, follow the steps for next item (create a coordinate-sorted BAM), _starting with a dataset having the unsorted.bam dataype_.
_If you need to download the BAM index_, the data will need to be coordinate-sorted first. For both options, the result will be a new dataset with the "SO:coordinate" header in the BAM dataset, the bam datatype assigned, and a downloadable index available under the disc icon.
- Option 1: Per-dataset, click on the pencil icon to reach the Edit Attributes tabs. Click into the second one (Edit Attributes > Convert). The BAM can be converted to a coordinate or queryname sorted BAM, but if you need an index to download, choose _coordinate-sort_.
- Option 2: Use the tool Picard SortSam or Samtools sort to sort by coordinate.

History with test data: https://usegalaxy.org:/u/jen/h/test-history-bam-upload-best

Reported at Biostars (thank you!): https://biostar.usegalaxy.org/p/29842

ping @martenson (thanks for help getting this fixed!)

areupload kinbug

Source

jennaj

All 13 comments

This data cannot be used with tools

It'll be auto-converted if you select the file as input.

the index cannot be downloaded.

right, we can fix this.

Workaround for users until fixed:

If you want to manually convert, don't use picard, go to edit attributes -> convert -> bam.

mvdbeek on 30 Oct 2018

@mvdbeek The original bam was not coordinate sorted and the peak view + download of these bams are not either. I think the sort is needed -- am I missing something?

jennaj on 30 Oct 2018

Right, so why would the upload change that? If you need it sorted it'll be auto converted.

mvdbeek on 30 Oct 2018

To elaborate a bit on this, Galaxy used to have just a single BAM datatype that was loosely defined as coordinate sorted, but nothing really enforced this. Tools that would consume bam files and expect them to be coordinate sorted would fail or give wrong results when fed a bam file that wasn't sorted.
What we used to do on when setting metadata for bam files is check whether the file needs grooming, and eventually sort in place.
This had some problems:

This check wasn't very reliable and was dependent on the samtools version, which led to server to server and file to file differences. Not great for reproducibility (although we have at least fixed the server-to-server differences by controlling the pysam version).
We used to set metadata after we sniffed a file, so the conversion can only happen if we detect a coordinate-sorted bam file
We start seeing long metadata run times if a file had to be converted

Now we have additional unsorted and queryname sorted datatypes. If we upload these we sniff them as the datatype that they really are. If they are needed in another sort-order they will be automatically converted if you run a job that needs them in another sort order.

This, I think, is less surprising to the user and more consistent: Uploading doesn't change what file you have uploaded, and setting a datatype only "helps out" the sniffer, it doesn't trigger any conversions.
The downside is the additional storage needed by the conversion (which we'll eliminate if we schedule the conversion with the job itself, something we talked about, can't find the issue now).

mvdbeek on 30 Oct 2018

Ah, that explains the new usage, thank you! I thought all bams Uploaded using autodetect were sorted still. If not now, and tools do the conversion/sort, and we fix the index link, agree that should be enough. Plus it makes it easier on users for the reasons you describe: preserve original data and still benefit from a built-in sort that avoids input/usage tool errors.

I'm curious how a sequence-only BAM will function in all of this. Will test that out. Distinct ticket if there are problems (there have been some in the past).

jennaj on 30 Oct 2018

This should be sniffed as unsorted.bam, which is maybe a little unintuitive as a name. Is there something distinct that we can sniff in those ?

mvdbeek on 30 Oct 2018

There won't be any header and no alignment data. The NCBI bam download tool produces these and used to give them the bam datatype. Let me test and see what it does now. Will also check how one Uploads. Will write back once done & share history.

jennaj on 30 Oct 2018

Test the four types of bam (based on the SO: header info). Seems like sniffing is binary - either bam or unsorted.bam. Queryname bam didn't get sniffed as qname_sorted.bam. Should it be now? Or is that still a to-do or something we won't be doing now (maybe the datatype get dropped?)

Test history with dinky bam/sams of all the flavors and few tests to see how function with tools: https://usegalaxy.org:/u/jen/h/test-history-bam-upload-best

jennaj on 30 Oct 2018

@mvdbeek Changing the datatype from any unsorted.bam to bam with edit attributes results in a metadata failure. Because they are not sorted? Or because of a missing/unattached index (.bam.bai).

So for now, a sniffed unsorted.bam is to be set with the datatype bam, it looks like the BAM data actually needs to be sorted, not just have the datatype changed.

Examples are in data 25-26-27 in the test history above. Those only had the datatype changed.

Autodetect does not fix the metadata warning. Examples are in data 28-29-30, same history.

Running tools against those datasets results in paused jobs. Examples in data 31-42. There is no way back from here for a user except to revert to the unsorted.bam datatype. This isn't obvious, a user needs to 1-know what datatype is and 2- how it is handled now eg. that Galaxy will do the coordinate sorting at runtime). Even a sort job on those datasets in an unstable metadata state will pause.

I'm going to change the workaround. People will need to coordinate sort right now to create a valid bam dataset that has an index that can be downloaded (the original problem reported -- end user couldn't download the index from a BAM that uploaded as unsorted.bam).

jennaj on 31 Oct 2018

Changing the datatype from any unsorted.bam to bam with edit attributes results in a metadata failure. Because they are not sorted? Or because of a missing/unattached index (.bam.bai).

@jennaj Did you changed the datatype from the "Datatypes" tab or did you use the converter from the "Convert" tab? The second should be the right one.

nsoranzo on 31 Oct 2018

Good catch! I used edit attributes > datatype, not edit attributes > convert.

Tested that out and it works for both coordinate-sorted and queryname-sorted.

Under the hood the convert function is basically the same as using either Samtools or Picard sort from the tool panel. Not sure which is more obvious for users. I'll cover both -- the way Upload works and usage to get an index needs to be written up in a clear way (new, distinct FAQ) and changed in existing FAQs. I'll just made some changes in the hub but will do more this week.

I can't think of a reason why a qname_sorted.bam couldn't have an attached (and downloadable) .bam.bai index or be autodetected with that datatype directly in Upload .. but maybe there is a reason? Possibly it is planned dev for later?

Thanks to you both for walking me through this and isolating what are actual issues versus new usage! Think I understand it now, at least for this part, hope we can tackle qname next and figure out what is usage (my problematic usage) vs potential sniffer or index-download issues.

My goal is to create a clear yet comprehensive FAQ reference for how this all works that can be shared with end users. Changes always bring about a bit of confusion, especially at first when older data is still around in histories and in active use. But certainly, in the big picture, the bam-type data modeling is MUCH more robust now & will avoid many, many prior usage issues. Very useful suite of changes!

jennaj on 31 Oct 2018

Test the four types of bam (based on the SO: header info). Seems like sniffing is binary - either bam or unsorted.bam. Queryname bam didn't get sniffed as qname_sorted.bam. Should it be now?

Thanks for testing this, and I can't believe we didn't actually test this automatically. This is a bug, which I will squash once and for all. We now have a testing framework that makes sure we sniff the correct datatype for a file. We just need to place a test file in lib/galaxy/datatypes/test.

I can't think of a reason why a qname_sorted.bam couldn't have an attached (and downloadable) .bam.bai index or be autodetected with that datatype directly in Upload .. but maybe there is a reason?

@martenson had the same question yesterday, the short answer is it isn't possible / useful:
The bam index contains byte offsets to (IIRC) the first read for every 16,384 bases in the reference.
So if a bam file isn't coordinate sorted you can't do this. You usually need the index for accessing genomic regions without loading the entire file, which again only works for coordinate sorted bam files.

mvdbeek on 31 Oct 2018

👍1

I think this is now in good shape. Thanks a bunch @jennaj @mvdbeek for improving this!