Galaxy: Upload and sniffer performance problems and bugs

Created on 28 Feb 2018  路  6Comments  路  Source: galaxyproject/galaxy

There are multiple issues (dating back quite a long time, it would appear) with uploads:

  1. The sniffers are still run even when a datatype has been explicitly selected
  2. Many sniffers for text types attempt to read files by line, and thus can easily run out of memory trying to read the first "line" of a binary file
  3. Even if you don't run out of memory (the binary file is < the memory allocated for the job, or there is a newline byte somewhere before the amount of allocated memory), processing files through every sniffer when they can read GB of data each is extremely slow
  4. Each sniffer opens and reads some amount of uploaded datasets repeatedly. We could read a fixed amount into memory once and give sniffers an itertools.tee()'d iterator (but this may not be a big performance problem if sniffers were guaranteed to only read a few KB to a few MB)

A few potential solutions:

  1. Don't run sniffers when a datatype is explicitly selected.
  2. Modify all sniffers (and ensure all future sniffers) only read files by chunks of fixed size, instead of by line (this should probably be done anyway due to #3796).
  3. Default to sniffers being "binary unaware" and only allow sniffers that assert themselves as "binary aware" to sniff binary files. Only run sniffers of datatypes subclassed from the Binary datatype on binary files.
  4. ???
areperformance areupload kinbug major

Most helpful comment

I think we should move as much as possible of this out of Galaxy into its own lib that we can test separately and other projects can make use of the sniffers. Some general scientific sniffer library, if possible.

All 6 comments

Milestoned to 18.05 so it does not hold up the release but fixes should be backported to 18.01.

Not duplicates, but related, thanks.

Agreed, if someone is going to spend some time on this, good to look at these other issues as well.

I think we should move as much as possible of this out of Galaxy into its own lib that we can test separately and other projects can make use of the sniffers. Some general scientific sniffer library, if possible.

A fair amount of galaxy.datatypes is already fairly independent, I'd guess the most Galaxy-specific aspect of it is the registry (datatypes_conf.xml).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tnabtaf picture tnabtaf  路  4Comments

selten picture selten  路  4Comments

beatrizserrano picture beatrizserrano  路  4Comments

mvdbeek picture mvdbeek  路  3Comments

afgane picture afgane  路  4Comments