Galaxy: sniffer problem in 19.05?

Created on 14 May 2019 · 17Comments · Source: galaxyproject/galaxy

From internal bug list:

> Attempt to upload GTF from pseudomonas.com (genome resource) failed with
> the error above. Not sure how to proceed:
> Fatal error: Exit code 1 ()
> Traceback (most recent call last):
>   File "/cvmfs/main.galaxyproject.org/galaxy/tools/data_source/upload.py",
> line 326, in <module>
>     __main__()
>   File "/cvmfs/main.galaxyproject.org/galaxy/tools/data_source/upload.py",
> line 319, in __main__
>     metadata.append(add_file(dataset, registry, output_path))
>   File "/cvmfs/main.galaxyproject.org/galaxy/tools/data_source/upload.py",
> line 128, in add_file
>     convert_spaces_to_tabs=dataset.space_to_tab,
>   File "/cvmfs/
> main.galaxyproject.org/galaxy/lib/galaxy/datatypes/upload_util.py", line
> 54, in handle_upload
>     convert_spaces_to_tabs=convert_spaces_to_tabs,
>   File "/cvmfs/main.galaxyproject.org/galaxy/lib/galaxy/datatypes/sniff.py",
> line 770, in handle_uploaded_dataset_file_internal
>     line_count, _converted_path = convert_fxn(converted_path,
> in_place=in_place, tmp_dir=tmp_dir, tmp_prefix=tmp_prefix)
>   File "/cvmfs/main.galaxyproject.org/galaxy/lib/galaxy/datatypes/sniff.py",
> line 122, in convert_newlines
>     for i, line in enumerate(io.open(fname, mode="U", encoding='utf-8')):
>   File "/cvmfs/main.galaxyproject.org/venv/lib/python2.7/codecs.py", line
> 314, in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 690:
> invalid start byte

kinbug

Source

jmchilton

All 17 comments

The problem is that we do the newline / space to tab conversion for text datatypes, but that fails with non-utf8 characters. So we could try reading this in rb mode instead, but there's a good chance downstream tools would fail. Automatic recoding isn't perfect either. Any ideas here @nsoranzo ?

mvdbeek on 14 May 2019

xref https://github.com/galaxyproject/galaxy/issues/5250 / https://github.com/galaxyproject/galaxy/issues/7412#issuecomment-469631452

mvdbeek on 14 May 2019

Given the timing I assumed it was a new issue, sorry for not searching. Hmm...

jmchilton on 14 May 2019

I think this is the file and I think it is readily reproducible on test.galaxyproject.org.

http://www.pseudomonas.com/downloads/pseudomonas/pgd_r_17_2/Pseudomonas_aeruginosa_PAO1_107/Pseudomonas_aeruginosa_PAO1_107_genes_and_intergenic.gtf

jmchilton on 14 May 2019

So we could try reading this in rb mode instead, but there's a good chance downstream tools would fail.

If we want to reject files that contain non-UTF encoding that is fine but we should do that explicitly and not as part of newline conversion right? I'm not sure how to figure out where the newlines are in rb mode though right... I feel like we should just give up on convert_newlines if there is a UnicodeDecodeError error.

jmchilton on 14 May 2019

I'm not sure how to figure out where the newlines are in rb mode though right.

There may be some caveats I'm not thinking about, but this seems to work (this file is in ISO-8859-1, which raises the same error):

wget https://raw.githubusercontent.com/galaxyproteomics/tools-galaxyp/master/tools/cardinal/test-data/Example_Processed.imzML
unix2dos Example_Processed.imzML

and then

with open('test', 'wb') as fp:
    for i, line in enumerate(open('Example_Processed.imzML', mode='rb')):
        fp.write(b"%s\n" % line.rstrip(b"\r\n"))

seems to work and convert the newlines properly

mvdbeek on 14 May 2019

It'll still not be utf-8, but I agree with you that failing the newline conversion shouldn't be the way we reject non utf8 files.

mvdbeek on 14 May 2019

Looks like this might be causing the following,

https://github.com/peterjc/pico_galaxy/pull/33#issuecomment-492653885

======================================================================
FAIL: ( venn_list ) > Test-1
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/peterjc/pico_galaxy/galaxy-dev/test/functional/test_toolbox.py", line 99, in test_tool
    self.do_it(tool_version=tool_version, test_index=test_index)
  File "/home/travis/build/peterjc/pico_galaxy/galaxy-dev/test/functional/test_toolbox.py", line 36, in do_it
    verify_tool(tool_id, self.galaxy_interactor, resource_parameters=resource_parameters, test_index=test_index, tool_version=tool_version, register_job_data=register_job_data)
  File "/home/travis/build/peterjc/pico_galaxy/galaxy-dev/lib/galaxy/tools/verify/interactor.py", line 779, in verify_tool
    raise e
JobOutputsError: 'utf8' codec can't decode byte 0xac in position 10: invalid start byte

In this case the first line of the PDF generated by matplotlib in the test is likely causing this, making me suspect the sniffer code:

$ hexdump -C example.pdf | head -n 1
00000000  25 50 44 46 2d 31 2e 34  0a 25 ac dc 20 ab ba 0a  |%PDF-1.4.%.. ...|

peterjc on 15 May 2019

👍1

That's another good point that we should probably do https://github.com/galaxyproject/galaxy/issues/7957#issuecomment-492377768 and not worry aobut whether this is a valid downstream format, I can PR this

mvdbeek on 15 May 2019

@mvdbeek I was working on https://github.com/jmchilton/galaxy/commit/f8fb89006b3471d457c89e41f4503f449d33e90e - I think it will conflict badly. Any chance you can start from there, I understand if you'd rather just fix this bug though.

jmchilton on 15 May 2019

Or I can try to pour some more time into this this afternoon and see if I can integrate your suggestion. I just don't understand how that works, but I believe you that it does.

jmchilton on 15 May 2019

👍1

If you can give it a try that'd be great.

mvdbeek on 15 May 2019

I can't get rb mode to recognize carriage return as a newline at all.

>>> from six import BytesIO
>>> BytesIO("1 2\r3 4").readline()
'1 2\r3 4'
>>> for line in BytesIO("1 2\r3 4"):
...     print line.rstrip(b"\r\n")
...
3 4

I ran the unit tests on your proposed solution above and my memory-limited variant and the same thing happened both times.

We could try reading the file byte by byte instead?

jmchilton on 16 May 2019

https://github.com/jmchilton/galaxy/commit/778b1c293c51197a2db5c5c1c31bc37b43c7a340 fails the unit tests 😢

jmchilton on 16 May 2019

Fixed in https://github.com/galaxyproject/galaxy/pull/7995

mvdbeek on 18 May 2019

This should have been fixed on the dev branch, right? I'm still seeing the problem with a matplotlib generated PDF file https://github.com/peterjc/pico_galaxy/pull/33#issuecomment-493954485

Should I upload a small PDF file as a test case?

peterjc on 20 May 2019

Thanks @peterjc, this is because the test interactor assumes utf-8 encoded text data when using contains. https://github.com/galaxyproject/galaxy/pull/8010 is going to fix that

mvdbeek on 21 May 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Filter Failed tool in workflow coerces wrong noodle type when connecting collection data

mtekman · 3Comments

Problems with Collection output created via structured_like from a data input with multiple=True

blankenberg · 4Comments

Inter-Galactic transfer of histories failing with 'Export history to file' & 'Import from file'

MoHeydarian · 4Comments

Batch Mode Execution Multiple Select List

katbeaulieu · 3Comments

Enhance History functions "Operations on multiple datasets" to include Edit Attributes functions (esp reassign/redetect datatype)

jennaj · 4Comments