Galaxy: sniffer problem in 19.05?

Created on 14 May 2019  路  17Comments  路  Source: galaxyproject/galaxy

From internal bug list:

> Attempt to upload GTF from pseudomonas.com (genome resource) failed with
> the error above. Not sure how to proceed:
> Fatal error: Exit code 1 ()
> Traceback (most recent call last):
>   File "/cvmfs/main.galaxyproject.org/galaxy/tools/data_source/upload.py",
> line 326, in <module>
>     __main__()
>   File "/cvmfs/main.galaxyproject.org/galaxy/tools/data_source/upload.py",
> line 319, in __main__
>     metadata.append(add_file(dataset, registry, output_path))
>   File "/cvmfs/main.galaxyproject.org/galaxy/tools/data_source/upload.py",
> line 128, in add_file
>     convert_spaces_to_tabs=dataset.space_to_tab,
>   File "/cvmfs/
> main.galaxyproject.org/galaxy/lib/galaxy/datatypes/upload_util.py", line
> 54, in handle_upload
>     convert_spaces_to_tabs=convert_spaces_to_tabs,
>   File "/cvmfs/main.galaxyproject.org/galaxy/lib/galaxy/datatypes/sniff.py",
> line 770, in handle_uploaded_dataset_file_internal
>     line_count, _converted_path = convert_fxn(converted_path,
> in_place=in_place, tmp_dir=tmp_dir, tmp_prefix=tmp_prefix)
>   File "/cvmfs/main.galaxyproject.org/galaxy/lib/galaxy/datatypes/sniff.py",
> line 122, in convert_newlines
>     for i, line in enumerate(io.open(fname, mode="U", encoding='utf-8')):
>   File "/cvmfs/main.galaxyproject.org/venv/lib/python2.7/codecs.py", line
> 314, in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 690:
> invalid start byte
kinbug

All 17 comments

The problem is that we do the newline / space to tab conversion for text datatypes, but that fails with non-utf8 characters. So we could try reading this in rb mode instead, but there's a good chance downstream tools would fail. Automatic recoding isn't perfect either. Any ideas here @nsoranzo ?

Given the timing I assumed it was a new issue, sorry for not searching. Hmm...

So we could try reading this in rb mode instead, but there's a good chance downstream tools would fail.

If we want to reject files that contain non-UTF encoding that is fine but we should do that explicitly and not as part of newline conversion right? I'm not sure how to figure out where the newlines are in rb mode though right... I feel like we should just give up on convert_newlines if there is a UnicodeDecodeError error.

I'm not sure how to figure out where the newlines are in rb mode though right.

There may be some caveats I'm not thinking about, but this seems to work (this file is in ISO-8859-1, which raises the same error):

wget https://raw.githubusercontent.com/galaxyproteomics/tools-galaxyp/master/tools/cardinal/test-data/Example_Processed.imzML
unix2dos Example_Processed.imzML

and then

with open('test', 'wb') as fp:
    for i, line in enumerate(open('Example_Processed.imzML', mode='rb')):
        fp.write(b"%s\n" % line.rstrip(b"\r\n"))

seems to work and convert the newlines properly

It'll still not be utf-8, but I agree with you that failing the newline conversion shouldn't be the way we reject non utf8 files.

Looks like this might be causing the following,

https://github.com/peterjc/pico_galaxy/pull/33#issuecomment-492653885

======================================================================
FAIL: ( venn_list ) > Test-1
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/peterjc/pico_galaxy/galaxy-dev/test/functional/test_toolbox.py", line 99, in test_tool
    self.do_it(tool_version=tool_version, test_index=test_index)
  File "/home/travis/build/peterjc/pico_galaxy/galaxy-dev/test/functional/test_toolbox.py", line 36, in do_it
    verify_tool(tool_id, self.galaxy_interactor, resource_parameters=resource_parameters, test_index=test_index, tool_version=tool_version, register_job_data=register_job_data)
  File "/home/travis/build/peterjc/pico_galaxy/galaxy-dev/lib/galaxy/tools/verify/interactor.py", line 779, in verify_tool
    raise e
JobOutputsError: 'utf8' codec can't decode byte 0xac in position 10: invalid start byte

In this case the first line of the PDF generated by matplotlib in the test is likely causing this, making me suspect the sniffer code:

$ hexdump -C example.pdf | head -n 1
00000000  25 50 44 46 2d 31 2e 34  0a 25 ac dc 20 ab ba 0a  |%PDF-1.4.%.. ...|

That's another good point that we should probably do https://github.com/galaxyproject/galaxy/issues/7957#issuecomment-492377768 and not worry aobut whether this is a valid downstream format, I can PR this

@mvdbeek I was working on https://github.com/jmchilton/galaxy/commit/f8fb89006b3471d457c89e41f4503f449d33e90e - I think it will conflict badly. Any chance you can start from there, I understand if you'd rather just fix this bug though.

Or I can try to pour some more time into this this afternoon and see if I can integrate your suggestion. I just don't understand how that works, but I believe you that it does.

If you can give it a try that'd be great.

I can't get rb mode to recognize carriage return as a newline at all.

>>> from six import BytesIO
>>> BytesIO("1 2\r3 4").readline()
'1 2\r3 4'
>>> for line in BytesIO("1 2\r3 4"):
...     print line.rstrip(b"\r\n")
...
3 4

I ran the unit tests on your proposed solution above and my memory-limited variant and the same thing happened both times.

We could try reading the file byte by byte instead?

This should have been fixed on the dev branch, right? I'm still seeing the problem with a matplotlib generated PDF file https://github.com/peterjc/pico_galaxy/pull/33#issuecomment-493954485

Should I upload a small PDF file as a test case?

Thanks @peterjc, this is because the test interactor assumes utf-8 encoded text data when using contains. https://github.com/galaxyproject/galaxy/pull/8010 is going to fix that

Was this page helpful?
0 / 5 - 0 ratings