Pip: Tests test_str_to_display and test_path_to_display are not endian-safe

Created on 28 Mar 2020  路  5Comments  路  Source: pypa/pip

Environment

  • pip version: 19.3.1
  • Python version: Python 3.6.10
  • OS: Gentoo/Linux musl ppc64

Description

These two tests encode strings as utf-16, and compare the result against a hardcoded bytestring. However, utf-16 means something different depending on the system endianness. The tests should accept either byte ordering, or explicitly use utf-16le (though this will remove the BOM).

Expected behavior

The tests pass.

How to Reproduce

  1. Run the test suite (pytest) on a big-endian system.
  2. The error shown below occurs.

Output

========================================================================== FAILURES ===========================================================================
______________________________________________________________ test_str_to_display__decode_error ______________________________________________________________

monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x3fff895bd828>, caplog = <_pytest.logging.LogCaptureFixture object at 0x3fff895b3fd0>

    def test_str_to_display__decode_error(monkeypatch, caplog):
        monkeypatch.setattr(locale, 'getpreferredencoding', lambda: 'utf-8')
        # Encode with an incompatible encoding.
        data = u'ab'.encode('utf-16')
        actual = str_to_display(data)

>       assert actual == u'\\xff\\xfea\x00b\x00', (
            # Show the encoding for easier troubleshooting.
            'encoding: {!r}'.format(locale.getpreferredencoding())
        )
E       AssertionError: encoding: 'utf-8'
E       assert '\\xfe\\xff\x00a\x00b' == '\\xff\\xfea\x00b\x00'
E         - \xfe\xff^@a^@b
E         + \xff\xfea^@b^@

tests/unit/test_compat.py:96: AssertionError
-------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------
WARNING: Bytes object does not appear to be encoded as utf-8
---------------------------------------------------------------------- Captured log call ----------------------------------------------------------------------
WARNING  pip._internal.utils.compat:compat.py:127 Bytes object does not appear to be encoded as utf-8
 test_path_to_display[\xfe\xff\x00/\x00p\x00a\x00t\x00h\x00/\x00d\x00\xe9\x00f-utf-8-b'\\xff\\xfe/\\x00p\\x00a\\x00t\\x00h\\x00/\\x00d\\x00\\xe9\\x00f\\x00'] _

monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x3fff88f3ff60>, path = b'\xfe\xff\x00/\x00p\x00a\x00t\x00h\x00/\x00d\x00\xe9\x00f'
fs_encoding = 'utf-8', expected = "b'\\xff\\xfe/\\x00p\\x00a\\x00t\\x00h\\x00/\\x00d\\x00\\xe9\\x00f\\x00'"

    @pytest.mark.parametrize('path, fs_encoding, expected', [
        (None, None, None),
        # Test passing a text (unicode) string.
        (u'/path/d茅f', None, u'/path/d茅f'),
        # Test a bytes object with a non-ascii character.
        (u'/path/d茅f'.encode('utf-8'), 'utf-8', u'/path/d茅f'),
        # Test a bytes object with a character that can't be decoded.
        (u'/path/d茅f'.encode('utf-8'), 'ascii', u"b'/path/d\\xc3\\xa9f'"),
        (u'/path/d茅f'.encode('utf-16'), 'utf-8',
         u"b'\\xff\\xfe/\\x00p\\x00a\\x00t\\x00h\\x00/"
         "\\x00d\\x00\\xe9\\x00f\\x00'"),
    ])
    def test_path_to_display(monkeypatch, path, fs_encoding, expected):
        monkeypatch.setattr(sys, 'getfilesystemencoding', lambda: fs_encoding)
        actual = path_to_display(path)
>       assert actual == expected, 'actual: {!r}'.format(actual)
E       AssertionError: actual: "b'\\xfe\\xff\\x00/\\x00p\\x00a\\x00t\\x00h\\x00/\\x00d\\x00\\xe9\\x00f'"
E       assert "b'\\xfe\\xff...0\\xe9\\x00f'" == "b'\\xff\\xfe/...9\\x00f\\x00'"
E         - b'\xfe\xff\x00/\x00p\x00a\x00t\x00h\x00/\x00d\x00\xe9\x00f'
E         ?   ----      ^^
E         + b'\xff\xfe/\x00p\x00a\x00t\x00h\x00/\x00d\x00\xe9\x00f\x00'
E         ?         ^^                                            ++++

tests/unit/test_utils.py:393: AssertionError
=================================================================== short test summary info ===================================================================
SKIPPED [1] tests/functional/test_configuration.py:19: Can't modify underlying file for any mode
SKIPPED [1] tests/functional/test_install.py:572: Python 2 only
SKIPPED [1] tests/functional/test_requests.py:4: <Skipped instance>
SKIPPED [1] tests/unit/test_collector.py:194: condition: sys.platform != 'win32'
SKIPPED [3] tests/unit/test_req.py:686: Test only available on Windows
SKIPPED [1] tests/unit/test_urls.py:27: condition: sys.platform != 'win32'
SKIPPED [1] tests/unit/test_urls.py:59: condition: sys.platform != 'win32'
SKIPPED [2] tests/unit/test_utils_subprocess.py:105: condition: sys.version_info >= (3,)
XFAIL tests/functional/test_install_reqs.py::test_install_distribution_union_conflicting_extras
XFAIL tests/functional/test_yaml.py::test_yaml_based[install/extras-2]
XFAIL tests/functional/test_yaml.py::test_yaml_based[install/conflicting_triangle]
XFAIL tests/functional/test_yaml.py::test_yaml_based[install/conflicting_diamond]
FAILED tests/unit/test_compat.py::test_str_to_display__decode_error - AssertionError: encoding: 'utf-8'
FAILED tests/unit/test_utils.py::test_path_to_display[\xfe\xff\x00/\x00p\x00a\x00t\x00h\x00/\x00d\x00\xe9\x00f-utf-8-b'\\xff\\xfe/\\x00p\\x00a\\x00t\\x00h\\x00/\\x00d\\x00\\xe9\\x00f\\x00']
================================ 2 failed, 1389 passed, 11 skipped, 249 deselected, 4 xfailed, 32 warnings in 1739.17 seconds =================================
tests

Most helpful comment

@gutsytechster utf16 bytestrings in both endiannesses are in my comment above. But to answer your specific question:

>>> ascii(u'/path/d茅f'.encode('utf-16'))
"b'\\xfe\\xff\\x00/\\x00p\\x00a\\x00t\\x00h\\x00/\\x00d\\x00\\xe9\\x00f'"

Endianness here is the same principle as for integers: pairs of bytes are swapped. So '/' is encoded either b'\0/' or b'/\0'. The U+FEFF (byte order mark, or BOM) tells you which order to interpret the bytes.

All 5 comments

@smaeul from the options you provided to resolve this issue. Can you explain when you say that the test should accept byte-ordering? I mean, would byte-ordering be dependent on the platform on which the test runs?

I am aiming to resolve this issue. So, any references would also help.

@gutsytechster What I mean is that there are two possible utf-16 encodings of any given string, and both are valid. For the second test:

>>> b'\xfe\xff\x00/\x00p\x00a\x00t\x00h\x00/\x00d\x00\xe9\x00f'.decode('utf-16') == \
... b'\xff\xfe/\x00p\x00a\x00t\x00h\x00/\x00d\x00\xe9\x00f\x00'.decode('utf-16')
True

So in the first test, assert actual == u'\\xff\\xfea\x00b\x00' is wrong, because data = u'ab'.encode('utf-16') could return the other, equally valid, encoding. The assertion should accept either value (utf16le+bom or utf16be+bom), explicitly use utf16le or utf16be (although that will omit the BOM), or ideally not hardcode the data at all (i.e. assert actual == data probably works here).

@samuel, could you let me know the byte representation of u'/path/def/ in the big-endian system(which I assume you have)? Or rather, the output of the following statement

ascii(u'/path/d茅f'.encode('utf-16'))

I've seen the conversion from le to be but they all seem to be possible for integers. I couldn't find a working solution for the characters.

@smaeul^^ soft ping!

@gutsytechster utf16 bytestrings in both endiannesses are in my comment above. But to answer your specific question:

>>> ascii(u'/path/d茅f'.encode('utf-16'))
"b'\\xfe\\xff\\x00/\\x00p\\x00a\\x00t\\x00h\\x00/\\x00d\\x00\\xe9\\x00f'"

Endianness here is the same principle as for integers: pairs of bytes are swapped. So '/' is encoded either b'\0/' or b'/\0'. The U+FEFF (byte order mark, or BOM) tells you which order to interpret the bytes.

Was this page helpful?
0 / 5 - 0 ratings