Environment
Description
These two tests encode strings as utf-16
, and compare the result against a hardcoded bytestring. However, utf-16
means something different depending on the system endianness. The tests should accept either byte ordering, or explicitly use utf-16le
(though this will remove the BOM).
Expected behavior
The tests pass.
How to Reproduce
pytest
) on a big-endian system.Output
========================================================================== FAILURES ===========================================================================
______________________________________________________________ test_str_to_display__decode_error ______________________________________________________________
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x3fff895bd828>, caplog = <_pytest.logging.LogCaptureFixture object at 0x3fff895b3fd0>
def test_str_to_display__decode_error(monkeypatch, caplog):
monkeypatch.setattr(locale, 'getpreferredencoding', lambda: 'utf-8')
# Encode with an incompatible encoding.
data = u'ab'.encode('utf-16')
actual = str_to_display(data)
> assert actual == u'\\xff\\xfea\x00b\x00', (
# Show the encoding for easier troubleshooting.
'encoding: {!r}'.format(locale.getpreferredencoding())
)
E AssertionError: encoding: 'utf-8'
E assert '\\xfe\\xff\x00a\x00b' == '\\xff\\xfea\x00b\x00'
E - \xfe\xff^@a^@b
E + \xff\xfea^@b^@
tests/unit/test_compat.py:96: AssertionError
-------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------
WARNING: Bytes object does not appear to be encoded as utf-8
---------------------------------------------------------------------- Captured log call ----------------------------------------------------------------------
WARNING pip._internal.utils.compat:compat.py:127 Bytes object does not appear to be encoded as utf-8
test_path_to_display[\xfe\xff\x00/\x00p\x00a\x00t\x00h\x00/\x00d\x00\xe9\x00f-utf-8-b'\\xff\\xfe/\\x00p\\x00a\\x00t\\x00h\\x00/\\x00d\\x00\\xe9\\x00f\\x00'] _
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x3fff88f3ff60>, path = b'\xfe\xff\x00/\x00p\x00a\x00t\x00h\x00/\x00d\x00\xe9\x00f'
fs_encoding = 'utf-8', expected = "b'\\xff\\xfe/\\x00p\\x00a\\x00t\\x00h\\x00/\\x00d\\x00\\xe9\\x00f\\x00'"
@pytest.mark.parametrize('path, fs_encoding, expected', [
(None, None, None),
# Test passing a text (unicode) string.
(u'/path/d茅f', None, u'/path/d茅f'),
# Test a bytes object with a non-ascii character.
(u'/path/d茅f'.encode('utf-8'), 'utf-8', u'/path/d茅f'),
# Test a bytes object with a character that can't be decoded.
(u'/path/d茅f'.encode('utf-8'), 'ascii', u"b'/path/d\\xc3\\xa9f'"),
(u'/path/d茅f'.encode('utf-16'), 'utf-8',
u"b'\\xff\\xfe/\\x00p\\x00a\\x00t\\x00h\\x00/"
"\\x00d\\x00\\xe9\\x00f\\x00'"),
])
def test_path_to_display(monkeypatch, path, fs_encoding, expected):
monkeypatch.setattr(sys, 'getfilesystemencoding', lambda: fs_encoding)
actual = path_to_display(path)
> assert actual == expected, 'actual: {!r}'.format(actual)
E AssertionError: actual: "b'\\xfe\\xff\\x00/\\x00p\\x00a\\x00t\\x00h\\x00/\\x00d\\x00\\xe9\\x00f'"
E assert "b'\\xfe\\xff...0\\xe9\\x00f'" == "b'\\xff\\xfe/...9\\x00f\\x00'"
E - b'\xfe\xff\x00/\x00p\x00a\x00t\x00h\x00/\x00d\x00\xe9\x00f'
E ? ---- ^^
E + b'\xff\xfe/\x00p\x00a\x00t\x00h\x00/\x00d\x00\xe9\x00f\x00'
E ? ^^ ++++
tests/unit/test_utils.py:393: AssertionError
=================================================================== short test summary info ===================================================================
SKIPPED [1] tests/functional/test_configuration.py:19: Can't modify underlying file for any mode
SKIPPED [1] tests/functional/test_install.py:572: Python 2 only
SKIPPED [1] tests/functional/test_requests.py:4: <Skipped instance>
SKIPPED [1] tests/unit/test_collector.py:194: condition: sys.platform != 'win32'
SKIPPED [3] tests/unit/test_req.py:686: Test only available on Windows
SKIPPED [1] tests/unit/test_urls.py:27: condition: sys.platform != 'win32'
SKIPPED [1] tests/unit/test_urls.py:59: condition: sys.platform != 'win32'
SKIPPED [2] tests/unit/test_utils_subprocess.py:105: condition: sys.version_info >= (3,)
XFAIL tests/functional/test_install_reqs.py::test_install_distribution_union_conflicting_extras
XFAIL tests/functional/test_yaml.py::test_yaml_based[install/extras-2]
XFAIL tests/functional/test_yaml.py::test_yaml_based[install/conflicting_triangle]
XFAIL tests/functional/test_yaml.py::test_yaml_based[install/conflicting_diamond]
FAILED tests/unit/test_compat.py::test_str_to_display__decode_error - AssertionError: encoding: 'utf-8'
FAILED tests/unit/test_utils.py::test_path_to_display[\xfe\xff\x00/\x00p\x00a\x00t\x00h\x00/\x00d\x00\xe9\x00f-utf-8-b'\\xff\\xfe/\\x00p\\x00a\\x00t\\x00h\\x00/\\x00d\\x00\\xe9\\x00f\\x00']
================================ 2 failed, 1389 passed, 11 skipped, 249 deselected, 4 xfailed, 32 warnings in 1739.17 seconds =================================
@smaeul from the options you provided to resolve this issue. Can you explain when you say that the test should accept byte-ordering? I mean, would byte-ordering be dependent on the platform on which the test runs?
I am aiming to resolve this issue. So, any references would also help.
@gutsytechster What I mean is that there are two possible utf-16 encodings of any given string, and both are valid. For the second test:
>>> b'\xfe\xff\x00/\x00p\x00a\x00t\x00h\x00/\x00d\x00\xe9\x00f'.decode('utf-16') == \
... b'\xff\xfe/\x00p\x00a\x00t\x00h\x00/\x00d\x00\xe9\x00f\x00'.decode('utf-16')
True
So in the first test, assert actual == u'\\xff\\xfea\x00b\x00'
is wrong, because data = u'ab'.encode('utf-16')
could return the other, equally valid, encoding. The assertion should accept either value (utf16le+bom or utf16be+bom), explicitly use utf16le or utf16be (although that will omit the BOM), or ideally not hardcode the data at all (i.e. assert actual == data
probably works here).
@samuel, could you let me know the byte representation of u'/path/def/
in the big-endian system(which I assume you have)? Or rather, the output of the following statement
ascii(u'/path/d茅f'.encode('utf-16'))
I've seen the conversion from le to be but they all seem to be possible for integers. I couldn't find a working solution for the characters.
@smaeul^^ soft ping!
@gutsytechster utf16 bytestrings in both endiannesses are in my comment above. But to answer your specific question:
>>> ascii(u'/path/d茅f'.encode('utf-16'))
"b'\\xfe\\xff\\x00/\\x00p\\x00a\\x00t\\x00h\\x00/\\x00d\\x00\\xe9\\x00f'"
Endianness here is the same principle as for integers: pairs of bytes are swapped. So '/'
is encoded either b'\0/'
or b'/\0'
. The U+FEFF (byte order mark, or BOM) tells you which order to interpret the bytes.
Most helpful comment
@gutsytechster utf16 bytestrings in both endiannesses are in my comment above. But to answer your specific question:
Endianness here is the same principle as for integers: pairs of bytes are swapped. So
'/'
is encoded eitherb'\0/'
orb'/\0'
. The U+FEFF (byte order mark, or BOM) tells you which order to interpret the bytes.