Pip: problems with LC_ALL=C

Created on 3 Jan 2013  路  12Comments  路  Source: pypa/pip

There is a pattern of using open(path, 'r').read() without explicit encoding in pip:

This pattern causes issues under Python 3.x with ASCII locale because file contents is decoded using ascii in this case and it fails for non-ascii data.

The first occurance (in setup.py) is clearly wrong IMHO: the utility function is used for reading pip's own index.txt and news.txt files which are encoded to utf8. It may cause the following exception:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 871: ordinal not in range(128)
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 16, in <module>

  File "/var/folders/_5/cbsg50991szfp1r9nwxpx8580000gq/T/pip-61p_z7-build/setup.py", line 31, in <module>

    "\n\n" + read("docs", "news.txt"))

  File "/var/folders/_5/cbsg50991szfp1r9nwxpx8580000gq/T/pip-61p_z7-build/setup.py", line 9, in read

    return codecs.open(os.path.join(os.path.abspath(os.path.dirname(__file__)), *parts), 'r').read()

  File "/Users/kmike/svn/pip/.tox/py32-ascii/lib/python3.2/encodings/ascii.py", line 26, in decode

    return codecs.ascii_decode(input, self.errors)[0]

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 871: ordinal not in range(128)

if the following is added to pip's own tox.ini:

[testenv:py32-ascii]
basepython = python3.2
setenv = LC_ALL=C

The second is more tricky and I didn't debug it. It causes the following exception:

Unpacking /Users/kmike/svn/DAWG/.tox/dist/DAWG-0.5.3.zip
  Running setup.py egg_info for package from file:///Users/kmike/svn/DAWG/.tox/dist/DAWG-0.5.3.zip

Exception:
Traceback (most recent call last):
  File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/basecommand.py", line 107, in main
    status = self.run(options, args)
  File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/commands/install.py", line 256, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/req.py", line 1042, in prepare_files
    req_to_install.run_egg_info()
  File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/req.py", line 241, in run_egg_info
    "%(Name)s==%(Version)s" % self.pkg_info())
  File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/req.py", line 334, in pkg_info
    data = self.egg_info_data('PKG-INFO')
  File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/req.py", line 274, in egg_info_data
    data = fp.read()
  File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2130: ordinal not in range(128)

in https://github.com/kmike/DAWG testing suite (https://github.com/kmike/DAWG/blob/master/tox.ini).
DAWG package has a non-ascii README.rst (which is loaded to long_description, binary under Python 2.x and unicode under Python 3.x).

Under Python 2.x this works fine because req.py doesn't try to decode the data.

encoding bug

Most helpful comment

I believe this part of the code base has been removed. pip now uses distlib to read legacy metadata (egg-info), which always uses UTF-8.

All 12 comments

It seems the first point, the issue on setup.py, is solved by @qwcode. But the second point of req.py and fp.read() seems to be buggy yet.

For the record: I gave up on this :) I don't know how to make reliably installable packages with non-ascii metadata under Python 2.x.

As for the first point, I think the commited solution is fragile (because non-ascii chars could accidently be introduced again) and it is better to explicitly decode news.txt from ascii to prevent such errors in future.

Also, I think pip should have Travis/tox environment with LC_ALL=C to test against this.

Same here:

$ LANG="POSIX" pip install -e .
Obtaining file:///home/ielectric/dev/pyramid_jinja2
  Running setup.py egg_info for package from file:///home/ielectric/dev/pyramid_jinja2

Cleaning up...
Exception:
Traceback (most recent call last):
  File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/basecommand.py", line 134, in main
    status = self.run(options, args)
  File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/commands/install.py", line 236, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/req.py", line 1047, in prepare_files
    req_to_install.run_egg_info()
  File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/req.py", line 262, in run_egg_info
    "%(Name)s==%(Version)s" % self.pkg_info())
  File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/req.py", line 355, in pkg_info
    data = self.egg_info_data('PKG-INFO')
  File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/req.py", line 295, in egg_info_data
    data = fp.read()
  File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 1087: ordinal not in range(128)

cc @pauleveritt

Indeed, this was the source of my problem. In one shell I had this in my environment:

LANG=en_US.UTF-8

...and pip worked. In another shell I didn't, and got the UnicodeDecodeError exception.

Hi there,

This issue seems to still be running; the way I see it, there seems to be two options:

  • Consider that all setup.py files are UTF-8, expect users to use a UTF-8 locale, and don't change anything
  • Consider that all setup.py files are UTF-8, and add explicit codecs.open(__file__, 'r', 'utf-8') instead of open(__file__, 'r') in pip/req.py
  • Prepare for non-UTF-8, non-ascii setup.py, and emulate Python's handling of coding: utf-8 & co markers.

It seems that fixing the manually written micro-scripts around lines 600 and 285, as well as egg_info_data at line 296 of req.py are enough to install packages with UTF-8 setup.py and metadata with LC_ALL=C.

I'd like to contribute a patch to fix this issue; which option would you prefer?

Is this still a thing?

@pradyunsg I haven't face this in recent times. I saw this before though. This might not be existing anymore. But in the face of ambiguity, refuse the temptation to guess.

Can anyone tell me how to reproduce this(if it's still a bug)? The initial comment seems to be referring to the older piece of code and I really am not able to understand much out ot it.

Though if it's not present, we can then close it. :)

I believe this part of the code base has been removed. pip now uses distlib to read legacy metadata (egg-info), which always uses UTF-8.

In that case, let's close this! :)

Was this page helpful?
0 / 5 - 0 ratings