Pip: problems with LC_ALL=C

Created on 3 Jan 2013 · 12Comments · Source: pypa/pip

There is a pattern of using open(path, 'r').read() without explicit encoding in pip:

This pattern causes issues under Python 3.x with ASCII locale because file contents is decoded using ascii in this case and it fails for non-ascii data.

The first occurance (in setup.py) is clearly wrong IMHO: the utility function is used for reading pip's own index.txt and news.txt files which are encoded to utf8. It may cause the following exception:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 871: ordinal not in range(128)
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 16, in <module>

  File "/var/folders/_5/cbsg50991szfp1r9nwxpx8580000gq/T/pip-61p_z7-build/setup.py", line 31, in <module>

    "\n\n" + read("docs", "news.txt"))

  File "/var/folders/_5/cbsg50991szfp1r9nwxpx8580000gq/T/pip-61p_z7-build/setup.py", line 9, in read

    return codecs.open(os.path.join(os.path.abspath(os.path.dirname(__file__)), *parts), 'r').read()

  File "/Users/kmike/svn/pip/.tox/py32-ascii/lib/python3.2/encodings/ascii.py", line 26, in decode

    return codecs.ascii_decode(input, self.errors)[0]

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 871: ordinal not in range(128)

if the following is added to pip's own tox.ini:

[testenv:py32-ascii]
basepython = python3.2
setenv = LC_ALL=C

The second is more tricky and I didn't debug it. It causes the following exception:

Unpacking /Users/kmike/svn/DAWG/.tox/dist/DAWG-0.5.3.zip
  Running setup.py egg_info for package from file:///Users/kmike/svn/DAWG/.tox/dist/DAWG-0.5.3.zip

Exception:
Traceback (most recent call last):
  File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/basecommand.py", line 107, in main
    status = self.run(options, args)
  File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/commands/install.py", line 256, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/req.py", line 1042, in prepare_files
    req_to_install.run_egg_info()
  File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/req.py", line 241, in run_egg_info
    "%(Name)s==%(Version)s" % self.pkg_info())
  File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/req.py", line 334, in pkg_info
    data = self.egg_info_data('PKG-INFO')
  File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/req.py", line 274, in egg_info_data
    data = fp.read()
  File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2130: ordinal not in range(128)

in https://github.com/kmike/DAWG testing suite (https://github.com/kmike/DAWG/blob/master/tox.ini).
DAWG package has a non-ascii README.rst (which is loaded to long_description, binary under Python 2.x and unicode under Python 3.x).

Under Python 2.x this works fine because req.py doesn't try to decode the data.

encoding bug

Source

kmike

Most helpful comment

I believe this part of the code base has been removed. pip now uses distlib to read legacy metadata (egg-info), which always uses UTF-8.

uranusjr on 21 Apr 2020

🎉1 👍1

All 12 comments

It seems the first point, the issue on setup.py, is solved by @qwcode. But the second point of req.py and fp.read() seems to be buggy yet.

hltbra on 28 Mar 2013

For the record: I gave up on this :) I don't know how to make reliably installable packages with non-ascii metadata under Python 2.x.

As for the first point, I think the commited solution is fragile (because non-ascii chars could accidently be introduced again) and it is better to explicitly decode news.txt from ascii to prevent such errors in future.

kmike on 28 Mar 2013

Also, I think pip should have Travis/tox environment with LC_ALL=C to test against this.

kmike on 28 Mar 2013

Same here:

$ LANG="POSIX" pip install -e .
Obtaining file:///home/ielectric/dev/pyramid_jinja2
  Running setup.py egg_info for package from file:///home/ielectric/dev/pyramid_jinja2

Cleaning up...
Exception:
Traceback (most recent call last):
  File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/basecommand.py", line 134, in main
    status = self.run(options, args)
  File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/commands/install.py", line 236, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/req.py", line 1047, in prepare_files
    req_to_install.run_egg_info()
  File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/req.py", line 262, in run_egg_info
    "%(Name)s==%(Version)s" % self.pkg_info())
  File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/req.py", line 355, in pkg_info
    data = self.egg_info_data('PKG-INFO')
  File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/req.py", line 295, in egg_info_data
    data = fp.read()
  File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 1087: ordinal not in range(128)

domenkozar on 9 Aug 2013

cc @pauleveritt

domenkozar on 16 Oct 2013

Indeed, this was the source of my problem. In one shell I had this in my environment:

LANG=en_US.UTF-8

...and pip worked. In another shell I didn't, and got the UnicodeDecodeError exception.

pauleveritt on 16 Oct 2013

👍2

Hi there,

This issue seems to still be running; the way I see it, there seems to be two options:

Consider that all setup.py files are UTF-8, expect users to use a UTF-8 locale, and don't change anything
Consider that all setup.py files are UTF-8, and add explicit codecs.open(__file__, 'r', 'utf-8') instead of open(__file__, 'r') in pip/req.py
Prepare for non-UTF-8, non-ascii setup.py, and emulate Python's handling of coding: utf-8 & co markers.

It seems that fixing the manually written micro-scripts around lines 600 and 285, as well as egg_info_data at line 296 of req.py are enough to install packages with UTF-8 setup.py and metadata with LC_ALL=C.

I'd like to contribute a patch to fix this issue; which option would you prefer?

rbarrois on 26 Dec 2013

Is this still a thing?

pradyunsg on 5 Nov 2017

@pradyunsg I haven't face this in recent times. I saw this before though. This might not be existing anymore. But in the face of ambiguity, refuse the temptation to guess.

auvipy on 7 May 2019

Can anyone tell me how to reproduce this(if it's still a bug)? The initial comment seems to be referring to the older piece of code and I really am not able to understand much out ot it.

Though if it's not present, we can then close it. :)