There is a pattern of using open(path, 'r').read()
without explicit encoding in pip:
This pattern causes issues under Python 3.x with ASCII locale because file contents is decoded using ascii in this case and it fails for non-ascii data.
The first occurance (in setup.py) is clearly wrong IMHO: the utility function is used for reading pip's own index.txt and news.txt files which are encoded to utf8. It may cause the following exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 871: ordinal not in range(128)
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 16, in <module>
File "/var/folders/_5/cbsg50991szfp1r9nwxpx8580000gq/T/pip-61p_z7-build/setup.py", line 31, in <module>
"\n\n" + read("docs", "news.txt"))
File "/var/folders/_5/cbsg50991szfp1r9nwxpx8580000gq/T/pip-61p_z7-build/setup.py", line 9, in read
return codecs.open(os.path.join(os.path.abspath(os.path.dirname(__file__)), *parts), 'r').read()
File "/Users/kmike/svn/pip/.tox/py32-ascii/lib/python3.2/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 871: ordinal not in range(128)
if the following is added to pip's own tox.ini:
[testenv:py32-ascii]
basepython = python3.2
setenv = LC_ALL=C
The second is more tricky and I didn't debug it. It causes the following exception:
Unpacking /Users/kmike/svn/DAWG/.tox/dist/DAWG-0.5.3.zip
Running setup.py egg_info for package from file:///Users/kmike/svn/DAWG/.tox/dist/DAWG-0.5.3.zip
Exception:
Traceback (most recent call last):
File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/basecommand.py", line 107, in main
status = self.run(options, args)
File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/commands/install.py", line 256, in run
requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/req.py", line 1042, in prepare_files
req_to_install.run_egg_info()
File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/req.py", line 241, in run_egg_info
"%(Name)s==%(Version)s" % self.pkg_info())
File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/req.py", line 334, in pkg_info
data = self.egg_info_data('PKG-INFO')
File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/site-packages/pip-1.2.1-py3.2.egg/pip/req.py", line 274, in egg_info_data
data = fp.read()
File "/Users/kmike/svn/DAWG/.tox/py32-locale/lib/python3.2/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2130: ordinal not in range(128)
in https://github.com/kmike/DAWG testing suite (https://github.com/kmike/DAWG/blob/master/tox.ini).
DAWG package has a non-ascii README.rst (which is loaded to long_description, binary under Python 2.x and unicode under Python 3.x).
Under Python 2.x this works fine because req.py doesn't try to decode the data.
It seems the first point, the issue on setup.py, is solved by @qwcode. But the second point of req.py
and fp.read()
seems to be buggy yet.
For the record: I gave up on this :) I don't know how to make reliably installable packages with non-ascii metadata under Python 2.x.
As for the first point, I think the commited solution is fragile (because non-ascii chars could accidently be introduced again) and it is better to explicitly decode news.txt from ascii to prevent such errors in future.
Also, I think pip should have Travis/tox environment with LC_ALL=C to test against this.
Same here:
$ LANG="POSIX" pip install -e .
Obtaining file:///home/ielectric/dev/pyramid_jinja2
Running setup.py egg_info for package from file:///home/ielectric/dev/pyramid_jinja2
Cleaning up...
Exception:
Traceback (most recent call last):
File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/basecommand.py", line 134, in main
status = self.run(options, args)
File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/commands/install.py", line 236, in run
requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/req.py", line 1047, in prepare_files
req_to_install.run_egg_info()
File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/req.py", line 262, in run_egg_info
"%(Name)s==%(Version)s" % self.pkg_info())
File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/req.py", line 355, in pkg_info
data = self.egg_info_data('PKG-INFO')
File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/site-packages/pip/req.py", line 295, in egg_info_data
data = fp.read()
File "/home/ielectric/dev/pyramid_jinja2/lib/python3.2/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 1087: ordinal not in range(128)
cc @pauleveritt
Indeed, this was the source of my problem. In one shell I had this in my environment:
LANG=en_US.UTF-8
...and pip worked. In another shell I didn't, and got the UnicodeDecodeError exception.
Hi there,
This issue seems to still be running; the way I see it, there seems to be two options:
setup.py
files are UTF-8, expect users to use a UTF-8 locale, and don't change anythingsetup.py
files are UTF-8, and add explicit codecs.open(__file__, 'r', 'utf-8')
instead of open(__file__, 'r')
in pip/req.py
setup.py
, and emulate Python's handling of coding: utf-8
& co markers.It seems that fixing the manually written micro-scripts around lines 600 and 285, as well as egg_info_data
at line 296 of req.py
are enough to install packages with UTF-8 setup.py
and metadata with LC_ALL=C
.
I'd like to contribute a patch to fix this issue; which option would you prefer?
Is this still a thing?
@pradyunsg I haven't face this in recent times. I saw this before though. This might not be existing anymore. But in the face of ambiguity, refuse the temptation to guess.
Can anyone tell me how to reproduce this(if it's still a bug)? The initial comment seems to be referring to the older piece of code and I really am not able to understand much out ot it.
Though if it's not present, we can then close it. :)
I believe this part of the code base has been removed. pip now uses distlib to read legacy metadata (egg-info), which always uses UTF-8.
In that case, let's close this! :)
Most helpful comment
I believe this part of the code base has been removed. pip now uses distlib to read legacy metadata (egg-info), which always uses UTF-8.