Pip: Sorting TypeError in move_wheel_files() during install (e.g. Poetry)

Created on 9 Oct 2018  路  25Comments  路  Source: pypa/pip

Environment

  • pip version: Latest
  • Python version: 3.6.2
  • OS: Fedora 25

Description

I'm getting the exception:

TypeError: '<' not supported between instances of 'int' and 'str'

when attempting to install poetry (example dockerfile and full stacktrace below).
Looking at the code for pip, in move_wheel_files it calls sorted(outrows) which is sorting a tuple. The 3rd column for that tuple looks like it could be an int or string, which is a bug:

>>> sorted((('','',''),('','',1)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'int' and 'str'

so the code in pip thats relevant:

            outrows = []
            for row in reader:
                row[0] = installed.pop(row[0], row[0])
                if row[0] in changed:
                    row[1], row[2] = rehash(row[0])
                outrows.append(tuple(row))
            for f in generated:
                digest, length = rehash(f)
                outrows.append((normpath(f, lib_dir), digest, length))
            for f in installed:
                outrows.append((installed[f], '', ''))
            for row in sorted(outrows):
                writer.writerow(row)

as you can see, the for f in installed will always place a string in the 3rd column of the tuple, however, the paths that use rehash put length in the 3rd column which looking at the code for rehash, will always be an integer:

def rehash(path, blocksize=1 << 20):
    """Return (hash, length) for path using hashlib.sha256()"""
    h = hashlib.sha256()
    length = 0
    with open(path, 'rb') as f:
        for block in read_chunks(f, size=blocksize):
            length += len(block)
            h.update(block)
    digest = 'sha256=' + urlsafe_b64encode(
        h.digest()
    ).decode('latin1').rstrip('=')
    return (digest, length)

Expected behavior

Not error out on sorting

How to Reproduce

Example dockerfile that reproduces the issue:

FROM fedora:25

RUN dnf -y update
RUN dnf -y install python36
RUN bash -c "curl https://bootstrap.pypa.io/get-pip.py | python3.6"
RUN bash -c "curl -sSL https://raw.githubusercontent.com/sdispater/poetry/master/get-poetry.py | python3.6"

Output

$ docker build -t poetry_fail .
Sending build context to Docker daemon    554kB
Step 1/5 : FROM fedora:25
 ---> 9cffd21a45e3
Step 2/5 : RUN dnf -y update
 ---> Using cache
 ---> 9afadb458128
Step 3/5 : RUN dnf -y install python36
 ---> Using cache
 ---> 7291182779ff
Step 4/5 : RUN bash -c "curl https://bootstrap.pypa.io/get-pip.py | python3.6"
 ---> Using cache
 ---> debf1591f40a
Step 5/5 : RUN bash -c "curl -sSL https://raw.githubusercontent.com/sdispater/poetry/master/get-poetry.py | python3.6"
 ---> Running in 04bb671e7e1e
Retrieving metadata

Installing version: 0.11.5
  - Getting dependencies
  - Vendorizing dependencies
  - Installing poetry
An error has occured: Command '('/usr/bin/python3.6', '-m', 'pip', 'install', '--upgrade', '--no-deps', '/tmp/poetry-installer-t434oaug/poetry-0.11.5-py2.py3-none-any.whl')' returned non-zero exit status 2.
Processing /tmp/poetry-installer-t434oaug/poetry-0.11.5-py2.py3-none-any.whl
Installing collected packages: poetry
Exception:
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", line 143, in main
    status = self.run(options, args)
  File "/usr/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 366, in run
    use_user_site=options.use_user_site,
  File "/usr/lib/python3.6/site-packages/pip/_internal/req/__init__.py", line 49, in install_given_reqs
    **kwargs
  File "/usr/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 760, in install
    use_user_site=use_user_site, pycompile=pycompile,
  File "/usr/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 382, in move_wheel_files
    warn_script_location=warn_script_location,
  File "/usr/lib/python3.6/site-packages/pip/_internal/wheel.py", line 514, in move_wheel_files
    for row in sorted(outrows):
TypeError: '<' not supported between instances of 'int' and 'str'

The command '/bin/sh -c bash -c "curl -sSL https://raw.githubusercontent.com/sdispater/poetry/master/get-poetry.py | python3.6"' returned a non-zero code: 2
wheel auto-locked bug

Most helpful comment

Ah, I think I know why this happens for Poetry, but not anyone else. Python sorts tuples by looking at items one by one. If a pair is sortable, it stops looking. The first two items in each record entry are file name and hash, both guarenteed to be strings. In most situations, you only have one row for each file (for obvious reasons), so the sorting never gets to use the third item (length). The bad Poetry wheel, however, contains two additional entries: poetry-0.11.5.dist-info/INSTALLER and ../../Scripts/poetry.exe. They conflict with the two dynamically generated rows by move_wheel_files()[1], having the same file names and hashes, thus triggering Python to use the third argument, causing the exception.

I think the short-term solution would be for Poetry to somehow prevent adding those conflicting rows into RECORDS. But the wheel format specification does not seem to prohibit those entries from existing, so either the PEP needs amendment, or pip needs to deal with potential row conflicts. Or maybe both should be done, since I don鈥檛 really think the ../../Scripts/poetry.exe row should be valid anyway, for security reasons (I don鈥檛 think a wheel should be able to write outside its installation root).

By the way鈥攖he bad Poetry wheel also contains a lot of pyc files (and entries for them in RECORD). It really shouldn鈥檛.

All 25 comments

Issue caused by https://github.com/pypa/pip/commit/f4bda78815bc9144255dc5ed29a89d97a54b131a -- looks like the rehash length should just be cast to str() so that there are consistent types?

having the same issue when trying to install poetry on CircleCI

cc @xavfernandez

I posted a fix for this here: https://github.com/pypa/pip/pull/5883

Thanks so much for the fix @cjerdonek. Do you know when we might be able to expect a new release with this fix merged? We are currently unable to use Poetry in our Docker builds without applying your patch inside the container as a build step, and all our Circle CI tests are broken.

@shabbyrobe Thanks for the note, and you're welcome. TBH, I don't know. @pradyunsg should know better as he did the last release. He would be in a better position to judge whether a bugfix release can go out. Here is more info about the process:
https://pip.pypa.io/en/stable/development/release-process/#creating-a-bug-fix-release

By the way, do you know what it is about Poetry that is causing that package to be affected and not others? I'm somewhat surprised we're not seeing other reports of the same issue.

Thanks heaps for getting back to me. I'm no expert on these matters but I may have been able to narrow it down a little with some grubby "printf debugging".

It seems that the size in the 3-tuple for outrows can be either an integer or a string, and when sorted decides it needs to compare on that column, it chokes. For some reason, poetry's installation process ends up with two outrows with the same shasum and very similar file names, and sorted ends up comparing on the size:

x = [
    ('../../bin/poetry', 'sha256=PFjhIj66KrOck_DzsYUZmdGTSB7Z2jFkOrJF6uEg3OU', '220'),
    ('../../../bin/poetry', 'sha256=PFjhIj66KrOck_DzsYUZmdGTSB7Z2jFkOrJF6uEg3OU', 220),
]
for row in sorted(x):
    print(row)

# BORK!

Not sure why, I'll see what else I can find.

Good digging! Hmm, I wonder if there might be a second issue lurking somewhere, either in pip, Poetry, or some other packaging component involved..

I've been lurking on this issue for a while since I also encountered the same error while installing Poetry. I believe the issue is within the Poetry library itself:

https://github.com/sdispater/poetry/blob/master/poetry/masonry/builders/wheel.py#L167

In writing the wheel zip, it string formats the tuple, which causes the output of os.stat.st_size to become stringized. If you agree that this is an error in Poetry's wheel packaging implementation, we probably should not fix this on the pip side.

Pip really shouldn't fail this far down the line with such a limited and vague exception on malformed input though.

@y4n9squared Thanks for the extra info. I do feel like some change in pip is needed. At the least, pip should be failing with a less cryptic message, for example, "Malformed RECORD line" (and then provide the path to the file and line contents). I don't know though whether pip should be liberal or strict with regard to the lines it accepts.

I do feel like some change in pip is needed. At the least, pip should be failing with a less cryptic message, for example, "Malformed RECORD line" (and then provide the path to the file and line contents).

@cjerdonek A valid point. A less cryptic error message would be helpful.

For reference purposes, here are two descriptions of the RECORD format:

  1. From PEP 376 (original): https://www.python.org/dev/peps/pep-0376/#record
  2. From PEP 427 (updated for the wheel format): https://www.python.org/dev/peps/pep-0427/#signed-wheel-files

It looks like the file is indeed malformed per these descriptions.

Thanks! With this information, I'll refer the bug back to the Poetry maintainers so we can get a fix in on that side.

I feel the anti-Postel鈥檚 principle should be applied here鈥攚hen handling an exchangeable format with potential alternative implementations, it is better to be strict than liberal. pip being the de facto standard package installer means that if it chooses to accept a format, other implementations would be forced to as well. It would be quite counterproductive for everyone if pip has too many unspecified quirks.

Oops, I think I goofed and may have been hasty in agreeing with @y4n9squared's suggestion that Poetry is writing a malformed line. If you look elsewhere in Poetry's code, you'll see that the size they use is indeed an integer. Here--

https://github.com/sdispater/poetry/blob/6065675dedd73153c59b8c16ec410595aebb83ab/poetry/masonry/builders/wheel.py#L246-L249

And here--

https://github.com/sdispater/poetry/blob/6065675dedd73153c59b8c16ec410595aebb83ab/poetry/masonry/builders/wheel.py#L266

Also, the output of the csv module's writer doesn't seem to be affected by whether something is a string or integer anyways. If you look at one of the example RECORD lines in PEP 376 above:

lib/python2.6/site-packages/docutils/__init__.py,md5=nWt-Dge1eug4iAgqLS_uWg,9544

you will see that it matches what Poetry is doing in the line that @y4n9squared linked to. So I think the problem is with pip after all? And @uranusjr's suggestion on my PR is perhaps a better way to deal with it.

ISTM the crux of the issue is that if you use the csv module to write an integer and then read it back (using the options spelled out in the PEP), you'll get a string and not an integer. (So it doesn't roundtrip.) Can someone confirm? I'm still surprised we haven't seen other reports of this issue if it really is an issue with pip.

I can confirm the behaviour of csv (it is a very annoying quirk):

>>> import csv
>>> with open('test.csv', 'w') as f:
...  w = csv.writer(f)
...  w.writerow([1, '1', 'test'])
... 
10
>>> with open('test.csv') as f:
...  r = csv.reader(f)
...  for row in r:
...   print(repr(row))
... 
['1', '1', 'test']

I traced the function a bit and it seems that a row may come from one of three sources: The to-be-installed wheel, dynamically generated, and the already installed dist-info directory. Only the last writes the length parameter as a string; the other two both write an integer. So this only happens if you try to install a wheel when a matching dist-info in the target location.

pip always uninstall the package (temporarily) when re-installing/upgrading. My guess is it never sees a dist-info in the target location, thus never hits this code path.

Oops, I think I goofed and may have been hasty in agreeing with @y4n9squared's suggestion that Poetry is writing a malformed line. If you look elsewhere in Poetry's code, you'll see that the size they use is indeed an integer. Here--

https://github.com/sdispater/poetry/blob/6065675dedd73153c59b8c16ec410595aebb83ab/poetry/masonry/builders/wheel.py#L246-L249

And here--

https://github.com/sdispater/poetry/blob/6065675dedd73153c59b8c16ec410595aebb83ab/poetry/masonry/builders/wheel.py#L266

Also, the output of the csv module's writer doesn't seem to be affected by whether something is a string or integer anyways. If you look at one of the example RECORD lines in PEP 376 above:

lib/python2.6/site-packages/docutils/__init__.py,md5=nWt-Dge1eug4iAgqLS_uWg,9544

you will see that it matches what Poetry is doing in the line that @y4n9squared linked to. So I think the problem is with pip after all? And @uranusjr's suggestion on my PR is perhaps a better way to deal with it.

ISTM the crux of the issue is that if you use the csv module to write an integer and then read it back (using the options spelled out in the PEP), you'll get a string and not an integer. (So it doesn't roundtrip.) Can someone confirm? I'm still surprised we haven't seen other reports of this issue if it really is an issue with pip.

@cjerdonek sorry about the false alarm -- I did some misleading testing. Also confirmed the stated behavior of csv as well.

So I did some additional digging and noticed that on pip 18.1, installing poetry from their provided installer causes the error, but installing directly from pip is fine. On pip 18.0, both wheel archives successfully install.

The two archive files have been attached as .good.whl (from pip) and .bad.whl (from get-poetry.py). I had to convert the file extension to zip to make Github happy.

poetry-0.11.5-py2.py3-none-any.bad.zip
poetry-0.11.5-py2.py3-none-any.good.zip

@cjerdonek reverting this commit (f4bda78815bc9144255dc5ed29a89d97a54b131a) on master seems to solve the issue, although I'm still confused as to why it doesn't cause errors on both wheel archives.

Ah, I think I know why this happens for Poetry, but not anyone else. Python sorts tuples by looking at items one by one. If a pair is sortable, it stops looking. The first two items in each record entry are file name and hash, both guarenteed to be strings. In most situations, you only have one row for each file (for obvious reasons), so the sorting never gets to use the third item (length). The bad Poetry wheel, however, contains two additional entries: poetry-0.11.5.dist-info/INSTALLER and ../../Scripts/poetry.exe. They conflict with the two dynamically generated rows by move_wheel_files()[1], having the same file names and hashes, thus triggering Python to use the third argument, causing the exception.

I think the short-term solution would be for Poetry to somehow prevent adding those conflicting rows into RECORDS. But the wheel format specification does not seem to prohibit those entries from existing, so either the PEP needs amendment, or pip needs to deal with potential row conflicts. Or maybe both should be done, since I don鈥檛 really think the ../../Scripts/poetry.exe row should be valid anyway, for security reasons (I don鈥檛 think a wheel should be able to write outside its installation root).

By the way鈥攖he bad Poetry wheel also contains a lot of pyc files (and entries for them in RECORD). It really shouldn鈥檛.

Author of Poetry here!

I would like to point out that the bad wheel is generated by the custom installer that Poetry uses (https://github.com/sdispater/poetry/blob/master/get-poetry.py) which uses a non-standard (and may I say a bad) way to build a wheel that will be used to install Poetry for the current Python version. This has since been fixed on the develop branch with the implementation of a new installer (https://github.com/sdispater/poetry/pull/378) which will be released with the next 0.12.0 version.

Note that the wheels uploaded to PyPI are not affected and are proper, standard wheels.

Sorry about the confusion.

I just merged the PR to prevent pip from crashing when sorting. Thanks for the helpful review discussion, @uranusjr and @xavfernandez, for reporting @tomalexander, and for the other helpful commenters on this issue!

Would anyone like to create a new issue to discuss whether we should be warning or erroring out going forward in the case of a path occurring twice in the RECORD rows?

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

GregBorrelly picture GregBorrelly  路  3Comments

dstufft picture dstufft  路  3Comments

cjolowicz picture cjolowicz  路  3Comments

imzi picture imzi  路  3Comments

therefromhere picture therefromhere  路  3Comments