Warehouse: Warehouse doesn't check whether uploaded packages ending in tar.gz are actually tarballs

Created on 18 Mar 2019  路  6Comments  路  Source: pypa/warehouse

Describe the bug

The function _is_valid_dist_file() function in warehouse/forklift/legacy.py fails to check that a file ending in .tar.gz is actually a taball file. This is different to a .zip, where the code does a basic check that it is indeed a valid zip file.

Expected behavior

_is_valid_dist_file() should check that if the filename ends in .tar.gz that it can open the file using `tarfile, e.g.:

# import tarfile at the top of file
# ...
elif filename.endswith(".tar.gz") :
    try:
        with tarfile.open(filename, "r:gz") as tfp:
            for member in tfp.getmembers():
                pass  # or do some other processing
            else:
                return False
    except tarfile.ReadError:
        return False
# ...

To Reproduce

See package https://pypi.org/project/wutdn
This is a file that ends in .tar.gz, but instead is just a plain text file

My Platform

N/A

Additional context

Apart from the above actual bug, I feel the 'assume OK unless otherwise told' layout of this function could risk unknown filetypes being uploaded to Pypi. It relies on the function _dist_file_regexes() being called by the parent function prior, which while the code currently does this, if this were to change it could introduce a more serious issue.

bug help needed testing

All 6 comments

One (relatively minor) security concern here is DoS from misformed/malicious tarballs: whatever code validates the tarball should be careful to avoid input-controlled loops or recursion.

We talked about this today & would love help from a volunteer on this!

So: please go ahead and grab this issue! Just comment and say that you're starting to work on it, and link in a comment to your work-in-progress branch once you start.

As always, if you have questions along the way as you work on this, please feel free to ask them here, in #pypa-dev on Freenode, or on the pypa-dev mailing list.

PR coming... (hadn't seen woodruffw's update before i started working on it)

As far as security concerns go... there is always a DoS concern when decompressing supplied data (problem 1). And a concern about zeroday or unpatched bugs in the underlying compiled decompression library.

For input controlled loops in my PR as it stands, the code behaves the same as the existing zipfile validation code by looping over all entries in the tarfile until it finds a PKG-INFO or there are none left.

It comes down to if we trust the Python stdlib tarfile and zipfile modules to not get stuck in an infinite loop upon malicious input here. I don't know if anyone has validated them for such purposes.

The decompression denial of service of cpu or memory consumption seems to be the most likely one to target and trigger.

Fixed in #5609.

Thank you @gpshead!

Was this page helpful?
0 / 5 - 0 ratings