We discovered this accidentally by way of https://github.com/zopefoundation/zc.relation/issues/4#issuecomment-397532224: if you pass a string containing newlines to the description argument of setup(), setuptools will generate a malformed PKG-INFO.
To reproduce:
# setup.py
from setuptools import setup
setup(
name='test-package',
version='0.1',
author='Blah Blah',
author_email='[email protected]',
description='description\n\n',
py_modules=['blah'],
)
(The contents of blah.py do not matter, but the file should exist.)
Run python setup.py sdist and the inspect test_package.egg-info/PKG-INFO. For me, with setuptools 39.1.0, it looks like this:
Metadata-Version: 1.0
Name: test-package
Version: 0.1
Summary: description
Home-page: UNKNOWN
Author: Blah Blah
Author-email: [email protected]
License: UNKNOWN
Description: UNKNOWN
Platform: UNKNOWN
The extra newlines lead tools to treat the rest of the PKG-INFO as a long_description.
I would expect setuptools to complain about the newlines in the description field, or at least escape them properly (i.e. prepend whitespace, like it does for the long_description field).
I looked at this for about 10 minutes, and I'm still not sure where validation should happen for these options/metadata fields. From what I can tell, only a few of the options do any sort of validation, and even then only in distutils.
So while I agree it would be nice for Setuptools to reject invalid metadata fields, it may not even have a good hook point to do so. Perhaps a hook could be added in Distribution.finalize_options.
Do you think this would be best fixed in distutils? Such a fix would take (much) longer to reach users, but I don't think this foot-gun is very dangerous.
@mgedmin I think at this point things are not really being fixed much in distutils. Some time soon we'll be removing setuptools' dependency on distutils and distutils will be deprecated entirely.
Just wanted to point out that this is not limited to just the Summary field -- any field with a double newline will cause the rest of the PKG-INFO file after those newlines to be interpreted as the "message body" and thus become the Long-Description.
@di That's not great. I think we could reasonably add strip('\n') to all fields, but that wouldn't stop anyone from doing something like A\n\nB. Is there any way to escape double-newlines so they aren't interpreted this way?
I see the options as:
long_description.s/\n+/\n/g for everything except long_description.I think I prefer 1 as least disruptive, but I'm ambivalent between 2 and 3. 2 is more explicit, but 3 creates less "churn" (since this will just be auto-magically fixed).
I think #1 is probably not the right thing to do here, as the resulting field would still contain the extra (albeit escaped) newline characters, which would not be what the user wants.
I think if we could do #2, it would be easy enough to just do #3 and avoid a possibly-confusing error message.
@mgedmin @di Even a single newline at the end of the description parameter of the setup function seems break PKG-INFO. (The second newline is added when writing PKG-INFO to separate the parameters in the file.)
See https://pypi.org/project/zc.relation/1.1.post1/#files for such a malformed PKG-INFO and https://pypi.org/project/zc.relation/1.1.post1/#description for the PyPI parsing result.
description.I tend to think that "newlines are not allowed in that field" would be a clear enough error message to support option (2) for that case.
However, if you go for the explicit options (1 or 3), note that setup.py upload (as opposed to twine) does NOT fail in that case (not sure what/how it does exactly, but data is shown correctly in testpypi).
Hence - having a look at what the raw upload command does might provide more objective feedback re: "what the user expects" here.
I see the options as:
- Escape newlines in the text so they aren't interpreted as delimiters.
- Raise an error whenever double newlines are found outside of
long_description.- Silently
s/\n+/\n/gfor everything exceptlong_description.I think I prefer 1 as least disruptive, but I'm ambivalent between 2 and 3. 2 is more explicit, but 3 creates less "churn" (since this will just be auto-magically fixed).
We should fail hard for configurations that are invalid. There should be an error message that gives a clear indication on how to fix the problem. This is a developer API, and developers should deal with their broken code. We shouldn't try to guess-fix broken stuff.
Do the fields/kwargs have an explicit specification of the data they expect? This should be enforced. The fields description and license obviously expect single-line data. Hence, when a newline is submitted in the data for those fields the generation of PKG-INFO should be aborted with a clear error message, e.g.
$ python3 setup.py sdist
ERROR: The "description" field requires a single line of text. Multi-line text was supplied.
Aborting.
OK, anyone who wants to submit a PR, let's go with a hard-failure if newlines are found in fields other than long_description.
I think the easiest way to do this is to implement the check right before PKG-INFO is actually written out. Hopefully that doesn't disconnect the error message too much from the location where these inputs are actually being generated.
I can do this -- if no-one else has already started.
IIUC, I have to adapt the write_pkg_file pseudo-method, which is used to monkey patch the distutils implementation from the Python standard library.
I'd really prefer to introduce a class for this that can encapsulate the validation logic for all values. It feels too much like spaghetti code, otherwise. I hope that's fine.
I'd really prefer to introduce a class for this that can encapsulate the validation logic for all values. It feels too much like spaghetti code, otherwise. I hope that's fine.
One of my top priorities for this weekend's sprint is to move metadata validation logic out of Warehouse and into pypa/packaging for reuse elsewhere, and this would be a good example of "elsewhere".
The API will likely behave a lot like how it's currently being used in Warehouse: there will be a class which is initialized with a dict adhering to JSON-compatible metadata, and an instance of that class will have a .validate() function and a .errors attribute.
If you want to move ahead with that assumption, we can probably just connect the dots later.
@di I have some difficulties with the idea of passing in only a (flat) dict with JSON-compatible Metadata to the constructor. Currently, the resulting code would look something like below. But this way we'd have to somehow stick the validation logic into the validate method:
def write_pkg_file(self, file):
"""Write the PKG-INFO format data to a file object."""
metadata = Metadata(dict(
name=self.get_name(),
version=self.get_version(),
description=self.get_description(),
# ...
))
metadata.validate(throw_exception=True)
version = get_metadata_version(self)
file.write('Metadata-Version: %s\n' % version)
file.write('Name: %s\n' % metadata.name)
file.write('Version: %s\n' % metadata.version)
file.write('Summary: %s\n' % metadata.description)
# ...
To avoid setting a validator for every single value we may set a "default" validator, and set individual validators with method calls, e.g.
# ...
))
metadata.add_validator('default', single_line)
metadata.add_validator('long_description', multi_line)
metadata.add_validator('requires', check_requirements)
metadata.validate(throw_exception=True)
Note that this splits up the value and its validation specification, and duplicates the metadata key.
Instead, I'd rather specify the validator together with with the metadata item, e.g.
metadata = Metadata(dict(
name=Value(self.get_name(), validator=single_line),
version=Value(self.get_version(), validator=single_line),
description=Value(self.get_description(), validator=multi_line),
# ...
))
Obviously, this is not particularly beautiful, but value and validation would be visible next to each other. Maybe there is a similar but better way. Any ideas?
I've also noticed that there are quite a few check_* functions the the dist module. Would it be okay to move them to a separate validation module together with the Metadata class?
@bittner Not sure I follow... The validator(s) would be part of the packaging library, which would understand how to validate a field based on the field name. You shouldn't have to do an add_validator here at all.
@di Please see PR #1562 and let me know if this is roughly what you envisioned. Thanks!
I'd like to point out, the error message given when this issue occurs is somewhat confusing from an end user perspective, since description could actually be a text issue with the license param of the setup.
https://github.com/pypa/warehouse/blob/44fed4afb2cac5da56a3e0f9d9d73872e4e6d0ec/warehouse/forklift/legacy.py#L1012
I was scratching my head on this trying to figure out why description wasn't working, when it was actually the license text that wasn't being parsed - https://github.com/pypa/warehouse/issues/3664#issuecomment-552218625
Thanks to https://github.com/pypa/warehouse/pull/7582 , https://github.com/pypa/packaging/issues/147 is now resolved, which may unblock progress on this per https://github.com/pypa/setuptools/issues/1390#issuecomment-432411556 .
@di Has there been any progress on this one?
I observed a similar problem if settings in setup.cfg span more than one line.
For example:
#Â setup.cfg
[metadata]
name = cyminiball
version = file: VERSION
url = https://github.com/hirsch-lab/cyminiball
license = LGPLv3
license_file = LICENSE
description =
Compute the smallest bounding ball of a point cloud.
Cython binding of the popular miniball utility. Fast!
long_description = file: README.md
long_description_content_type = text/markdown
author = Norman
keywords = miniball, geometry, fast
classifiers =
Programming Language :: Python :: 3
Programming Language :: Python :: 3.6
Programming Language :: Python :: 3.7
Programming Language :: Python :: 3.8
Programming Language :: Python :: 3.9
Programming Language :: Python :: Implementation :: CPython
License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System :: OS Independent
Topic :: Scientific/Engineering :: Mathematics
Topic :: Software Development :: Libraries
Topic :: Utilities
Intended Audience :: Developers
Intended Audience :: Science/Research
Reading the meta-data for this package results in incomplete information:
try:
from importlib.metadata import metadata
except ModuleNotFoundError:
from importlib_metadata import metadata
print(dict(metadata("cyminiball")))
# {'Metadata-Version': '2.1',
# 'Name': 'cyminiball',
# 'Version': '1.1.10',
# 'Summary': '',
# 'Requires-Dist': 'numpy'}
Note the empty Summary. Author, URL and other keys are missing. The same applies for other multi-line values, e.g. "classifiers", breaking the meta-data readout.
What to do about this?
@normanius: The project cyminiball needs to fix their use of the description field not to include a newline. In 904f4a1, I'm adding a crash when newlines are present, which addresses the reported issue, but will cause cyminiball and similar packages not to build.
@jaraco Thanks for the quick response. I also found the summary of @di's recent activities in this area.
Just out of curiosity: Why are newlines not possible? What breaks with newlines? When using a setup.cfg, such restrictions are not evident because configparser is able to handle newlines. Breaking long lines is a natural reflex. Plus, I cannot read any notes about this anywhere. For example, this setuptools userguide article was the source I used as a template for my setup.cfg files, but it didn't make a reference to any line breaking.
This is probably unrelated, but it's unexpected that
from importlib.metadata import metadata
print(metadata("cyminiball")["classifier"])
# 'Programming Language :: Python :: 3'
returns only the first entry in the list of specified classifiers. Is this behavior intended or should it be reported somewhere?
Update: I just found the (undocumented) feature metadata("cyminiball").get_all("classifier"), which returns the complete list of classifiers.
Just out of curiosity: Why are newlines not possible? What breaks with newlines?
In its final form, the PKG-INFO metadata file is an RFC 822 document. Double newlines at any point in this document indicate a switch from the message headers to the message body. This causes everything after a double newline to be interpreted as part of the message body, including important metadata fields.
@jaraco with the latest update it's likely that more people will encounter pip install failures due to dependencies having newlines in the descriptions. It's causing my CI/CD pipelines to fail. I'm using Docker so I had to do a pip install setuptools==51.2.0 before pip install to work around this, which isn't ideal.
That’s unfortunate. I wonder how widespread these issues are. What packages are affected? How much control do you have to either install from wheel or fix/patch the affected packages?
For me, it was spleeter==2.0.2, which doesn't seem to have wheel. But they recently migrated over to using Poetry in 2.1 which conveniently does have it. So it's no longer an issue in the latest version. I'll have to make some changes to my project regardless.
Hello,
is this problem in the process of being fixed. I have just rum some of my azure pipelines and got a similar error https://dev.azure.com/matteoravasi/PyLops/_build/results?buildId=221&view=logs&j=011e1ec8-6569-5e69-4f06-baf193d1351e&t=fe9cbe73-a432-5abe-f444-5cc590841043&l=229
It is not fully clear to me if this is one of the dependencies or my description file? But I do not have any explicit \n myself https://github.com/PyLops/pyproximal/blob/main/setup.py. Could the fact I go to the new line be a problem, even though being the text between triple quotes that should not be really a new line. Thanks!
@mrava87 Yes, AFAIU your code (before the fix) was buggy. The problem here is not that setuptools doesn’t accept newlines in the description field — it’s not supposed to, the (short) description _must_ fit on a single line — but that in this case it produces corrupt output instead of halting and catching fire immediately.
That is to say, yes, there’s a bug in setuptools, even if there weren’t, what you just fixed in your code was a real mistake as well. (Your Travis pipeline is still failing, but for a different reason, as far as I can see.)
Thanks! My travis pipeline is back on track, but yes that was another mistake.
Strange as I had two lines for long time and it always worked fine. But I have no problem to make it one line :)
In retrospect, the change introduced was probably too aggressive, especially because it affects third-party packages and older releases. I'll back off the change to emit a warning and remove the newlines with the intention of making the behavior invalid in the future.
@mrava87 wrote:
Strange as I had two lines for long time and it always worked fine.
That’s the problem with broken output, yes: whatever fields happened to come after the description in the PKG-INFO just got silently reinterpreted as a part of the text of the long description. In my case, for example, the package was _almost_ correct, except that the Requires-Python aka python_requires field was silently discarded (and that the package description on PyPI was broken, of course). I never even recognized this before somebody using an old and unsupported Python complained. (Yes, I could and should have, but the point is that it’s not obvious.)
setuptools 51.3.0 landed and I immediately got some error report about broken (internal) builds.
Reporting error here is not a good idea, really.
Please, don't go that way. Drop description altogether if you don't like it and can't fix it, but don't break installations. Ecosystem is huge, libraries are plenty, automated builds are numerous…
@Mekk Immediately issuing a fatal error here is perhaps not a good idea (if you are affected, fetch 51.3.3 and get a warning instead), but please do recognize that if these errors break your build, _your build was already subtly broken_. It’s not that setuptools dropped support for newlines in the description field, it’s that it never supported them properly in the first place and produced bad and wrong output when they were encountered, only that bad and wrong output was still good enough that it still worked in most cases (though see my comment above re python_requires). Again, this is _not_ about dropping something that used to be supported, this is turning silent breakage into loud breakage. Which I think we can all agree is a good thing.
I am not even trying to comprehend all possible setuptools use cases. But my builds worked happily and ignored the issue (mayhaps because they had single newlines, not double…), then they stopped to. And I strongly bet I am not the only one.
I understand, that the problem discussed here in some cases caused misinterpretation of some metadata field(s). Well, looks like this is more difficult to face and only rarely causes problems. Allowing package to install on incompatible interpreter is rather minor.
BTW: as I look at some random PKG-INFO files around, looks like Description field (mapped from long_description if I understand correclty) happily handles newlines by simply indenting whole field content. Couldn't the same technique be used for Summary in case it contains newlines?
For example, here is some random package I wrote on PyPi: https://pypi.org/project/mercurial_path_pattern/ Note that keywords and classifiers are correct (that's important as they are below long description in my PKG INFO).
And here is it's PKG-INFO (cut a bit to make it shorter:
Metadata-Version: 1.1
Name: mercurial_path_pattern
Version: 1.4.2
Summary: Mercurial Path Pattern Extension
Home-page: https://foss.heptapod.net/mercurial/mercurial-path_pattern
Author: Marcin Kasperski
Author-email: [email protected]
License: BSD
Description: .. -*- mode: rst; compile-command: "rst2html README.rst README.html" -*-
=======================================================
Mercurial Path Pattern
=======================================================
Don't repeat yourself defining ``[paths]`` over many repositories,
specify the general rule once in ``~/.hgrc``.
Path Pattern is a Mercurial_ extension used to define default
remote path aliases. You may find it helpful if you maintain
consistently layed out repository trees on a few machines.
… and so on and so on ending in empty line …
Keywords: mercurial hg path alias
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
…and remaining classifiers…
At least pypi parses that properly.
So: is there any fundamental reason why the same method (indenting) which nicely works for Description, couldn't work for Summary? Then one wouldn't have to forbid or patch those descriptions, they would simply work.
(having said all that, I probably do not care, not sure whether I see those short descriptions anywhere and I doubt I would spot if they started to be silently discarded)
Probably the strongest argument is that the field is specified to be one line. I do see that the "email headers" format does appear to support multiline values in any field, although the indentation semantics are not elegant:
draft $ cat > METADATA
Summary: foo
bar
draft $ import importlib.metadata
draft $ dist = importlib.metadata.Distribution.at('.')
draft $ dist.metadata['Summary']
'foo\n bar'
That is, the indents get included when the value is loaded. The same happens when there's a leading bar:
draft $ cat > METADATA
Summary: foo
|bar
draft $ dist.metadata['Summary']
'foo\n |bar'
The spec additionally indicates that for description (and only for description), seven spaces plus a pipe character should be used to indicate a continuation following a newline.
The spec is very imprecise about what characters are allowed. Most indicate "a string" and some examples do appear to have newlines and not the pipe characters. Even PEP 566 JSON conversion has the metadata converted without any indication of how to deal with multiline strings (all of the padding and pipes in continuation lines would be included).
Probably the PEPs need a refresh or new PEP to specify how multi-line fields should be serialized and deserialized to remove extraneous content.
Until someone takes up that mantle, Setuptools will do its best to honor and enforce the specified format as best as possible.
@mrava87 wrote:
Strange as I had two lines for long time and it always worked fine.
That’s the problem with broken output, yes: whatever fields happened to come after the description in the PKG-INFO just got silently reinterpreted as a part of the text of the long description. In my case, for example, the package was _almost_ correct, except that the
Requires-Pythonakapython_requiresfield was silently discarded (and that the package description on PyPI was broken, of course). I never even recognized this before somebody using an old and unsupported Python complained. (Yes, I could and should have, but the point is that it’s not obvious.)
Thanks for the explanation. Makes sense and good that things get fixed and users can't follow bad practices anymore. No problem with things breaking as long as it is clear why and we can change, its all good :D
Most helpful comment
We should fail hard for configurations that are invalid. There should be an error message that gives a clear indication on how to fix the problem. This is a developer API, and developers should deal with their broken code. We shouldn't try to guess-fix broken stuff.
Do the fields/kwargs have an explicit specification of the data they expect? This should be enforced. The fields
descriptionandlicenseobviously expect single-line data. Hence, when a newline is submitted in the data for those fields the generation of PKG-INFO should be aborted with a clear error message, e.g.