As it removes md5 fragment on the fly, pkg_resources.Distribution.hashcmp is slow:
@property
def hashcmp(self):
return (
self.parsed_version,
self.precedence,
self.key,
_remove_md5_fragment(self.location),
self.py_version or '',
self.platform or '',
)
Would there be side effects if location is changed into a property where the setter would not only store the location but also store the location without fragment ?
Can you quantify how slow is hashcmp and what the user impact is (e.g. what are the conditions under which a user would notice the slowness)?
The simple example below calls hashcmp (thus _remove_md5_fragment) nearly 1.3 million times:
import pkg_resources
import setuptools.package_index
pki = setuptools.package_index.PackageIndex()
req = pkg_resources.Requirement.parse('setuptools')
pki.obtain(req)
I guess that setuptools is the most patholigical case.
Nevertheless, trying to obtain django calls hashcmp more than 200k times.
Thanks for that. It sounds like they way package_index uses it, it's very inefficient.
On my machine, that code takes about 9000ms to run. A long time, to be sure, but bearable on an incidental basis and a good motivator to avoid that code path.
I'm pretty sure package_index is deprecated as it's only used as part of the deprecated easy_install. Did you encounter this issue in real-world usage not using easy_install?
I've added a couple of alternative implementations. Here's a comparison of the performance of each:
setuptools master $ cat hashcmp.py
import pkg_resources
import setuptools.package_index
def run():
pki = setuptools.package_index.PackageIndex()
req = pkg_resources.Requirement.parse('setuptools')
pki.obtain(req)
setuptools master $ python -m timeit -s 'import hashcmp' 'hashcmp.run()'
1 loop, best of 5: 8.41 sec per loop
setuptools optimize-ditribution-hashcmp $ python -m timeit -s 'import hashcmp' 'hashcmp.run()'
1 loop, best of 5: 2.15 sec per loop
setuptools feature/faster-distribution-hashcmp $ python -m timeit -s 'import hashcmp' 'hashcmp.run()'
1 loop, best of 5: 2.12 sec per loop
setuptools feature/faster-distribution-hashcmp-wrap $ python -m timeit -s 'import hashcmp' 'hashcmp.run()'
1 loop, best of 5: 1.68 sec per loop
setuptools feature/faster-distribution-hashcmp-no-remove $ python -m timeit -s 'import hashcmp' 'hashcmp.run()'
1 loop, best of 5: 2.21 sec per loop
I'm liking this last option the best. I think I'm going to go with that.
In #2098, @gotcha had some good questions:
* Should the full `hashcmp` be cached ? I tend to answer yes because it already relies on other cached properties.
Perhaps, especially if there's a compelling use-case. I'm not yet convinced there is.
* Could it be cached simply, like `key` or `parsed_version` ? I also tend to answer positively. * Which implementation should we choose ? `hasattr` or `try except` ? I am not deep enough in Python internals to know which is quicker. * Is there a reason to use a different implementation of caching for `key`and `parsed_version` ? Should we unify on the quickest ?
I'd like to avoid micro-optimizing this method unless we can show the value of it. I'd also like not to re-invent the best caching method.
* What is the reason to use 3.8 `cached_property` instead of a simpler caching implementation ?
Yes! Using a published, standardized, maintained solution is cleaner (better separation of concerns), shorter (once backports are no longer needed), and easier to maintain. And in my experience, the performance is adequate if not near optimal. It avoids the pitfalls you identify above (variety of possible approaches). IMO, the separation of concerns is the most important benefit, separating the caching behavior from the essential behavior.
Most importantly, this code is probably not worth our efforts. pkg_resources and packageindex especially are largely legacy implementations and I'd like to address critical bug fixes, but otherwise focus on the essential, long supported features of setuptools (mainly building packages).
It is always better to remove code when possible. Thanks for #2108
To answer your question above about real-world usage, I am in the process of removing zc.buildout dependency on setuptools.easy_install.
In a first phase, I plan to keep the dependency on setuptools.package_index.
Is there some doc that presents how quick it will be deprecated ?
Really appreciate the work on zc.buildout. That's a good question about package_index. I don't believe there's anything official about deprecating it. I don't have any near-term plans to deprecate it. My instinct is that there are already another implementation out there. Looks like pip._internal.index is what pip uses.
Maybe the best thing here would be to extract package_index as a third-party package so that you or anyone could iterate on it more rapidly. I'll open a ticket about it.