Pip: Parallelizing the install process + PoC!

Created on 4 May 2020 · 46Comments · Source: pypa/pip

First of all, kudos guys on the v20+ release :+1: , wheel caching is much better and the addition of cached built wheels is an awesome time saver.

TL;DR
Reference implementation for pip install --parallel , which will download inspect and install packages in parallel, highly accelerating the installation environment of an entire project (or actually any requirements.txt file).
A few speedup numbers (Ubuntu w/ python 3.6):

Requirements variant  | Parallel+D/L | PyPi v20.1+D/L | Speedup incl' Download | Parallel | PyPi v20.1 | Speedup Cached Wheels
--------------------- | ------------ | -------------- | ---------------------- | -------- | ---------- | ----------------------
A                     | 0:58.14      | 1:51.08        | x1.90                  | 0:16.16  | 0:33.57    | x2.08
B                     | 5:47.36      | 6:55.49        | x1.19                  | 0:52.19  | 1:12.07    | x1.37
C                     | 0:56.08      | 1:44.34        | x1.86                  | 0:14.36  | 0:29.21    | x2.01
D                     | 0:36.45      | 1:39.55        | x2.71                  | 0:14.59  | 0:33.20    | x2.22
--------------------- | ------------ | -------------- | ---------------------- | -------- | ---------- | ----------------------
Average               |              |                | x1.91                  |          |            | x1.92

Details:
We heavily rely on pip to set up environments on our ever changing production systems.
Specifically either setting a new virtual environment or inside a docker, the full implementation of our ML/DL agent is open-source and can be found here: https://github.com/allegroai/trains

With the introduction of pip 20.1 we decided to try and parallelize the pip install process in order to save precious spin up time.

The package resolving & download was moved into a ThreadPool (RequirementSet was added a global lock). The only caveat is the download progress bar (more on that down below).

The reason for using Threads is the fact that most of the speedup here is Network and I/O which are parallelized well on python Threads. The second reason is the way RequirementSet is constancy growing with every package discovered, and it's requirements. Sharing the set among all threads
is quite trivial, as opposed to sharing them across processes.

Wheel files (only) installation process moved to a Process Pool, as installing the wheels will not execute any script, and can be parallelized without risk.

The reason for choosing Processes over Threads is the limiting factor here is actually CPU cycles in the installation process. Also the Wheel unpacking and copying is totally independent from one another and the rest of the process, and lastly order has no meaning in this stage as the requirements order is only important when building packages.

Unpacking the wheels was optimized to reuse the unpacked folders, this happens between the inspection part (resolver) on a cached wheel and the installation of the same wheel (essentially what happened was the wheel was unpacked twice)
Solving the parallel progress bar: The progress bar has now a global lock, the first progress-bar to acquire it, will be the one outputting the progress to the tty. Once the progress-bar is done, another instance can acquire the lock and will continue to report from its current progress stage.
This looks something like:

|████████████████████████████████| 101 kB 165 kB/s 
|█                           | 71 kB 881 kB/s eta 0:00:03

The full reference code can be found here
Usage example:

pip install --parallel -r requirements.txt

needs discussion enhancement

Source

bmartinn

🎉4 👍2

Most helpful comment

The race condition I'm referring to would occur during installation of two wheels that both contain a file with the same name. That is very easy to do, since any wheel can contain any file. You can test it yourself, from a checkout of pip:

$ ./.tox/py38/bin/python
Python 3.8.0 (default, Jan  4 2020, 14:08:19)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tests.lib.wheel import make_wheel
>>> make_wheel("simple1", "0.1.0", extra_files={"simple.py":"1"}).save_to_dir(".")
'./simple1-0.1.0-py2.py3-none-any.whl'
>>> make_wheel("simple2", "0.1.0", extra_files={"simple.py":"2"}).save_to_dir(".")
'./simple2-0.1.0-py2.py3-none-any.whl'
>>>
$ ./.tox/py38/bin/python -m pip install --find-links=. simple1 simple2
Looking in links: .
Processing ./simple1-0.1.0-py2.py3-none-any.whl
Processing ./simple2-0.1.0-py2.py3-none-any.whl
Installing collected packages: simple1, simple2
Successfully installed simple1-0.1.0 simple2-0.1.0
$ cat ./.tox/py38/lib/python3.8/site-packages/simple.py
2

Race condition is irrelevant, either serial or parallel you are destroying your own setup.

Letting someone destroy their own setup inadvertently is not a good user experience! Part of the motivation for #8119 is to reduce that chance.

chrahunt on 11 Jul 2020

👍3

All 46 comments

Benchmark test details.

Take the download (D/L) column time/speed with a grain of salt, download speed is not stable enough between runs, but obviously parallel download is much more efficient.

Requirements variant B is actually somewhat of an outlier, it contains the torch==1.5.0 package, which is a 700+Mb package. Obviously the download and the installation process is limited by the single slowest package, and that is the case for variant B. The reason for having it in the first place, is that we use it quite often (and also TensorFlow, a package almost the same size). For comparison scikit-learn is only a few tens of Mb.
The other variants are a more off-the-shelf mixture of packages.

Notice that in requirements variants A/B/C all the package versions are fixed.
Attached requirement files used for the benchmark:

reqsA.txt
reqsB.txt
reqsC.txt
reqsD.txt

bmartinn on 4 May 2020

Thank you for sharing the benchmark, the design as well as the proof of concept. I am glad to see the positive benchmark for larger (I supposed by estimating the size of others based on that of B, I'd love to have them to test in the future though) collections, compared to the ones tested under GH-3981.

I made a quick run and found that the PoC still have the issue with

We need to find a way to clean up correctly when SIGINT is issued. Since the progress bar is initialized in a different thread we cannot set the signal handler as InterruptibleMixin did before. That functionality is currently commented out and needs to be somehow restored before we merge this.

I am unaware if this is because you and your team has not worked on that or there's still a technical obstacle to be solved. If the latter is the case, I'd love to know what you found in extra to the quote above.

I've also submitted a plan for solving GH-825 as part of GSoC, however it's purely a plan. Theoretically it should work, but I can't be sure that it definitely will in practice. I'd be really grateful if you take a look at it and leave some comments. Note that I'll only know if it's approved by pip's maintainers tomorrow (May 5), so don't take it too seriously.

Given the situation in GH-8169, I wonder how Pool performs under a single-core machine, and is there any perk that we're yet to be aware of. I also suspect multiprocessing.ThreadPool to have same problem with multiprocessing.dummy.Pool, however I'm not sure on how to reproduce (that's why the issue was opened in the first place).

McSinyx on 4 May 2020

Thanks @McSinyx for the references!

To answer a few of your concerns:

> have them to test in the future though) compared to the ones tested under GH-3981.

Requirement files added here

Notice that GH-3981 implementation is quite outdated (more than 3 years old), and actually the good people of pypa did a great a job with v20, simplifying the process of package download and dependency resolving. This is the main reason for us to tackle this issue now, as the solution is relatively straight forward.
If you have a single core, Pool should not be used (performance loss is negligible, but why would you create another subprocess, right :).
See the if statement and program flow here
You could actually add to the if multiprocessing.cpu_count() > 1 and that will make sure that even if one uses the --parallel flag but has a single core, there is no performance loss.
The same can be added here
Regrading the Progress Bar signal handler, see here
Basically only the main thread will replace the handler, any other thread will just ignore it.
This means that ctrl-c will be caught and processed, while still allowing multiple threads to report progress bars. To be honest, the self.finish() that is called, on the signal handler seems quite redundant, the only thing it does is flush the tty, and return the cursor. Both will happen anyhow when the process quits (just my 2 cents :)...

On a different note, if you are willing to help out with this code and make it PR worthy, please don't be a stranger, feel free to push to the forked repo :smile:

bmartinn on 4 May 2020

This is awesome.
Looking forward for the PR.

raviv on 4 May 2020

Also note that pip is in the middle of the process of integrating a new resolver. So make sure you're able to also introduce the parallelism to the new resolver, otherwise all the good work would be lost once the transition is finished.

uranusjr on 4 May 2020

👍1

@uranusjr thanks for the heads up!
Where could I find the new resolver ?

EDIT:
Are you referring to ?
https://github.com/pypa/pip/blob/master/src/pip/_internal/resolution/resolvelib/resolver.py
If so, the implementation is almost the same, should work without any real effort.
How do I make sure I'm using it (seems like the default is "legacy")?

bmartinn on 4 May 2020

@bmartinn Your observations are correct. The new resolver can be enabled (in 20.1 or later) by passing --unstable-feature=resolver to pip.

uranusjr on 4 May 2020

👍1

@bmartinn, thank you for the requirement files as well as explanations 2 and 3. As of the termination, yesterday I notice that after canceling in the middle of parallel download, my cursor disappeared, so there might be something wrong. However, to make the UI user-friendly, I believe that it might be a good idea to only having the main thead to take care of all of the displaying, as in my proposal, which is now accepted (yay, and thank PyPA maintainers and others for commenting during the drafting process).

To @uranusjr as well, I'm not sure how the proposal was reviewed, but if you haven't taken a look at it, could you please scheme through the implementation idea (section 3.3.3) if that fits your vision of coordinating into the new resolver, i.e. does that cause any trouble for the on going implementation of the resolver?

BTW I feel the need of saying that I've just got (physically) back to school this week and is still trying to get used to the new and denser schedule, so please tolerate me if in the next one or two days I am not able to carefully examine things to meaningfully participate in the discussion.

McSinyx on 5 May 2020

🎉1

@McSinyx, following your suggestion, I pushed a fix for the Logger handler issue (basically forcing at least one instance to be created on the main thread), let me know if it solved the cursor problem.

@uranusjr I'm testing the --unstable-feature=resolver feature, and it seems to download tar.gz and not .whl files, are you guys switching to tag.gz ?

bmartinn on 5 May 2020

@bmartinn pip maintains a wheel cache and won’t redownload if you had downloaded a wheel previously. You probably need to supply --no-cache or something.

uranusjr on 5 May 2020

Apologies @uranusjr I was seeing the future download being tar.gz and apparently there is no wheel for it (it compiles), other packages behave as expected and download .whl files :)
Thanks!

bmartinn on 5 May 2020

Hi @uranusjr
I updated the code here to support the new Resolver as well as the legacy one. I like the new resolver / resolution design, and I tried to make the parallelization as agnostic as possible, basically there is another resolution resolver running in the background try to guess ahead what will be the package to download.

This design should hold with the continuous development of the new Resolver/Resolution classes.

With your approval I'd like to push it as a PR :)

BTW:
It seems the new resolver is slower then the current default one, but I guess that's reasonable as it looks like it is still under development :+1:
Currently with the new experimental Resolver, we still get a x1.7 factor speedup through the parallelization, so at least we have that.

EDIT:
@uranusjr
Do you think it is better to have the ResolverLib parallelization PR to resolvelib or here?

bmartinn on 7 May 2020

I think it would be a reasonable addition, but some considerations are needed. threading (and by extension parts of multiprocessing) is not a universally available module (#8169), so a fallback to single-thread operation would be needed on such platforms. The main task (IMO) would be to come up with a reasonable abstraction so we don’t end up maintaing two partially duplicated pieces of code.

uranusjr on 7 May 2020

Sure, we can quite quickly protect the import and disable the feature automatically.
(notice that in my code, the parallelization is an additional flag anyhow)

The main task (IMO) would be to come up with a reasonable abstraction so we don’t end up maintaining two partially duplicated pieces of code.

You mean regrading the ResolverLib ?

bmartinn on 7 May 2020

I meant regarding within ResolveLib, the parallelisation needs to reduce duplicated resolving logic implementations. In your current implementation, for example, there are two completely separate pieces of code for updating criteria in single-thread and multi-threaded modes, which would be hard to maintain if we want to revise the resolution logic.

uranusjr on 7 May 2020

Yes @uranusjr you are correct, I put it there just so the execution path is easier to see (i.e. when viewing the diff), I'll extract it out to a function and make sure we call it, obviously easier to maintain :)

Still with that in mind, should it be pushed to pip or ResolverLib repository, it seems you already have a PR updating to the new ResolverLib version, and the parallelization itself is fully contained in the new ResolverLib, so I guess that makes most sense (from maintenance perspective as well), to PR it there and then just have it synced back to pip?!
What do you think ?

EDIT:
@uranusjr I extracted the "inital requirements to criteria" method and added safeguards on multiprocessing (basically disabling the parallelism if multiprocessing is not supported)

bmartinn on 7 May 2020

Hi @bmartinn, it took me quite a while to get back in shape :smile: IIUC, your branch has three separate enhancements:

Parallel install by multiprocessing
Parallel download by multithreading
Cache wheel

I don't have enough understanding to comment on 3 (I like the idea, just haven't carefully read the code), but for the other two:

Parallel install

It's true that the current situation is much better now than in the past where download/resolve/build/install were coupled. I believe there shouldn't be any problem with throwing packages to be installed to a pool since currently we only display Installing collected packages and Successfully installed.

I'd also be nice if we can have parallel build, but I think it's out of scope of this discussion. 3 features in a single branch is already really hard for me to follows, and I believe pip's maintainers might have the same issue.

Parallel download

I have two main concerns over the design:

It seems that every requirement resolution is fed to the pool. Does this risk re-resolve and re-download? Is there anyway to avoid this without refactoring? If it's the case, I'll proceed with my GSoC in a less trivial approach, and I'd love too have some helps on design as well as testing and feedback.
The UX: either if we go with grouping downloads into known-size collections and we can display common progress; otherwise we can do it similar to install output (start and finish download).

McSinyx on 8 May 2020

TL;DR
See "A few thoughts:" at the bottom

Hi @McSinyx
Yes, your observations are correct :)

Regrading the install process, as you noticed parallelizing the wheel unpacking installation is not a problem to have in multiple process as they are true independent. builds however can replay on other packages in the build process, so building them in parallel is a bit more tricky, hence as a first step I left it serially built (notice that actually there some work that needs to be done there regardless of the parallelization as the build process is not aware of dependencies, but actually it should, especially when installing packages from get repos where you have to build them even if they are pure python, we have seen multiple installations break due to that, and it will be great to see someone tackle the problem ;)
A short note, the installation process can be very slow if you have multiple packages and if all packages are cached the most time consuming part of the pip install process. It can easily be a 30 sec process before parallelization and a few seconds after...
Regrading the Download process, actually this is tied with the resolving process. When packages are resolved they need to be downloaded in order to get all the dependencies from the wheel files. If you have a list of packages with fixed versions (that you know will not collide), then you only download each package once. But if you have to "resolve" some collisions, you might end-up downloading several versions of the package until you manage to remove all collisions between dependency versions. More specifically the current resolver (under the folder 'legacy' in the master, but still the default one), will try to "optimistically" hope that all versions can co-exist with one another. The parallelization of this resolver is quite trivial after adding some protection mechanisms so that we can can access a collection from multiple threads safely. Since each resolve will download it's own package we essentially parallelize the download process (but also the inspection of the cache etc.) this is quite efficient and easy to maintain.

The new Resolver is more sophisticated and tries to resolve collisions by iterating over the least dependent packages first, then slowly "climb" the decencies tree. The actual implantation comes from this package (resolvelib)[https://github.com/sarugaku/resolvelib].
In order to parallelize the new resolver, we decide on a bit different approach, we have a copy of the Resolution class (that does most of the heavy lifting in terms of downloading / resolving specific versions), then we try to "guess ahead" what the main thread Resolution will need, and get those packages, and their dependencies in advance. Every time the main Resolution instance collects a new package, it will sync its state to the Background Resolution instance, so that the background can better guess what is needed. Notice that the actual decision making is done by the main instance, and the background instance can get download the "wrong" version of the package, but that okay, because it's just in the background, so no harm done :) Then when the main Resolution will need a specific package, it might just download it itself.

UI design. The latest progress bar implementation (again thank you for noticing the hide cursor bug),
will give the user as much information as possible, but still, hopefully, make the log readable.
Starting a new download will be printed in a new line, ending of a background download will be printed as a full progress bar with all the info. and the current download will be printed as the last line (like in serial installation). The only issue with storing "the background downloads and then printing them is the following. 1. the data is currently not passed to the progress bar itself, so it does know which package it's reporting for. 2. If we do not report it being downloaded, it might reduce from transparency...

A few thoughts:

I would like to have the "legacy: resolver parallelization and the parallel instillation PR'ed. With the obvious approval from @uranusjr @pradyunsg
Regrading the wheel caching, I can't see any reason not have that as well in the PR
I think, the main challenge is the new Resolver. Currently it is twice as slow comparing to the "legacy" one, and that is with packages with fixed versions. Even with parallelization introduced it is not very fast. I believe a lot of work can be done there, and since this is the future IMHO this is where the GSoC efforts should be directed to. I don't think that the new Resolver parallelization I have on my branch should be PR'ed at the moment, but it can be a good starting point for you, and I will be happy to help :)
UI, I think that your suggstion will be a great addition :)

bmartinn on 8 May 2020

Regarding PRs, please make multiple ones for each feature, as per pip's guideline. Since IMHO we have our mini roadmap here, it'd be easier to discuss with welly-scoped patches attached. Personally I feel that the ones involving multithreading and multiprocessing will not be merged until we have a better understanding and/or consensus over the situation at GH-8169, but the caching should not suffer such delay.

About the progress bar, as an user, frankly I feel that the current mock-up is quite confusing. Even with changing number of download at the same time, we can just redraw the entire screen, but (1) it's complicated and (2) it'd be a mess on half-broken terminals which in my experience are quite common. That being said, it that's the way to go, I'd try my best help making it happen. However, I still have concern over duplicate downloads of future ones, the kind of concern of I don't really know if it's avoidable. We might need to refactor the following to do convenient hashing of URL

_internal.operations.prepare.RequirementPreparer.prepare_linked_requirement:
    _internal.operations.prepare.unpack_url:
        _internal.operations.prepare.get_http_url:
            _internal.operations.prepare._download_http_url  # actual download

and I'm not a fan of while not ready: sleep for a short interval for parallel waiting for the same thing. Also, I start to feel that I'm talking too abstract here and I might not know all what I say, so please split your branch into PRs so we can better discuss.

I'd love to hear elaboration on the parallel build, but this might not be the right place, since there're so many things to be discussed at the same time here already. AFAIK there's no ticket dedicate for it, so feel free to create one (or just tell me to do it).

McSinyx on 8 May 2020

Thanks @McSinyx sounds like a plan to me :)
btw:
What do you mean by

I'm not a fan of while not ready: sleep for a short interval

Where did you see this pseudo code?

bmartinn on 9 May 2020

Where did you see this pseudo code?

For the full context, what I was trying to say was

I still have concern over duplicate downloads of future ones [...] and I'm not a fan of while not ready: sleep for a short interval for parallel waiting for the same thing

e.g. if A is needed by both B and C and we don't control the collection of packages to be downloaded, then if, say, B initiate the download first, and the progress is not finished when C ask for A too, then the thread resolving C would need to wait for that future A to finish. There could be another way that I'm not aware of but if that's not the case then we need to do void loops to wait for it and personally I don't like it.

McSinyx on 10 May 2020

Heads up: This is on my "take a look at" list, but I'll have my hands full right now at least for the next few days, with pip 20.1.1.

pradyunsg on 11 May 2020

I'm super excited that @bmartinn has done so much useful work on this, and filed a PR for this as well! It's much appreciated! I'm 100% onboard for doing this.

However, to use Brett Cannon's excellent analogy, you're gifting a puppy here and I'm wondering how accepting this will affect my life once the puppy is all grown up. I really think we should carefully consider how this affects various other parts of pip, before moving too far with this:

consider the overall implications of introducing parallelization in pip (#8320).
how this affects the "topological" order of the installation (see https://pip.pypa.io/en/stable/reference/pip_install/#installation-order)
- the important characteristic is "don't end up with a broken installation"
- we'd likely have to install things at the same "level" -- this should be pretty easy to do with the new resolver, no idea about the older resolver implementation. :)
explore about how this affects https://github.com/pypa/pip/pull/8119 and other similar issues.
the rollout has to be "smooth", with an opt-in initially so that we can get feedback from users who are willing to experiment first, before rolling it out generally.

I do have some thoughts on the specific implementation here (w.r.t. output etc) but I think we should address the more meaty design/technical concerns around the effects of such a change (like the ones above and possibly more!), prior to getting to the implementation/CLI details.

FWIW, 4 is covered by pip's existing --unstable-feature flag already, so I'm not too concerned about that TBH.

One thing I'll note here, is that there's a minor overlap with @McSinyx's GSoC project for this summer here (#825, parallel downloads), in that they both need parallelization. However, beyond the discussion about how pip does parallelization (#8320), I don't think there's any overlap between these two tasks, since the GSoC project is more about what happens during dependency resolution and this specific issue is more about what happens after the dependency resolution process is completed.

pradyunsg on 31 May 2020

@pradyunsg answering a few of the concerns you raised:

Yes, we should totally use #8320, I think it's aligned with the reference implementation usage
Topological order is part of the requirement resolving, if we are discussing the current implementation (what is referenced in the code as legacy), the topological order is maintained, that said it is not consistent, if two packages are dependent on the same third package with different version, then it will break in in one of the two, but will change from run to run. That said, if there is no collision than the order has no meaning (since at the end you do get to the most limiting factor, it just that sometimes it will be the first and sometimes the second package that caused it).
Regrading the new implementation, this is basically the same argument, only this time, when a collision occurs, the resolver tries again, with the "best" state in mind. In short, if there is a solution it will be found, nonetheless some collisions cannot be solved, and if we just install the last requirement, then as I pointed before two consecutive runs might produce a different set of packages (again only if there is a collision, and in both runs they will be broken only differently)
Should work as each wheel gets its own process (this is actually when we are reinstalling the package)
definitely , this is why I added --parallel flag, I would definitely not have the parallel resolver as default, at least not until we receive enough feedback. Regrading the installer, since this is really only unzipping wheels, order is meaningless so there is no real reason why we should not want to use parallelism (with the exception of OS support, of course)

Regrading @McSinyx GSoC , please feel free to use this branch as a starting point, most of the work was put into the resolver itself, and making sure it is parallel, there is also some work done with the logger (making it parallel friendly) and a few ideas on improvements. Bottom line I think there is enough work here to cover GSoC and more :)

bmartinn on 31 May 2020

❤1

It is nice to have parallel download and parallel installation of wheel files. But did someone thought about doing package builds with multiple threads (like "make -jX")?

On AArch64 we have to wait and wait looking how 'pip install package1' occupies single core for 20+ minutes while 45 cores idle waiting for pip to finish.

hrw on 30 Jun 2020

But did someone thought about doing package builds with multiple threads

That's a backend issue for setuptools, or whatever backend the package uses, to address. I don't know if setuptools has looked at parallel builds - maybe you could ask there.

pfmoore on 30 Jun 2020

👍1

But did someone thought about doing package builds with multiple threads

That's a backend issue for setuptools, or whatever backend the package uses, to address. I don't know if setuptools has looked at parallel builds - maybe you could ask there.

That would be quite cool! and really accelerate the process of installing packages.
I have the feeling we will be saving lots of time/electricity with these PR's, go green :)

bmartinn on 30 Jun 2020

explore about how this affects #8119 and other similar issues.
Should work as each wheel gets its own process (this is actually when we are reinstalling the package)

I think we need to re-examine the implications here. There may be overlaps in the wheels we are going to install, and there would be race conditions in those cases. Having to deal with parallel installation possibly introducing files at any point in time requires a different strategy than proposed in #8119, at least.

chrahunt on 9 Jul 2020

@chrahunt parallel installation is only for wheels. Wheels installation is essentially unzip, that is it, no build no scripts are supported. This is the only reason why it is safe to assume we can parallelize this process.

Now regrading #8119, this is just a warning, which should be displayed regardless of whether this is serial or parallel.
Race condition is irrelevant, either serial or parallel you are destroying your own setup.
This is of course assuming that somehow you managed to install two different wheels with the same name on a single pip install call (which this is something I'd like to see ;)

bmartinn on 10 Jul 2020

$ ./.tox/py38/bin/python
Python 3.8.0 (default, Jan  4 2020, 14:08:19)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tests.lib.wheel import make_wheel
>>> make_wheel("simple1", "0.1.0", extra_files={"simple.py":"1"}).save_to_dir(".")
'./simple1-0.1.0-py2.py3-none-any.whl'
>>> make_wheel("simple2", "0.1.0", extra_files={"simple.py":"2"}).save_to_dir(".")
'./simple2-0.1.0-py2.py3-none-any.whl'
>>>
$ ./.tox/py38/bin/python -m pip install --find-links=. simple1 simple2
Looking in links: .
Processing ./simple1-0.1.0-py2.py3-none-any.whl
Processing ./simple2-0.1.0-py2.py3-none-any.whl
Installing collected packages: simple1, simple2
Successfully installed simple1-0.1.0 simple2-0.1.0
$ cat ./.tox/py38/lib/python3.8/site-packages/simple.py
2

Race condition is irrelevant, either serial or parallel you are destroying your own setup.

Letting someone destroy their own setup inadvertently is not a good user experience! Part of the motivation for #8119 is to reduce that chance.

chrahunt on 11 Jul 2020

👍3

The race condition I'm referring to

Can we clarify terminology here, just to prevent misunderstanding? What you're referring to seems to me to be an operation whose result depends on the order in which two sub-operations are executed. That's not quite the same as what I think of when I hear the term "race condition" - which I associate with the additional property that one of the outcomes is demonstrably correct. So, in my view, "race condition" implies "non-deterministic chance of getting a broken result rather than a success". (I'm not arguing that my definition has to be correct, just that we should be careful to be clear what we are discussing).

In this case, I don't think there's an obviously "correct" result for installing two wheels that both include the same filename (with differing content). And I also don't believe that pip guarantees order of installation¹.

So running installs in parallel isn't as significant a degradation as you're suggesting, unless I'm missing something. The one place it is objectively worse is that if two wheels have two files in common, parallel install could end up with one from wheel A and one from wheel B - which means neither A nor B will work. Serial install will always leave one of A or B functional. But I don't think that's a huge improvement.

Letting someone destroy their own setup inadvertently is not a good user experience!

100% agree with this. And unconstrained parallel installs look like they will make it harder to achieve the goal in #8119, so I think it's reasonable to ask that this PR "doesn't do things that will make #8119 impossible".

¹ The new resolver installs dependencies before dependents, but otherwise selects the order arbitrarily.

pfmoore on 11 Jul 2020

@chrahunt kudos on the quick toy implementation! very nice!

I totally forgot that one can put a file in the "root" dist package folder. As you pointed, quite nasty...
More to the point, as @pfmoore mentioned, this is not about making the final setup working, there is no guarantee to make it in a usable state (serial or parallel installation). That said, our goal here as in #8119 is to convey the broken state to the user (i.e. output a warning)
@pfmoore you are correct, the definition "race condition" is not very accurate, I think that atomic is more appropriate.
Specifically if the implementation of "warn if unzipping and overwriting file" is "If exists then warn" the statement itself is not atomic, so two processes can bypass it even though they should not (both context switches after the "if" and before the creation of the file)
Posix file systems support atomic "create file if it does not exist", which is designed to solve exactly these kind of use cases (files and pipe are used quite often as a way to implement locking mechanisms).
I'm not sure how #8119 implemented the warn-if-overwrite, but since the unzipping process of the wheel is actually fully pythonic, we could use os.open("filename", os.O_CREAT | os.O_EXCL) which will fail if the file already exists, and is of course an atomic call.
This means that you will be able to generate a warning if two packages are stepping on the other's files, but the order of the warnings is not guaranteed. For example if we take @chrahunt code, and extend it to have two files in the root folder, then you might end up with a warning that package A overwritten file1 and package B overwritten file2. Nonetheless we will get a warning, and the user will know they have a broken setup.

WDYT?

bmartinn on 11 Jul 2020

in my view, "race condition" implies "non-deterministic chance of getting a broken result rather than a success"

I think this view is fine - what matters isn't the definition but the fact that we're trying to avoid bad outcomes (for users, for maintenance burden, for debugging).

The situation I gave above was one in which a race condition can occur under a multi-process installation, not a race condition itself. There are places in code that we make assumptions that could be rendered false and lead to failure in the face of multi-process installation. See #8562 (comment) for an explicit example: now if this code overlaps with this code in separate processes then the latter will raise an exception¹. We probably make similar assumptions in the rest of the wheel installation code, the internal utilities it uses, 3rd party libraries we use, and standard library modules we use.

Using code in multiprocessing forces contributors to consider those situations (and more) on an ongoing basis (e.g. development, code review). That can be worth it, but we should be intentional about deciding it.

¹ In the proposed implementation an exception will be swallowed and we will try to install sequentially. IMO that approach leads to a much worse experience, because terminal failure of any one wheel installation will leave all but that one wheel installed. That environment would potentially be much more broken than today.

chrahunt on 11 Jul 2020

Point taken.

If I now understand correctly, you're flagging a general problem with any form of parallel installation, not just a possible issue with this PR. Which would imply that any proposal to do parallel installs needs to start with (some form of) design document, discussing the trade-offs, integrity guarantees that we want to make, and explicit "undefined behaviour" situations where we don't offer guarantees and consider the use case unsupported.

That seems fair - although in practice, I suspect it will result in parallel installs being postponed for a relatively long time - in my experience open source workflows are not good at handling the sort of broad design questions we're talking about here. I'd love to be proved wrong, though!

pfmoore on 11 Jul 2020

@chrahunt I might be missing something here, but with all examples, you will end up with a broken (i.e. overwritten) setup. This is bad, but unavoidable. The only thing you need is to make sure you convey it to the user.
Are we in agreement on this one?

If we are, then again the implementations you pointed are all atomic issues not a race condition¹, and yes they should probably be dealt with, I think we can assume Locks are supported. Anyhow I think that at least to begin with, parallel installation / download / resolving, should be turned off by default. It only makes sense to have this feature when pip is used to build an entire environment, and then the speed benefit is huge, and I hope it is safe to assume you do not have packages overwriting one another.

¹ Atomic operation assumes correctness outside of the atomic call as long as only a single atomic call is executed on the "system" at a single moment. Race condition is defined when there is no guarantee who will get to point A first, but that does not assume incorrectness of code, if it breaks correctness than this falls under the need for an atomic-section. In your example even introducing an atomic-section will not solve the race-condition, there is no guarantee which package will be installed first, and that is okay, as long as we do not end up with two packages installed without knowing they overwrote one another.

bmartinn on 11 Jul 2020

That seems fair - although in practice, I suspect it will result in parallel installs being postponed for a relatively long time

@pfmoore I'm with you on this one and IMHO this is exactly the opposite of the open-source spirit. If you try to "play it safe" and try to grantee all scenarios, you are limiting your ability to introduce innovation. Pip dropping python 2.7 and soon python 3.5 support is a good example, we have to keep moving forward. I think that trying to protected against extreme scenarios which are broken to begin with, is missing the point of speeding up the process. And since I think that these days accelerating the pip install process will affect so many users and systems, as they relay on pip to restore environments and do that very often, this is a notable goal. just my 2 cents :)

BTW:
This is a draft implementation, and should be considered as a proof-of-concept :)

bmartinn on 11 Jul 2020

IMHO this is exactly the opposite of the open-source spirit. If you try to "play it safe" and try to grantee all scenarios, you are limiting your ability to introduce innovation.

This, I disagree with. The problem here is a combination of compatibility and support, it's nothing to do with an "open-source spirit". Pip is used so widely, and in so many different workflows and environments, that we have to be extremely careful to not break things - that's true just as much for an open source project as for a paid one. IMO, the reason open source has become so popular and common, is precisely because volunteers and open source maintainers care about their projects and users, and take issues like backward compatibility seriously (typically more seriously than many commercial vendors).

Open source gives us more options and flexibility to innovate, but not at the expense of our users.

pfmoore on 11 Jul 2020

👍1

you will end up with a broken (i.e. overwritten) setup. This is bad, but unavoidable. The only thing you need is to make sure you convey it to the user. Are we in agreement on this one?

We disagree here. The point of #8119 is to introduce a deprecation notice, which would turn into a hard failure. So I think a broken setup will be avoidable in that case.

Anyhow I think that at least to begin with, parallel installation / download / resolving, should be turned off by default. It only makes sense to have this feature when pip is used to build an entire environment, and then the speed benefit is huge, and I hope it is safe to assume you do not have packages overwriting one another.

I don't think this addresses the maintenance overhead I mentioned above. Once we support multi-processing here then it's something we need to think about during code review and development, regardless of whether it is the default. On a side note, having code that is not often run increases the chances that we will break it inadvertently.

If you try to "play it safe" and try to grantee all scenarios, you are limiting your ability to introduce innovation.

And that is totally OK! Everything is a trade-off. Having a slightly slower installs that, IMO, are easier to reason about, review code for, contribute code for, and support users using may be more aligned to our desired outcomes.

And since I think that these days accelerating the pip install process will affect so many users and systems, as they relay on pip to restore environments and do that very often, this is a notable goal.

I agree 100%, this is an impactful thing to look into.

This is a draft implementation, and should be considered as a proof-of-concept :)

Noted!

chrahunt on 11 Jul 2020

@pfmoore I think I failed to explain myself, I'm not saying we should break stuff or stop backwards compatibility, this will be horrible. My point is a proper design document that will cover all angles is challenging and very hard to achieve in the asynchronous workflows of open-source community. This is why so many times these things evolve organically and iteratively, where each step improves the previous one. This is basically my suggestion in regards to parallelization in general. This also connects with @chrahunt point, it might take a few iterations until we achieve stability, and that is exactly why I'm against pushing parallelization as default behavior, I really think this is too complicate to end with a single PR.

We disagree here. The point of #8119 is to introduce a deprecation notice, which would turn into a hard failure. So I think a broken setup will be avoidable in that case.

@chrahunt, my bad, I was not aware this is a warning before instillation, and not during/after. BTW I would imaging that just listing files before unzipping them will take a toll on speed, but of course you can always parallelize the check it self... Anyhow you are most definitely correct, the only way to implement warn before unzip is to test everything before installing. And again I do not understand if #8119 produces warning before the specific wheel in unpacked or should stop the installation before any package is unpacked. But I guess that is a discussion that should be in the #8119 thread.

I don't think this addresses the maintenance overhead I mentioned above. Once we support multi-processing here then it's something we need to think about during code review and development, regardless of whether it is the default. On a side note, having code that is not often run increases the chances that we will break it inadvertently.

It does not address the maintenance issue, and this is always challenging with multiprocessing workloads. That said, this is exactly why I think this feature should only be turned on with a specific flag when installing, since it will take time until it is stable.
I think a step in the right direction was done by @McSinyx #8320 , which is introducing parallelization utilities that will be used throughout the code, in order to solve any atomic-section, races, etc.

bmartinn on 11 Jul 2020

Looks like there's been some spirited meta-esque discussion here, containing words and phrases like "innovation" / "Everything is a trade-off" / "open source spirit" / "flexibility to innovate" etc and on interpretations of meaning of phrases. I don't want to add to that spirited discussion (mostly because I don't have the mental bandwidth for it right now).

I have faith that everyone here will take a moment to step away if they need to, and avoid escalating things.

The new resolver installs dependencies before dependents

Well... the old one did that too! However, that logic is difficult to understand (for me at least), without spending a lot of time frowning at the screen and going "huh?" multiple times over the course of 20 minutes.

(yes, I've spent a substantial amount of time on the old resolver. yes, I am eagerly waiting to delete all of that code.)

in order to solve any atomic-section, races, etc

Those utilities are not going to "solve" race conditions and logical/design issues. They do not to avoid the need to consider the implications of parallel/async programming.

They avoid having the entire codebase riddled with calls to multiprocessing/multithreading etc, and provide a clean way to have fallback-logic when a platform doesn't have support for some parallelization routine. [note 1]

pradyunsg on 15 Jul 2020

👍1

I think everything I've said in https://github.com/pypa/pip/issues/8187#issuecomment-636508240 is still True. Allow me to use more words what I was trying to convey by referring to Brett's post:

This would be wonderful and I would love if we can make this speedup happen. But, we ought to consider both the benefits and drawbacks of it.

In an ideal case, we'd have no drawbacks when comparing parallel installs to serial installs. Sadly, we're not in an ideal world (gestures at everything) so I think we ought to figure out how parallel installs would differ from serial installs. Knowing this will help us when we're implementing this, reviewing the implementation, planning the rollout and communicating to users during the rollout.

Yes, all this is a lot of work (and perhaps "inertia" even) but I don't want us to release something that'll "not work" for a substantial chunk of the Python community. If someone disagrees that we should not be careful about avoiding disruption, please say so and elaborate on why.

OTOH, I really don't care how we do this -- all I know is that I'm not volunteering to do this work myself. Whether this takes the form of prototyping where we discover things as we go, or a design document that we discuss before starting implementation, or some other workflow -- I really don't care, as long as I can provide feedback before merging the code into master (since master is where we make releases to end users).

I'll note that poetry recently implemented parallel installs and there's ideas that can be borrowed if we want to do so.

pradyunsg on 15 Jul 2020

👍1

Thanks @pradyunsg ! That sounds like a plan to me.
I will gladly start a design document draft, as obviously this feature is dear to my heart :)
What would be the best way to have everyone contribute to the document (I'm not aware of any docs-feature built into GitHub ...) ?
Should I just open a repo with a markdown document, and everyone could edit it via commits ?

bmartinn on 16 Jul 2020

@bmartinn Awesome! Thanks for being willing to champion this! ^>^

If you wanna collaboratively work on a design document, I'd suggest just pick a platform that lets you put text in a place and share it -- a GitHub Gist / HackMD / Etherpad / Google Docs / DropBox Paper etc. :)

The main difference between these platforms is the level of access control for various folks -- I think Google Docs with "view" access would likely work best, since folks can (publicly) suggest edits and make comments on specific parts of the content, while the author/owner is the only person who can actually make those changes. YMMV tho, and as long as it's possible to provide feedback on the content "near" it (i.e. not in this issue), I'm on board. :)

pradyunsg on 17 Jul 2020

I'm passing by just to mention that after GH-8562, caching unpacked wheel is no longer needed.

McSinyx on 9 Aug 2020

👍1

I'm passing by just to mention that after GH-8562, caching unpacked wheel is no longer needed.

What is the meaning of GH and expansion for other labels?

SmartManoj on 4 Oct 2020

It's the same as #8562, but IMHO it looks classier and I grew a habit of using it after watching projects with multiple issue trackers (not at the same time though, but they use GH-* to distinguish with issues from another tracker).

McSinyx on 4 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Can't install PyYaml: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc6 in position 69: invalid continuation byte

nathan130200 · 3Comments

Pip issues while installing requirements, OSError: [Errno 1] Operation not permitted:

GregBorrelly · 3Comments

Should we print an error if --user and --target are used together?

pradyunsg · 3Comments

pip freeze generating format that's incorrectly parsed by pip install -r

JordanReiter · 3Comments

Could not install pip

imzi · 3Comments