This is opened as the continuation of GH-7962, GH-8161, GH-8162, GH-3981 and GH-4654 and is one of the approaches to solve GH-825.
One common inconvenience with using pip
is the delay for networking, since most package indices are not really fast[citation needed] and during package management pip
needs to fetch many things (the package list, the packages themselves, etc.). Parallelization is one obvious solution to tackle this, and I hope it will the cheaper one, hence this issue is open to ensure that the implementation process will not be a labor-expensive work.
Until next year when Python 2 support is dropped, there are two options: multithreading and multiprocessing. While the latter is safer, (1) not every platform has multiple CPU cores and (2) the modified code will need to undergo a huge refactoring to give each core the data it needs. So we are left with multiprocessing. The Python 3 asyncio
immediate solution however (plus it also require making many existing routines awaitable).
Putting thread-safety aside (not because it's not a problem, but rather because I think everyone knows how problematic it is), the most obvious solution provided by Python multiprocessing.dummy.Pool
requires sem_open
(bpo-3770), which seems to raises ImportError
during initialization of the pool's attributes. Since sem_open
is to be provided by the operating system, this raises the question that whether multiprocessing.dummy
is supported on platforms that pip
care to support and is (the more generic?) threading
suffers the same issue if we implement the Pool
ourselves. How about concurrent.futures
(GH-3981)? Would it be worth it to do it, from the developers' perspective as well as that of our users, if things go wrong on their platform?
From GH-8162, IMHO it is safe to assume that (this is a really dangerous thing to say :disappointed:) we can fallback to map
if multiprocessing.dummy.Pool
can't have sem_open
. If this works, personally I suggest to declare a higher order function to reuse in other places, namely for parallel downloading of packages (GH-825). Still under the assumption that this is correct, we can easily mock the failing behavior for testing. However, with my modest experience in threading and the overwhelming responsibility of not breaking thousands[citation needed, could be millions] of people's workflows, please do not take my words for granted and kindly share your thoughts on this particular matter.
I think the broader question here is whether pip should support platforms that don't provide usable threading support. That's basically what @McSinyx said, but summarised down to the bare essential point.
From the Python documentation, the threading
module is required in Python 3.7+ (before that it was optional), and multiprocessing.dummy
is documented as just being a wrapper around threading
. And concurrent.futures
is available from Python 3.2, and I believe there's a backport as well.
We currently claim to support Python 3.5+ (I'm going to ignore Python 2, as we'll be dropping support for that in 2021, and it's not the real issue anyway). So on that basis, we need to cover platforms without threading1, at least until we drop Python 3.5 and 3.6 support.
Personally, I'd suggest that what we do is have a compatibility module that implements whatever concurrency primitives we want, and has fallbacks for non-threaded platforms. We can then unit-test those wrappers to ensure that we behave the same with or without threading, and then we use the wrappers wherever we need them in the rest of the code. Once we drop support for platforms without threading, we can decide whether to keep the wrappers or use the core features directly.
1 #8161 was actually reported on Python 3.8.2, on Android Termux. If we take the Python docs seriously, that platform is broken by not providing a working threading implementation. I don't know how we want to deal with that. Python on mobile is an important enough area that I can see core Python being sympathetic to the idea of not being too strict here. Luckily, the point is irrelevant for now if we are going to support platforms without threading anyway.
Python 3.5
We'll likely be dropping Python 3.5 the same time as Python 2.7 btw -- since Python 3.5 goes EoL in August / September 2020.
One potential consideration after 2021 is asyncio (if pip ever wants to use it). A lot of the async stuff use threading as a backend when whatever they want to do doesn鈥檛 have OS-level event loop support.
I do not think it is important to support non threading Pythons. We don't have to support it just because it is possible option.
That being said, I think longer term we are ideally using some form of async code instead of threading or multiprocessing directly (I would love to use trio as it is much better than asyncio imo, but it has some C stuff so it would be a much harder change).
Hi @McSinyx I was not aware of this thread and just opened #8187
It has a reference implementation of parallelization of the install process.
Bottom line, x1.9 factor speed up :)
It should support python 2.7 and python 3.5 because it relies on ThreadPool and Pool , both are available on 2.7/3.5.
Notice that in the reference implementation the multi-threading support is turned off by default, only adding --parallel
will actually use the ThreadPool/Pool. This way we still support platforms with limited multiprocessing capabilities.
Notice how we had solved the multi-instance progress bar download issue, I would love some feedback on our solution, any suggestions are welcome.