Conda-forge.github.io: CI for GPU packages

Created on 16 May 2020 · 18Comments · Source: conda-forge/conda-forge.github.io

Some packages require actual GPU devices to run tests, and this is currently not possible - is there any way we can get this to happen?

This has already been mentioned in #901 by @leofang:

Is CI set up to build and test GPU packages? If not, where is this done?

However, I think this is a much more specific question than what #901 tries to address, so I'm opening this issue. Note that the (somewhat stalled) discussion in #902 might provide a clue as to the way forward:

MS-rep: We're soon going to start private preview of elastic self-hosted pools. Basically, we'll do the elastic management side for you; you run it in your Azure subscription so you can have whatever beefy machine you want.

@mariusvniekerk: That sounds great. So essentially these will run the same build host configurations as stock ones, just with different hosting? Or do we have to build the machine images ourselves

For more details see discussion there; but maybe there are other/better ways as well?

A non-exhaustive list of packages that are affected by this: pytorch, cupy, pyarrow, faiss, etc.

Source

h-vetinari

👍1

Most helpful comment

So, for some context for everyone subscribed to this issue (not least all those people I tagged).

In the meantime, I tried to procure some initial funding for the idea of a jointly sponsored build queue from my employer (a small-ish data consulting company based in Switzerland; only a small fraction of our work is related to conda, but we're interested in the health of the ecosystem, particularly from the testing & security side of the story), and got approval for $500/month for a year as an experiment. My hope is that by placing some initial chips on the table, some others of the (much more involved) players might be enticed to join as well. 🙃

Those 6000$/year are roughly the amount to continuously run one NC6 agent on azure (the smallest GPU-instance). This likely wouldn't be enough to cover all of conda-forge's needs, but I'm guesstimating that roughly 3-4 times that much would be more than enough at the moment (e.g. the drone queue is also managing on two machines).

That amount represents around a person-week of engineer time (cf. NEP46), which I think should be a drop in the bucket for companies that employ people that spend any non-trivial amount of time concerned with packaging GPU-related software - especially compared to the time lost by having to do it disjointly, resp. the potential time saved by doing it through conda-forge.

I suggested this to @conda-forge/core, and it turns out that there is at least one other proposal along those lines currently in the pipeline. This would be great from my POV, because unifying all those efforts is IMO the ideal scenario (assuming the legalities are solvable). In any case, those discussions are now slowly unfolding, and perhaps provide some background colour to the opening of #1272 & #1273.

In short: it would be amazing if any of the involved people / companies could take this as impetus to chip in something as well. GPU computing is only ever going to get larger, and I believe that sharing some (comparatively low) CI infra costs to enable conda-forge to do the building & integration would provide huge bang-per-buck for the people & companies that are building & using such packages.

h-vetinari on 13 Mar 2021

👍5

All 18 comments

Thanks, @h-vetinari. This is an important request. For the case of CuPy, we have had multiple times hitting bugs when attempting to enable some (experimental) support that upstream did not cover enough in their CIs. Having CF's own CI would be very helpful.

That said, in our case this could have been completely avoided if the upstream had tested it thoroughly, and I do feel this is the right way to go, especially for GPU packages. The upstream CIs should have a large build matrix (Python ver * NumPy ver * CUDA ver * OS ver * ...) to provide a good coverage. In contrast, CF's CIs should only focus on "getting the packaging right" and nothing further. Taking CuPy as an example, it is impractical to run its full test suite on CF's CIs, as each run takes 1~1.5 hr, and we have 12 builds from the aforementioned matrix. It simply takes too long.

leofang on 18 May 2020

cc: @jakirkham

leofang on 20 May 2020

cc @scopatz

beckermr on 15 Aug 2020

I wanted to bring up something along the lines of this again:

The birds

Given the effort expended for packaging GPU packages, the role of conda(-forge) in the scientific/ML stack (including "network effect") & the capabilities of existing conda(-forge) infrastructure, it would kill a lot of metaphorical birds with one stone if conda-forge CI had support for GPUs, because:

a lot of redundant efforts could be saved
without sacrificing CI quality
yielding high-quality packaging with large os-/arch-/version-coverage

The stone - A jointly sponsored build queue for conda-forge?

In the comment I referenced above, I just mentioned Microsoft (who are by now powering most of the CI of CF) as a possible sponsor for this, but that goes even more so for the companies that are more directly involved, like Nvidia (obviously), facebook (pytorch, but also faiss, etc.) and perhaps others like NumFOCUS, Ursa Labs (arrow), quansight or quantstack.

I'm thinking that an opt-in build queue based on a separate, GPU-enabled azure subscription (which is already feasible for CF) would have huge bang-for-buck even just for the companies directly involved (in the form of less time spent by employees on packaging), to say nothing of the ecosystem.

I apologise for the multi-ping, but I'm hoping that by bringing different people to the same table, a way forward might emerge more quickly & easily. 🙃

Some NVidia / RAPIDS / CuPy folks
@cjnolet @dantegd @leofang @kkraus14 @mike-wendt @teju85
Some Facebook folks
@beauby @ezyang @mdouze @seemethere @soumith
Some NumFOCUS / Quansight / Ursa Labs peeps
@kszucs @pearu @scopatz @rgommers @wesm
Microsoft
@vtbassmatt
conda-forge / anaconda + other possibly relevant parties
@conda-forge/core @hadim @hmaarrfk @jph00

Happy holidays!

PS. I had this idea for a while, but the thought got retriggered by the new GPU support for pytorch in conda-forge, that however times out (requiring manual builds) and does not test with actual GPUs (plus having some time to write them words).

h-vetinari on 29 Dec 2020

👍2

We should all chat offline. We have other things going on around this as well and hope to make progress in the new year.

beckermr on 29 Dec 2020

An other idea would be to use a self hosted runner.

https://docs.github.com/en/free-pro-team@latest/actions/hosting-your-own-runners/about-self-hosted-runners

For windows it is trickier, but for linux we should be ok to pool a few resources maybe.

hmaarrfk on 29 Dec 2020

There are a bunch of practical and legal issues around this. The technical bits of setting it up are straight forward.

beckermr on 29 Dec 2020

cc @datametrician @jakirkham from the NVIDIA side for visibility as well

kkraus14 on 29 Dec 2020

Happy New Year! 🎉

In terms of the tech, you can probably use scale set agents with an appropriate choice of VM and image to achieve this.

From a funding/sponsorship perspective, Microsoft's contribution model is free hosted agents and parallelism.

vtbassmatt on 30 Dec 2020

@vtbassmatt are you saying the permissions on the conda-forge account already allow us to use agents w/ GPUs attached?

beckermr on 30 Dec 2020

Yes, you (whoever is the organization admin) should be able to create a scale set pool pointing to an Azure subscription. The subscription will need a scale set running on the type of VM you prefer, in this case with GPU support. Then any pipelines authorized to use that pool will have access to that GPU-enabled virtual hardware.

If this isn't clear, my apologies and I can help out more next week when I'm back at work.

vtbassmatt on 30 Dec 2020

👀1 👍1

Sounds like a Christmas gift for free!

leofang on 30 Dec 2020

That seems perfectly clear! Thank you!

@mariusvniekerk is one of our most knowledgeable azure folks.

We will give this a shot and see what we find!

beckermr on 30 Dec 2020

Ping @mariusvniekerk :)

h-vetinari on 9 Jan 2021

Marius and I chatted. I think I misunderstood what was being said here. We'll need a hosted pool of VMs with gpus to do this.

beckermr on 9 Jan 2021

@beckermr: Marius and I chatted. I think I misunderstood what was being said here. We'll need a hosted pool of VMs with gpus to do this.

That was my understanding when I tried to make the case for bringing together interested/affected parties to come up with (the funding for) such a hosted pool.

h-vetinari on 9 Jan 2021

👍1

We have action here on multiple fronts. I am going to close this issue in favor of https://github.com/conda-forge/conda-forge.github.io/issues/1272.

beckermr on 12 Mar 2021

So, for some context for everyone subscribed to this issue (not least all those people I tagged).

h-vetinari on 13 Mar 2021

👍5

Was this page helpful?

0 / 5 - 0 ratings