Conda-forge.github.io: CI for GPU packages

Created on 16 May 2020  路  18Comments  路  Source: conda-forge/conda-forge.github.io

Some packages require actual GPU devices to run tests, and this is currently not possible - is there any way we can get this to happen?

This has already been mentioned in #901 by @leofang:

Is CI set up to build and test GPU packages? If not, where is this done?

However, I think this is a much more specific question than what #901 tries to address, so I'm opening this issue. Note that the (somewhat stalled) discussion in #902 might provide a clue as to the way forward:

MS-rep: We're soon going to start private preview of elastic self-hosted pools. Basically, we'll do the elastic management side for you; you run it in your Azure subscription so you can have whatever beefy machine you want.

@mariusvniekerk: That sounds great. So essentially these will run the same build host configurations as stock ones, just with different hosting? Or do we have to build the machine images ourselves

For more details see discussion there; but maybe there are other/better ways as well?

A non-exhaustive list of packages that are affected by this: pytorch, cupy, pyarrow, faiss, etc.

Most helpful comment

So, for some context for everyone subscribed to this issue (not least all those people I tagged).

In the meantime, I tried to procure some initial funding for the idea of a jointly sponsored build queue from my employer (a small-ish data consulting company based in Switzerland; only a small fraction of our work is related to conda, but we're interested in the health of the ecosystem, particularly from the testing & security side of the story), and got approval for $500/month for a year as an experiment. My hope is that by placing some initial chips on the table, some others of the (much more involved) players might be enticed to join as well. 馃檭

Those 6000$/year are roughly the amount to continuously run one NC6 agent on azure (the smallest GPU-instance). This likely wouldn't be enough to cover all of conda-forge's needs, but I'm guesstimating that roughly 3-4 times that much would be more than enough at the moment (e.g. the drone queue is also managing on two machines).

That amount represents around a person-week of engineer time (cf. NEP46), which I think should be a drop in the bucket for companies that employ people that spend any non-trivial amount of time concerned with packaging GPU-related software - especially compared to the time lost by having to do it disjointly, resp. the potential time saved by doing it through conda-forge.

I suggested this to @conda-forge/core, and it turns out that there is at least one other proposal along those lines currently in the pipeline. This would be great from my POV, because unifying all those efforts is IMO the ideal scenario (assuming the legalities are solvable). In any case, those discussions are now slowly unfolding, and perhaps provide some background colour to the opening of #1272 & #1273.

In short: it would be amazing if any of the involved people / companies could take this as impetus to chip in something as well. GPU computing is only ever going to get larger, and I believe that sharing some (comparatively low) CI infra costs to enable conda-forge to do the building & integration would provide huge bang-per-buck for the people & companies that are building & using such packages.

All 18 comments

Thanks, @h-vetinari. This is an important request. For the case of CuPy, we have had multiple times hitting bugs when attempting to enable some (experimental) support that upstream did not cover enough in their CIs. Having CF's own CI would be very helpful.

That said, in our case this could have been completely avoided if the upstream had tested it thoroughly, and I do feel this is the right way to go, especially for GPU packages. The upstream CIs should have a large build matrix (Python ver * NumPy ver * CUDA ver * OS ver * ...) to provide a good coverage. In contrast, CF's CIs should only focus on "getting the packaging right" and nothing further. Taking CuPy as an example, it is impractical to run its full test suite on CF's CIs, as each run takes 1~1.5 hr, and we have 12 builds from the aforementioned matrix. It simply takes too long.

cc: @jakirkham

cc @scopatz

I wanted to bring up something along the lines of this again:

The birds

Given the effort expended for packaging GPU packages, the role of conda(-forge) in the scientific/ML stack (including "network effect") & the capabilities of existing conda(-forge) infrastructure, it would kill a lot of metaphorical birds with one stone if conda-forge CI had support for GPUs, because:

  • a lot of redundant efforts could be saved
  • without sacrificing CI quality
  • yielding high-quality packaging with large os-/arch-/version-coverage

The stone - A jointly sponsored build queue for conda-forge?

In the comment I referenced above, I just mentioned Microsoft (who are by now powering most of the CI of CF) as a possible sponsor for this, but that goes even more so for the companies that are more directly involved, like Nvidia (obviously), facebook (pytorch, but also faiss, etc.) and perhaps others like NumFOCUS, Ursa Labs (arrow), quansight or quantstack.

I'm thinking that an opt-in build queue based on a separate, GPU-enabled azure subscription (which is already feasible for CF) would have huge bang-for-buck even just for the companies directly involved (in the form of less time spent by employees on packaging), to say nothing of the ecosystem.

I apologise for the multi-ping, but I'm hoping that by bringing different people to the same table, a way forward might emerge more quickly & easily. 馃檭

Some NVidia / RAPIDS / CuPy folks
@cjnolet @dantegd @leofang @kkraus14 @mike-wendt @teju85
Some Facebook folks
@beauby @ezyang @mdouze @seemethere @soumith
Some NumFOCUS / Quansight / Ursa Labs peeps
@kszucs @pearu @scopatz @rgommers @wesm
Microsoft
@vtbassmatt
conda-forge / anaconda + other possibly relevant parties
@conda-forge/core @hadim @hmaarrfk @jph00

Happy holidays!

PS. I had this idea for a while, but the thought got retriggered by the new GPU support for pytorch in conda-forge, that however times out (requiring manual builds) and does not test with actual GPUs (plus having some time to write them words).

We should all chat offline. We have other things going on around this as well and hope to make progress in the new year.

An other idea would be to use a self hosted runner.

https://docs.github.com/en/free-pro-team@latest/actions/hosting-your-own-runners/about-self-hosted-runners

For windows it is trickier, but for linux we should be ok to pool a few resources maybe.

There are a bunch of practical and legal issues around this. The technical bits of setting it up are straight forward.

cc @datametrician @jakirkham from the NVIDIA side for visibility as well

Happy New Year! 馃帀

In terms of the tech, you can probably use scale set agents with an appropriate choice of VM and image to achieve this.

From a funding/sponsorship perspective, Microsoft's contribution model is free hosted agents and parallelism.

@vtbassmatt are you saying the permissions on the conda-forge account already allow us to use agents w/ GPUs attached?

Yes, you (whoever is the organization admin) should be able to create a scale set pool pointing to an Azure subscription. The subscription will need a scale set running on the type of VM you prefer, in this case with GPU support. Then any pipelines authorized to use that pool will have access to that GPU-enabled virtual hardware.

If this isn't clear, my apologies and I can help out more next week when I'm back at work.

Sounds like a Christmas gift for free!

That seems perfectly clear! Thank you!

@mariusvniekerk is one of our most knowledgeable azure folks.

We will give this a shot and see what we find!

Ping @mariusvniekerk :)

Marius and I chatted. I think I misunderstood what was being said here. We'll need a hosted pool of VMs with gpus to do this.

@beckermr: Marius and I chatted. I think I misunderstood what was being said here. We'll need a hosted pool of VMs with gpus to do this.

That was my understanding when I tried to make the case for bringing together interested/affected parties to come up with (the funding for) such a hosted pool.

We have action here on multiple fronts. I am going to close this issue in favor of https://github.com/conda-forge/conda-forge.github.io/issues/1272.

So, for some context for everyone subscribed to this issue (not least all those people I tagged).

In the meantime, I tried to procure some initial funding for the idea of a jointly sponsored build queue from my employer (a small-ish data consulting company based in Switzerland; only a small fraction of our work is related to conda, but we're interested in the health of the ecosystem, particularly from the testing & security side of the story), and got approval for $500/month for a year as an experiment. My hope is that by placing some initial chips on the table, some others of the (much more involved) players might be enticed to join as well. 馃檭

Those 6000$/year are roughly the amount to continuously run one NC6 agent on azure (the smallest GPU-instance). This likely wouldn't be enough to cover all of conda-forge's needs, but I'm guesstimating that roughly 3-4 times that much would be more than enough at the moment (e.g. the drone queue is also managing on two machines).

That amount represents around a person-week of engineer time (cf. NEP46), which I think should be a drop in the bucket for companies that employ people that spend any non-trivial amount of time concerned with packaging GPU-related software - especially compared to the time lost by having to do it disjointly, resp. the potential time saved by doing it through conda-forge.

I suggested this to @conda-forge/core, and it turns out that there is at least one other proposal along those lines currently in the pipeline. This would be great from my POV, because unifying all those efforts is IMO the ideal scenario (assuming the legalities are solvable). In any case, those discussions are now slowly unfolding, and perhaps provide some background colour to the opening of #1272 & #1273.

In short: it would be amazing if any of the involved people / companies could take this as impetus to chip in something as well. GPU computing is only ever going to get larger, and I believe that sharing some (comparatively low) CI infra costs to enable conda-forge to do the building & integration would provide huge bang-per-buck for the people & companies that are building & using such packages.

Was this page helpful?
0 / 5 - 0 ratings