Spack: Alternate CUDA provider

Created on 19 Oct 2020  路  21Comments  路  Source: spack/spack

The nvhpc package provides CUDA, but there is currently no way to use it as a cuda provider.

Continues discussion started in https://github.com/spack/spack/pull/19294#issuecomment-708470862.

Description

The NVIDIA HPC SDK is a comprehensive set of compilers, libraries, and tools. The nvhpc package currently exposes the compilers, CPU math libraries (+blas, +lapack), and MPI (+mpi). While the HPC SDK includes CUDA and CUDA math libraries, they are not currently exposed (no +cuda). The included CUDA may be used with other compilers and is not limited to the NV compilers.

CUDA is currently provided by the cuda package. A virtual package cannot exist with the same name as a real package.

Potential solutions:

  1. Create a new virtual package name like cuda-virtual (and packages would have to change their depends_on declarations to indicate that any provider of cuda-virtual is acceptable).
  2. Rename the cuda package, for instance to cuda-toolkit, and have it provide cuda. The nvhpc package could also provide cuda.
  3. Packages explicitly depend_on('nvhpc') to use the CUDA bundled with the HPC SDK.

The same issue also applies to nccl. The HPC SDK includes NCCL, but it is already provided by the nccl package.

cc @scheibelp

cuda feature virtual-dependencies

Most helpful comment

Now that I've given this some more thought, I think it would be useful for me to understand how y'all package all the NVIDIA-related stuff. E.g. what are the packages you have today, and what version schemes are associated with them.

We, NVIDIA, could potentially take on some of the work here in defining Spack packages, if that would be helpful.

All 21 comments

I vote for solution 2: rename cuda to cuda-toolkit and have both packages provide cuda. For nccl, maybe something like nvidia-nccl?

Also pinging our official CUDA maintainers: @ax3l @Rombur

I also think solution 2 is the way to go.

Agreed.

I reached out to one of the NVHPC architects, @brycelelbach, and got the following detailed info on nvhpc and the cuda toolkit:

  • The nvhpc sdk supports multiple versions of the cuda toolkit. Currently, downloads provide either a bundle with the newest cuda version or another bundle with the newest + plus two previous CUDA versions, i.e. provides("[email protected]:11.1")
  • Nvhpc packages existing releases of cuda, it will never have a special version.
  • Solution 2 seems right.

This has come up a couple times very recently (SDKs including implementations of packages) and I am considering an alternative approach:

  • The SDK package (nvhpc) can come with a method to define externals that reference its installation prefix
  • A new concretizer directive could be included (something like supplies) that tells the concretizer that the SDK comes with additional packages (which would fit well with the concretize-together option in environments)

The goal of this would be to avoid the work of converting packages to virtuals when an SDK provides them (so I agree that once packages are converted to virtuals that Spack should be able to resolve these sorts of issues, but I think it would be ideal to avoid the need for that conversion).

This is based on the assumption that nvhpc is not providing a distinct CUDA implementation (in the sense that openmpi provides a distinct implementation of MPI) but instead is downloading the same binaries that you could get when installing the CUDA package directly. I should say that if an SDK does provide a distinct version, making it a provider could be an appropriate choice.

I'm curious what you all think of that.

My product manager pointed out another caveat I didn't fully cover: while NVHPC packages specific existing versions of the CUDA toolkit, CUDA libraries (CUBLAS, etc) are independently semantically versioning, and the versions in an NVHPC release may be different from the versions in the corresponding CUDA toolkit release. There are also libraries in the NVHPC SDK that are not in the CTK.

So, to summarize:

  • There are CUDA toolkit releases, which contains NVCC, a CUDA runtime, a subset of CUDA libraries, and an NVIDIA driver.

    • NVCC and the CUDA runtime use the same version as the CUDA toolkit.

    • The NVIDIA driver has a distinct version from the CUDA toolkit (rNNN); a minimum version is required for each CUDA toolkit.

    • The CUDA libraries have an independent semantic version scheme from the CUDA toolkit.

  • There are NVHPC SDK releases, which contains multiple CUDA toolkit, NVC++, NVFORTRAN, and all the CUDA libraries.

    • There are multiple versions of NVCC and the CUDA toolkit in an NVHPC release. These versions are always some existing release of NVCC and the CUDA toolkit; we don't introduce new versions of NVCC and the CUDA toolkit in an NVHPC release.

    • The CUDA libraries in an NVHPC release may not be the same versions as those included with the CUDA toolkit; they're independently versioned. New versions of CUDA libraries may be introduced by NVHPC SDK releases.

    • NVC++, NVFORTRAN, and some CUDA libraries are only released in the NVHPC SDK.

So, my suggestions:

  • The CUDA toolkit package should provide an NVCC package (with CUDA toolkit versioning) and a CUDA runtime package (with CUDA toolkit versioning).
  • The CUDA toolkit package should provide a package for each CUDA library (with independent semantic versioning).
  • The NVHPC SDK should provide multiple NVCC package (with CUDA toolkit versioning) and a CUDA runtime package (with CUDA toolkit versioning).
  • The NVHPC SDK should provide a package for each CUDA library (with independent semantic versioning).
  • The NVHPC SDK should provide an NVC++ and NVFORTRAN package (with NVHPC SDK versioning).

Given all the above, I'd still encourage something like option 2.

Now that I've given this some more thought, I think it would be useful for me to understand how y'all package all the NVIDIA-related stuff. E.g. what are the packages you have today, and what version schemes are associated with them.

We, NVIDIA, could potentially take on some of the work here in defining Spack packages, if that would be helpful.

We would love to get contributions directly from NVIDIA. Another thing you can do is add the GitHub handles of any NVIDIA employees who would like to be listed as official maintainers for the build recipe. This gives us someone to ping when we review PRs or get reports of build issues.

Ah, it seems there was some confusion. @samcmill is my coworker at NVIDIA and the relevant person. I think neither I nor Axel realized he filed this bug!

In case folks don't realize it, I am employed by NVIDIA. We recently contributed support for the NVIDIA HPC SDK (#19294). Please go ahead and ping me if there are any NV software issues.

This sounds fantastic, yes would love to add your GitHub handles as co-maintainers, e.g. to the cuda package so you receive pings on them.

We currently ship a package called cuda that provides the CTK (spack edit cuda). We also have thrust and cub packages that could be used to install a development version and/or an older/newer version.

Spack has a pretty on-point Python DSL in its package.py files that pop up if you spack edit <package>. The class name inside such a file is the package name, e.g. class Cub(Package) inside var/spack/repos/builtin/packages/cub/package.py is cub, class NlohmannJson(CMakePackage) inside var/spack/repos/builtin/packages/nlohmann-json/package.py is nlohmann-json.

Packages can also provide("virtual-name") other packages, e.g. openmpi and mpich do both provide mpi at a certain version range in their package.py. One could potentially make cuda, thrust, cub, etc. virtual packages that are provided by various packages like cuda-toolkit and nvhpc or thrust-oss. @adamjstewart et al. can definitely brief you further; let me already leave the tutorial and packageing guide. There is also an open discussion in #19269 if we should provide("cublas") et al from the cuda (CTK) package.

Another neet little CUDA thing that we do is that we provide a mixin-class for packages that are (optionally) depending on CUDA. Above, cub is a Package based on a simple install logic (downloads and calls the install phase; optionally accepts patch-ing, etc.). Whereas thrust is a CMakePackage that further knowns the cmake-configure, build and install phase, a method for CMake arguments, among others. Everything that is not overwritten as a method is just taken from defaults (e.g. compare spack edit thrust with spack edit adios2).

The CudaPackage class, defined in lib/spack/spack/build_systems/cuda.py, maintains host-compiler conflicts and provides unified package variants (options) to select the GPU architecture. It's currently used by ~150 packages and reduced duplication, which just derive from it in their package.py. PR #19038 adds GoUDA support.
As an example how this looks, see spack edit paraview and spack info paraview, which lists cuda and cuda_arch as its build variants, which are in turn derived from CudaPackage.

There are also libraries in the NVHPC SDK that are not in the CTK.

@brycelelbach Could you mention an example?

I also have a few other questions:

  • Is there a means to obtain the CUDA libraries other than via the CUDA toolkit or by the NVHPC SDK?
  • Would you have a means to identify the different libraries provided by the CUDA toolkit and the NVHPC SDK (as well as their versions)?
  • It is mentioned that NVHPC SDK provides all the libraries that the CTK does: are the libraries that overlap provided by the CTK instance that is bundled with the NVHPC SDK, or are there distinct libraries provided (in which case I assume the CTK is useful for the driver/runtime/nvcc)?

@brycelelbach Could you mention an example?

NCCL, cuTENSOR.

Is there a means to obtain the CUDA libraries other than via the CUDA toolkit or by the NVHPC SDK?

Typically, yes, there's a way to obtain just the library from our website or otherwise.

Would you have a means to identify the different libraries provided by the CUDA toolkit and the NVHPC SDK (as well as their versions)?

Uh, can you elaborate on what you want? The documentation for both packages should list the contents.

It is mentioned that NVHPC SDK provides all the libraries that the CTK does: are the libraries that overlap provided by the CTK instance that is bundled with the NVHPC SDK, or are there distinct libraries provided (in which case I assume the CTK is useful for the driver/runtime/nvcc)?

The overlapping libraries are provided by the NVHPC SDK, and may not be the same version that was packaged with the associated CTK versions.

Could you mention an example?

NCCL, cuTENSOR.

Is there a means to obtain the CUDA libraries other than via the CUDA toolkit or by the NVHPC SDK?

Typically, yes, there's a way to obtain just the library from our website or otherwise.

I think the Spack nccl package is an example of this: it downloads from (e.g.) https://github.com/NVIDIA/nccl/archive/v2.7.3-1.tar.gz

Does the NVHPC SDK provide a distinct version of NCCL, or is it a compiled instance of an archive available at https://github.com/NVIDIA/nccl/archive/?

Would you have a means to identify the different libraries provided by the CUDA toolkit and the NVHPC SDK (as well as their versions)?

can you elaborate on what you want? The documentation for both packages should list the contents.

For NVHPC SDK I see https://docs.nvidia.com/hpc-sdk/index.html at https://developer.nvidia.com/hpc-sdk which leads me to https://docs.nvidia.com/hpc-sdk/hpc-sdk-release-notes/index.html. That table includes mapped versions of e.g. NCCL for the NVHPC SDK version 20.9.

Likewise for a CTK release I see https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

Based on https://github.com/spack/spack/issues/19365#issuecomment-714178028, it sounds like any CTK library also mentioned in the NVHPC SDK release is overridden by NVHPC SDK.

cudart is an example of a library provided only in CTK. Unlike NCCL, I do not see a separate download option, so is it the case that the cudart library is only available via the CTK?

Btw, I'm currently listed as a maintainer for our cudnn and nccl packages, but I honestly don't know much about them. The only reason I've been trying to keep them up-to-date is because I'm a DL researcher and I use PyTorch pretty heavily. I would love for NVIDIA to officially take them over and add any features others might be interested in.

cudart is an example of a library provided only in CTK. Unlike NCCL, I do not see a separate download option, so is it the case that the cudart library is only available via the CTK?

cudart is also provided in the HPC SDK. Since this is versioned directly in lock step with CUDA (e.g. libcudart.so.11.0 is included for CUDA 11.0) we did not list it out separately in the docs. If that is confusing, we can enhance the docs.

This has come up a couple times very recently (SDKs including implementations of packages) and I am considering an alternative approach:

  • The SDK package (nvhpc) can come with a method to define externals that reference its installation prefix
  • A new concretizer directive could be included (something like supplies) that tells the concretizer that the SDK comes with additional packages (which would fit well with the concretize-together option in environments)

The goal of this would be to avoid the work of converting packages to virtuals when an SDK provides them (so I agree that once packages are converted to virtuals that Spack should be able to resolve these sorts of issues, but I think it would be ideal to avoid the need for that conversion).

Can those more familiar with the Spack internals please comment on @scheibelp's alternative proposal above? It is not clear to me how much effort it would require to architect and implement?

Otherwise, the consensus pretty clearly is option 2.

cudart is an example of a library provided only in CTK. Unlike NCCL, I do not see a separate download option, so is it the case that the cudart library is only available via the CTK?

cudart is also provided in the HPC SDK. Since this is versioned directly in lock step with CUDA (e.g. libcudart.so.11.0 is included for CUDA 11.0) we did not list it out separately in the docs. If that is confusing, we can enhance the docs.

Based on your comment and looking at https://docs.nvidia.com/hpc-sdk/hpc-sdk-release-notes/index.html, my impression is that cudart is supplied via the CTK that comes with NVHPC SDK: I assume the labels CUDA 10.1 | CUDA 10.2 | CUDA 11.0 at the top of table 1 refer to CUDA toolkit releases. When I say "only available via the CTK" I mean that it is also possible that it is available via the CTK supplied by the NVHPC SDK; another way to put it is that all the cudart libs provided by the NVHPC SDK are in some release of the CTK - is that correct?

The SDK package (nvhpc) can come with a method to define externals that reference its installation prefix

A new concretizer directive could be included (something like supplies) that tells the concretizer that the SDK comes with additional packages (which would fit well with the concretize-together option in environments)

an those more familiar with the Spack internals please comment on @scheibelp's alternative proposal above? It is not clear to me how much effort it would require to architect and implement?

The first suggestion would be easy (IMO): you would just add logic for locating each library inside of the nvhpc installation prefix. The second would require some work on my end to create the directives but also would not be much more difficult than adding provides declarations. The definite advantage of using provides here is that it integrates seamlessly into Spack's current concretizer. The only hangup is confusion/effort related to conversion of existing implementation packages (e.g. spack edit nccl) into virtuals.

Commenting to get emails on this.

Commenting to get emails on this

It would be good if Github provided a "subscribe to discussion" button.

It would be good if Github provided a "subscribe to discussion" button.

You mean this?
Screen Shot 2020-11-14 at 9 39 47 AM

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ax3l picture ax3l  路  3Comments

Joseguz101 picture Joseguz101  路  3Comments

ifelsefi picture ifelsefi  路  3Comments

adamjstewart picture adamjstewart  路  3Comments

JavierCVilla picture JavierCVilla  路  3Comments