Taichi: AMDGPU backend

Created on 30 Jan 2020  路  24Comments  路  Source: taichi-dev/taichi

Hi, I think it'd be cool to add support for AMDGPU using LLVM's AMDGPU backend and ROCm software stack (their CUDA equivalent). If you already have NVPTX backend working, I don't think it is too hard to add support for AMDGPU as well. And the graphics on AMDGPU is fantastic.

In fact, the apache project TVM has AMDGPU codegen support which was almost copy-pasted from NVPTX backend. You also need a runtime, but again TVM's implementation can help.

I helped bringing up their support for AMD more than two years ago and that was how I started contributing to TVM. As a computer graphics enthusiast I'm also very interested in this project and am looking for opportunities to contribute. If you are interested in this topic, I can start take a look @yuanming-hu

feature request stale

Most helpful comment

I think it's a great idea! One question I have:

  • Is supporting AMD GPUs via AMDGCN better than via OpenCL?

All 24 comments

Disclaimer: I don't work for AMD and have no relationship with them whatsoever. I just like my GPU.

I think it's a great idea! One question I have:

  • Is supporting AMD GPUs via AMDGCN better than via OpenCL?

yeah good question if you already had supporting OpenCL in mind. I think there are many metrics for "being better" (performance, tooling, ease of development) etc. For AMD it is a question of supporting OpenCL or ROCm, so let me try giving pros/cons on each:

OpenCL pros:

  • Cross platform, can support other backends (I can think of Adreno and Mali, but if I remember correctly Apple ditched OpenCL and Android is also moving away from OpenCL in favor of Vulkan)
  • AMD has OpenCL drivers for both Win and Linux
  • Can share codegen implementation with CUDA (at least that is how TVM implements CUDA and OpenCL codegen)

OpenCL cons:

  • From what I see OpenCL on AMD is more like in "maintenance mode". They are stuck with OpenCL 2.0 forever. But I would say this applies to industry in general (Apple, Android etc)

ROCm pros:

  • AMD is actively pushing software stack built on ROCm (actively developed libraries for math, DL etc only work on ROCm environment).
  • Has more potential for optimizing for AMDGPU specifically (even inline asm is also possible) than "generic" OpenCL backend. We at TVM compared performance on OpenCL vs ROCm after auto tuning, and we saw better performance on ROCm.
  • Can share codegen implementation with NVPTX (in principle just needs to replace NV specific intrinsic for thread id, barrier instruction etc with AMD ones)
  • AMD's open source OpenCL driver is built on top of ROCm stack and this is the direction they are going.

ROCm cons:

  • For now it is Linux only.
  • Software stack itself is not mature compared to CUDA (but I'd say it is match better than OpenCL on AMD)

I don't know much about ROCm, but there doesn't seem to be a way to allocate virtual memory yet. In C++ and CUDA this is done with mmap and cudaMallocManged, respectively. Taichi relies on Virtual memory and page faulting mechanisms in its memory management design (atleast for sparse data structures). So you may need to wait until ROCm supports these features before implementing a backend for it.

It's that unified memory thing right? It does seem to be supported by ROCm. See https://github.com/ROCm-Developer-Tools/HIP/search?q=cudaMallocManaged&unscoped_q=cudaMallocManaged

Yes it is the unified memory thing. Maybe hipMallocManaged would work too, but it would need some testing to see if it actually behaves like mmap. That is, can it be used to reserve a huge amount of virtual memory without allocating physical memory until pages are touched? This is a relatively new feature of cudaMallocManaged and it isn't really advertised. I wouldn't be super suprised if hip doesnt support this specific thing yet. But it would be cool if it does!

@masahi @KLozes thanks for the discussions! I learned a lot from you guys regarding AMDGPU :-)

Based on what I learned, it seems to me that AMDGPU is a better option than OpenCL for Taichi, because

  • According to @masahi, AMDGPU can share codegen with NVPTX and it not too much work (not sure about the messy things such as linking against libdevice.bc though)
  • The CUDA backend is obsolete in Taichi now, since its compilation speed is too slow compared to LLVM NVPTX codegen. (Taichi used to take > 1 minute to compile all CUDA kernels in a program...) Also invoking nvcc sometimes also leads to issues. Therefore OpenCL sharing codegen with CUDA is not really an OpenCL Pro for Taichi as this stage.
  • My feeling about OpenCL is roughly the same as @masahi: hardware manufacturers are not very active to support its new versions
  • I'd say it's sad that ROCm supports only Linux (which in my understanding is mostly used by developers). Hopefully, they can support Windows soon and compete with CUDA on all platforms...

(Sorry about my delayed reply. I was busy fixing an urgent bug for v0.4.1...)

Regarding unified memory and Taichi's memory allocator, that's another story. I'll post more thoughts on those tomorrow.

Linking with bitcode on AMD during LLVM codegen is straightforward. I did that for TVM in https://github.com/apache/incubator-tvm/pull/570

Cool! One less thing to worry about!

A figure illustrating the current memory management system in Taichi. More details coming tomorrow. I'm considering to support the backends without hardware unified memory as well, depending on how soon every device will support unified memory...

Screen Shot 2020-02-01 at 12 33 00 AM

Considering removing the dependency on unified memory since it seems CUDA on Windows does not have very good support for it... http://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf (search for "windows")

Screenshot from 2020-02-02 01-22-49

I can see how removing the dependency on unified memory would make Taichi more portable. But, do you think it will slow down Taichi an appreciable amount? I imagine data activation would be slower without the page faulting mechanism. Also, the bitmasked data structure wouldn't be possible anymore, which I think should have less memory fragmentation than using pointers for sparsity.

I think the tradeoff would be between performance and memory fragmentation. If we allocate smaller pieces of memory trunks to reduce fragmentation, there will be more allocation/locks contention, which will harm performance.

I wouldn't worry about performance too much. A bigger issue would be supporting Python-scope (single) tensor element access. Without unified memory reading/writing a single element means one kernel launch. For writing, we need batching the write requests to reduce #kernel launches. For reading we have to do prefetching/caching, otherwise, there will be a lot of kernel launches and cudaMemcpy across PCI-e... This will need a lot of refactoring to realize.

Or we simply disable Python-scope data access, but instruct the users to use numpy instead.

I don't think disabling Python-scope access would be a big deal. Would be nice to keep keep it for 0D tensors though.

Can't we abstract over memory management details of different backend so that the rest of taichi doesn't have to care if unified memory is supported or not?

I haven't looked at the code in detail, but if disabling unified memory is an option I wonder why we cannot turn unified mem on/off per backend basis.

Or is low level memory management details strongly tied to the rest of the system? If this constraint comes from the need to handle sparse data structures, I am interested. I believe this is not an issue for "dense" domain system such as halide or tvm.

I don't think disabling Python-scope access would be a big deal. Would be nice to keep keep it for 0D tensors though.

Yeah I think the worst case is that we will still support it, yet at a cost of one kernel launch per read/write in Python.

Can't we abstract over memory management details of different backend so that the rest of taichi doesn't have to care if unified memory is surported or not? I haven't looked at the code in detail, but if disabling unified memory is an option I wonder why we cannot turn unified mem on/off per backend basis.

Good question. We want unified memory because memory management performance is important, especially when we have sparse data structures. We can, of course, go without unified memory, if we support dense structures only, or simply implement sparse data structures with a higher cost.

ok I took a quick a look at the code base and I see some good refactoring opportunities there:

  • Currently, LLVM GPU means NV and CUDA and the code is hard coded that way. I want to separate the concept of "GPU" from "CUDA" so that AMDGPU backend can share some code with NVPTX.

  • There are many places in the code base where if/else is used to dispatch into backend specific logics, like below. This quickly becomes awkward as we add more backends in the future (AMDGPU, ARM, upcoming fancy dGPU from Intel..?). I think using abstract interfaces and virtual method calls are better for code hygiene. They can also be used to hide memory management details I mentioned above.

  if (arch == Arch::x86_64) {
      // do something for x64
  } else if (arch == Arch::cuda) {
     // do something else for cuda 
  } else {
    ...
  }

If you think this is a good idea, I can open a separate issue to discuss some refactoring plans. I want them to be a prereq for AMD work.

I think these are great ideas! Please feel free to propose a more hygienic refactoring solution. Starting with just two backends (x64 and CUDA), a lot of legacy designs will no longer suit the current tendency of having more and more backends.

The current list of potentially supported backends is here:
https://github.com/taichi-dev/taichi/blob/5866eb5148297941e82e9998d48ea2eed0d9bf01/taichi/inc/archs.inc.h

I'm quite interested to learn how you plan to remove the dependency on unified memory. Currently the Metal backend (#396 ) can only support dense snodes, and I took a look at the dynamic snodes. The memory allocation part seems to happen in request_allocate_aligned, where the CUDA side busy loops waiting for the host side to pass back the new memory address.

For Metal, we may be able to do the host/kernel sync via MTLSharedEvent. However, as far as I know, the host side must bind all the buffers before launching a Metal kernel (no dynamic buffer binding while the kernel is running). Is this the same in OpenGL/Vulkan as well? If so, what would be a good way to pass back new chunks of memories?

Yeah, currently the host-device communication via request_allocate_aligned is pretty tricky and might lead to some portability issues.

The worst case is that we simply pre-allocate a, say 2GB, buffer, before the kernel launch, and define the cases where a single kernel activates > 2GB memory as undefined behavior on devices where host/kernel sync is not supported...

Going to sleep now and will think more about this tomorrow.

Unfortunately, after more investigation I think we should not depend on unified memory anymore. Because

  • Devices such as OpenGL compute shader/AMD GPUs does not support it
  • Even for NVIDIA GPUs, support on Windows is poor
    https://devtalk.nvidia.com/default/topic/1030191/cuda-unified-memory-oversubscription-in-windows-systems/
  • According to my experiments, the unified memory support on NVIDIA Jetson TX2 is also a little mysterious.
  • In practice I've also found the NVIDIA driver to be problematic under the current request-based memory allocation during a single kernel launch. It seems to me that for some magical reason after a cudaMallocManaged during another kernel launch, CUDA can no longer JIT compile new PTX (loadModule).

This means the memory allocator needs to pre-allocate a huge piece of memory ahead of time.

For Python-scope accesses, we need to either launch a GPU kernel or maintain a software cache for more batched load/store.

However, as far as I know, the host side must bind all the buffers before launching a Metal kernel (no dynamic buffer binding while the kernel is running). Is this the same in OpenGL/Vulkan as well?

But I think bindings should be able when no kernel is running, does that help (dynamic buffer)?

If so, what would be a good way to pass back new chunks of memories?

The only way to pass back data is glMapBuffer, map into somewhere in host memory.
The best I could do if snode_reader/writer not x86_64 is only map: args, extra_args, external_ptr, but not root.

Warning: The issue has been out-of-update for 50 days, marking stale.

I think it's a great idea! One question I have:

* Is supporting AMD GPUs via AMDGCN better than via OpenCL?

A Vulkan (Metal equivalent on Linux) backend may be a better fit.
There is no OSS OpenCL implementation on Linux yet, and amdgpu-pro is not attractive to many users because its quality is actually worse than OSS driver.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

GeoffreyPlitt picture GeoffreyPlitt  路  4Comments

liaopeiyuan picture liaopeiyuan  路  3Comments

yuanming-hu picture yuanming-hu  路  3Comments

KLozes picture KLozes  路  4Comments

yuanming-hu picture yuanming-hu  路  3Comments