Hi, I think it'd be cool to add support for AMDGPU using LLVM's AMDGPU backend and ROCm software stack (their CUDA equivalent). If you already have NVPTX backend working, I don't think it is too hard to add support for AMDGPU as well. And the graphics on AMDGPU is fantastic.
In fact, the apache project TVM has AMDGPU codegen support which was almost copy-pasted from NVPTX backend. You also need a runtime, but again TVM's implementation can help.
I helped bringing up their support for AMD more than two years ago and that was how I started contributing to TVM. As a computer graphics enthusiast I'm also very interested in this project and am looking for opportunities to contribute. If you are interested in this topic, I can start take a look @yuanming-hu
Disclaimer: I don't work for AMD and have no relationship with them whatsoever. I just like my GPU.
I think it's a great idea! One question I have:
yeah good question if you already had supporting OpenCL in mind. I think there are many metrics for "being better" (performance, tooling, ease of development) etc. For AMD it is a question of supporting OpenCL or ROCm, so let me try giving pros/cons on each:
OpenCL pros:
OpenCL cons:
ROCm pros:
ROCm cons:
I don't know much about ROCm, but there doesn't seem to be a way to allocate virtual memory yet. In C++ and CUDA this is done with mmap and cudaMallocManged, respectively. Taichi relies on Virtual memory and page faulting mechanisms in its memory management design (atleast for sparse data structures). So you may need to wait until ROCm supports these features before implementing a backend for it.
It's that unified memory thing right? It does seem to be supported by ROCm. See https://github.com/ROCm-Developer-Tools/HIP/search?q=cudaMallocManaged&unscoped_q=cudaMallocManaged
Yes it is the unified memory thing. Maybe hipMallocManaged would work too, but it would need some testing to see if it actually behaves like mmap. That is, can it be used to reserve a huge amount of virtual memory without allocating physical memory until pages are touched? This is a relatively new feature of cudaMallocManaged and it isn't really advertised. I wouldn't be super suprised if hip doesnt support this specific thing yet. But it would be cool if it does!
@masahi @KLozes thanks for the discussions! I learned a lot from you guys regarding AMDGPU :-)
Based on what I learned, it seems to me that AMDGPU is a better option than OpenCL for Taichi, because
libdevice.bc though)nvcc sometimes also leads to issues. Therefore OpenCL sharing codegen with CUDA is not really an OpenCL Pro for Taichi as this stage.(Sorry about my delayed reply. I was busy fixing an urgent bug for v0.4.1...)
Regarding unified memory and Taichi's memory allocator, that's another story. I'll post more thoughts on those tomorrow.
Linking with bitcode on AMD during LLVM codegen is straightforward. I did that for TVM in https://github.com/apache/incubator-tvm/pull/570
Cool! One less thing to worry about!
A figure illustrating the current memory management system in Taichi. More details coming tomorrow. I'm considering to support the backends without hardware unified memory as well, depending on how soon every device will support unified memory...

Considering removing the dependency on unified memory since it seems CUDA on Windows does not have very good support for it... http://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf (search for "windows")

I can see how removing the dependency on unified memory would make Taichi more portable. But, do you think it will slow down Taichi an appreciable amount? I imagine data activation would be slower without the page faulting mechanism. Also, the bitmasked data structure wouldn't be possible anymore, which I think should have less memory fragmentation than using pointers for sparsity.
I think the tradeoff would be between performance and memory fragmentation. If we allocate smaller pieces of memory trunks to reduce fragmentation, there will be more allocation/locks contention, which will harm performance.
I wouldn't worry about performance too much. A bigger issue would be supporting Python-scope (single) tensor element access. Without unified memory reading/writing a single element means one kernel launch. For writing, we need batching the write requests to reduce #kernel launches. For reading we have to do prefetching/caching, otherwise, there will be a lot of kernel launches and cudaMemcpy across PCI-e... This will need a lot of refactoring to realize.
Or we simply disable Python-scope data access, but instruct the users to use numpy instead.
I don't think disabling Python-scope access would be a big deal. Would be nice to keep keep it for 0D tensors though.
Can't we abstract over memory management details of different backend so that the rest of taichi doesn't have to care if unified memory is supported or not?
I haven't looked at the code in detail, but if disabling unified memory is an option I wonder why we cannot turn unified mem on/off per backend basis.
Or is low level memory management details strongly tied to the rest of the system? If this constraint comes from the need to handle sparse data structures, I am interested. I believe this is not an issue for "dense" domain system such as halide or tvm.
I don't think disabling Python-scope access would be a big deal. Would be nice to keep keep it for 0D tensors though.
Yeah I think the worst case is that we will still support it, yet at a cost of one kernel launch per read/write in Python.
Can't we abstract over memory management details of different backend so that the rest of taichi doesn't have to care if unified memory is surported or not? I haven't looked at the code in detail, but if disabling unified memory is an option I wonder why we cannot turn unified mem on/off per backend basis.
Good question. We want unified memory because memory management performance is important, especially when we have sparse data structures. We can, of course, go without unified memory, if we support dense structures only, or simply implement sparse data structures with a higher cost.
ok I took a quick a look at the code base and I see some good refactoring opportunities there:
Currently, LLVM GPU means NV and CUDA and the code is hard coded that way. I want to separate the concept of "GPU" from "CUDA" so that AMDGPU backend can share some code with NVPTX.
There are many places in the code base where if/else is used to dispatch into backend specific logics, like below. This quickly becomes awkward as we add more backends in the future (AMDGPU, ARM, upcoming fancy dGPU from Intel..?). I think using abstract interfaces and virtual method calls are better for code hygiene. They can also be used to hide memory management details I mentioned above.
if (arch == Arch::x86_64) {
// do something for x64
} else if (arch == Arch::cuda) {
// do something else for cuda
} else {
...
}
If you think this is a good idea, I can open a separate issue to discuss some refactoring plans. I want them to be a prereq for AMD work.
I think these are great ideas! Please feel free to propose a more hygienic refactoring solution. Starting with just two backends (x64 and CUDA), a lot of legacy designs will no longer suit the current tendency of having more and more backends.
The current list of potentially supported backends is here:
https://github.com/taichi-dev/taichi/blob/5866eb5148297941e82e9998d48ea2eed0d9bf01/taichi/inc/archs.inc.h
I'm quite interested to learn how you plan to remove the dependency on unified memory. Currently the Metal backend (#396 ) can only support dense snodes, and I took a look at the dynamic snodes. The memory allocation part seems to happen in request_allocate_aligned, where the CUDA side busy loops waiting for the host side to pass back the new memory address.
For Metal, we may be able to do the host/kernel sync via MTLSharedEvent. However, as far as I know, the host side must bind all the buffers before launching a Metal kernel (no dynamic buffer binding while the kernel is running). Is this the same in OpenGL/Vulkan as well? If so, what would be a good way to pass back new chunks of memories?
Yeah, currently the host-device communication via request_allocate_aligned is pretty tricky and might lead to some portability issues.
The worst case is that we simply pre-allocate a, say 2GB, buffer, before the kernel launch, and define the cases where a single kernel activates > 2GB memory as undefined behavior on devices where host/kernel sync is not supported...
Going to sleep now and will think more about this tomorrow.
Unfortunately, after more investigation I think we should not depend on unified memory anymore. Because
This means the memory allocator needs to pre-allocate a huge piece of memory ahead of time.
For Python-scope accesses, we need to either launch a GPU kernel or maintain a software cache for more batched load/store.
However, as far as I know, the host side must bind all the buffers before launching a Metal kernel (no dynamic buffer binding while the kernel is running). Is this the same in OpenGL/Vulkan as well?
But I think bindings should be able when no kernel is running, does that help (dynamic buffer)?
If so, what would be a good way to pass back new chunks of memories?
The only way to pass back data is glMapBuffer, map into somewhere in host memory.
The best I could do if snode_reader/writer not x86_64 is only map: args, extra_args, external_ptr, but not root.
Warning: The issue has been out-of-update for 50 days, marking stale.
I think it's a great idea! One question I have:
* Is supporting AMD GPUs via AMDGCN better than via OpenCL?A Vulkan (Metal equivalent on Linux) backend may be a better fit.
There is no OSS OpenCL implementation on Linux yet, and amdgpu-pro is not attractive to many users because its quality is actually worse than OSS driver.
Most helpful comment
I think it's a great idea! One question I have: