Taichi: [Refactors] Code Modularization

Created on 3 Mar 2020  路  24Comments  路  Source: taichi-dev/taichi

Over the past week, the Taichi code structure was drastically changed. Since each module should have only a single responsibility, a few huge classes are broken down into smaller pieces:

  • (Abstract) KernelCodeGen

    • CodeGenLLVM

    • CodeGenLLVMCPU

    • CodeGenLLVMCUDA (We may create a CodeGenLLVMSIMT for the common parts of CUDA and AMDGPU @masahi)

    • CodeGenMetal (not really a derived class, but should it be?)

    • CodeGenOpenGL (not really a derived class, but should it be?)

  • (Abstract) StructCompiler

    • ...

  • (Abstract) JITModule: holds compiled functions that are ready to launch

    • JITModuleCUDA

    • JITModuleCPU

    • (Again, should JITModule be extended, so that Metal/OpenGL modules can share this interface?)

    • (Abstract) Runtime (not yet implemented, need further discussions on if it is really necessary): JIT compile generated code, launch compiled functions from JITModule, allocate memory, binding device buffers (#542) etc, sync host/device. (JITSession is now part of RuntimeEnv)

    • RuntimeCPU

    • RuntimeCUDA

    • RuntimeMetal

    • RuntimeOpenGL

    • Program: As mentioned by @k-ye maybe we should make it virtual as well (https://github.com/taichi-dev/taichi/issues/554#issuecomment-593303288). The other option is to simply use a non-virtual Program holding pointers to instances of the abstract class above, and create members using, e.g. JITSession::create(arch). This will save the creation of new derived Program classes. For now, we keep Program a class with a bunch of (virtual) pointers to abstract components.

    • TaichiLLVMContext: this class serves as an LLVM module building util, and needs a bit more refactoring.

I'm open to all kinds of inputs on if this design makes sense, especially from backend developers (@k-ye @archibate @masahi). Also note that the interfaces are currently designed for LLVM-based backends, and we need some extension to make them work with source2source backends (Metal, OpenGL, etc), if people think having common base classes for all backends is a good idea.

discussion

Most helpful comment

As one of data points, here is a structure of TVM https://github.com/apache/incubator-tvm/tree/master/src

codegen/
   source/
      codegen_cuda.cc
      codegen_opencl.cc
      codegen_metal.cc
      codegen_opengl.cc
      codegen_source_base.cc (This could correspond to `CodeGenCFamily` above)
      ...
   spirv/
      codegen_spirv.cc
      ir_builder.cc
      ...
   llvm/
      codegen_llvm.cc (the base of all llvm based backends)
      codegen_amdgpu.cc
      codegen_nvtpx.cc
      codegen_cpu.cc, codegen_arm.cc
      llvm_module.cc (for JIT)

runtime/
    cuda/
        cuda_module.cc
        cuda_device_api.cc
    opencl/
        opencl_module.cc
        opencl_device_api.cc
    opengl/
        ...
    vulkan/
        ...
    metal/
        ...
    rocm(amdgpu)/
        ...
    ... (similar)

All 24 comments

One more thing: Currently, source files are organized in a tree structure. For example, under taichi/codegen/ we have codegen_llvm_cuda, codegen_metal, codegen_opengl.

I'm wondering if we should organize everything in a per-backend manner. I.e., everything related to CUDA should be put in taichi/backends/cuda, including codegen, struct, jit_session, jit_module, runtime_env etc. The pros are:

  • Higher cohesion in folders. Adding a backend means creating a new folder and implement the modules.
  • No guards like TI_WITH_CUDA needed anymore, since we only include files under the taichi/backends/cuda folder in CMake if the TI_WITH_CUDA:BOOL is on.
  • In the future, we may want to compile each backend into a shared object (instead of having everything in a huge libtaichi_core.so). With this file organization, it's easier to decide which shared object/dynamic library a file belong to. For example, everything under taichi/backends/cuda belongs to libtaichi_core_cuda.so.

(I can't come up with clear cons of this, but please comment if you find one :-)

I'm wondering if we should organize everything in a per-backend manner. I.e., everything related to CUDA should be put in taichi/backends/cuda, including codegen, struct, jit_session, jit_module, runtime_env etc.

Agree, platform/opengl/opengl_api.h is annoying, should be together with codegen, also make commit trace more reasonable.

No guards like TI_WITH_CUDA needed anymore, since we only include files under the taichi/backends/cuda folder in CMake if the TI_WITH_CUDA:BOOL is on.

A CUDA hater can simply rm -rf taichi/backends/cuda :-) Solves #529.

CodeGenMetal (not really a derived class, but should it be?)
CodeGenOpenGL (not really a derived class, but should it be?)

I think this depends on where you want to make that per-backend separation? I don't see too many implementation being shared between these two codegens and the LLVM-based codegen (probably except for a very high level gen() function...) If we have already encapsulated each backend's impl at a higher level, then these codegen classes can just be an implementation detail inside that higher-level abstraction.

What I do think can be shared is between Metal and OpenGL codegen, just like CodeGenLLVMSIMT, because these two are C-ish languages.

(Abstract) JITSession: compile LLVM Modules/Source files into JITModules
(Again, should JITModule be extended, so that Metal/OpenGL modules can share this interface?)

I'm not super familiar with this part, but this sounds similar to what the MetalRuntime class does (registering kernels, caching the compiled kernels, launching the kernels, etc). I guess for Metal & OpenGL, the underlying runtime API has already take care of the compilation part?

No guards like TI_WITH_CUDA needed anymore, since we only include files under the taichi/backends/cuda folder in CMake if the TI_WITH_CUDA:BOOL is on.

+1.

As for codegens, I wonder if there's ever gonna be a need where someone wants to do cross-compilation. After all, it's perfectly fine for you to generate CUDA/OpenGL/Metal code without having the runtime supporting it... (On the other hand, I'm not sure if the compiled code without Taichi runtime is useful in any way...)

After all, it's perfectly fine for you to generate CUDA/OpenGL/Metal code without having the runtime supporting it...

+1.
Use something like ti.export mentioned in #439, we may have arch=js and run taichi code in browser. With this taichi is more like a compiler (CodeGen) instead of a interpreter (JITModule).
Also you may not want arch=js, you want lang=js. You won't like to mistake arch=x64 with lang=cpp.

I'm wondering if we should organize everything in a per-backend manner.

The only con I can think of is, since I still see a possibility where heterogeneous backends (or mixed backends?) could be useful, this might be a bit too restrictive. For example, the Metal backend used part of the LLVM runtime for its memory. Although this particular example is probably "working by accident", we may still find cases where it makes sense for the runtime to use different arch for different kernels.

On the other hand, I would call this concern a bit over-engineering, which we should solve only when there's a real need. I believe the much more important thing is to cleanly figure out the boundary of each of these modules. For example, all GPU backends may exhibit a pattern of submitting a bunch of kernels, then do synchronization at certain point. Such logic should be placed at a very high-level base module. That module will have a few virtual functions for each GPU backend to implement.

The only con I can think of is, since I still see a possibility where heterogeneous backends (or mixed backends?) could be useful, this might be a bit too restrictive.

That con may be solved by something like taichi/backends/common/c-like: iff any one of TI_WITH_METAL or TI_WITH_OPENGL specified, this dir is included.
Also can do #ifdef TI_WITH_METAL guards in common codes to extend flexibility.

For example, all GPU backends may exhibit a pattern of submitting a bunch of kernels, then do synchronization at certain point. Such logic should be placed at a very high-level base module. That module will have a few virtual functions for each GPU backend to implement.

+1. I believe it will ridiculously save backend dev's time. Also make our codes better unified, eg. when yuanming-hu say: no-more-sync-before-launch, and he changes the class CommonBackend, every backend changes, no need for reconstructing the three backends.
Also common base should be chooseable, eg. src2src backends may want emit() as base, while LLVM ones don't like it at all :)

I think this depends on where you want to make that per-backend separation? I don't see too many implementation being shared between these two codegens and the LLVM-based codegen (probably except for a very high level gen() function...) If we have already encapsulated each backend's impl at a higher level, then these codegen classes can just be an implementation detail inside that higher-level abstraction.

What I do think can be shared is between Metal and OpenGL codegen, just like CodeGenLLVMSIMT, because these two are C-ish languages.

Exactly. I feel like we should do a CodeGenCFamily that serves as the common base class for Metal/OpenGL codegens.

As for codegens, I wonder if there's ever gonna be a need where someone wants to do cross-compilation. After all, it's perfectly fine for you to generate CUDA/OpenGL/Metal code without having the runtime supporting it... (On the other hand, I'm not sure if the compiled code without Taichi runtime is useful in any way...)

Yeah that sounds useful sometime in the future. For now we haven't got a strong request for supporting cross-compilation. Maybe some mobile developers want compiling to ARM on x64 platforms, but we need #439 first for them.

I'm not super familiar with this part, but this sounds similar to what the MetalRuntime class does (registering kernels, caching the compiled kernels, launching the kernels, etc). I guess for Metal & OpenGL, the underlying runtime API has already take care of the compilation part?

Let me think a bit more about this. We want modularized code, but not over design. Maybe letting RuntimeEnv deal with everything is better.

For example, all GPU backends may exhibit a pattern of submitting a bunch of kernels, then do synchronization at certain point. Such logic should be placed at a very high-level base module. That module will have a few virtual functions for each GPU backend to implement.

Yeah, I'm actually planning to make everything as async as possible in Taichi. I can't talk about it too much here since that's one of our research projects :-)

Hi, I haven't had a chance to read all this, but I'm excited for the direction we are going :) I was also thinking about how to refactor backend, I'm glad people are interested in the topic.

I can't come up with clear cons of this, but please comment if you find one

It may make sharing codegen code between similar backends awkward. NVPTX/AMD, Vulkan/DX/Metal etc.

To me JITSession seems redundant. If we don't have a plan for AOT and we always know we will be jitting, making it a part of codegen may be simpler. Also I'm not sure if we need separate JIT class for cuda and cpu like JITSessionCUDA, JITSessionCPU, since both x86 and NVPTX are jitted by LLVM (no need for runtime cuda kernel compilation anymore since the legacy backend is gone).

As one of data points, here is a structure of TVM https://github.com/apache/incubator-tvm/tree/master/src

codegen/
   source/
      codegen_cuda.cc
      codegen_opencl.cc
      codegen_metal.cc
      codegen_opengl.cc
      codegen_source_base.cc (This could correspond to `CodeGenCFamily` above)
      ...
   spirv/
      codegen_spirv.cc
      ir_builder.cc
      ...
   llvm/
      codegen_llvm.cc (the base of all llvm based backends)
      codegen_amdgpu.cc
      codegen_nvtpx.cc
      codegen_cpu.cc, codegen_arm.cc
      llvm_module.cc (for JIT)

runtime/
    cuda/
        cuda_module.cc
        cuda_device_api.cc
    opencl/
        opencl_module.cc
        opencl_device_api.cc
    opengl/
        ...
    vulkan/
        ...
    metal/
        ...
    rocm(amdgpu)/
        ...
    ... (similar)

I can't come up with clear cons of this, but please comment if you find one

It may make sharing codegen code between similar backends awkward. NVPTX/AMD, Vulkan/DX/Metal etc.

That's a good point. One possible solution is to have shared components in a "common" folder that is always compiled. For example, we can have CodeGenLLVM base class that covers CPUs/NVPTX/AMD, and a CodeGenCFamily class for Vulkan/DX/Metal/OpenGL, stay in the same folder as the abstract CodeGen class. The CodeGenLLVM base class is always necessary, since we need it as the fallback CPU backend on every platform.

To me JITSession seems redundant. If we don't have a plan for AOT and we always know we will be jitting, making it a part of codegen may be simpler. Also I'm not sure if we need separate JIT class for cuda and cpu like JITSessionCUDA, JITSessionCPU, since both x86 and NVPTX are jitted by LLVM (no need for runtime cuda kernel compilation anymore since the legacy backend is gone).

That makes sense. Let's demote JITSession. I prefer to merge it into RuntimeEnv (@k-ye) instead of CodeGen, since

  • A lot of device APIs (CUDA/Metal/OpenGL) exposes interfaces for both JIT compiling and launching kernels, so it makes sense to merge JITing and runtime environment management.
  • (In the current design) CodeGen is created per kernel, yet JITSession and RuntimeEnv have the same lifetime as the program.

Also thanks for sharing the TVM folder structure!

Given all of the discussions/considerations above, I think we can go the per-backend file organization:

  • Common codegens, such as CodeGenLLVM/CodeGenCFamily, put them in taichi/codegen.
  • In taichi/backends/ create one folder for one backend and its corresponding runtime.

    • No more guards such as TI_WITH_CUDA needed, since the folder will be compiled only if TI_WITH_CUDA=TRUE

    • The default host backend (x64/ARM64) will be always shipped with libtaichi_core.so

    • Files in a backend folder, say taichi/backends/cuda will be compiled into libtaichi_cuda.so, which can be dynamically downloaded/loaded on demand. In the future the user only needs to pip install taichi instead of something like pip install taichi-cuda-10-0-with-opengl-with-metal-with-js-mac #233

To me JITSession seems redundant. If we don't have a plan for AOT and we always know we will be jitting, making it a part of codegen may be simpler.
A lot of device APIs (CUDA/Metal/OpenGL) exposes interfaces for both JIT compiling and launching kernels, so it makes sense to merge JITing and runtime environment management.

+1. The JIT* part in Taichi currently seems to only serve the LLVM side. Since not every backend needs that level of abstraction, I think hiding them behind RuntimEnv would be better.

  • Files in a backend folder, say taichi/backends/cuda will be compiled into libtaichi_cuda.so, which can be dynamically downloaded/loaded on demand. In the future the user only needs to pip install taichi instead of something like pip install taichi-cuda-10-0-with-opengl-with-metal-with-js-mac #233

Another way to unify the non-cuda/cuda 10.0/cuda 10.1 packages: just load (dlopen) libcudart.so at program runtime (instead of link time/load time), and dlfcn the necessary cuda driver function pointers.

I also realized that eventually we may need a way to unify the runtime system on the device side. (Basically what's inside taichi/runtime, e.g. Offload tasks like clear_list, element_listgen, and methods like lookup, activate for each SNode type).

The problem is that we want to express that logic once, and emit different code for different backends. This somewhat implies another codegen here -- We could have an IR dedicated for the runtime (overkill??). Or we could express that logic in languages like Python as a blueprint, add AST visitors to emit CUDA/OpenGL/Metal, and make this part of the build process.

Neither seems super clean or satisfactory to me, yet I cannot think of another way now. I wonder if TVM or Halide could shed some insights here. On the other hand, I don't know if their systems involve such a nontrivial part on the device side. A large part of Taichi's runtime efforts is to support sparsity, but this isn't what TVM or Halide focuses on? (To me it seems like TVM/Halide's scheduling and execution problem are done on the host side). But maybe this is just due to my own ignorance...

At last, the end goal of removing the unified memory is to:

  • pre-allocate a large fixed amount of memory -- no dynamic allocation inside kernels anymore
  • then memory allocation (activate, append) probably basically becomes incrementing some atomic integers.

This should significantly simplify the data structures and the logic of the device runtime. Maybe we can defer these decisions to when unified memory is completely gone.

I also realized that eventually we may need a way to unify the runtime system on the device side. (Basically what's inside taichi/runtime, e.g. Offload tasks like clear_list, element_listgen, and methods like lookup, activate for each SNode type).

The problem is that we want to express that logic once, and emit different code for different backends. This somewhat implies another codegen here -- We could have an IR dedicated for the runtime (overkill??). Or we could express that logic in languages like Python as a blueprint, add AST visitors to emit CUDA/OpenGL/Metal, and make this part of the build process.

Yeah, I agree that unifying the runtime system is a great idea, and one direction to go is to have an IR at a smaller granularity.

Actually, in the old src2src backends, we didn't even have element_listgen as part of the IR. For every struct-for, we had to generate the snode lists from the root node of the tree to the leaf node, in C++/CUDA, as a single JITed function. Later we had an OffloadedTask system to break down the node generation to layer-by-layer OffloadedTask::list_gen's.

If we want to further break it down, we need an IR to describe the list_gen functions themselves. For example, we can actually represent OffloadedTask::list_gen as an OffloadedTask::struct_for, which essentially expands all the active parent SNode into the list of active child SNode. Then the IR would need something like push_back_active_element_into_snode_queue(int snode_id, Ptr).

Breaking down the IR to this granularity still makes sense to me, since it makes it easier to develop the runtime systems. Going even further, as you said, might be an overkill, as the LLVM IR already does that, and the engineering cost looks higher than maintaining a few different runtime systems.

I wonder if TVM or Halide could shed some insights here. On the other hand, I don't know if their systems involve such a nontrivial part on the device side. A large part of Taichi's runtime efforts is to support sparsity, but this isn't what TVM or Halide focuses on? (To me it seems like TVM/Halide's scheduling and execution problem are done on the host side). But maybe this is just due to my own ignorance...

Exactly, supporting sparse operations and dynamic small-granularity memory allocation is the key component of Taichi that makes everybody headache. Halide/TVM does not need to do that.

At last, the end goal of removing the unified memory is to:

  • pre-allocate a large fixed amount of memory -- no dynamic allocation inside kernels anymore

That's very true. Pre-allocation is not the perfect solution, but I can't think of a better way. Now when without unified memory, users have to do ti.init(device_memory_GB=8) and pray their program does not run out of memory... (Note that here we are using the host/device page faulting mechanism to allow the host-side memory pool reading device-side memory allocation requests.)

  • then memory allocation (activate, append) probably basically becomes incrementing some atomic integers.

The old memory allocator (source2source) uses atomic_adds to allocate memory. This actually introduces unified memory dependency: for each SNode with element size, say, 1 KB, we might need to reserve 8 GB virtual memory (i.e. unified memory on CUDA) on device just in case the atomic pointers overflows...

To get rid of unified memory dependency, the current memory allocation strives to use atomic_add for >99% of the cases, and does something slightly more complex when overflowing happens.

This should significantly simplify the data structures and the logic of the device runtime. Maybe we can defer these decisions to when unified memory is completely gone.

In fact, if we do not support automatically growing device memory space, then unified memory is already gone. You can disable that by setting env TI_USE_UNIFIED_MEMORY=0.

Another way to unify the runtime system e.g. taichi/runtime/llvm/runtime.cpp: just use macros to allow that runtime.cpp file can be compiled into llvm bitcode/Metal/GLSL etc... Hacky but seems easier than introducing a set of low-level IR...

If we want to further break it down, we need an IR to describe the list_gen functions themselves.
Breaking down the IR to this granularity still makes sense to me, since it makes it easier to develop the runtime systems.

+1. The part that an IR seems overkill to me is that, we don't really have a demand to support "arbitrary" behavior via this IR. Taichi has a full control of list_gen's logic, and it is supposed to be stable most of the time. Maybe we can start by breaking down the list_gen's logic into smaller, more atomic pieces of abstract functions for each backend to implement...

In fact, if we do not support automatically growing device memory space, then unified memory is already gone.

Ack, but I was thinking about dynamic features like append... I wonder if it is possible to get to a state where even the NodeManager would be removed... IIUC, it is for dynamic memory allocation? Is it correct to say that features like ptr.activate() or dynamic.append() would still rely on it?

Another way to unify the runtime system e.g. taichi/runtime/llvm/runtime.cpp: just use macros to allow that runtime.cpp file can be compiled into llvm bitcode/Metal/GLSL etc... Hacky but seems easier than introducing a set of low-level IR...

+1, just painful to maintain (or maybe even to look three months after it's committed... 馃槀 )

Actually, in the old src2src backends, we didn't even have element_listgen as part of the IR. For every struct-for, we had to generate the snode lists from the root node of the tree to the leaf node, in C++/CUDA, as a single JITed function. Later we had an OffloadedTask system to break down the node generation to layer-by-layer OffloadedTask::list_gen's.

I guess this will still be the case for metal/opengl or any other to-source backends. And that makes me sad because to support this device runtime, it will have to be emitted as string literals completely as well.. Or maybe it's time to consider having each backend define its runtime impl in its native shader language, which is similar to " just use macros to allow that runtime.cpp file can be compiled into llvm bitcode/Metal/GLSL etc".

If we want to further break it down, we need an IR to describe the list_gen functions themselves. For example, we can actually represent OffloadedTask::list_gen as an OffloadedTask::struct_for, which essentially expands all the active parent SNode into the list of active child SNode. Then the IR would need something like push_back_active_element_into_snode_queue(int snode_id, Ptr).

Hmm, I'm actually thinking about one level below here. For example, struct_for for LLVM uses grid-stride loops, but for Metal we simply launch one thread per element. Ideally we can express the grid-stride loop logic at a backend-independent level, and translate that to each backend...

Ack, but I was thinking about dynamic features like append... I wonder if it is possible to get to a state where even the NodeManager would be removed... IIUC, it is for dynamic memory allocation? Is it correct to say that features like ptr.activate() or dynamic.append() would still rely on it?

My feeling is that NodeManager is already a set of minimal infrastructure for us to support dynamic allocation... With unified memory, we can, of course, allocate 1TB virtual address space which never overflows and thereby an atomic variable is sufficient. Note that the other charge of NodeManager is to track garbage-collected nodes. I believe the current design is already rather minimal yet still remains effective...

[I posted this in the wrong issue a moment ago, sorry]

@yuanming-hu Is this expected to be complete during v0.6 time frame? I want to come back to taichi after refactoring effort is done.

(I quit my job and started master's program from this month. Unfortunately I don't have to spend on taichi recently. But AMDGPU backend is still on my list!)

@yuanming-hu Is this expected to be complete during v0.6 time frame?

Good question. I plan to wrap this up ideally by the end of this week so that we can release v0.6 sometime next week.

I want to come back to taichi after refactoring effort is done.

Welcome back! And sorry, it takes too long for me to actually finalize this...

(I quit my job and started master's program from this month. Unfortunately I don't have to spend on taichi recently. But AMDGPU backend is still on my list!)

Enjoy your new life, and please, no rush at all! We absolutely welcome a new AMDGPU backend at any time :-)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yuanming-hu picture yuanming-hu  路  4Comments

archibate picture archibate  路  4Comments

Xayahp picture Xayahp  路  3Comments

archibate picture archibate  路  4Comments

quadpixels picture quadpixels  路  3Comments