Taichi: [Discussions] Unify temporary global variables and those in the SNode data structures

Created on 13 May 2020  路  5Comments  路  Source: taichi-dev/taichi

(Following the discussions in https://github.com/taichi-dev/taichi/pull/951#issuecomment-627558999)

Proposal

Currently, we allocate a temporary buffer for relaying variables between different offloaded tasks. Maybe we should simply allocate such as buffer in the SNode tree and everything are just part of the SNodes.

For example, we can allocate 6 dense SNodes each of size 512, with types i32, u32, f32, i64, u64, f64 respectively.

Benefits

  • GlobalTemporaryStmt can be removed
  • Optimization and aliasing analysis are easier

Concerns

  • What if someday we have more types?
discussion

Most helpful comment

We don't need 6 at all. We only need one with i32, and make use of ti.bit_cast.

I wonder if we need, say, an f32 global temp, and use ti.bit_cast to the i32 buffer every time when we access it, will it be slower than allocate an f32 global temp directly?

All 5 comments

For example, we can allocate 6 dense SNodes each of size 512, with types i32, u32, f32, i64, u64, f64 respectively.

We don't need 6 at all. We only need one with i32, and make use of ti.bit_cast.
Please check out the OpenGL implementation:
https://github.com/taichi-dev/taichi/blob/1045bc24e9f7e98687302d8864bc1ad177fe6445/taichi/backends/opengl/codegen_opengl.cpp#L116-L118
https://github.com/taichi-dev/taichi/blob/1045bc24e9f7e98687302d8864bc1ad177fe6445/taichi/backends/opengl/codegen_opengl.cpp#L292-L299
https://github.com/taichi-dev/taichi/blob/1045bc24e9f7e98687302d8864bc1ad177fe6445/taichi/backends/opengl/opengl_data_types.h#L38-L49
I will talk to you more about GLSL pointer impl on TaichiCon, you will like this.

I realized that if we consider offloaded -> kernel, then gtmp -> arg&ret.
A gtmp is basically the return value of last offloaded, and the argument for next offloaded.
Then does it also worth to make arg buffer to be SNode data structures? For now they are ctx.args[0], will move this to root buffer, and copy-in-and-out each launch harm performance?

We don't need 6 at all. We only need one with i32, and make use of ti.bit_cast.

I wonder if we need, say, an f32 global temp, and use ti.bit_cast to the i32 buffer every time when we access it, will it be slower than allocate an f32 global temp directly?

Thanks for the inputs!

We don't need 6 at all. We only need one with i32, and make use of ti.bit_cast.
Please check out the OpenGL implementation:

If we don't allocate 6 buffers, at least we will need to add a new PointerCast Stmt (e.g. i64 * to f32 *). bit_cast is not enough since there might be atomic operations directly on the global temporary buffer.

I realized that if we consider offloaded -> kernel, then gtmp -> arg&ret.
A gtmp is basically the return value of last offloaded, and the argument for next offloaded.
Then does it also worth to make arg buffer to be SNode data structures? For now they are ctx.args[0], will move this to root buffer, and copy-in-and-out each launch harm performance?

This will introduce a lot more host-device synchronization and make the system run slower.

We don't need 6 at all. We only need one with i32, and make use of ti.bit_cast.

I wonder if we need, say, an f32 global temp, and use ti.bit_cast to the i32 buffer every time when we access it, will it be slower than allocate an f32 global temp directly?

Yes, it will be slower and atomics won't work.

Given all the considerations, I guess the easiest solution is to allocate, say a 1024xi64 buffer and simply implement a PointerCastStmt.

Given all the considerations, I guess the easiest solution is to allocate, say a 1024xi64 buffer and simply implement a PointerCastStmt.

Agree, btw, do we have void *pointers? We may make the 1024xi64 buffer to be a void * buffer for genericity?

Was this page helpful?
0 / 5 - 0 ratings