Taichi: [Discussions] Unify temporary global variables and those in the SNode data structures

Created on 13 May 2020 · 5Comments · Source: taichi-dev/taichi

(Following the discussions in https://github.com/taichi-dev/taichi/pull/951#issuecomment-627558999)

Proposal

Currently, we allocate a temporary buffer for relaying variables between different offloaded tasks. Maybe we should simply allocate such as buffer in the SNode tree and everything are just part of the SNodes.

For example, we can allocate 6 dense SNodes each of size 512, with types i32, u32, f32, i64, u64, f64 respectively.

Benefits

GlobalTemporaryStmt can be removed
Optimization and aliasing analysis are easier

Concerns

What if someday we have more types?

discussion

Source

yuanming-hu

👍1

Most helpful comment

We don't need 6 at all. We only need one with i32, and make use of ti.bit_cast.

I wonder if we need, say, an f32 global temp, and use ti.bit_cast to the i32 buffer every time when we access it, will it be slower than allocate an f32 global temp directly?

xumingkuan on 14 May 2020

👍2

All 5 comments

For example, we can allocate 6 dense SNodes each of size 512, with types i32, u32, f32, i64, u64, f64 respectively.

We don't need 6 at all. We only need one with i32, and make use of ti.bit_cast.
Please check out the OpenGL implementation:
https://github.com/taichi-dev/taichi/blob/1045bc24e9f7e98687302d8864bc1ad177fe6445/taichi/backends/opengl/codegen_opengl.cpp#L116-L118
https://github.com/taichi-dev/taichi/blob/1045bc24e9f7e98687302d8864bc1ad177fe6445/taichi/backends/opengl/codegen_opengl.cpp#L292-L299
https://github.com/taichi-dev/taichi/blob/1045bc24e9f7e98687302d8864bc1ad177fe6445/taichi/backends/opengl/opengl_data_types.h#L38-L49
I will talk to you more about GLSL pointer impl on TaichiCon, you will like this.

archibate on 13 May 2020

👍1

I realized that if we consider offloaded -> kernel, then gtmp -> arg&ret.
A gtmp is basically the return value of last offloaded, and the argument for next offloaded.
Then does it also worth to make arg buffer to be SNode data structures? For now they are ctx.args[0], will move this to root buffer, and copy-in-and-out each launch harm performance?

archibate on 13 May 2020

👀1 👍1

We don't need 6 at all. We only need one with i32, and make use of ti.bit_cast.

I wonder if we need, say, an f32 global temp, and use ti.bit_cast to the i32 buffer every time when we access it, will it be slower than allocate an f32 global temp directly?

xumingkuan on 14 May 2020

👍2

Thanks for the inputs!

We don't need 6 at all. We only need one with i32, and make use of ti.bit_cast.
Please check out the OpenGL implementation:

If we don't allocate 6 buffers, at least we will need to add a new PointerCast Stmt (e.g. i64 * to f32 *). bit_cast is not enough since there might be atomic operations directly on the global temporary buffer.

I realized that if we consider offloaded -> kernel, then gtmp -> arg&ret.
A gtmp is basically the return value of last offloaded, and the argument for next offloaded.
Then does it also worth to make arg buffer to be SNode data structures? For now they are ctx.args[0], will move this to root buffer, and copy-in-and-out each launch harm performance?

This will introduce a lot more host-device synchronization and make the system run slower.

We don't need 6 at all. We only need one with i32, and make use of ti.bit_cast.

I wonder if we need, say, an f32 global temp, and use ti.bit_cast to the i32 buffer every time when we access it, will it be slower than allocate an f32 global temp directly?

Yes, it will be slower and atomics won't work.

Given all the considerations, I guess the easiest solution is to allocate, say a 1024xi64 buffer and simply implement a PointerCastStmt.

yuanming-hu on 14 May 2020

Given all the considerations, I guess the easiest solution is to allocate, say a 1024xi64 buffer and simply implement a PointerCastStmt.

Agree, btw, do we have void *pointers? We may make the 1024xi64 buffer to be a void * buffer for genericity?

archibate on 18 May 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[Discussion] Should we drop support for Python 3.5?

yuanming-hu · 3Comments

Support `continue` in loops

yuanming-hu · 4Comments

[Lang] tensor as local temporary variable / do we have tensor slice support?

archibate · 4Comments

Upgrade the Constant Folding pass

xumingkuan · 3Comments

[Bug] [Sparse] Particles disappear in `example/particle_renderer.py`

yuanming-hu · 3Comments