(Following the discussions in https://github.com/taichi-dev/taichi/pull/951#issuecomment-627558999)
Proposal
Currently, we allocate a temporary buffer for relaying variables between different offloaded tasks. Maybe we should simply allocate such as buffer in the SNode tree and everything are just part of the SNodes.
For example, we can allocate 6 dense SNodes each of size 512, with types i32, u32, f32, i64, u64, f64 respectively.
Benefits
GlobalTemporaryStmt can be removedConcerns
For example, we can allocate 6 dense SNodes each of size 512, with types i32, u32, f32, i64, u64, f64 respectively.
We don't need 6 at all. We only need one with i32, and make use of ti.bit_cast.
Please check out the OpenGL implementation:
https://github.com/taichi-dev/taichi/blob/1045bc24e9f7e98687302d8864bc1ad177fe6445/taichi/backends/opengl/codegen_opengl.cpp#L116-L118
https://github.com/taichi-dev/taichi/blob/1045bc24e9f7e98687302d8864bc1ad177fe6445/taichi/backends/opengl/codegen_opengl.cpp#L292-L299
https://github.com/taichi-dev/taichi/blob/1045bc24e9f7e98687302d8864bc1ad177fe6445/taichi/backends/opengl/opengl_data_types.h#L38-L49
I will talk to you more about GLSL pointer impl on TaichiCon, you will like this.
I realized that if we consider offloaded -> kernel, then gtmp -> arg&ret.
A gtmp is basically the return value of last offloaded, and the argument for next offloaded.
Then does it also worth to make arg buffer to be SNode data structures? For now they are ctx.args[0], will move this to root buffer, and copy-in-and-out each launch harm performance?
We don't need
6at all. We only need one withi32, and make use ofti.bit_cast.
I wonder if we need, say, an f32 global temp, and use ti.bit_cast to the i32 buffer every time when we access it, will it be slower than allocate an f32 global temp directly?
Thanks for the inputs!
We don't need
6at all. We only need one withi32, and make use ofti.bit_cast.
Please check out the OpenGL implementation:
If we don't allocate 6 buffers, at least we will need to add a new PointerCast Stmt (e.g. i64 * to f32 *). bit_cast is not enough since there might be atomic operations directly on the global temporary buffer.
I realized that if we consider
offloaded->kernel, thengtmp->arg&ret.
Agtmpis basically the return value of lastoffloaded, and the argument for nextoffloaded.
Then does it also worth to makeargbuffer to be SNode data structures? For now they arectx.args[0], will move this to root buffer, and copy-in-and-out each launch harm performance?
This will introduce a lot more host-device synchronization and make the system run slower.
We don't need
6at all. We only need one withi32, and make use ofti.bit_cast.I wonder if we need, say, an
f32global temp, and useti.bit_castto thei32buffer every time when we access it, will it be slower than allocate anf32global temp directly?
Yes, it will be slower and atomics won't work.
Given all the considerations, I guess the easiest solution is to allocate, say a 1024xi64 buffer and simply implement a PointerCastStmt.
Given all the considerations, I guess the easiest solution is to allocate, say a 1024xi64 buffer and simply implement a PointerCastStmt.
Agree, btw, do we have void *pointers? We may make the 1024xi64 buffer to be a void * buffer for genericity?
Most helpful comment
I wonder if we need, say, an
f32global temp, and useti.bit_castto thei32buffer every time when we access it, will it be slower than allocate anf32global temp directly?