Concisely describe the proposed feature
Following the step of @k-ye (who made the Metal backend), I made the OpenGL backend.
Now Metal supports sparse structures like bitmasked, there's no reason for GL and me to stop our step here and feel it's already fullfilled.
Describe the solution you'd like (if any)
But, where to start from? Maybe pointer is a good choice?
Additional comments
Reference issue #593 for Metal's implementation.
I'd suggest to start with bitmasked. It is the simplest sparse data structure because it doesn't need memory allocation on the GPU side (Not sure if this is possible on OpenGL).
The only difference between dense and bitmasked is that, the later = dense + a bit for each element marking its activation.
x = ti.var(ti.i32)
ti.root.bitmasked(ti.i, 4).place(x)
# | i32 | i32 | i32 | i32 | 0101 |
# In this case, index 2 and 0 are activated
To implement sparse data structures, LLVM's Runtime is the best reference, especially elements_lists.
Each SNode has a corresponding ListManager. What ListManager does is basically to record the coordinates of the active elements at that SNode level. Then, inside struct_for tasks, instead of iterating over all the coordinates, you only need to iterate through this ListManager to figure out the active elements.
Basically, you need to support two more types of offloaded tasks: clear_list and listgen. The first one is trivial, it clears ListManager. The second one is to find out the active elements and to properly initialize its ListManager.
| i32 | i32 | i32 | i32 | 0101 |
Does taichi store 0101 in another i32 array already? Or we can customize it for different backends? (GLSL have boolean arrays, can be optimized better)
Does taichi store 0101 in another i32 array already? Or we can customize it for different backends?
The later. Each backend needs to maintain there own bitmasks, just like the main data.
As for the example, in additional to 4 B * 4 = 16 B allocated for the 4 i32s, you need one extra i32 to store the four bit masks (each i32 can store up to 32 bit masks). The minimum bytes you need is thus 17. However, due to the alignment requirement, it will be padded to 20 (or 24 if it is 64-bit alignment)
Each SNode has a corresponding ListManager. What ListManager does is basically to record the coordinates of the active elements at that SNode level.
Can we have no ListManager for dense snode? Since always have a_dense_snode->list_man->active_bits == [1, 1, 1,...].
Can we have no ListManager for dense snode?
I think the easiest way will be to have them, because you need to invoke listgen for every node in the path between root to the leaf. Having ListManager for dense snodes means the listgen process can take care of all kinds of SNodes homogenously.
You mean, listgen have a member function called iter_active? But note that GLSL doesn't have a keyword like class, as you can see I was faking #define S1_children(sn) (sn + i * xx) like the OO-ish S1 sn; sn.children() in Metal. To have such OO-ish ListManager not really fit GLSL very much. It's better to consider how to impl this more in compile-time but not runtime for me.
listgen process? I thought listgen is only to deal with bitmasked...
Yeah, I see llvm&metal have a heavy runtime support which contains OO-ish class and high-level abstractions...but I think abstractions for intermediate code (which focus on performance and seldomly edited by devs) sounds like an overkill. Especially true to NV GL compiler whose compile time mysteriously depends on O(n鲁) of code length...
Yeah, I see llvm&metal have a heavy runtime support which contains OO-ish class and high-level abstractions...
Not really. I think these backends really just defined a Runtime struct to hold data, plus a few methods to operate on that struct. Conceptually, it's similar to this example:
struct Foo {
int x;
bool y;
};
void inc_x_if_y(Foo* f) {
if (f->y) {
f->x++;
}
}
This is pretty much the C way to do encapsulation. On the other hand, I think OO is mainly about inheritance (and I agree that OO is too heavy for GPU..), which is not used on either backend..
I thought GLSL would allow you to do the same (https://www.khronos.org/opengl/wiki/Data_Type_(GLSL)#Structs)?
You mean, listgen have a member function called iter_active?
listgen itself is a kernel function..
Would the Metal listgen give you more clue about what it's doing and how it's implemented? (Some util functions are defined in https://github.com/taichi-dev/taichi/blob/master/taichi/backends/metal/shaders/runtime_utils.metal.h)
I just remembered you said that GLSL doesn't support pointers? For instance, I guess it's impossible to implement the above inc_x_if_y() by passing in a Foo*?
Hmm, in that case it definitely makes the coding more difficult, due to the weaker abstraction model...
I thought GLSL would allow you to do the same (https://www.khronos.org/opengl/wiki/Data_Type_(GLSL)#Structs)?
Yes it allows. However, the true difficulty is that, the struct's is only allowed to be statically declared, i.e. all data must be specified at launch-time. We can't dynamically allocate memory within a GLSL kernel.
I just remembered you said that GLSL doesn't support pointers? For instance, I guess it's impossible to implement the above inc_x_if_y() by passing in a Foo*?
Yes, that's why I want to make a less abstract runtime.
I see https://github.com/taichi-dev/taichi/blob/ad4d7ae04f4e559e84f6dee4a64ad57c3cf0c7fb/taichi/backends/metal/shaders/runtime_kernels.metal.h#L68-L69
and https://github.com/taichi-dev/taichi/blob/ad4d7ae04f4e559e84f6dee4a64ad57c3cf0c7fb/taichi/backends/metal/shaders/runtime_kernels.metal.h#L51
Is these two args specified by taichi, not metal only?
Let me explain, please see if I'm telling correct:
element_listgen(parent, child) appends all active elements in child snode to the ListManager of child.
clear_list(parent, child) remove all elements that once appended from the ListManager of child.
https://github.com/taichi-dev/taichi/blob/ad4d7ae04f4e559e84f6dee4a64ad57c3cf0c7fb/taichi/backends/metal/shaders/runtime_utils.metal.h#L39-L54
I found you made a memcpy to listman and it contains a whole copy of that data?
all data must be specified at launch-time. We can't dynamically allocate memory within a GLSL kernel.
Hmm, what do you mean by "dynamically allocating memory within a GLSL kernel"? Metal doesn't dynamically allocate memory inside the kernel. I thought if you allocate the buffer of the correct size at CPU side, and fill in data whose memory layout is identical to your custom GLSL struct, then the kernel will be able to read that data?
Is these two args specified by taichi, not metal only?
Nope, they are actually Metal only. I didn't not generate a unique kernel for every (parent, child) pair, so I have to pass in their IDs.
Let me explain, please see if I'm telling correct:
Exactly.. And you may also noticed that clear_list() doesn't not really clear anything, just resets the next counter to zero.
I found you made a memcpy to listman and it contains a whole copy of that data?
Right. This way we can simulate a "generic" list container.
simulate a "generic" list container.
Now I get more puzzled, why we need a container to store duplicated values?
Then the kernel will be able to read that data?
Yes, it will.
I didn't not generate a unique kernel for every (parent, child) pair.
I realized this requires getting parent/child SNode by their id? Sounds like another way is to generate a whole new clear_list when called for different (parent, child) pair.
Why is
import taichi as ti
ti.init(arch=ti.opengl)
ti.misc.util.set_gdb_trigger(False)
x = ti.var(ti.i32)
y = ti.var(ti.i32, shape=())
z = ti.var(ti.i32, shape=())
ti.root.bitmasked(ti.i, 4).place(x)
@ti.kernel
def func():
y[None] = ti.is_active(x, 0)
x[0] = 1
z[None] = ti.is_active(x, 0)
func()
print(x[0])
print(y[None])
print(z[None])
resulted in RuntimeError: [ir.h:flatten@2304] ti.is_active only works on pointer, hash or bitmasked nodes.?
Because x is a i32 instead of bitmasked. How about using x.parent()?
Because
xis ai32instead ofbitmasked. How about usingx.parent()?
Cool! Now I realized that there's a distinguish between SNode and DataType's.
Now, I found ti.deactivate(x.parent(), 0) results in a segfault...
Why there is only a Warning place in https://taichi.readthedocs.io/en/latest/sparse.html? I can't find the keyword deactivate anywhere.
Right, to be clear: x is a place SNode with data type i32. x.parent() is the bitmasked node you want.
Why there is only a
Warningplace in https://taichi.readthedocs.io/en/latest/sparse.html? I can't find the keyworddeactivateanywhere.
I haven't got a chance to do the documentation there...
I haven't got a chance to do the documentation there...
Do we have an issue opened to track this? I thought it's a key feature of taichi?
Too bad, a new comer never learn the knowledge of activate/deactivate unless he/she ask the developers how-to-play, or go view the tests/python to find out the usage.
[Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-ifga5h6u
[Taichi] sandbox prepared
[Taichi] version 0.5.10, cpu only, commit ad4d7ae0, python 3.8.2
[E 04/10/20 00:18:34.723] Received signal 11 (Segmentation fault)
***********************************
* Taichi Compiler Stack Traceback *
***********************************
/tmp/taichi-ifga5h6u/taichi_core.so: taichi::Logger::error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)
/tmp/taichi-ifga5h6u/taichi_core.so: taichi::signal_handler(int)
/usr/lib/libc.so.6(+0x3bd70) [0x7f48af03cd70]
[0x7f48a36dd018]
I still don't understand why ti.deactivate(x.parent(), 0) causes a segfault on x64... Is this the wrong usage?
Stack backtrace terminated unexpectedly, may because call to a bad function pointer?
I tried x64 and it works. Could you try x64 as well?
I'll track the documentation issues in https://github.com/taichi-dev/taichi/issues/736
Yes, I'm running on x64 and got segfault. While on opengl no fault but wrong answer.
import taichi as ti
#ti.init(arch=ti.opengl)
ti.misc.util.set_gdb_trigger(False)
x = ti.var(ti.i32)
y = ti.var(ti.i32, shape=())
z = ti.var(ti.i32, shape=())
ti.root.bitmasked(ti.i, 4).place(x)
@ti.kernel
def func():
y[None] = ti.is_active(x.parent(), 0)
x[0] = 233
z[None] = ti.is_active(x.parent(), 0)
func()
print(x[0])
print(y[None])
print(z[None])
Oh! Sorry, I found the problem, it's because ti.init must be called before any invocation, and I just commented to disable opengl to x64, that's not right. Maybe we can have better inform ti.init?