Taichi: [OpenGL] Sparse data structure support

Created on 5 Apr 2020  路  26Comments  路  Source: taichi-dev/taichi

Concisely describe the proposed feature

Following the step of @k-ye (who made the Metal backend), I made the OpenGL backend.
Now Metal supports sparse structures like bitmasked, there's no reason for GL and me to stop our step here and feel it's already fullfilled.

Describe the solution you'd like (if any)
But, where to start from? Maybe pointer is a good choice?

Additional comments

feature request

All 26 comments

Reference issue #593 for Metal's implementation.

I'd suggest to start with bitmasked. It is the simplest sparse data structure because it doesn't need memory allocation on the GPU side (Not sure if this is possible on OpenGL).

The only difference between dense and bitmasked is that, the later = dense + a bit for each element marking its activation.

x = ti.var(ti.i32)
ti.root.bitmasked(ti.i, 4).place(x)

# |    i32    |    i32    |    i32    |    i32    | 0101 |
# In this case, index 2 and 0 are activated

To implement sparse data structures, LLVM's Runtime is the best reference, especially elements_lists.

https://github.com/taichi-dev/taichi/blob/811f468a8bde4911815f540cbd9ba84e9e0578e8/taichi/runtime/llvm/runtime.cpp#L495

Each SNode has a corresponding ListManager. What ListManager does is basically to record the coordinates of the active elements at that SNode level. Then, inside struct_for tasks, instead of iterating over all the coordinates, you only need to iterate through this ListManager to figure out the active elements.

Basically, you need to support two more types of offloaded tasks: clear_list and listgen. The first one is trivial, it clears ListManager. The second one is to find out the active elements and to properly initialize its ListManager.

| i32 | i32 | i32 | i32 | 0101 |

Does taichi store 0101 in another i32 array already? Or we can customize it for different backends? (GLSL have boolean arrays, can be optimized better)

Does taichi store 0101 in another i32 array already? Or we can customize it for different backends?

The later. Each backend needs to maintain there own bitmasks, just like the main data.

  • LLVM

https://github.com/taichi-dev/taichi/blob/a2f499182f76cdbabd92f05fe497e89c8726d53d/taichi/runtime/llvm/node_bitmasked.h#L19-L20

  • Metal

https://github.com/taichi-dev/taichi/blob/a2f499182f76cdbabd92f05fe497e89c8726d53d/taichi/backends/metal/shaders/runtime_utils.metal.h#L76-L77

As for the example, in additional to 4 B * 4 = 16 B allocated for the 4 i32s, you need one extra i32 to store the four bit masks (each i32 can store up to 32 bit masks). The minimum bytes you need is thus 17. However, due to the alignment requirement, it will be padded to 20 (or 24 if it is 64-bit alignment)

Each SNode has a corresponding ListManager. What ListManager does is basically to record the coordinates of the active elements at that SNode level.

Can we have no ListManager for dense snode? Since always have a_dense_snode->list_man->active_bits == [1, 1, 1,...].

Can we have no ListManager for dense snode?

I think the easiest way will be to have them, because you need to invoke listgen for every node in the path between root to the leaf. Having ListManager for dense snodes means the listgen process can take care of all kinds of SNodes homogenously.

You mean, listgen have a member function called iter_active? But note that GLSL doesn't have a keyword like class, as you can see I was faking #define S1_children(sn) (sn + i * xx) like the OO-ish S1 sn; sn.children() in Metal. To have such OO-ish ListManager not really fit GLSL very much. It's better to consider how to impl this more in compile-time but not runtime for me.

listgen process? I thought listgen is only to deal with bitmasked...
Yeah, I see llvm&metal have a heavy runtime support which contains OO-ish class and high-level abstractions...but I think abstractions for intermediate code (which focus on performance and seldomly edited by devs) sounds like an overkill. Especially true to NV GL compiler whose compile time mysteriously depends on O(n鲁) of code length...

Yeah, I see llvm&metal have a heavy runtime support which contains OO-ish class and high-level abstractions...

Not really. I think these backends really just defined a Runtime struct to hold data, plus a few methods to operate on that struct. Conceptually, it's similar to this example:

struct Foo {
  int x;
  bool y;
};

void inc_x_if_y(Foo* f) {
  if (f->y) {
    f->x++;
  }
}

This is pretty much the C way to do encapsulation. On the other hand, I think OO is mainly about inheritance (and I agree that OO is too heavy for GPU..), which is not used on either backend..

I thought GLSL would allow you to do the same (https://www.khronos.org/opengl/wiki/Data_Type_(GLSL)#Structs)?

You mean, listgen have a member function called iter_active?

listgen itself is a kernel function..

Would the Metal listgen give you more clue about what it's doing and how it's implemented? (Some util functions are defined in https://github.com/taichi-dev/taichi/blob/master/taichi/backends/metal/shaders/runtime_utils.metal.h)

https://github.com/taichi-dev/taichi/blob/ad4d7ae04f4e559e84f6dee4a64ad57c3cf0c7fb/taichi/backends/metal/shaders/runtime_kernels.metal.h#L58-L103


I just remembered you said that GLSL doesn't support pointers? For instance, I guess it's impossible to implement the above inc_x_if_y() by passing in a Foo*?

Hmm, in that case it definitely makes the coding more difficult, due to the weaker abstraction model...

I thought GLSL would allow you to do the same (https://www.khronos.org/opengl/wiki/Data_Type_(GLSL)#Structs)?

Yes it allows. However, the true difficulty is that, the struct's is only allowed to be statically declared, i.e. all data must be specified at launch-time. We can't dynamically allocate memory within a GLSL kernel.

I just remembered you said that GLSL doesn't support pointers? For instance, I guess it's impossible to implement the above inc_x_if_y() by passing in a Foo*?

Yes, that's why I want to make a less abstract runtime.

Let me explain, please see if I'm telling correct:
element_listgen(parent, child) appends all active elements in child snode to the ListManager of child.
clear_list(parent, child) remove all elements that once appended from the ListManager of child.

all data must be specified at launch-time. We can't dynamically allocate memory within a GLSL kernel.

Hmm, what do you mean by "dynamically allocating memory within a GLSL kernel"? Metal doesn't dynamically allocate memory inside the kernel. I thought if you allocate the buffer of the correct size at CPU side, and fill in data whose memory layout is identical to your custom GLSL struct, then the kernel will be able to read that data?

Is these two args specified by taichi, not metal only?

Nope, they are actually Metal only. I didn't not generate a unique kernel for every (parent, child) pair, so I have to pass in their IDs.

Let me explain, please see if I'm telling correct:

Exactly.. And you may also noticed that clear_list() doesn't not really clear anything, just resets the next counter to zero.

I found you made a memcpy to listman and it contains a whole copy of that data?

Right. This way we can simulate a "generic" list container.

simulate a "generic" list container.

Now I get more puzzled, why we need a container to store duplicated values?

Then the kernel will be able to read that data?

Yes, it will.

I didn't not generate a unique kernel for every (parent, child) pair.

I realized this requires getting parent/child SNode by their id? Sounds like another way is to generate a whole new clear_list when called for different (parent, child) pair.

Why is

import taichi as ti

ti.init(arch=ti.opengl)
ti.misc.util.set_gdb_trigger(False)

x = ti.var(ti.i32)
y = ti.var(ti.i32, shape=())
z = ti.var(ti.i32, shape=())
ti.root.bitmasked(ti.i, 4).place(x)

@ti.kernel
def func():
    y[None] = ti.is_active(x, 0)
    x[0] = 1
    z[None] = ti.is_active(x, 0)

func()
print(x[0])
print(y[None])
print(z[None])

resulted in RuntimeError: [ir.h:flatten@2304] ti.is_active only works on pointer, hash or bitmasked nodes.?

Because x is a i32 instead of bitmasked. How about using x.parent()?

Because x is a i32 instead of bitmasked. How about using x.parent()?

Cool! Now I realized that there's a distinguish between SNode and DataType's.
Now, I found ti.deactivate(x.parent(), 0) results in a segfault...

Why there is only a Warning place in https://taichi.readthedocs.io/en/latest/sparse.html? I can't find the keyword deactivate anywhere.

Right, to be clear: x is a place SNode with data type i32. x.parent() is the bitmasked node you want.

Why there is only a Warning place in https://taichi.readthedocs.io/en/latest/sparse.html? I can't find the keyword deactivate anywhere.

I haven't got a chance to do the documentation there...

I haven't got a chance to do the documentation there...

Do we have an issue opened to track this? I thought it's a key feature of taichi?

Too bad, a new comer never learn the knowledge of activate/deactivate unless he/she ask the developers how-to-play, or go view the tests/python to find out the usage.

[Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-ifga5h6u
[Taichi] sandbox prepared
[Taichi] version 0.5.10, cpu only, commit ad4d7ae0, python 3.8.2
[E 04/10/20 00:18:34.723] Received signal 11 (Segmentation fault)


***********************************
* Taichi Compiler Stack Traceback *
***********************************
/tmp/taichi-ifga5h6u/taichi_core.so: taichi::Logger::error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)
/tmp/taichi-ifga5h6u/taichi_core.so: taichi::signal_handler(int)
/usr/lib/libc.so.6(+0x3bd70) [0x7f48af03cd70]
[0x7f48a36dd018]

I still don't understand why ti.deactivate(x.parent(), 0) causes a segfault on x64... Is this the wrong usage?
Stack backtrace terminated unexpectedly, may because call to a bad function pointer?

I tried x64 and it works. Could you try x64 as well?

I'll track the documentation issues in https://github.com/taichi-dev/taichi/issues/736

Yes, I'm running on x64 and got segfault. While on opengl no fault but wrong answer.

import taichi as ti

#ti.init(arch=ti.opengl)
ti.misc.util.set_gdb_trigger(False)

x = ti.var(ti.i32)
y = ti.var(ti.i32, shape=())
z = ti.var(ti.i32, shape=())
ti.root.bitmasked(ti.i, 4).place(x)

@ti.kernel
def func():
    y[None] = ti.is_active(x.parent(), 0)
    x[0] = 233
    z[None] = ti.is_active(x.parent(), 0)

func()
print(x[0])
print(y[None])
print(z[None])

Oh! Sorry, I found the problem, it's because ti.init must be called before any invocation, and I just commented to disable opengl to x64, that's not right. Maybe we can have better inform ti.init?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

g1n0st picture g1n0st  路  3Comments

archibate picture archibate  路  3Comments

Xayahp picture Xayahp  路  3Comments

liaopeiyuan picture liaopeiyuan  路  3Comments

yuanming-hu picture yuanming-hu  路  4Comments