Concisely describe the proposed feature
I'd like to support bitmasked() on the Metal backend. This decorator is the easiest sparsity feature to support, because it does not require dynamic memory allocation on the device side.
Describe the solution you'd like (if any)
I have a seemingly working solution in https://github.com/k-ye/taichi/tree/mtlbit. It (seems to) works on a toy example, but I need more tests (maybe good to see if it can work on a modified taichi_sparse.py).
I tried to follow LLVM's runtime system as much as possible, e.g. runtime's listgen and snode's activate, is_active, etc. One thing that trips me is that, it is still not entirely clear to me how the coordinate refinement works.
That said, I think I found a simpler approach that fits under Metal backend's current implementation. Instead of storing Element inside the list for each SNode https://github.com/taichi-dev/taichi/blob/b6c6c1ee10fae81007544a6de66dcaf3659dbba0/taichi/runtime/llvm/runtime.cpp#L438-L442
The Element for the Metal backend looks like this:
struct ListgenElement {
int32_t loop_index = 0;
int32_t root_mem_offset = 0; // used by is_active()
};
Inside a struct_for kernel, we iterate through the ListManager of that SNode, and loop_index here can be used as the "thread_id". The rest of the stmt IRs already knows how to map this loop_index into the correct leaf node through a series of OffsetAndExtractBitsStmt and SNodeLookupStmt. Do you see any problem with this approach?
Additional comments
Inside a
struct_forkernel, we iterate through theListManagerof that SNode, andloop_indexhere can be used as the "thread_id". The rest of the stmt IRs already knows how to map thisloop_indexinto the correct leaf node through a series ofOffsetAndExtractBitsStmtandSNodeLookupStmt. Do you see any problem with this approach?
I think it's a space-time-scalability tradeoff issue:
loop_index.My suggestion is to stick to your Metal implementation for now, and gradually switch to the LLVM style for consistent behavior on different backends.
Meanwhile, I'll document the confusing refine_coordinate function (#597)
My suggestion is to stick to your Metal implementation for now, and gradually switch to the LLVM style for consistent behavior on different backends.
I've implemented refine_coordinate now that I get what it's doing 馃槃
Here's the ported taichi_sparse.py using bitmasked().

It looks a bit different from the original example, note the screen-burning effect (娈嬪奖?? what's the english word of it...) I'm not sure if this is due to the difference between bitmasked and pointer, or a bug in my implementation. However, when i switched to x64 and ran it again, it just crashed...
I'll do some clean up and break down the implementation for review.
BTW, I have a question about non-power-of-two sizes. Say we have the following hierarchy:
root.dense(ti.i, 3).dense(ti.i, 5)
Inside Taichi, it will be padded to POT, i.e. S1.n=4, S2.n=8.
Then, for an index at 11, logically it should belong to (S1@2, S2@1) because 2 * 5 + 1 = 11. However, due to the padding, it seems that this is actually located at (S1@1, S2@3) (1 * 8 + 3 = 11). Is this expected?
Then, for an index at
11, logically it should belong to(S1@2, S2@1)because2 * 5 + 1 = 11. However, due to the padding, it seems that this is actually located at(S1@1, S2@3)(1 * 8 + 3 = 11). Is this expected?
Yes, it is expected. Note that refine_coordinates limits all the operands within it to be powers of two. This is for performance considerations. Otherwise, we'll end up with very expensive integer division and mod...
Maybe the screen-burning is because bitmasked blocks are not filled-with-zero when re-activated?
In garbage collection, we zero-fill the deactivated blocks to make sure they are 0 when reactivated. GC is invoked after offload statements with deactivation.
(Note that we are assuming no activation/deactivation on the same SNode can happen within the same offloaded task...Currently there's no compile-time check for this though. https://github.com/taichi-dev/taichi/issues/607)
Maybe the screen-burning is because bitmasked blocks are not filled-with-zero when re-activated?
You are right :)! I skipped GC tasks completely on Metal, but didn't realize it also zero-filled the elements. After doing a .fill(0) it looks correct:

A few optimizations to consider:
refine_coordinates for each SNode. This way all the bits manipulation's operands can be baked in at compile time. It also avoids the access to extractors in the global memory.Some cleanups:
platforms/metal to backends/metal #667 metal::MetalX to just metal::XSuperseded by #678
Most helpful comment
I've implemented
refine_coordinatenow that I get what it's doing 馃槃Here's the ported
taichi_sparse.pyusingbitmasked().It looks a bit different from the original example, note the screen-burning effect (娈嬪奖?? what's the english word of it...) I'm not sure if this is due to the difference between
bitmaskedandpointer, or a bug in my implementation. However, when i switched tox64and ran it again, it just crashed...I'll do some clean up and break down the implementation for review.
BTW, I have a question about non-power-of-two sizes. Say we have the following hierarchy:
root.dense(ti.i, 3).dense(ti.i, 5)Inside Taichi, it will be padded to POT, i.e.
S1.n=4, S2.n=8.Then, for an index at
11, logically it should belong to(S1@2, S2@1)because2 * 5 + 1 = 11. However, due to the padding, it seems that this is actually located at(S1@1, S2@3)(1 * 8 + 3 = 11). Is this expected?