Taichi: Support basic sparsity SNode on Metal

Created on 14 Mar 2020 · 8Comments · Source: taichi-dev/taichi

Concisely describe the proposed feature

I'd like to support bitmasked() on the Metal backend. This decorator is the easiest sparsity feature to support, because it does not require dynamic memory allocation on the device side.

Describe the solution you'd like (if any)

I have a seemingly working solution in https://github.com/k-ye/taichi/tree/mtlbit. It (seems to) works on a toy example, but I need more tests (maybe good to see if it can work on a modified taichi_sparse.py).

I tried to follow LLVM's runtime system as much as possible, e.g. runtime's listgen and snode's activate, is_active, etc. One thing that trips me is that, it is still not entirely clear to me how the coordinate refinement works.

That said, I think I found a simpler approach that fits under Metal backend's current implementation. Instead of storing Element inside the list for each SNode https://github.com/taichi-dev/taichi/blob/b6c6c1ee10fae81007544a6de66dcaf3659dbba0/taichi/runtime/llvm/runtime.cpp#L438-L442

The Element for the Metal backend looks like this:

    struct ListgenElement {
      int32_t loop_index = 0;
      int32_t root_mem_offset = 0;  // used by is_active()
    };

Inside a struct_for kernel, we iterate through the ListManager of that SNode, and loop_index here can be used as the "thread_id". The rest of the stmt IRs already knows how to map this loop_index into the correct leaf node through a series of OffsetAndExtractBitsStmt and SNodeLookupStmt. Do you see any problem with this approach?

Additional comments

The entire runtime structs and methods are emitted as strings, making it very painful to maintain now... I'll probably switch to a source code file-based solution eventually.
Be super careful about the coronavirus & please stay at home!!

feature request mac

Source

k-ye

👍3 ❤1

Most helpful comment

My suggestion is to stick to your Metal implementation for now, and gradually switch to the LLVM style for consistent behavior on different backends.

I've implemented refine_coordinate now that I get what it's doing 😄

Here's the ported taichi_sparse.py using bitmasked().

sparse

It looks a bit different from the original example, note the screen-burning effect (残影?? what's the english word of it...) I'm not sure if this is due to the difference between bitmasked and pointer, or a bug in my implementation. However, when i switched to x64 and ran it again, it just crashed...

I'll do some clean up and break down the implementation for review.

BTW, I have a question about non-power-of-two sizes. Say we have the following hierarchy:

root.dense(ti.i, 3).dense(ti.i, 5)

Inside Taichi, it will be padded to POT, i.e. S1.n=4, S2.n=8.

Then, for an index at 11, logically it should belong to (S1@2, S2@1) because 2 * 5 + 1 = 11. However, due to the padding, it seems that this is actually located at (S1@1, S2@3) (1 * 8 + 3 = 11). Is this expected?

k-ye on 16 Mar 2020

🚀1 😄1 👍1

All 8 comments

Inside a struct_for kernel, we iterate through the ListManager of that SNode, and loop_index here can be used as the "thread_id". The rest of the stmt IRs already knows how to map this loop_index into the correct leaf node through a series of OffsetAndExtractBitsStmt and SNodeLookupStmt. Do you see any problem with this approach?

I think it's a space-time-scalability tradeoff issue:

The LLVM runtime approach stored an i32 per coordinate. Currently, taichi supports up to 8-D tensors. This means the extreme case would support 2^256 (background) voxels. The downside is more storage/bandwidth consumption per list element.
Your Metal implementation saves space but limits the total amount of background voxels to 2^32. The struct-for loops will also have to recompute the physical coordinates according to the compressed i32 loop_index.

My suggestion is to stick to your Metal implementation for now, and gradually switch to the LLVM style for consistent behavior on different backends.

Meanwhile, I'll document the confusing refine_coordinate function (#597)

yuanming-hu on 14 Mar 2020

👍1

My suggestion is to stick to your Metal implementation for now, and gradually switch to the LLVM style for consistent behavior on different backends.

I've implemented refine_coordinate now that I get what it's doing 😄

Here's the ported taichi_sparse.py using bitmasked().

sparse

I'll do some clean up and break down the implementation for review.

BTW, I have a question about non-power-of-two sizes. Say we have the following hierarchy:

root.dense(ti.i, 3).dense(ti.i, 5)

Inside Taichi, it will be padded to POT, i.e. S1.n=4, S2.n=8.

k-ye on 16 Mar 2020

🚀1 😄1 👍1

Then, for an index at 11, logically it should belong to (S1@2, S2@1) because 2 * 5 + 1 = 11. However, due to the padding, it seems that this is actually located at (S1@1, S2@3) (1 * 8 + 3 = 11). Is this expected?

Yes, it is expected. Note that refine_coordinates limits all the operands within it to be powers of two. This is for performance considerations. Otherwise, we'll end up with very expensive integer division and mod...

yuanming-hu on 16 Mar 2020

👍1

Maybe the screen-burning is because bitmasked blocks are not filled-with-zero when re-activated?

In garbage collection, we zero-fill the deactivated blocks to make sure they are 0 when reactivated. GC is invoked after offload statements with deactivation.

(Note that we are assuming no activation/deactivation on the same SNode can happen within the same offloaded task...Currently there's no compile-time check for this though. https://github.com/taichi-dev/taichi/issues/607)

yuanming-hu on 16 Mar 2020

👍1

Maybe the screen-burning is because bitmasked blocks are not filled-with-zero when re-activated?

You are right :)! I skipped GC tasks completely on Metal, but didn't realize it also zero-filled the elements. After doing a .fill(0) it looks correct:

sparse

k-ye on 16 Mar 2020

A few optimizations to consider:

[ ] Generate refine_coordinates for each SNode. This way all the bits manipulation's operands can be baked in at compile time. It also avoids the access to extractors in the global memory.
[x] Have a better way to run the hierarchical listgen. Right now each thread covers one SNode, and if that SNode has a big branch-out factor, then we are likely under-utilizing the GPU resources. Figure out a way so that the workload can be balanced in a more fine-grained way.

k-ye on 19 Mar 2020

Some cleanups:

[x] Move from platforms/metal to backends/metal #667
[x] Rename classes from metal::MetalX to just metal::X

k-ye on 27 Mar 2020

Superseded by #678

k-ye on 29 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[bug] Testing rerun with failure no longer works with pytest 6.1.0

yuanming-hu · 3Comments

NAN check and throw exception on debug mode

g1n0st · 3Comments

[Lang] tensor as local temporary variable / do we have tensor slice support?

archibate · 4Comments

[Backend] DirectX (D3D11, HLSL) Backend

quadpixels · 3Comments

[Doc] Improve documentation

yuanming-hu · 3Comments