Taichi: [Bug] Parallel assignment going wrong

Created on 11 Jul 2020 · 6Comments · Source: taichi-dev/taichi

The output for the following code should be 1, 2, 3 for each line. However, it occurs 0, 0, 0.
It seems that this only happens after I update v0.6.18.

import taichi as ti


ti.init(arch=ti.cpu)
real = ti.f32
mat = ti.var(real, shape=(3, 16))


@ti.kernel
def do_something():
    for r in range(4):
        c = r * 4
        for i in ti.static(range(4)):
            mat[0, c], mat[1, c], mat[2, c] = 1.0, 2.0, 3.0
            c += 1
    for c in range(16):
        print(mat[0, c], mat[1, c], mat[2, c])


if __name__ == "__main__":
    do_something()

potential bug

Source

squarefk

Most helpful comment

mrp:

import taichi as ti
ti.init(arch=ti.cpu, advanced_optimization=False, print_ir=True)
x = ti.var(ti.i32, shape=4)

@ti.kernel
def func():
    c = 0
    for i in ti.static(range(4)):
        x[c] = 1
        c += 1

func()
print(x.to_numpy())

ir:

[I 07/12/20 10:38:27.916] [compile_to_offloads.cpp:operator()@24] Offloaded:
kernel {
  $0 = offloaded  
  body {
    <i32 x1> $1 = alloca
    <i32 x1> $2 = const [0]
    <i32 x1> $3 : local store [$1 <- $2]
    <i32 x1> $4 = const [1]
    <i32 x1> $5 = local load [ [$1[0]]]
    <i32*x1> $6 = global ptr [S2place_i32], index [$5] activate=true
    <i32*x1> $7 : global store [$6 <- $4]  # <--- 1
    <i32 x1> $8 = alloca
    <i32 x1> $9 = atomic add($1, $4)
    <i32 x1> $10 : local store [$8 <- $9]
    <i32 x1> $11 = local load [ [$1[0]]]
    <i32*x1> $12 = global ptr [S2place_i32], index [$11] activate=true
    <i32*x1> $13 : global store [$12 <- $4]  # <--- 2
    <i32 x1> $14 = alloca
    <i32 x1> $15 = atomic add($1, $4)
    <i32 x1> $16 : local store [$14 <- $15]
    <i32 x1> $17 = local load [ [$1[0]]]
    <i32*x1> $18 = global ptr [S2place_i32], index [$17] activate=true
    <i32*x1> $19 : global store [$18 <- $4]  # <--- 3
    <i32 x1> $20 = alloca
    <i32 x1> $21 = atomic add($1, $4)
    <i32 x1> $22 : local store [$20 <- $21]
    <i32 x1> $23 = local load [ [$1[0]]]
    <i32*x1> $24 = global ptr [S2place_i32], index [$23] activate=true
    <i32*x1> $25 : global store [$24 <- $4]  # <--- 4
    <i32 x1> $26 = alloca
    <i32 x1> $27 = atomic add($1, $4)
    <i32 x1> $28 : local store [$26 <- $27]
  }
}
[I 07/12/20 10:38:27.917] [compile_to_offloads.cpp:operator()@24] Optimized by CFG:
kernel {
  $0 = offloaded  
  body {
    <i32 x1> $1 = alloca
    <i32 x1> $2 = const [0]
    <i32 x1> $3 : local store [$1 <- $2]
    <i32 x1> $4 = const [1]
    <i32*x1> $5 = global ptr [S2place_i32], index [$2] activate=true
    <i32*x1> $6 : global store [$5 <- $4]  # <--- 1
    <i32 x1> $7 = atomic add($1, $4)
    <i32 x1> $8 = local load [ [$1[0]]]
    <i32*x1> $9 = global ptr [S2place_i32], index [$8] activate=true
    <i32 x1> $10 = atomic add($1, $4)
    <i32 x1> $11 = local load [ [$1[0]]]
    <i32*x1> $12 = global ptr [S2place_i32], index [$11] activate=true
    <i32 x1> $13 = atomic add($1, $4)
    <i32 x1> $14 = local load [ [$1[0]]]
    <i32*x1> $15 = global ptr [S2place_i32], index [$14] activate=true
    <i32*x1> $16 : global store [$15 <- $4]  # <--- 4
  }
}

@xumingkuan

archibate on 12 Jul 2020

👍2

All 6 comments

mrp:

import taichi as ti
ti.init(arch=ti.cpu, advanced_optimization=False, print_ir=True)
x = ti.var(ti.i32, shape=4)

@ti.kernel
def func():
    c = 0
    for i in ti.static(range(4)):
        x[c] = 1
        c += 1

func()
print(x.to_numpy())

ir:

[I 07/12/20 10:38:27.916] [compile_to_offloads.cpp:operator()@24] Offloaded:
kernel {
  $0 = offloaded  
  body {
    <i32 x1> $1 = alloca
    <i32 x1> $2 = const [0]
    <i32 x1> $3 : local store [$1 <- $2]
    <i32 x1> $4 = const [1]
    <i32 x1> $5 = local load [ [$1[0]]]
    <i32*x1> $6 = global ptr [S2place_i32], index [$5] activate=true
    <i32*x1> $7 : global store [$6 <- $4]  # <--- 1
    <i32 x1> $8 = alloca
    <i32 x1> $9 = atomic add($1, $4)
    <i32 x1> $10 : local store [$8 <- $9]
    <i32 x1> $11 = local load [ [$1[0]]]
    <i32*x1> $12 = global ptr [S2place_i32], index [$11] activate=true
    <i32*x1> $13 : global store [$12 <- $4]  # <--- 2
    <i32 x1> $14 = alloca
    <i32 x1> $15 = atomic add($1, $4)
    <i32 x1> $16 : local store [$14 <- $15]
    <i32 x1> $17 = local load [ [$1[0]]]
    <i32*x1> $18 = global ptr [S2place_i32], index [$17] activate=true
    <i32*x1> $19 : global store [$18 <- $4]  # <--- 3
    <i32 x1> $20 = alloca
    <i32 x1> $21 = atomic add($1, $4)
    <i32 x1> $22 : local store [$20 <- $21]
    <i32 x1> $23 = local load [ [$1[0]]]
    <i32*x1> $24 = global ptr [S2place_i32], index [$23] activate=true
    <i32*x1> $25 : global store [$24 <- $4]  # <--- 4
    <i32 x1> $26 = alloca
    <i32 x1> $27 = atomic add($1, $4)
    <i32 x1> $28 : local store [$26 <- $27]
  }
}
[I 07/12/20 10:38:27.917] [compile_to_offloads.cpp:operator()@24] Optimized by CFG:
kernel {
  $0 = offloaded  
  body {
    <i32 x1> $1 = alloca
    <i32 x1> $2 = const [0]
    <i32 x1> $3 : local store [$1 <- $2]
    <i32 x1> $4 = const [1]
    <i32*x1> $5 = global ptr [S2place_i32], index [$2] activate=true
    <i32*x1> $6 : global store [$5 <- $4]  # <--- 1
    <i32 x1> $7 = atomic add($1, $4)
    <i32 x1> $8 = local load [ [$1[0]]]
    <i32*x1> $9 = global ptr [S2place_i32], index [$8] activate=true
    <i32 x1> $10 = atomic add($1, $4)
    <i32 x1> $11 = local load [ [$1[0]]]
    <i32*x1> $12 = global ptr [S2place_i32], index [$11] activate=true
    <i32 x1> $13 = atomic add($1, $4)
    <i32 x1> $14 = local load [ [$1[0]]]
    <i32*x1> $15 = global ptr [S2place_i32], index [$14] activate=true
    <i32*x1> $16 : global store [$15 <- $4]  # <--- 4
  }
}

@xumingkuan

archibate on 12 Jul 2020

👍2

I see. A systematic solution will be implementing value_diff and making use of it to improve alias_analysis, but it will take a lot of time. I'll write a hotfix for now.

xumingkuan on 12 Jul 2020

🚀1

Btw I thought CFG was in advanced_optimization and I did turn it off?

archibate on 12 Jul 2020

Btw I thought CFG was in advanced_optimization and I did turn it off?

Oh, we have a CFG optimization pass even when advanced_optimization=False...

xumingkuan on 12 Jul 2020

Let's design an optimization level later, if there are 3 levels, which level do you think CFG is?

archibate on 12 Jul 2020

Let's design an optimization level later, if there are 3 levels, which level do you think CFG is?

I don't know... Currently, as addressed at https://github.com/taichi-dev/taichi/pull/1470#issue-447815733, it's non-trivial (and even probably ill-defined) to implement something like optimization_level=0. And we're still doing some refactoring on the IR. Maybe we should design optimization levels when our IR becomes more mature.

xumingkuan on 12 Jul 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings