Taichi: [test] `test_ad_atomic.py::test_ad_reduce` sometimes fails

Created on 20 Apr 2020  Â·  11Comments  Â·  Source: taichi-dev/taichi

Describe the bug
test_ad_atomic.py::test_ad_reduce sometimes fails on AppVeyor, and it often passes when rerun.

Log/Screenshots
I find it weird here : $6 = get child [S0root->S5dense] $5 and $11 = get child [S0root->S3dense] $5. Can one statement link to two SNodes?

kernel {
  $0 = offloaded range_for(0, 16) block_dim=adaptive {
    <i32 x1> $1 = loop index 0
    <i32 x1> $2 = bit_extract($1 + 0, 0~4)
    <gen*x1> $3 = get root
    <i32 x1> $4 = const [0]
    <gen*x1> $5 = [S0root][root]::lookup($3, $4) activate = false
    <gen*x1> $6 = get child [S0root->S5dense] $5
    <i32 x1> $7 = bit_extract($2 + 0, 0~4)
    <gen*x1> $8 = [S5dense][dense]::lookup($6, $7) activate = false
    <f32*x1> $9 = get child [S5dense->S6place_f32] $8
    <f32 x1> $10 = global load $9
    <gen*x1> $11 = get child [S0root->S3dense] $5
    <gen*x1> $12 = [S3dense][dense]::lookup($11, $4) activate = false
    <f32*x1> $13 = get child [S3dense->S4place_f32] $12
    <f32 x1> $14 = global load $13
    <f32 x1> $15 = mul $14 $10
    <f32 x1> $16 = add $15 $15
    <f32*x1> $17 = get child [S5dense->S7place_f32] $8
    <f32 x1> $18 = atomic add($17, $16)
  }
}

Additional comments
https://github.com/taichi-dev/taichi/blob/9a3bb30bbb06bb38c52d87a4c34541aa5a874281/tests/python/test_ad_atomic.py#L14

I'm not sure if it's allowed to call place two times in one statement.

potential bug welcome contribution

Most helpful comment

I think $5 here means a pointer acquired from the offset 0 from the root:

    <gen*x1> $3 = get root
    <i32 x1> $4 = const [0]
    <gen*x1> $5 = [S0root][root]::lookup($3, $4) activate = false

Correct.

And I don't understand why $5 can be used in two get child statements with different snodes here. I think S5dense and S3dense cannot have the same address.

    <gen*x1> $6 = get child [S0root->S5dense] $5
    <gen*x1> $11 = get child [S0root->S3dense] $5

$6 and $11 do not have the same address. Note that each element of S0root has an instance of S1, S3 and S5.

All 11 comments

@yuanming-hu What do you think?

Thanks for looking into this. This has been a nasty issue for a while :-)

We should remove https://github.com/taichi-dev/taichi/blob/104699524cfefc2ef8bbde5d12ec7990b8a6eda6/python/taichi/lang/snode.py#L41

to avoid this confusing syntax here. Not sure if this is the cause though.

To debug, take a look at ti.get_runtime().prog.print_snode_tree()

ti.get_runtime().prog.print_snode_tree():

S0root
  S1dense
    S2place_f32
  S3dense
    S4place_f32
  S5dense
    S6place_f32
    S7place_f32

After I changed
https://github.com/taichi-dev/taichi/blob/9a3bb30bbb06bb38c52d87a4c34541aa5a874281/tests/python/test_ad_atomic.py#L14
to

ti.root.place(loss, loss.grad)
ti.root.dense(ti.i, N).place(x, x.grad)

, both the snode tree and the IR didn't change.

Thanks for the info. I took another look at the script and had no idea why it fails. We should also try to stably reproduce this issue: if I remember correctly this test will only fail with, a small probability, if I run all the tests together (which takes a couple of minutes), and it will pass if I run this one alone.

Also, it only fails on Windows.

Do you also think that $6 = get child [S0root->S5dense] $5 with $11 = get child [S0root->S3dense] $5 is probably the cause of failure, or it looks good to you?

I found it still exists if I set advanced_optimization to false:

kernel {
  $0 = offloaded range_for(0, 16) block_dim=adaptive {
    <i32 x1> $1 = const [0]
    <i32 x1> $2 = loop index 0
    <i32 x1> $3 = bit_extract($2 + 0, 0~4)
    <i32 x1> $4 = const [1]
    <i32 x1> $5 = mul $3 $4
    <i32 x1> $6 = add $1 $5
    <f32 x1> $7 = alloca
    <f32 x1> $8 = alloca
    <gen*x1> $9 = get root
    <i32 x1> $10 = linearized(ind {}, stride {})
    <gen*x1> $11 = [S0root][root]::lookup($9, $10) activate = false
    <gen*x1> $12 = get child [S0root->S5dense] $11
    <i32 x1> $13 = bit_extract($6 + 0, 0~4)
    <i32 x1> $14 = linearized(ind {$13}, stride {16})
    <gen*x1> $15 = [S5dense][dense]::lookup($12, $14) activate = false
    <f32*x1> $16 = get child [S5dense->S6place_f32] $15
    <f32 x1> $17 = global load $16
    <gen*x1> $18 = get child [S0root->S3dense] $11
    <gen*x1> $19 = [S3dense][dense]::lookup($18, $10) activate = false
    <f32*x1> $20 = get child [S3dense->S4place_f32] $19
    <f32 x1> $21 = global load $20
    <f32 x1> $22 = local load [ [$8[0]]]
    <f32 x1> $23 = add $22 $21
    <f32 x1> $24 = mul $23 $17
    <f32 x1> $25 = local load [ [$7[0]]]
    <f32 x1> $26 = add $25 $24
    <f32 x1> $27 = add $26 $24
    <f32*x1> $28 = get child [S5dense->S7place_f32] $15
    <f32 x1> $29 = atomic add($28, $27)
  }
}

($12 = get child [S0root->S5dense] $11 and $18 = get child [S0root->S3dense] $11 here)

Do you also think that $6 = get child [S0root->S5dense] $5 with $11 = get child [S0root->S3dense] $5 is probably the cause of failure

Can one statement link to two SNodes?

Could you provide greater detail? I'm not sure if I understand what link means here. Thanks.

I think $5 here means a pointer acquired from the offset 0 from the root:

    <gen*x1> $3 = get root
    <i32 x1> $4 = const [0]
    <gen*x1> $5 = [S0root][root]::lookup($3, $4) activate = false

And I don't understand why $5 can be used in two get child statements with different snodes here. I think S5dense and S3dense cannot have the same address.

    <gen*x1> $6 = get child [S0root->S5dense] $5
    <gen*x1> $11 = get child [S0root->S3dense] $5

I think $5 here means a pointer acquired from the offset 0 from the root:

    <gen*x1> $3 = get root
    <i32 x1> $4 = const [0]
    <gen*x1> $5 = [S0root][root]::lookup($3, $4) activate = false

Correct.

And I don't understand why $5 can be used in two get child statements with different snodes here. I think S5dense and S3dense cannot have the same address.

    <gen*x1> $6 = get child [S0root->S5dense] $5
    <gen*x1> $11 = get child [S0root->S3dense] $5

$6 and $11 do not have the same address. Note that each element of S0root has an instance of S1, S3 and S5.

This may related to variable-not-initialized? I found #633 caused by num_groups not reset to 1 after each kernel invocation, and cause a random crash.
Also I heard win has a different memory initialize mechanism from linux: they initialize stack as 0xcc, and heap as 0xcd, causing the famous 烫烫烫 and 屯屯屯 codes, see https://blog.csdn.net/mig_davidli/article/details/37507731.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yuanming-hu picture yuanming-hu  Â·  4Comments

archibate picture archibate  Â·  4Comments

yuanming-hu picture yuanming-hu  Â·  3Comments

quadpixels picture quadpixels  Â·  3Comments

liaopeiyuan picture liaopeiyuan  Â·  3Comments