Describe the bug
test_ad_atomic.py::test_ad_reduce sometimes fails on AppVeyor, and it often passes when rerun.
Log/Screenshots
I find it weird here : $6 = get child [S0root->S5dense] $5 and $11 = get child [S0root->S3dense] $5. Can one statement link to two SNodes?
kernel {
$0 = offloaded range_for(0, 16) block_dim=adaptive {
<i32 x1> $1 = loop index 0
<i32 x1> $2 = bit_extract($1 + 0, 0~4)
<gen*x1> $3 = get root
<i32 x1> $4 = const [0]
<gen*x1> $5 = [S0root][root]::lookup($3, $4) activate = false
<gen*x1> $6 = get child [S0root->S5dense] $5
<i32 x1> $7 = bit_extract($2 + 0, 0~4)
<gen*x1> $8 = [S5dense][dense]::lookup($6, $7) activate = false
<f32*x1> $9 = get child [S5dense->S6place_f32] $8
<f32 x1> $10 = global load $9
<gen*x1> $11 = get child [S0root->S3dense] $5
<gen*x1> $12 = [S3dense][dense]::lookup($11, $4) activate = false
<f32*x1> $13 = get child [S3dense->S4place_f32] $12
<f32 x1> $14 = global load $13
<f32 x1> $15 = mul $14 $10
<f32 x1> $16 = add $15 $15
<f32*x1> $17 = get child [S5dense->S7place_f32] $8
<f32 x1> $18 = atomic add($17, $16)
}
}
Additional comments
https://github.com/taichi-dev/taichi/blob/9a3bb30bbb06bb38c52d87a4c34541aa5a874281/tests/python/test_ad_atomic.py#L14
I'm not sure if it's allowed to call place two times in one statement.
@yuanming-hu What do you think?
Thanks for looking into this. This has been a nasty issue for a while :-)
We should remove https://github.com/taichi-dev/taichi/blob/104699524cfefc2ef8bbde5d12ec7990b8a6eda6/python/taichi/lang/snode.py#L41
to avoid this confusing syntax here. Not sure if this is the cause though.
To debug, take a look at ti.get_runtime().prog.print_snode_tree()
ti.get_runtime().prog.print_snode_tree():
S0root
S1dense
S2place_f32
S3dense
S4place_f32
S5dense
S6place_f32
S7place_f32
After I changed
https://github.com/taichi-dev/taichi/blob/9a3bb30bbb06bb38c52d87a4c34541aa5a874281/tests/python/test_ad_atomic.py#L14
to
ti.root.place(loss, loss.grad)
ti.root.dense(ti.i, N).place(x, x.grad)
, both the snode tree and the IR didn't change.
Thanks for the info. I took another look at the script and had no idea why it fails. We should also try to stably reproduce this issue: if I remember correctly this test will only fail with, a small probability, if I run all the tests together (which takes a couple of minutes), and it will pass if I run this one alone.
Also, it only fails on Windows.
Do you also think that $6 = get child [S0root->S5dense] $5 with $11 = get child [S0root->S3dense] $5 is probably the cause of failure, or it looks good to you?
I found it still exists if I set advanced_optimization to false:
kernel {
$0 = offloaded range_for(0, 16) block_dim=adaptive {
<i32 x1> $1 = const [0]
<i32 x1> $2 = loop index 0
<i32 x1> $3 = bit_extract($2 + 0, 0~4)
<i32 x1> $4 = const [1]
<i32 x1> $5 = mul $3 $4
<i32 x1> $6 = add $1 $5
<f32 x1> $7 = alloca
<f32 x1> $8 = alloca
<gen*x1> $9 = get root
<i32 x1> $10 = linearized(ind {}, stride {})
<gen*x1> $11 = [S0root][root]::lookup($9, $10) activate = false
<gen*x1> $12 = get child [S0root->S5dense] $11
<i32 x1> $13 = bit_extract($6 + 0, 0~4)
<i32 x1> $14 = linearized(ind {$13}, stride {16})
<gen*x1> $15 = [S5dense][dense]::lookup($12, $14) activate = false
<f32*x1> $16 = get child [S5dense->S6place_f32] $15
<f32 x1> $17 = global load $16
<gen*x1> $18 = get child [S0root->S3dense] $11
<gen*x1> $19 = [S3dense][dense]::lookup($18, $10) activate = false
<f32*x1> $20 = get child [S3dense->S4place_f32] $19
<f32 x1> $21 = global load $20
<f32 x1> $22 = local load [ [$8[0]]]
<f32 x1> $23 = add $22 $21
<f32 x1> $24 = mul $23 $17
<f32 x1> $25 = local load [ [$7[0]]]
<f32 x1> $26 = add $25 $24
<f32 x1> $27 = add $26 $24
<f32*x1> $28 = get child [S5dense->S7place_f32] $15
<f32 x1> $29 = atomic add($28, $27)
}
}
($12 = get child [S0root->S5dense] $11 and $18 = get child [S0root->S3dense] $11 here)
Do you also think that
$6 = get child [S0root->S5dense] $5with$11 = get child [S0root->S3dense] $5is probably the cause of failure
Can one statement link to two
SNodes?
Could you provide greater detail? I'm not sure if I understand what link means here. Thanks.
I think $5 here means a pointer acquired from the offset 0 from the root:
<gen*x1> $3 = get root
<i32 x1> $4 = const [0]
<gen*x1> $5 = [S0root][root]::lookup($3, $4) activate = false
And I don't understand why $5 can be used in two get child statements with different snodes here. I think S5dense and S3dense cannot have the same address.
<gen*x1> $6 = get child [S0root->S5dense] $5
<gen*x1> $11 = get child [S0root->S3dense] $5
I think
$5here means a pointer acquired from the offset0from the root:<gen*x1> $3 = get root <i32 x1> $4 = const [0] <gen*x1> $5 = [S0root][root]::lookup($3, $4) activate = false
Correct.
And I don't understand why
$5can be used in twoget childstatements with different snodes here. I thinkS5denseandS3densecannot have the same address.<gen*x1> $6 = get child [S0root->S5dense] $5 <gen*x1> $11 = get child [S0root->S3dense] $5
$6 and $11 do not have the same address. Note that each element of S0root has an instance of S1, S3 and S5.
Another mysterious crash: https://ci.appveyor.com/project/yuanming-hu/taichi/builds/32373295/job/jbdelwb6ch2yj2g6#L9430
This may related to variable-not-initialized? I found #633 caused by num_groups not reset to 1 after each kernel invocation, and cause a random crash.
Also I heard win has a different memory initialize mechanism from linux: they initialize stack as 0xcc, and heap as 0xcd, causing the famous 烫烫烫 and 屯屯屯 codes, see https://blog.csdn.net/mig_davidli/article/details/37507731.
Most helpful comment
Correct.
$6 and $11 do not have the same address. Note that each element of
S0roothas an instance of S1, S3 and S5.