Concisely describe the proposed feature
With new extensions introduced by #581, there are lots of space to optimize the IR. I also found some feasible optimizations that are not directly related to the new extension. For example, in this fragment of IR,
...
<f32 x1> $5 = alloca
if $26 {
...
} else {
...
}
if $26 {
...
} else {
...
}
<f32 x1> $83 = local load [ [$5[0]]] (the only statement about $5)
...
we could merge the two if's together, change $83 to const [0], and then delete $5.
A list of optimizations I have done and going to do:
-1 & a, 0 | a (#827)linearized (#509)alloca's to const [0] (#662)if's with identical condition (#668)if's (thanks for @archibate 's discussion) (#727)WholeKernelCSE pass (#727, #1082)WhileControlStmt with cond == const [1] (#829)DIE for stack pop (#1324)Additional comments
For benchmarking, we may want to introduce a temporary boolean variable as the switch of optimization.
Some nice slides: https://courses.cs.washington.edu/courses/cse401/08wi/lecture/opt-mark.v2.pdf
@yuanming-hu please assign me. It seems that I can't assign myself...
Awesome!! This is vitally important for improving run-time performance & reducing compilation time. Thanks for taking charge of this.
Merge adjacent if's with identical condition
What if these if's contains statements with side-effect like x = x + 1? eg.
if (cond) x++;
if (cond) x++;
We want to obtain:
if (cond) { x++; x++; }
and the duplicated x++ can be dealt in other lower passes.
Merge identical local loads if no statements between them modify the variable even if there are if's
What if the two local load is in different blocks? eg.
if (cond) {
print 'yes';
x = local load 233;
} else {
print 'no';
x = local load 233;
}
What if a statement is shown once in IR, but ran for multiple times, should we optimize it? eg.
while (cond) {
x = local load 233
... (no changes stored to 233)
}
We may move this out the while.
First add a analysis pass to detect if a block stored an address.
Merge adjacent if's with identical condition
What if these if's contains statements with side-effect like
x = x + 1? eg.if (cond) x++;
if (cond) x++;We want to obtain:
if (cond) { x++; x++; }
and the duplicated x++ can be dealt in other lower passes.
Exactly.
Merge identical local loads if no statements between them modify the variable even if there are if's
What if the two local load is in different blocks? eg.
if (cond) {
print 'yes';
x = local load 233;
} else {
print 'no';
x = local load 233;
}
This is non-trivial. We could analyze the common code fragment of true-branch and the false-branch, and put them outside the if, but I don't know if it would make a great difference.
What if a statement is shown once in IR, but ran for multiple times, should we optimize it? eg.
while (cond) {
x = local load 233
... (no changes stored to 233)
}We may move this out the while.
If cond is false, does moving it out have side effects?
First add a
analysispass to detect if a blockstoredan address.
To merge identical local loads if no statements between them modify the variable, this is not necessary: I think directly searching for modifications when we find a local load fits the code frame better. Maybe we can add this pass later if necessary.
If cond is false, does moving it out have side effects?
No, it's just load and never used, will be opt-out by other lower passes.
How about first make:
if (cond) {
print 'yes';
x = local load 233;
} else {
print 'no';
x = local load 233;
}
to become:
if (cond) print 'yes'; else print 'no';
if (cond) xxx; else xxx;
since cond is aconstant IR value, and the second can be safely opt-out.
How about first make:
if (cond) {
print 'yes';
x = local load 233;
} else {
print 'no';
x = local load 233;
}to become:
if (cond) print 'yes'; else print 'no';
if (cond) xxx; else xxx;since cond is aconstant IR value, and the second can be safely opt-out.
I just thought about a situation:
if (cond) {
print 'yes';
x = local load 233;
print 'yes';
} else {
print 'no';
x = local load 233;
print 'no';
}
I can't tell if the following is more efficient than the above:
if (cond) print 'yes'; else print 'no';
x = local load 233;
if (cond) print 'yes'; else print 'no';
(especially when the common code fragment is relatively short than the others)
We can restrict this optimization to only the first statement and the last statement of the body of if.
@yuanming-hu What do https://github.com/taichi-dev/taichi/blob/aa90e319be3b599085495f88b660f4e987a08134/taichi/ir/ir.h#L1637 mean?
May I just ignore them when merging two adjacent if's?
Quick answer for now: yes. I'll document this in greater detail later. You don't have to worry about that until we start doing vectorization.
I just found a piece of IR:
<i32 x1> $8 = const [0]
...
if $19 {
...
<i32 x1> $25 = const [0]
...
} else {
...
<i32 x1> $40 = const [0]
...
}
I think we could optimize them all to $8. Currently void visit(ConstStmt*) searches statements before the current statement, and so $25 cannot find $8 as they are not in a basic block.
There are two ways to do this optimization:
Which do you think is better?
I think 2 is better. At compile time it's hard to judge whether $25 or $40 will be after $8, but it's sure that $8 is before $25 and $40.
Shall this pass (identical ConstStmt elimination) be still in BasicBlockSimplify? It won't be in one basic block, so maybe I should implement it in Simplify?
Let's add a WholeKernelCSE (common subexpression elimination) pass then.
For checking if the first statements (which can be container statements) in both branches of if are exactly the same, shall we add a function like bool same_statements(IRNode *root1, IRNode *root2) in ir.h and implement it using visitors in taichi/analysis/?
Very good question. I need to think about this a little bit. One very important IR functionality is to test if two IRNodes are equivalent. IRNode can be not only one statement but also a hierarchy. We might need to use some hashing here.
A few things to think about here
std::vector<Stmt **> Stmt::operands that keeps tracks of all operands of a statement in a unified manner, but the special fields (e.g. BinaryOpType BinaryOpStmt::op_type)Expressions since they only live in the frontend.There are 3 kinds of solutions I thought about. Denote the number of statements in the container IRNode we want to test by $n$ (if it's not a container, then n=1).
IRNode's are the same, and O(n) time in the worst case to find two IRNode's are different. I think in most cases, we can find two IRNode's are different in O(1).IRNode's are different in O(1) in expectation, but we still need O(n) time to find two IRNode's are the same ---- Binary DNAs' length is 惟(n).IRNode's are the same in O(log(n)) with some fancy data structures.To me, I prefer the 1st solution. I think it unacceptable to spend O(depth) more time whenever modifying statements, just to avoid the worst-case O(n) time finding if two IRNode's are different: we modify statements far more often than checking if two IRNode's are equivalent.
If there is a stage that statements don't change anymore, we can build data structures for comparing IRNode's then.
Thanks for the detailed analysis. I agree with your decision and we should probably go with the 1st solution.
Meanwhile, a very easy-to-implement (and slightly hacky) way to test if two statements are equivalent:
re_id pass to minimize the statement indicesprint_ir to convert the statements to an std::stringThis should work for most cases (assuming the print_ir pass is doing a correct job) and can probably be implemented within 20 LoC.
Thanks for the hacky way, but I want to implement a reject-fast solution. I think most of the queries will be of different IRNode's.
Maybe I can implement a visitor to visit one of the IRNode's, while storing the corresponding IRNode in the visitor class?
Sounds good. I champion your decision :-)
Maybe I can implement a visitor to visit one of the
IRNode's, while storing the correspondingIRNodein the visitor class?
Right, you have to use one IRNode to guide the other.
I wonder if this IR is valid:
<f32 x1> $238 = alloca
<f32 x1> $197 = alloca
<f32 x1> $239 : local store [$238 <- $197]
<f32 x1> $199 = ...
<f32 x1> $200 : local store [$197 <- $199]
<f32 x1> $242 = local load [ [$238[0]]]
<f32 x1> $218 = local load [ [$242[0]]]
It causes simplify.cpp to crash because the alloca here
https://github.com/taichi-dev/taichi/blob/24e76a14e3ebfc4a8ee7cc2b36d44030a75e226a/taichi/transforms/simplify.cpp#L479
is not an AllocaStmt when we are visiting $218.
Good question. LocalLoad must take Allocas as inputs. $218 is invalid.
So shall we add TI_ASSERT(...->is<AllocaStmt>()); to LocalAddress::var and LocalStoreStmt::ptr in their constructors?
Oh no, it's causing assertion failure even in the initial IR.
So shall we add
TI_ASSERT(...->is<AllocaStmt>());toLocalAddress::varandLocalStoreStmt::ptrin their constructors?
Good idea.
Oh no, it's causing assertion failure even in the initial IR.
Could you share with me more details?
Test case: test_ad_if.py test_ad_if_mutable
Part of the change set:
LocalAddress(Stmt *var, int offset) : var(var), offset(offset) {
std::cout << "local address" << std::endl;
TI_ASSERT(var->is<AllocaStmt>());
}
...
void flatten(VecStatement &ret) override {
std::cout << "from flatten" << std::endl;
ret.push_back(std::make_unique<LocalLoadStmt>(
LocalAddress(current_block->lookup_var(id), 0)));
stmt = ret.back().get();
}
Output:
Before preprocessing:
@ti.kernel
def func(i: ti.i32):
t = x[i]
if t > 0:
y[i] = t
else:
y[i] = 2 * t
After preprocessing:
def func():
i = ti.decl_scalar_arg(ti.i32)
t = ti.expr_init(ti.subscript(x, i))
if 1:
__cond = ti.chain_compare([t, 0], ['Gt'])
ti.core.begin_frontend_if(ti.Expr(__cond).ptr)
ti.core.begin_frontend_if_true()
ti.subscript(y, i).assign(t)
ti.core.pop_scope()
ti.core.begin_frontend_if_false()
ti.subscript(y, i).assign(2 * t)
ti.core.pop_scope()
[I 04/06/20 18:22:47.127] [compile_to_offloads.cpp:taichi::lang::irpass::com
pile_to_offloads::<lambda_a9f5d9347feda29776c658d0949d74f7>::operator ()@17]
Initial IR:
==========
kernel {
$0 = alloca @tmp4
@tmp4 = gbl load #@tmp0[arg[0]]
$2 = alloca @tmp5
@tmp5 = @tmp4
$4 = alloca @tmp6
@tmp6 = 0
$6 = alloca @tmp7
@tmp7 = 1
if (@tmp7 & (@tmp5 > @tmp6)) {
#@tmp2[arg[0]] = @tmp4
} else {
#@tmp2[arg[0]] = (@tmp4 * 2)
}
}
==========
from flatten
local address
local address
[E 04/06/20 18:22:47.129] [taichi/ir/ir.h:taichi::lang::LocalAddress::LocalA
ddress@1687] var->is<AllocaStmt>()
Still finding where the second local address comes from now. Compiling ir.h takes minutes.
Maybe I should do the assertion only when var != nullptr?
Maybe I should do the assertion only when
var != nullptr?
I assume LocalAddress'es must not have null pointers, but it would be good to be defensive.
Actually, there's a piece of Windows debugging infrastructure we can do here: could you help integrate this piece of code into taichi/system/traceback.cpp? It will give you a stack of function calls. Currently on Windows you only have a error message instead of call stack when things crash. This makes debugging hard. Feel free to open up an issue/draft PR to track this.
#include <intrin.h>
#include <dbghelp.h>
#include <cstdio>
#include <vector>
#include <string>
#include <sstream>
#include "taichi/platform/windows/windows.h"
#pragma comment(lib, "dbghelp.lib")
// https://gist.github.com/rioki/85ca8295d51a5e0b7c56e5005b0ba8b4
//
// Debug Helpers
//
// Copyright (c) 2015 - 2017 Sean Farrell <[email protected]>
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.
//
namespace dbg {
inline void trace(const char *msg, ...) {
char buff[1024];
va_list args;
va_start(args, msg);
vsnprintf(buff, 1024, msg, args);
OutputDebugStringA(buff);
va_end(args);
}
inline std::string basename(const std::string &file) {
unsigned int i = file.find_last_of("\\/");
if (i == std::string::npos) {
return file;
} else {
return file.substr(i + 1);
}
}
struct StackFrame {
DWORD64 address;
std::string name;
std::string module;
unsigned int line;
std::string file;
};
inline std::vector<StackFrame> stack_trace() {
#if _WIN64
DWORD machine = IMAGE_FILE_MACHINE_AMD64;
#else
DWORD machine = IMAGE_FILE_MACHINE_I386;
#endif
HANDLE process = GetCurrentProcess();
HANDLE thread = GetCurrentThread();
if (SymInitialize(process, NULL, TRUE) == FALSE) {
DBG_TRACE(__FUNCTION__ ": Failed to call SymInitialize.");
return std::vector<StackFrame>();
}
SymSetOptions(SYMOPT_LOAD_LINES);
CONTEXT context = {};
context.ContextFlags = CONTEXT_FULL;
RtlCaptureContext(&context);
#if _WIN64
STACKFRAME frame = {};
frame.AddrPC.Offset = context.Rip;
frame.AddrPC.Mode = AddrModeFlat;
frame.AddrFrame.Offset = context.Rbp;
frame.AddrFrame.Mode = AddrModeFlat;
frame.AddrStack.Offset = context.Rsp;
frame.AddrStack.Mode = AddrModeFlat;
#else
STACKFRAME frame = {};
frame.AddrPC.Offset = context.Eip;
frame.AddrPC.Mode = AddrModeFlat;
frame.AddrFrame.Offset = context.Ebp;
frame.AddrFrame.Mode = AddrModeFlat;
frame.AddrStack.Offset = context.Esp;
frame.AddrStack.Mode = AddrModeFlat;
#endif
bool first = true;
std::vector<StackFrame> frames;
while (StackWalk(machine, process, thread, &frame, &context, NULL,
SymFunctionTableAccess, SymGetModuleBase, NULL)) {
StackFrame f = {};
f.address = frame.AddrPC.Offset;
#if _WIN64
DWORD64 moduleBase = 0;
#else
DWORD moduleBase = 0;
#endif
moduleBase = SymGetModuleBase(process, frame.AddrPC.Offset);
char moduelBuff[MAX_PATH];
if (moduleBase &&
GetModuleFileNameA((HINSTANCE)moduleBase, moduelBuff, MAX_PATH)) {
f.module = basename(moduelBuff);
} else {
f.module = "Unknown Module";
}
#if _WIN64
DWORD64 offset = 0;
#else
DWORD offset = 0;
#endif
char symbolBuffer[sizeof(IMAGEHLP_SYMBOL) + 255];
PIMAGEHLP_SYMBOL symbol = (PIMAGEHLP_SYMBOL)symbolBuffer;
symbol->SizeOfStruct = (sizeof IMAGEHLP_SYMBOL) + 255;
symbol->MaxNameLength = 254;
if (SymGetSymFromAddr(process, frame.AddrPC.Offset, &offset, symbol)) {
f.name = symbol->Name;
} else {
DWORD error = GetLastError();
DBG_TRACE(__FUNCTION__ ": Failed to resolve address 0x%X: %u\n",
frame.AddrPC.Offset, error);
f.name = "Unknown Function";
}
IMAGEHLP_LINE line;
line.SizeOfStruct = sizeof(IMAGEHLP_LINE);
DWORD offset_ln = 0;
if (SymGetLineFromAddr(process, frame.AddrPC.Offset, &offset_ln, &line)) {
f.file = line.FileName;
f.line = line.LineNumber;
} else {
DWORD error = GetLastError();
DBG_TRACE(__FUNCTION__ ": Failed to resolve line for 0x%X: %u\n",
frame.AddrPC.Offset, error);
f.line = 0;
}
if (!first) {
frames.push_back(f);
}
first = false;
}
SymCleanup(process);
return frames;
}
inline void handle_assert(const char *func, const char *cond) {
std::stringstream buff;
buff << func << ": Assertion '" << cond << "' failed! \n";
buff << "\n";
std::vector<StackFrame> stack = stack_trace();
buff << "Callstack: \n";
for (unsigned int i = 0; i < stack.size(); i++) {
buff << "0x" << std::hex << stack[i].address << ": " << stack[i].name << "("
<< std::dec << stack[i].line << ") in " << stack[i].module << "\n";
}
// please replace with std::printf
MessageBoxA(NULL, buff.str().c_str(), "Assert Failed", MB_OK | MB_ICONSTOP);
abort();
}
} // namespace dbg
I just checked that var is nullptr in the second local address.
I just located that
<f32 x1> $242 = local load [ [$238[0]]]
<f32 x1> $218 = local load [ [$242[0]]]
is introduced in make_adjoint... Debugging.
This is in BackupSSA and I printed it and found that all auto allocas are indeed allocas, at least at that place -- otherwise it should trigger assertion failure.
I suspect the problem is in MakeAdjoint. Please check #726 when you are available (the output is so long that I opened a new issue for it).
Sounds good. I'm occupied until 11:59 PM but I'll take a look after that time.
Currently, the following $47 cannot be eliminated:
<i32 x1> $2 = alloca
if $22 {
<i32 x1> $47 : local store [$2 <- $46]
}
(nothing related to $2)
This is because $47 doesn't know that $2 will never be loaded.
There are 5 cases like this in test_ad_if_mutable, so we can reduce the number of statements by at least 10 (eliminating local store & alloca).
Describe the solution you'd like (if any)
I want to implement a pass that analyzes allocas (for each alloca, do store forwarding and useless local store elimination), but I don't know if I should implement it in a new pass or in an existing pass.
(global tmp vars may be similar, but the Stmts are different so they can't be implemented together)
I find BasicBlockSimplify's function quite limited -- there are 3 of its main functions (common subexpression elimination, store forwarding, useless local store elimination) I want to upgrade.
LocalLoadSearcher, LocalStoreSearcher, LocalStoreForwarder may be necessary for the upgraded common subexpression elimination/store forwarding/useless local store elimination passes. Shall we move them to analysis/?
Currently, the following
$47cannot be eliminated:<i32 x1> $2 = alloca if $22 { <i32 x1> $47 : local store [$2 <- $46] } (nothing related to $2)This is because
$47doesn't know that$2will never be loaded.There are 5 cases like this in
test_ad_if_mutable, so we can reduce the number of statements by at least 10 (eliminating local store & alloca).Describe the solution you'd like (if any)
I want to implement a pass that analyzesallocas (for eachalloca, do store forwarding and useless local store elimination), but I don't know if I should implement it in a new pass or in an existing pass.(global tmp vars may be similar, but the Stmts are different so they can't be implemented together)
I find
BasicBlockSimplify's function quite limited -- there are 3 of its main functions (common subexpression elimination, store forwarding, useless local store elimination) I want to upgrade.
Thanks for spotting this. A new pass sounds better since no existing pass does this. Also I think the logic of this pass would be complex enough to justify the existence of itself.
LocalLoadSearcher, LocalStoreSearcher, LocalStoreForwardermay be necessary for the upgraded common subexpression elimination/store forwarding/useless local store elimination passes. Shall we move them toanalysis/?
Sounds good!!
I want to make use of AlgSimp::alg_is_one to eliminate $6 in this case:
<i32 x1> $5 = const [1]
$6 : while control nullptr, $5
(We can eliminate it even if mask is not nullptr, right?)
But should it be in the alg_simp pass?
alg_is_one
maybe alg_is_non_zero_constant :)
alg_simp pass?
I thought this is abour control flow not algebra expr level, so maybe not really related?
maybe alg_is_non_zero_constant :)
Yes... It may be clearer if the type is u1.
BTW what's the behavior of if 0.1 or while control ..., 0.1?
I found
<i32 x1> $10 = const [1]
<i32 x1> $11 = cmp_gt $6 $9
<i32 x1> $12 = bit_and $10 $11
in some IRs, but it's hard to optimize if there are neither boolean types (u1) nor logical operations (logic_and).
Maybe another way to optimize it is to change this from expr_init(True) to expr_init(-1)...
https://github.com/taichi-dev/taichi/blob/532ea3340e8c8201c97c768110be907038df7a17/python/taichi/lang/impl.py#L110
My feeling is that we should systematically fix this after we have u1 introduced...

The geometric mean of the optimization factor on the number of statements among all tests is 1.068 now.
Cool! I assume bigger means more optimized in the table. I'm curious about which test gives you < 0.75 number, and which are > 1.5?
(PS: it's almost always good to use xlabel and ylabel and title to make your plots easier to understand.)
Tests with > 1.5 boost:
test_ad_if__test_ad_if : 1.5348837209302326
test_ad_if__test_ad_if_mutable : 2.0485436893203883
test_ad_if__test_ad_if_parallel : 1.9245283018867925
test_ad_if__test_ad_if_parallel_complex : 1.625
test_continue__test_kernel_continue : 1.5844155844155845
(test_ad_if__test_ad_if_mutable should have been optimized from 105 statements to 26 statements, but there are other kernels causing ~100 statements in total that can hardly be optimized.)
Tests that become much worse (< 0.75):
test_tensor_dimensionality__test_dimensionality : 0.7463768115942029
test_tensor_reflection__test_POT : 0.7272727272727273
Thanks for the report. The bad news is that we have overfit to the test_ad_if series; the good news is there are still a lot of space to improve here...
I just found that test_tensor_reflection__test_POT has no kernels in it. Figuring out what's wrong...
For test_tensor_dimensionality, it's indeed optimized:
before:
kernel {
$0 = offloaded range_for(0, 256) block_dim=adaptive {
<i32 x1> $1 = const [0]
<i32 x1> $2 = loop index 0
<i32 x1> $3 = bit_extract($2 + 0, 7~8)
<i32 x1> $4 = const [1]
<i32 x1> $5 = mul $3 $4
<i32 x1> $6 = add $1 $5
<i32 x1> $7 = bit_extract($2 + 0, 6~7)
<i32 x1> $8 = mul $7 $4
<i32 x1> $9 = add $1 $8
<i32 x1> $10 = bit_extract($2 + 0, 5~6)
<i32 x1> $11 = mul $10 $4
<i32 x1> $12 = add $1 $11
<i32 x1> $13 = bit_extract($2 + 0, 4~5)
<i32 x1> $14 = mul $13 $4
<i32 x1> $15 = add $1 $14
<i32 x1> $16 = bit_extract($2 + 0, 3~4)
<i32 x1> $17 = mul $16 $4
<i32 x1> $18 = add $1 $17
<i32 x1> $19 = bit_extract($2 + 0, 2~3)
<i32 x1> $20 = mul $19 $4
<i32 x1> $21 = add $1 $20
<i32 x1> $22 = bit_extract($2 + 0, 1~2)
<i32 x1> $23 = mul $22 $4
<i32 x1> $24 = add $1 $23
<i32 x1> $25 = bit_extract($2 + 0, 0~1)
<i32 x1> $26 = mul $25 $4
<i32 x1> $27 = add $1 $26
<i32 x1> $28 = add $6 $9
<i32 x1> $29 = add $28 $12
<i32 x1> $30 = add $29 $15
<i32 x1> $31 = add $30 $18
<i32 x1> $32 = add $31 $21
<i32 x1> $33 = add $32 $24
<i32 x1> $34 = add $33 $27
<gen*x1> $35 = get root
<i32 x1> $36 = linearized(ind {}, stride {})
<gen*x1> $37 = [S0root][root]::lookup($35, $36) activate = false
<gen*x1> $38 = get child [S0root->S1dense] $37
<i32 x1> $39 = bit_extract($6 + 0, 0~1)
<i32 x1> $40 = bit_extract($9 + 0, 0~1)
<i32 x1> $41 = bit_extract($12 + 0, 0~1)
<i32 x1> $42 = bit_extract($15 + 0, 0~1)
<i32 x1> $43 = bit_extract($18 + 0, 0~1)
<i32 x1> $44 = bit_extract($21 + 0, 0~1)
<i32 x1> $45 = bit_extract($24 + 0, 0~1)
<i32 x1> $46 = bit_extract($27 + 0, 0~1)
<i32 x1> $47 = linearized(ind {$39, $40, $41, $42, $43, $44, $45, $46},
stride {2, 2, 2, 2, 2, 2, 2, 2})
<gen*x1> $48 = [S1dense][dense]::lookup($38, $47) activate = false
<i32*x1> $49 = get child [S1dense->S2place_i32] $48
<i32 x1> $50 = atomic add($49, $34)
<i32*x1> $51 = get child [S1dense->S3place_i32] $48
<i32 x1> $52 = atomic add($51, $6)
}
}
after:
kernel {
$0 = offloaded range_for(0, 256) block_dim=adaptive {
<i32 x1> $1 = loop index 0
<i32 x1> $2 = bit_extract($1 + 0, 7~8)
<i32 x1> $3 = bit_extract($1 + 0, 6~7)
<i32 x1> $4 = bit_extract($1 + 0, 5~6)
<i32 x1> $5 = bit_extract($1 + 0, 4~5)
<i32 x1> $6 = bit_extract($1 + 0, 3~4)
<i32 x1> $7 = bit_extract($1 + 0, 2~3)
<i32 x1> $8 = bit_extract($1 + 0, 1~2)
<i32 x1> $9 = bit_extract($1 + 0, 0~1)
<i32 x1> $10 = add $2 $3
<i32 x1> $11 = add $10 $4
<i32 x1> $12 = add $11 $5
<i32 x1> $13 = add $12 $6
<i32 x1> $14 = add $13 $7
<i32 x1> $15 = add $14 $8
<i32 x1> $16 = add $15 $9
<gen*x1> $17 = get root
<i32 x1> $18 = const [0]
<gen*x1> $19 = [S0root][root]::lookup($17, $18) activate = false
<gen*x1> $20 = get child [S0root->S1dense] $19
<i32 x1> $21 = bit_extract($2 + 0, 0~1)
<i32 x1> $22 = bit_extract($3 + 0, 0~1)
<i32 x1> $23 = bit_extract($4 + 0, 0~1)
<i32 x1> $24 = bit_extract($5 + 0, 0~1)
<i32 x1> $25 = bit_extract($6 + 0, 0~1)
<i32 x1> $26 = bit_extract($7 + 0, 0~1)
<i32 x1> $27 = bit_extract($8 + 0, 0~1)
<i32 x1> $28 = bit_extract($9 + 0, 0~1)
<i32 x1> $29 = const [2]
<i32 x1> $30 = mul $27 $29
<i32 x1> $31 = add $28 $30
<i32 x1> $32 = const [4]
<i32 x1> $33 = mul $26 $32
<i32 x1> $34 = add $31 $33
<i32 x1> $35 = const [8]
<i32 x1> $36 = mul $25 $35
<i32 x1> $37 = add $34 $36
<i32 x1> $38 = const [16]
<i32 x1> $39 = mul $24 $38
<i32 x1> $40 = add $37 $39
<i32 x1> $41 = const [32]
<i32 x1> $42 = mul $23 $41
<i32 x1> $43 = add $40 $42
<i32 x1> $44 = const [64]
<i32 x1> $45 = mul $22 $44
<i32 x1> $46 = add $43 $45
<i32 x1> $47 = const [128]
<i32 x1> $48 = mul $21 $47
<i32 x1> $49 = add $46 $48
<gen*x1> $50 = [S1dense][dense]::lookup($20, $49) activate = false
<i32*x1> $51 = get child [S1dense->S2place_i32] $50
<i32 x1> $52 = atomic add($51, $16)
<i32*x1> $53 = get child [S1dense->S3place_i32] $50
<i32 x1> $54 = atomic add($53, $2)
}
}
It's just lowering linearize causing too many statements.
Well... the $21-$28 here is just the same as $2-$9, isn't it?
It's just lowering linearize causing too many statements.
I see :-) People sometimes use a cost model to assign, say linearize higher weight.
Well... the $21-$28 here is just the same as $2-$9, isn't it?
Right, we can add a special optimization for a bitextract that takes as input another bitextract.
I just found that
test_tensor_reflection__test_POThas no kernels in it. Figuring out what's wrong...
Maybe we shouldn't have printed stats in the destructor of Program. We compile some kernels after Program finalizes, so we've got some dislocations on statistics and test names.
Look at this log (Windows fatal exception always appears at the beginning, and it seems doesn't matter):
C:\Users\xmk\Desktop\taichi\tests\python>pytest -s test_tensor_dimensionality.py
============================================================================== test session starts ===============================================================================
platform win32 -- Python 3.7.4, pytest-5.0.1, py-1.8.0, pluggy-0.12.0
rootdir: C:\Users\xmk\Desktop\taichi\tests\python
plugins: arraydiff-0.3, doctestplus-0.3.0, openfiles-0.3.2, remotedata-0.3.2
collecting ... [Taichi] mode=development
[Taichi] <dev mode>, supported archs: [cpu only], commit d7610347, python 3.7.4
Windows fatal exception: code 0xc0000138
Current thread 0x00002798 (most recent call first):
File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
File "<frozen importlib._bootstrap_external>", line 1043 in create_module
File "<frozen importlib._bootstrap>", line 583 in module_from_spec
File "<frozen importlib._bootstrap>", line 670 in _load_unlocked
File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 983 in _find_and_load
File "D:\Anaconda3\lib\site-packages\torch\__init__.py", line 81 in <module>
File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
File "<frozen importlib._bootstrap_external>", line 728 in exec_module
File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 983 in _find_and_load
File "C:\Users\xmk\Desktop\taichi\python\taichi\lang\util.py", line 7 in <module>
File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
File "<frozen importlib._bootstrap_external>", line 728 in exec_module
File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 983 in _find_and_load
File "C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py", line 2 in <module>
File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
File "<frozen importlib._bootstrap_external>", line 728 in exec_module
File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 983 in _find_and_load
File "C:\Users\xmk\Desktop\taichi\python\taichi\lang\impl.py", line 3 in <module>
File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
File "<frozen importlib._bootstrap_external>", line 728 in exec_module
File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 983 in _find_and_load
File "C:\Users\xmk\Desktop\taichi\python\taichi\lang\__init__.py", line 1 in <module>
File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
File "<frozen importlib._bootstrap_external>", line 728 in exec_module
File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 983 in _find_and_load
File "C:\Users\xmk\Desktop\taichi\python\taichi\__init__.py", line 17 in <module>
File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
File "<frozen importlib._bootstrap_external>", line 728 in exec_module
File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 983 in _find_and_load
File "C:\Users\xmk\Desktop\taichi\tests\python\test_tensor_dimensionality.py", line 1 in <module>
File "D:\Anaconda3\lib\site-packages\_pytest\assertion\rewrite.py", line 149 in exec_module
File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 983 in _find_and_load
File "D:\Anaconda3\lib\site-packages\py\_path\local.py", line 701 in pyimport
File "D:\Anaconda3\lib\site-packages\_pytest\python.py", line 498 in _importtestmodule
File "D:\Anaconda3\lib\site-packages\_pytest\python.py", line 431 in _getobj
File "D:\Anaconda3\lib\site-packages\_pytest\python.py", line 250 in obj
File "D:\Anaconda3\lib\site-packages\_pytest\python.py", line 446 in _inject_setup_module_fixture
File "D:\Anaconda3\lib\site-packages\_pytest\python.py", line 434 in collect
File "D:\Anaconda3\lib\site-packages\_pytest\runner.py", line 247 in <lambda>
File "D:\Anaconda3\lib\site-packages\_pytest\runner.py", line 220 in from_call
File "D:\Anaconda3\lib\site-packages\_pytest\runner.py", line 247 in pytest_make_collect_report
File "D:\Anaconda3\lib\site-packages\pluggy\callers.py", line 187 in _multicall
File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 81 in <lambda>
File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 87 in _hookexec
File "D:\Anaconda3\lib\site-packages\pluggy\hooks.py", line 289 in __call__
File "D:\Anaconda3\lib\site-packages\_pytest\runner.py", line 368 in collect_one_node
File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 717 in genitems
File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 498 in _perform_collect
File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 459 in perform_collect
File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 266 in pytest_collection
File "D:\Anaconda3\lib\site-packages\pluggy\callers.py", line 187 in _multicall
File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 81 in <lambda>
File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 87 in _hookexec
File "D:\Anaconda3\lib\site-packages\pluggy\hooks.py", line 289 in __call__
File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 256 in _main
File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 213 in wrap_session
File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 250 in pytest_cmdline_main
File "D:\Anaconda3\lib\site-packages\pluggy\callers.py", line 187 in _multicall
File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 81 in <lambda>
File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 87 in _hookexec
File "D:\Anaconda3\lib\site-packages\pluggy\hooks.py", line 289 in __call__
File "D:\Anaconda3\lib\site-packages\_pytest\config\__init__.py", line 74 in main
File "D:\Anaconda3\Scripts\pytest-script.py", line 10 in <module>
collected 1 item
test_tensor_dimensionality.py Running test on arch=Arch.x64
[T 04/22/20 19:03:48.499] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:48.500] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:48.501] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:48.502] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:48.503] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:48.503] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb6ad10) arch=x64 initialized.
[T 04/22/20 19:03:48.511] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:49.220] [llvm_context.cpp:taichi::lang::compile_runtime_bitcode@137] Compiling runtime module bitcode...
[T 04/22/20 19:03:49.617] [llvm_context.cpp:taichi::lang::compile_runtime_bitcode@152] runtime module bitcode compiled.
[T 04/22/20 19:03:49.892] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:49.893] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000742 s
[T 04/22/20 19:03:49.894] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 32 B
[T 04/22/20 19:03:49.916] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:49.916] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:49.947] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:49.948] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:49.951] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:49.963] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:49.964] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:49.965] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:49.978] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:49.980] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:49.981] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:49.987] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:49.987] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:49.988] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:49.990] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:49.990] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:49.991] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:50.003] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:50.034] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c4_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:50.118] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements : 58.00
[T 04/22/20 19:03:50.127] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb6ad10) finalized.
[T 04/22/20 19:03:50.132] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:50.133] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:50.134] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:50.134] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:50.135] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:50.135] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb66010) arch=x64 initialized.
[T 04/22/20 19:03:50.141] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:50.397] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:50.398] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000711 s
[T 04/22/20 19:03:50.398] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 64 B
[T 04/22/20 19:03:50.423] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:50.423] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:50.443] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:50.444] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.444] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:50.451] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:50.451] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.454] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:50.456] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:50.458] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.459] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:50.461] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:50.461] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.462] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:50.466] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:50.466] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:50.469] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:50.483] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:50.525] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c6_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:50.610] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements : 74.00
[T 04/22/20 19:03:50.617] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb66010) finalized.
[T 04/22/20 19:03:50.622] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:50.622] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:50.623] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:50.627] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:50.628] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:50.628] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb68b60) arch=x64 initialized.
[T 04/22/20 19:03:50.635] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:50.880] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:50.881] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000716 s
[T 04/22/20 19:03:50.883] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 128 B
[T 04/22/20 19:03:50.905] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:50.905] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:50.926] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:50.927] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.928] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:50.934] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:50.935] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.937] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:50.938] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:50.942] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.943] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:50.944] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:50.945] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.947] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:50.949] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:50.952] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:50.953] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:50.963] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:50.995] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c8_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:51.075] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements : 90.00
[T 04/22/20 19:03:51.083] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb68b60) finalized.
[T 04/22/20 19:03:51.089] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:51.089] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:51.090] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:51.096] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:51.097] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:51.101] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb6b1e0) arch=x64 initialized.
[T 04/22/20 19:03:51.106] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:51.341] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:51.341] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000685 s
[T 04/22/20 19:03:51.342] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 256 B
[T 04/22/20 19:03:51.360] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:51.361] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:51.387] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:51.388] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.389] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:51.391] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:51.392] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.392] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:51.394] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:51.395] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.396] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:51.397] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:51.398] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.399] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:51.404] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:51.405] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:51.405] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:51.416] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:51.448] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c10_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:51.541] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements : 106.00
[T 04/22/20 19:03:51.550] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb6b1e0) finalized.
[T 04/22/20 19:03:51.556] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:51.557] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:51.558] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:51.558] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:51.559] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:51.559] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb68b60) arch=x64 initialized.
[T 04/22/20 19:03:51.565] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:51.808] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:51.808] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000695 s
[T 04/22/20 19:03:51.809] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 512 B
[T 04/22/20 19:03:51.825] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:51.826] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:51.843] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:51.844] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.849] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:51.861] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:51.863] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.864] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:51.877] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:51.879] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.880] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:51.885] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:51.885] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.886] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:51.888] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:51.889] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:51.890] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:51.907] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:51.944] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c12_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:52.027] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements : 122.00
[T 04/22/20 19:03:52.035] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb68b60) finalized.
[T 04/22/20 19:03:52.040] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:52.041] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:52.042] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:52.048] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:52.049] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:52.053] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb6a370) arch=x64 initialized.
[T 04/22/20 19:03:52.059] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:52.300] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:52.301] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000793 s
[T 04/22/20 19:03:52.301] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 1024 B
[T 04/22/20 19:03:52.319] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:52.320] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:52.338] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:52.339] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.340] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:52.346] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:52.347] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.349] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:52.351] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:52.355] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.356] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:52.358] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:52.359] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.360] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:52.366] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:52.367] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:52.367] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:52.381] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:52.421] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c14_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:52.508] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements : 138.00
[T 04/22/20 19:03:52.516] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb6a370) finalized.
[T 04/22/20 19:03:52.521] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:52.521] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:52.523] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:52.527] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:52.530] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:52.530] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb68b60) arch=x64 initialized.
[T 04/22/20 19:03:52.540] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:52.779] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:52.780] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000759 s
[T 04/22/20 19:03:52.781] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 2048 B
[T 04/22/20 19:03:52.797] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:52.797] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:52.818] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:52.820] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.820] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:52.834] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:52.835] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.836] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:52.849] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:52.851] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.852] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:52.865] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:52.866] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.868] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:52.874] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:52.875] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:52.876] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113] Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:52.891] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:52.934] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c16_0...
.
============================================================================ 1 passed in 5.47 seconds ============================================================================
The final Compiling kernel fill_c16_0's statistics is mistakenly written to test_tensor_reflection__test_POT when running ti test.
@yuanming-hu do you have any ideas to fix this? Having tests and statistics mismatched looks bad.
We compile some kernels after
Programfinalizes, so we've got some dislocations on statistics and test names.
Interesting - I think the real issue here is kernels should always be compiled when a program exists. I'm looking into test_tensor_dimensionality - do you have an idea of how a kernel is compiled without a program?
Oh, the kernel is compiled with a program, but it seems that the last Program's destructor function is never called.
Maybe that is because Windows fatal exception: code 0xc0000138 happens before the program's dtor? Note that the outputs are not guaranteed to be sorted in wall-clock time when you have multiple output streams...
If I use ti test -v rather than pytest -s, Windows fatal exception doesn't happen. Would you please print something in Program's constructors and destructors to double-check if the last Program is not destructed?
Setting the environment variable TI_LOG_LEVEL=trace will do the job.
Looks like each test ends with Compiling kernel ... rather than Program finalizing with TI_LOG_LEVEL=trace...
BTW, the number of Program's constructions and destructions can be greater than the number of tests in files like test_tensor_dimensionality.py. However, we only record one of the Program's statistics.
On Linux it looks fine to me:
pytest -s test_tensor_dimensionality.py
=================================================== test session starts ====================================================
platform linux -- Python 3.6.9, pytest-5.4.1, py-1.8.0, pluggy-0.13.0
rootdir: /home/yuanming/repos/taichi/tests/python
plugins: arraydiff-0.3, forked-1.1.3, remotedata-0.3.2, xdist-1.31.0, doctestplus-0.4.0, openfiles-0.4.0
collecting ... [Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-le6az2sv
[Taichi] sandbox prepared
[T 04/22/20 20:17:34.188] [cuda_driver.cpp:CUDADriver@44] CUDA driver API (v10.2) loaded.
[Taichi] <dev mode>, supported archs: [cpu, cuda, opengl], commit 01391957, python 3.6.9
collected 1 item
test_tensor_dimensionality.py Running test on arch=Arch.x64
[T 04/22/20 20:17:34.399] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:34.399] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:34.399] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:34.399] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:34.400] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:34.400] [program.cpp:Program@133] Program (0x56273a7f02d0) arch=x64 initialized.
[T 04/22/20 20:17:34.401] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:34.406] [llvm_context.cpp:compile_runtime_bitcode@137] Compiling runtime module bitcode...
[T 04/22/20 20:17:34.622] [llvm_context.cpp:compile_runtime_bitcode@152] runtime module bitcode compiled.
[T 04/22/20 20:17:34.731] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:34.731] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 3.29e-05 s
[T 04/22/20 20:17:34.731] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 32 B
[T 04/22/20 20:17:34.739] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:34.739] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:34.749] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:34.749] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:34.749] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f4170974000
[D 04/22/20 20:17:34.750] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:34.750] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:34.750] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f4170977000
[D 04/22/20 20:17:34.751] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:34.751] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:34.751] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f417097a000
[D 04/22/20 20:17:34.752] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:34.752] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:34.752] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f417097d000
[D 04/22/20 20:17:34.753] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:34.753] [memory_pool.cpp:daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 20:17:34.753] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f4170980000
[T 04/22/20 20:17:34.758] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:34.772] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c4_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:34.814] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:34.815] [program.cpp:finalize@512] Program (0x56273a7f02d0) finalized.
[T 04/22/20 20:17:34.817] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:34.829] [cuda_context.cpp:CUDAContext@25] Using CUDA device [id=0]: GeForce GTX 1080 Ti
[T 04/22/20 20:17:34.829] [cuda_context.cpp:CUDAContext@33] CUDA Device Compute Capability: 6.1
[T 04/22/20 20:17:34.875] [cuda_context.cpp:CUDAContext@38] Total memory 10.91 GB; free memory 9.74 GB
[T 04/22/20 20:17:34.875] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:34.876] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:34.876] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:34.876] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:34.876] [program.cpp:Program@133] Program (0x56273abbbac0) arch=cuda initialized.
[T 04/22/20 20:17:34.877] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:34.986] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:34.986] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:34.986] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:34.986] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:34.995] [llvm_context.cpp:compile_runtime_bitcode@137] Compiling runtime module bitcode...
[T 04/22/20 20:17:35.218] [llvm_context.cpp:compile_runtime_bitcode@152] runtime module bitcode compiled.
[T 04/22/20 20:17:35.414] [jit_cuda.cpp:add_module@90] PTX size: 82.11KB
[T 04/22/20 20:17:35.414] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:35.419] [jit_cuda.cpp:add_module@97] CUDA module load time : 4.472970962524414ms
[T 04/22/20 20:17:35.419] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:35.419] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 6.01e-05 s
[T 04/22/20 20:17:35.419] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 32 B
[T 04/22/20 20:17:35.428] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:35.428] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:35.437] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:35.437] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:35.437] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:35.439] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:35.439] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:35.439] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:35.440] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:35.440] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:35.440] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:35.441] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:35.441] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:35.441] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:35.442] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:35.442] [memory_pool.cpp:daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 20:17:35.442] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:35.460] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c6_0...
[T 04/22/20 20:17:35.646] [jit_cuda.cpp:add_module@90] PTX size: 83.62KB
[T 04/22/20 20:17:35.646] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:35.647] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4448890686035156ms
[D 04/22/20 20:17:35.647] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c6_0_kernel_4_range_for<<<896, 64>>>
[T 04/22/20 20:17:35.647] [jit_cuda.cpp:lookup_function@47] Kernel fill_c6_0_kernel_4_range_for compilation time: 0.0011920928955078125ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:35.661] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:35.661] [program.cpp:finalize@512] Program (0x56273abbbac0) finalized.
[T 04/22/20 20:17:35.670] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:35.670] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:35.670] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:35.670] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:35.670] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:35.670] [program.cpp:Program@133] Program (0x56273abbbac0) arch=opengl initialized.
[T 04/22/20 20:17:35.671] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:35.780] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:35.780] [program.cpp:materialize_layout@298] OpenGL root buffer size: 32 B
[T 04/22/20 20:17:35.927] [opengl_extension.inc.h:initialize_opengl@3] [glsl] Found GL_ARB_compute_shader
[T 04/22/20 20:17:35.927] [opengl_extension.inc.h:initialize_opengl@4] [glsl] Found GL_ARB_gpu_shader_int64
[T 04/22/20 20:17:35.927] [opengl_extension.inc.h:initialize_opengl@5] [glsl] Found GL_NV_shader_atomic_float
[T 04/22/20 20:17:35.927] [opengl_extension.inc.h:initialize_opengl@6] [glsl] Found GL_NV_shader_atomic_float64
[T 04/22/20 20:17:35.927] [opengl_extension.inc.h:initialize_opengl@7] [glsl] Found GL_NV_shader_atomic_int64
[T 04/22/20 20:17:35.928] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:35.928] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:35.928] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:35.928] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:35.929] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c8_0...
[D 04/22/20 20:17:35.931] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c8_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };
void fill_c8_00()
{ // range for
// range known at compile time
int _tid = int(gl_GlobalInvocationID.x);
if (_tid >= 4) return;
int _itv = 0 + _tid * 1;
int G = _itv;
int H = (((0 + G) >> 1) & ((1 << 1) - 1));
int L = (((0 + G) >> 0) & ((1 << 1) - 1));
int O = H + L;
int S = 0;
int Gm = 0;
int U = S + 32 * Gm; // S0
int V = U + 0; // S1
int W = (((0 + H) >> 0) & ((1 << 1) - 1));
int X = (((0 + L) >> 0) & ((1 << 1) - 1));
int Go = 1;
int Gp = X * Go;
int Gq = Gm + Gp;
int Gr = 2;
int Gs = W * Gr;
int Gt = Gq + Gs;
int Z = V + 8 * Gt; // S1
int Aq = Z + 0; // S2
int As = atomicAdd(_data_i32_[Aq >> 2], O);
int AE = Z + 4; // S3
int AG = atomicAdd(_data_i32_[AE >> 2], H);
}
void main()
{
fill_c8_00();
}
layout(local_size_x = 4 /* 1, 4 */, local_size_y = 1, local_size_z = 1) in;
[T 04/22/20 20:17:35.931] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:35.931] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:35.931] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:35.931] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:35.932] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:35.932] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:35.932] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:35.932] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:35.932] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:35.933] [program.cpp:finalize@512] Program (0x56273abbbac0) finalized.
[T 04/22/20 20:17:35.936] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:35.936] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:35.936] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:35.936] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:35.936] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:35.936] [program.cpp:Program@133] Program (0x56273b70b7e0) arch=x64 initialized.
[T 04/22/20 20:17:35.937] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:36.047] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:36.047] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 2.22e-05 s
[T 04/22/20 20:17:36.047] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 64 B
[T 04/22/20 20:17:36.055] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:36.055] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:36.064] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:36.064] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:36.064] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:36.065] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:36.065] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:36.065] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:36.067] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:36.067] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:36.067] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:36.068] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:36.068] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:36.068] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:36.069] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:36.069] [memory_pool.cpp:daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 20:17:36.069] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:36.074] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:36.088] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c10_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:36.126] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:36.127] [program.cpp:finalize@512] Program (0x56273b70b7e0) finalized.
[T 04/22/20 20:17:36.129] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:36.129] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:36.129] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:36.129] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:36.129] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:36.129] [program.cpp:Program@133] Program (0x56273abbbac0) arch=cuda initialized.
[T 04/22/20 20:17:36.130] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:36.241] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:36.241] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:36.241] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:36.241] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:36.438] [jit_cuda.cpp:add_module@90] PTX size: 82.16KB
[T 04/22/20 20:17:36.438] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:36.438] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4589557647705078ms
[T 04/22/20 20:17:36.439] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:36.439] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 6.41e-05 s
[T 04/22/20 20:17:36.439] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 64 B
[T 04/22/20 20:17:36.447] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:36.447] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:36.457] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:36.457] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:36.457] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:36.458] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:36.458] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:36.458] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:36.459] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:36.459] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:36.459] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:36.460] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:36.460] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:36.460] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:36.461] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:36.461] [memory_pool.cpp:daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 20:17:36.461] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:36.479] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c12_0...
[T 04/22/20 20:17:36.665] [jit_cuda.cpp:add_module@90] PTX size: 83.76KB
[T 04/22/20 20:17:36.665] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:36.665] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4951953887939453ms
[D 04/22/20 20:17:36.665] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c12_0_kernel_10_range_for<<<896, 64>>>
[T 04/22/20 20:17:36.665] [jit_cuda.cpp:lookup_function@47] Kernel fill_c12_0_kernel_10_range_for compilation time: 0.00095367431640625ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:36.680] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:36.680] [program.cpp:finalize@512] Program (0x56273abbbac0) finalized.
[T 04/22/20 20:17:36.689] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:36.689] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:36.689] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:36.689] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:36.689] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:36.689] [program.cpp:Program@133] Program (0x56273b70b7e0) arch=opengl initialized.
[T 04/22/20 20:17:36.690] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:36.802] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:36.802] [program.cpp:materialize_layout@298] OpenGL root buffer size: 64 B
[T 04/22/20 20:17:36.803] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:36.803] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:36.803] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:36.803] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:36.806] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c14_0...
[D 04/22/20 20:17:36.808] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c14_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };
void fill_c14_00()
{ // range for
// range known at compile time
int _tid = int(gl_GlobalInvocationID.x);
if (_tid >= 8) return;
int _itv = 0 + _tid * 1;
int G = _itv;
int H = (((0 + G) >> 2) & ((1 << 1) - 1));
int L = (((0 + G) >> 1) & ((1 << 1) - 1));
int O = (((0 + G) >> 0) & ((1 << 1) - 1));
int R = H + L;
int S = R + O;
int X = 0;
int OS = 0;
int Z = X + 64 * OS; // S0
int Aq = Z + 0; // S1
int Ar = (((0 + H) >> 0) & ((1 << 1) - 1));
int As = (((0 + L) >> 0) & ((1 << 1) - 1));
int At = (((0 + O) >> 0) & ((1 << 1) - 1));
int OU = 1;
int OV = At * OU;
int OW = OS + OV;
int OX = 2;
int OY = As * OX;
int OZ = OW + OY;
int P0 = 4;
int P1 = Ar * P0;
int P2 = OZ + P1;
int Av = Aq + 8 * P2; // S1
int Aw = Av + 0; // S2
int Ay = atomicAdd(_data_i32_[Aw >> 2], S);
int AM = Av + 4; // S3
int AO = atomicAdd(_data_i32_[AM >> 2], H);
}
void main()
{
fill_c14_00();
}
layout(local_size_x = 8 /* 1, 8 */, local_size_y = 1, local_size_z = 1) in;
[T 04/22/20 20:17:36.808] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:36.808] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:36.808] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:36.808] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:36.808] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:36.808] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:36.808] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:36.808] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:36.809] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:36.809] [program.cpp:finalize@512] Program (0x56273b70b7e0) finalized.
[T 04/22/20 20:17:36.812] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:36.812] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:36.812] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:36.812] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:36.812] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:36.812] [program.cpp:Program@133] Program (0x56273c08bd40) arch=x64 initialized.
[T 04/22/20 20:17:36.813] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:36.925] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:36.925] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 1.91e-05 s
[T 04/22/20 20:17:36.925] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 128 B
[T 04/22/20 20:17:36.933] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:36.933] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:36.943] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:36.943] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:36.943] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:36.944] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:36.944] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:36.944] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:36.945] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:36.945] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:36.945] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:36.946] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:36.946] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:36.946] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:36.947] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:36.947] [memory_pool.cpp:daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 20:17:36.947] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:36.952] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:36.967] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c16_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:37.007] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:37.008] [program.cpp:finalize@512] Program (0x56273c08bd40) finalized.
[T 04/22/20 20:17:37.010] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:37.010] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:37.010] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:37.010] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:37.010] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:37.010] [program.cpp:Program@133] Program (0x56273c2e8cc0) arch=cuda initialized.
[T 04/22/20 20:17:37.011] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:37.123] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:37.123] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:37.123] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:37.123] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:37.316] [jit_cuda.cpp:add_module@90] PTX size: 82.22KB
[T 04/22/20 20:17:37.316] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:37.317] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4658699035644531ms
[T 04/22/20 20:17:37.317] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:37.317] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 7.20e-05 s
[T 04/22/20 20:17:37.317] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 128 B
[T 04/22/20 20:17:37.326] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:37.326] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:37.336] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:37.336] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:37.336] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:37.337] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:37.337] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:37.337] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:37.338] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:37.338] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:37.339] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:37.340] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:37.340] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:37.340] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:37.341] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:37.341] [memory_pool.cpp:daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 20:17:37.341] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:37.359] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c18_0...
[T 04/22/20 20:17:37.548] [jit_cuda.cpp:add_module@90] PTX size: 83.82KB
[T 04/22/20 20:17:37.548] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:37.549] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.476837158203125ms
[D 04/22/20 20:17:37.549] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c18_0_kernel_16_range_for<<<896, 64>>>
[T 04/22/20 20:17:37.549] [jit_cuda.cpp:lookup_function@47] Kernel fill_c18_0_kernel_16_range_for compilation time: 0.00095367431640625ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:37.564] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:37.565] [program.cpp:finalize@512] Program (0x56273c2e8cc0) finalized.
[T 04/22/20 20:17:37.573] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:37.573] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:37.573] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:37.573] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:37.573] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:37.573] [program.cpp:Program@133] Program (0x56273abbbac0) arch=opengl initialized.
[T 04/22/20 20:17:37.574] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:37.687] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:37.687] [program.cpp:materialize_layout@298] OpenGL root buffer size: 128 B
[T 04/22/20 20:17:37.688] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:37.688] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:37.688] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:37.688] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:37.690] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c20_0...
[D 04/22/20 20:17:37.692] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c20_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };
void fill_c20_00()
{ // range for
// range known at compile time
int _tid = int(gl_GlobalInvocationID.x);
if (_tid >= 16) return;
int _itv = 0 + _tid * 1;
int G = _itv;
int H = (((0 + G) >> 3) & ((1 << 1) - 1));
int L = (((0 + G) >> 2) & ((1 << 1) - 1));
int O = (((0 + G) >> 1) & ((1 << 1) - 1));
int R = (((0 + G) >> 0) & ((1 << 1) - 1));
int U = H + L;
int V = U + O;
int W = V + R;
int As = 0;
int Z2 = 0;
int Au = As + 128 * Z2; // S0
int Av = Au + 0; // S1
int Aw = (((0 + H) >> 0) & ((1 << 1) - 1));
int Ax = (((0 + L) >> 0) & ((1 << 1) - 1));
int Ay = (((0 + O) >> 0) & ((1 << 1) - 1));
int Az = (((0 + R) >> 0) & ((1 << 1) - 1));
int Z4 = 1;
int Z5 = Az * Z4;
int Z6 = Z2 + Z5;
int Z7 = 2;
int Z8 = Ay * Z7;
int Z9 = Z6 + Z8;
int Za = 4;
int Zb = Ax * Za;
int Zc = Z9 + Zb;
int Zd = 8;
int Ze = Aw * Zd;
int Zf = Zc + Ze;
int AB = Av + 8 * Zf; // S1
int AC = AB + 0; // S2
int AE = atomicAdd(_data_i32_[AC >> 2], W);
int AU = AB + 4; // S3
int AW = atomicAdd(_data_i32_[AU >> 2], H);
}
void main()
{
fill_c20_00();
}
layout(local_size_x = 16 /* 1, 16 */, local_size_y = 1, local_size_z = 1) in;
[T 04/22/20 20:17:37.692] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:37.692] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:37.692] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:37.692] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:37.693] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:37.693] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:37.693] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:37.693] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:37.693] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:37.694] [program.cpp:finalize@512] Program (0x56273abbbac0) finalized.
[T 04/22/20 20:17:37.696] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:37.696] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:37.696] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:37.696] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:37.697] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:37.697] [program.cpp:Program@133] Program (0x56273c08bd40) arch=x64 initialized.
[T 04/22/20 20:17:37.698] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:37.810] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:37.810] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 3.31e-05 s
[T 04/22/20 20:17:37.810] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 256 B
[T 04/22/20 20:17:37.818] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:37.818] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:37.827] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:37.827] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:37.827] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:37.829] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:37.829] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:37.829] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:37.830] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:37.830] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:37.830] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:37.831] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:37.831] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:37.831] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:37.832] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:37.832] [memory_pool.cpp:daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 20:17:37.832] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:37.837] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:37.852] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c22_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:37.894] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:37.894] [program.cpp:finalize@512] Program (0x56273c08bd40) finalized.
[T 04/22/20 20:17:37.896] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:37.896] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:37.896] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:37.897] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:37.897] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:37.897] [program.cpp:Program@133] Program (0x56273b13cd50) arch=cuda initialized.
[T 04/22/20 20:17:37.898] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:38.009] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:38.009] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:38.009] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:38.009] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:38.206] [jit_cuda.cpp:add_module@90] PTX size: 82.22KB
[T 04/22/20 20:17:38.206] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:38.207] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.43392181396484375ms
[T 04/22/20 20:17:38.207] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:38.207] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 6.70e-05 s
[T 04/22/20 20:17:38.207] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 256 B
[T 04/22/20 20:17:38.215] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:38.215] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:38.225] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:38.225] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:38.225] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:38.226] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:38.226] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:38.226] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:38.227] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:38.227] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:38.227] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:38.228] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:38.228] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:38.228] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:38.229] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:38.229] [memory_pool.cpp:daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 20:17:38.229] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:38.247] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c24_0...
[T 04/22/20 20:17:38.432] [jit_cuda.cpp:add_module@90] PTX size: 83.88KB
[T 04/22/20 20:17:38.433] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:38.433] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4620552062988281ms
[D 04/22/20 20:17:38.433] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c24_0_kernel_22_range_for<<<896, 64>>>
[T 04/22/20 20:17:38.433] [jit_cuda.cpp:lookup_function@47] Kernel fill_c24_0_kernel_22_range_for compilation time: 0.00095367431640625ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:38.448] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:38.448] [program.cpp:finalize@512] Program (0x56273b13cd50) finalized.
[T 04/22/20 20:17:38.457] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:38.457] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:38.457] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:38.457] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:38.457] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:38.457] [program.cpp:Program@133] Program (0x56273b8e1b00) arch=opengl initialized.
[T 04/22/20 20:17:38.458] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:38.570] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:38.570] [program.cpp:materialize_layout@298] OpenGL root buffer size: 256 B
[T 04/22/20 20:17:38.571] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:38.571] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:38.571] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:38.571] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:38.574] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c26_0...
[D 04/22/20 20:17:38.576] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c26_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };
void fill_c26_00()
{ // range for
// range known at compile time
int _tid = int(gl_GlobalInvocationID.x);
if (_tid >= 32) return;
int _itv = 0 + _tid * 1;
int G = _itv;
int H = (((0 + G) >> 4) & ((1 << 1) - 1));
int L = (((0 + G) >> 3) & ((1 << 1) - 1));
int O = (((0 + G) >> 2) & ((1 << 1) - 1));
int R = (((0 + G) >> 1) & ((1 << 1) - 1));
int U = (((0 + G) >> 0) & ((1 << 1) - 1));
int X = H + L;
int Y = X + O;
int Z = Y + R;
int Aq = Z + U;
int Ax = 0;
int AAQ = 0;
int Az = Ax + 256 * AAQ; // S0
int AA = Az + 0; // S1
int AB = (((0 + H) >> 0) & ((1 << 1) - 1));
int AC = (((0 + L) >> 0) & ((1 << 1) - 1));
int AD = (((0 + O) >> 0) & ((1 << 1) - 1));
int AE = (((0 + R) >> 0) & ((1 << 1) - 1));
int AF = (((0 + U) >> 0) & ((1 << 1) - 1));
int AAS = 1;
int AAT = AF * AAS;
int AAU = AAQ + AAT;
int AAV = 2;
int AAW = AE * AAV;
int AAX = AAU + AAW;
int AAY = 4;
int AAZ = AD * AAY;
int AB0 = AAX + AAZ;
int AB1 = 8;
int AB2 = AC * AB1;
int AB3 = AB0 + AB2;
int AB4 = 16;
int AB5 = AB * AB4;
int AB6 = AB3 + AB5;
int AH = AA + 8 * AB6; // S1
int AI = AH + 0; // S2
int AK = atomicAdd(_data_i32_[AI >> 2], Aq);
int B2 = AH + 4; // S3
int B4 = atomicAdd(_data_i32_[B2 >> 2], H);
}
void main()
{
fill_c26_00();
}
layout(local_size_x = 32 /* 1, 32 */, local_size_y = 1, local_size_z = 1) in;
[T 04/22/20 20:17:38.576] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:38.576] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:38.576] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:38.576] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:38.577] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:38.577] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:38.577] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:38.577] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:38.578] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:38.579] [program.cpp:finalize@512] Program (0x56273b8e1b00) finalized.
[T 04/22/20 20:17:38.581] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:38.581] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:38.581] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:38.581] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:38.581] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:38.581] [program.cpp:Program@133] Program (0x56273b5dbb30) arch=x64 initialized.
[T 04/22/20 20:17:38.582] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:38.692] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:38.692] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 3.29e-05 s
[T 04/22/20 20:17:38.692] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 512 B
[T 04/22/20 20:17:38.701] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:38.701] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:38.710] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:38.710] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:38.710] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:38.711] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:38.711] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:38.711] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:38.712] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:38.712] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:38.712] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:38.713] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:38.713] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:38.713] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:38.714] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:38.714] [memory_pool.cpp:daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 20:17:38.714] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:38.720] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:38.735] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c28_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:38.776] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:38.777] [program.cpp:finalize@512] Program (0x56273b5dbb30) finalized.
[T 04/22/20 20:17:38.779] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:38.779] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:38.779] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:38.779] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:38.779] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:38.779] [program.cpp:Program@133] Program (0x56273b2d9800) arch=cuda initialized.
[T 04/22/20 20:17:38.780] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:38.892] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:38.892] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:38.892] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:38.892] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:39.085] [jit_cuda.cpp:add_module@90] PTX size: 82.22KB
[T 04/22/20 20:17:39.085] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:39.085] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.46515464782714844ms
[T 04/22/20 20:17:39.086] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:39.086] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 8.51e-05 s
[T 04/22/20 20:17:39.086] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 512 B
[T 04/22/20 20:17:39.094] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:39.094] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:39.104] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:39.104] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:39.104] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:39.105] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:39.105] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:39.105] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:39.106] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:39.106] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:39.106] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:39.107] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:39.107] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:39.107] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:39.108] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:39.108] [memory_pool.cpp:daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 20:17:39.108] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:39.127] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c30_0...
[T 04/22/20 20:17:39.317] [jit_cuda.cpp:add_module@90] PTX size: 83.93KB
[T 04/22/20 20:17:39.317] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:39.317] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.47397613525390625ms
[D 04/22/20 20:17:39.317] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c30_0_kernel_28_range_for<<<896, 64>>>
[T 04/22/20 20:17:39.317] [jit_cuda.cpp:lookup_function@47] Kernel fill_c30_0_kernel_28_range_for compilation time: 0.00095367431640625ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:39.334] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:39.335] [program.cpp:finalize@512] Program (0x56273b2d9800) finalized.
[T 04/22/20 20:17:39.355] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:39.355] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:39.355] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:39.355] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:39.355] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:39.355] [program.cpp:Program@133] Program (0x56273b541240) arch=opengl initialized.
[T 04/22/20 20:17:39.357] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:39.473] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:39.473] [program.cpp:materialize_layout@298] OpenGL root buffer size: 512 B
[T 04/22/20 20:17:39.474] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:39.474] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:39.474] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:39.474] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:39.477] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c32_0...
[D 04/22/20 20:17:39.480] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c32_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };
void fill_c32_00()
{ // range for
// range known at compile time
int _tid = int(gl_GlobalInvocationID.x);
if (_tid >= 64) return;
int _itv = 0 + _tid * 1;
int G = _itv;
int H = (((0 + G) >> 5) & ((1 << 1) - 1));
int L = (((0 + G) >> 4) & ((1 << 1) - 1));
int O = (((0 + G) >> 3) & ((1 << 1) - 1));
int R = (((0 + G) >> 2) & ((1 << 1) - 1));
int U = (((0 + G) >> 1) & ((1 << 1) - 1));
int X = (((0 + G) >> 0) & ((1 << 1) - 1));
int Aq = H + L;
int Ar = Aq + O;
int As = Ar + R;
int At = As + U;
int Au = At + X;
int AC = 0;
int AOi = 0;
int AE = AC + 512 * AOi; // S0
int AF = AE + 0; // S1
int AG = (((0 + H) >> 0) & ((1 << 1) - 1));
int AH = (((0 + L) >> 0) & ((1 << 1) - 1));
int AI = (((0 + O) >> 0) & ((1 << 1) - 1));
int AJ = (((0 + R) >> 0) & ((1 << 1) - 1));
int AK = (((0 + U) >> 0) & ((1 << 1) - 1));
int AL = (((0 + X) >> 0) & ((1 << 1) - 1));
int AOk = 1;
int AOl = AL * AOk;
int AOm = AOi + AOl;
int AOn = 2;
int AOo = AK * AOn;
int AOp = AOm + AOo;
int AOq = 4;
int AOr = AJ * AOq;
int AOs = AOp + AOr;
int AOt = 8;
int AOu = AI * AOt;
int AOv = AOs + AOu;
int AOw = 16;
int AOx = AH * AOw;
int AOy = AOv + AOx;
int AOz = 32;
int AOA = AG * AOz;
int AOB = AOy + AOA;
int AN = AF + 8 * AOB; // S1
int AO = AN + 0; // S2
int AQ = atomicAdd(_data_i32_[AO >> 2], Au);
int Ba = AN + 4; // S3
int Bc = atomicAdd(_data_i32_[Ba >> 2], H);
}
void main()
{
fill_c32_00();
}
layout(local_size_x = 64 /* 1, 64 */, local_size_y = 1, local_size_z = 1) in;
[T 04/22/20 20:17:39.480] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:39.480] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:39.480] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:39.480] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:39.481] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:39.481] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:39.481] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:39.481] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:39.484] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:39.484] [program.cpp:finalize@512] Program (0x56273b541240) finalized.
[T 04/22/20 20:17:39.487] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:39.487] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:39.487] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:39.487] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:39.487] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:39.487] [program.cpp:Program@133] Program (0x56273b007f00) arch=x64 initialized.
[T 04/22/20 20:17:39.488] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:39.601] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:39.601] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 3.19e-05 s
[T 04/22/20 20:17:39.601] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 1024 B
[T 04/22/20 20:17:39.609] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:39.609] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:39.619] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:39.619] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:39.619] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:39.621] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:39.621] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:39.621] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:39.622] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:39.622] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:39.622] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:39.623] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:39.623] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:39.623] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:39.624] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:39.624] [memory_pool.cpp:daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 20:17:39.624] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:39.629] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:39.646] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c34_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:39.689] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:39.689] [program.cpp:finalize@512] Program (0x56273b007f00) finalized.
[T 04/22/20 20:17:39.692] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:39.692] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:39.692] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:39.692] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:39.693] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:39.693] [program.cpp:Program@133] Program (0x56273c07b960) arch=cuda initialized.
[T 04/22/20 20:17:39.694] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:39.807] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:39.807] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:39.807] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:39.807] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:40.002] [jit_cuda.cpp:add_module@90] PTX size: 82.22KB
[T 04/22/20 20:17:40.002] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:40.002] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.5109310150146484ms
[T 04/22/20 20:17:40.003] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:40.003] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 8.61e-05 s
[T 04/22/20 20:17:40.003] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 1024 B
[T 04/22/20 20:17:40.011] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:40.011] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:40.021] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:40.021] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:40.021] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:40.022] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:40.022] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:40.022] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:40.023] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:40.023] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:40.023] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:40.024] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:40.024] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:40.024] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:40.025] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:40.025] [memory_pool.cpp:daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 20:17:40.025] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:40.046] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c36_0...
[T 04/22/20 20:17:40.233] [jit_cuda.cpp:add_module@90] PTX size: 83.99KB
[T 04/22/20 20:17:40.233] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:40.234] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.5099773406982422ms
[D 04/22/20 20:17:40.234] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c36_0_kernel_34_range_for<<<896, 64>>>
[T 04/22/20 20:17:40.234] [jit_cuda.cpp:lookup_function@47] Kernel fill_c36_0_kernel_34_range_for compilation time: 0.00095367431640625ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:40.251] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:40.252] [program.cpp:finalize@512] Program (0x56273c07b960) finalized.
[T 04/22/20 20:17:40.261] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:40.261] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:40.262] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:40.262] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:40.262] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:40.262] [program.cpp:Program@133] Program (0x56273abbbac0) arch=opengl initialized.
[T 04/22/20 20:17:40.263] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:40.377] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:40.377] [program.cpp:materialize_layout@298] OpenGL root buffer size: 1024 B
[T 04/22/20 20:17:40.378] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:40.378] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:40.378] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:40.378] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:40.384] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c38_0...
[D 04/22/20 20:17:40.388] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c38_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };
void fill_c38_00()
{ // range for
// range known at compile time
int _tid = int(gl_GlobalInvocationID.x);
if (_tid >= 128) return;
int _itv = 0 + _tid * 1;
int G = _itv;
int H = (((0 + G) >> 6) & ((1 << 1) - 1));
int L = (((0 + G) >> 5) & ((1 << 1) - 1));
int O = (((0 + G) >> 4) & ((1 << 1) - 1));
int R = (((0 + G) >> 3) & ((1 << 1) - 1));
int U = (((0 + G) >> 2) & ((1 << 1) - 1));
int X = (((0 + G) >> 1) & ((1 << 1) - 1));
int Aq = (((0 + G) >> 0) & ((1 << 1) - 1));
int At = H + L;
int Au = At + O;
int Av = Au + R;
int Aw = Av + U;
int Ax = Aw + X;
int Ay = Ax + Aq;
int AH = 0;
int B3o = 0;
int AJ = AH + 1024 * B3o; // S0
int AK = AJ + 0; // S1
int AL = (((0 + H) >> 0) & ((1 << 1) - 1));
int AM = (((0 + L) >> 0) & ((1 << 1) - 1));
int AN = (((0 + O) >> 0) & ((1 << 1) - 1));
int AO = (((0 + R) >> 0) & ((1 << 1) - 1));
int AP = (((0 + U) >> 0) & ((1 << 1) - 1));
int AQ = (((0 + X) >> 0) & ((1 << 1) - 1));
int AR = (((0 + Aq) >> 0) & ((1 << 1) - 1));
int B3q = 1;
int B3r = AR * B3q;
int B3s = B3o + B3r;
int B3t = 2;
int B3u = AQ * B3t;
int B3v = B3s + B3u;
int B3w = 4;
int B3x = AP * B3w;
int B3y = B3v + B3x;
int B3z = 8;
int B3A = AO * B3z;
int B3B = B3y + B3A;
int B3C = 16;
int B3D = AN * B3C;
int B3E = B3B + B3D;
int B3F = 32;
int B3G = AM * B3F;
int B3H = B3E + B3G;
int B3I = 64;
int B3J = AL * B3I;
int B3K = B3H + B3J;
int AT = AK + 8 * B3K; // S1
int AU = AT + 0; // S2
int AW = atomicAdd(_data_i32_[AU >> 2], Ay);
int Bi = AT + 4; // S3
int Bk = atomicAdd(_data_i32_[Bi >> 2], H);
}
void main()
{
fill_c38_00();
}
layout(local_size_x = 128 /* 1, 128 */, local_size_y = 1, local_size_z = 1) in;
[T 04/22/20 20:17:40.388] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:40.388] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:40.388] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:40.388] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:40.389] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:40.389] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:40.389] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:40.389] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:40.397] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:40.398] [program.cpp:finalize@512] Program (0x56273abbbac0) finalized.
[T 04/22/20 20:17:40.401] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:40.401] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:40.401] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:40.401] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:40.401] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:40.401] [program.cpp:Program@133] Program (0x56273b13f180) arch=x64 initialized.
[T 04/22/20 20:17:40.402] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:40.516] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:40.516] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 2.10e-05 s
[T 04/22/20 20:17:40.516] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 2048 B
[T 04/22/20 20:17:40.524] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:40.524] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:40.534] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:40.534] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:40.534] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:40.535] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:40.535] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:40.535] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:40.536] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:40.536] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:40.536] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:40.537] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:40.537] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:40.537] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:40.538] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:40.538] [memory_pool.cpp:daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 20:17:40.538] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:40.544] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:40.562] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c40_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:40.607] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:40.608] [program.cpp:finalize@512] Program (0x56273b13f180) finalized.
[T 04/22/20 20:17:40.610] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:40.610] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:40.610] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:40.610] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:40.610] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:40.610] [program.cpp:Program@133] Program (0x56273c6e92b0) arch=cuda initialized.
[T 04/22/20 20:17:40.611] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:40.723] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:40.723] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:40.723] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:40.723] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:40.922] [jit_cuda.cpp:add_module@90] PTX size: 82.22KB
[T 04/22/20 20:17:40.922] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:40.923] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4589557647705078ms
[T 04/22/20 20:17:40.923] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:40.923] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 7.30e-05 s
[T 04/22/20 20:17:40.923] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 2048 B
[T 04/22/20 20:17:40.931] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:40.931] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:40.941] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:40.941] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:40.941] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:40.942] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:40.942] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:40.942] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:40.943] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:40.943] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:40.943] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:40.944] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:40.944] [memory_pool.cpp:daemon@111] Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 20:17:40.944] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:40.945] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:40.945] [memory_pool.cpp:daemon@111] Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 20:17:40.945] [memory_pool.cpp:daemon@113] Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:40.966] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c42_0...
[T 04/22/20 20:17:41.153] [jit_cuda.cpp:add_module@90] PTX size: 84.04KB
[T 04/22/20 20:17:41.153] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:41.153] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.45800209045410156ms
[D 04/22/20 20:17:41.154] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c42_0_kernel_40_range_for<<<896, 64>>>
[T 04/22/20 20:17:41.154] [jit_cuda.cpp:lookup_function@47] Kernel fill_c42_0_kernel_40_range_for compilation time: 0.014781951904296875ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:41.171] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:41.172] [program.cpp:finalize@512] Program (0x56273c6e92b0) finalized.
[T 04/22/20 20:17:41.181] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:41.181] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:41.181] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:41.181] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:41.181] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:41.181] [program.cpp:Program@133] Program (0x56273c85ee90) arch=opengl initialized.
[T 04/22/20 20:17:41.182] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:41.298] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:41.298] [program.cpp:materialize_layout@298] OpenGL root buffer size: 2048 B
[T 04/22/20 20:17:41.299] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:41.299] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:41.299] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:41.299] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:41.314] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c44_0...
[D 04/22/20 20:17:41.318] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c44_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };
void fill_c44_00()
{ // range for
// range known at compile time
int _tid = int(gl_GlobalInvocationID.x);
if (_tid >= 256) return;
int _itv = 0 + _tid * 1;
int G = _itv;
int H = (((0 + G) >> 7) & ((1 << 1) - 1));
int L = (((0 + G) >> 6) & ((1 << 1) - 1));
int O = (((0 + G) >> 5) & ((1 << 1) - 1));
int R = (((0 + G) >> 4) & ((1 << 1) - 1));
int U = (((0 + G) >> 3) & ((1 << 1) - 1));
int X = (((0 + G) >> 2) & ((1 << 1) - 1));
int Aq = (((0 + G) >> 1) & ((1 << 1) - 1));
int At = (((0 + G) >> 0) & ((1 << 1) - 1));
int Aw = H + L;
int Ax = Aw + O;
int Ay = Ax + R;
int Az = Ay + U;
int AA = Az + X;
int AB = AA + Aq;
int AC = AB + At;
int AM = 0;
int Bk8 = 0;
int AO = AM + 2048 * Bk8; // S0
int AP = AO + 0; // S1
int AQ = (((0 + H) >> 0) & ((1 << 1) - 1));
int AR = (((0 + L) >> 0) & ((1 << 1) - 1));
int AS = (((0 + O) >> 0) & ((1 << 1) - 1));
int AT = (((0 + R) >> 0) & ((1 << 1) - 1));
int AU = (((0 + U) >> 0) & ((1 << 1) - 1));
int AV = (((0 + X) >> 0) & ((1 << 1) - 1));
int AW = (((0 + Aq) >> 0) & ((1 << 1) - 1));
int AX = (((0 + At) >> 0) & ((1 << 1) - 1));
int Bka = 1;
int Bkb = AX * Bka;
int Bkc = Bk8 + Bkb;
int Bkd = 2;
int Bke = AW * Bkd;
int Bkf = Bkc + Bke;
int Bkg = 4;
int Bkh = AV * Bkg;
int Bki = Bkf + Bkh;
int Bkj = 8;
int Bkk = AU * Bkj;
int Bkl = Bki + Bkk;
int Bkm = 16;
int Bkn = AT * Bkm;
int Bko = Bkl + Bkn;
int Bkp = 32;
int Bkq = AS * Bkp;
int Bkr = Bko + Bkq;
int Bks = 64;
int Bkt = AR * Bks;
int Bku = Bkr + Bkt;
int Bkv = 128;
int Bkw = AQ * Bkv;
int Bkx = Bku + Bkw;
int AZ = AP + 8 * Bkx; // S1
int B0 = AZ + 0; // S2
int B2 = atomicAdd(_data_i32_[B0 >> 2], AC);
int Bq = AZ + 4; // S3
int Bs = atomicAdd(_data_i32_[Bq >> 2], H);
}
void main()
{
fill_c44_00();
}
layout(local_size_x = 256 /* 1, 256 */, local_size_y = 1, local_size_z = 1) in;
[T 04/22/20 20:17:41.318] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:41.318] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:41.318] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:41.318] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:41.319] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:41.319] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:41.319] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:41.319] [opengl_api.cpp:link@129] glLinkProgram OUT
.
==================================================== 1 passed in 7.31s =====================================================
[T 04/22/20 20:17:41.374] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:41.375] [program.cpp:finalize@512] Program (0x56273c85ee90) finalized.
Looks like the only difference is the final two lines -- on Windows I've never seen them.
Anyway, as long as they are mismatched in the same way with/without advanced optimization, we can still get a plausible comparison...
@yuanming-hu I found an issue when doing CSE for global pointers:
Case: test_ad_if_parallel_complex
Before (good):
[I 06/30/20 20:38:44.108] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Simplified I:
kernel {
<f32 x1> $0 = const [1.0]
<i32 x1> $1 = const [1]
<i32 x1> $2 = const [0]
<f32 x1> $3 = const [0.0]
<i32 x1> $4 = const [2]
$5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
<i32 x1> $6 = loop $5 index 0
<f32 x1> $7 = alloca
<f32 x1> $8 : local store [$7 <- $3]
<f32*x1> $9 = global ptr [S2place_f32], index [$6] activate=true
<f32 x1> $10 = global load $9
<i32 x1> $11 = cmp_gt $10 $3
<i32 x1> $12 = bit_and $11 $1
$13 : if $12 {
<f32*x1> $14 = global ptr [S2place_f32], index [$6] activate=true
<f32 x1> $15 = global load $14
<f32 x1> $16 = div $0 $15
<f32 x1> $17 : local store [$7 <- $16]
}
<f32 x1> $18 = local load [ [$7[0]]]
<f32*x1> $19 = global ptr [S4place_f32], index [$6] activate=true
<f32*x1> $20 : global store [$19 <- $18]
}
}
[I 06/30/20 20:38:44.110] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Gradient:
kernel {
<f32 x1> $0 = const [1.0]
<i32 x1> $1 = const [1]
<i32 x1> $2 = const [0]
<f32 x1> $3 = const [0.0]
<i32 x1> $4 = const [2]
$5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
<f32 x1> $6 = alloca
<f32 x1> $7 = alloca
<f32 x1> $8 = alloca
<f32 x1> $9 = alloca
<i32 x1> $10 = loop $5 index 0
<f32 x1> $11 = stack alloc (max_size=16)
<f32 x1> $12 : stack push $11, val = $3
<f32*x1> $13 = global ptr [S2place_f32], index [$10] activate=true
<f32 x1> $14 = global load $13
<i32 x1> $15 = cmp_gt $14 $3
<i32 x1> $16 = bit_and $15 $1
$17 : if $16 {
<f32*x1> $18 = global ptr [S2place_f32], index [$10] activate=true
<f32 x1> $19 = global load $18
<f32 x1> $20 : local store [$6 <- $19]
<f32 x1> $21 = div $0 $19
<f32 x1> $22 : stack push $11, val = $21
}
<f32*x1> $23 = global ptr [S4place_f32], index [$10] activate=true
<f32*x1> $24 = global ptr [S6place_f32], index [$10] activate=true
<f32 x1> $25 = global load $24
<f32 x1> $26 : stack acc adj $11, val = $25
$27 : if $16 {
<f32 x1> $28 = stack load top adj $11
<f32 x1> $29 = local load [ [$9[0]]]
<f32 x1> $30 = add $29 $28
<f32 x1> $31 : local store [$9 <- $30]
<f32 x1> $32 : stack pop $11
<f32 x1> $33 = local load [ [$6[0]]]
<f32 x1> $34 = div $30 $33
<f32 x1> $35 = local load [ [$8[0]]]
<f32 x1> $36 = add $35 $34
<f32 x1> $37 : local store [$8 <- $36]
<f32 x1> $38 = mul $33 $33
<f32 x1> $39 = div $30 $38
<f32 x1> $40 = neg $39
<f32 x1> $41 = local load [ [$7[0]]]
<f32 x1> $42 = add $41 $40
<f32 x1> $43 : local store [$7 <- $42]
<f32*x1> $44 = global ptr [S5place_f32], index [$10] activate=true
<f32 x1> $45 = atomic add($44, $42)
}
<f32*x1> $46 = global ptr [S5place_f32], index [$10] activate=true
<f32 x1> $47 = atomic add($46, $3)
<f32 x1> $48 : stack pop $11
}
}
After(bad, with some debug output in full_simplify()):
[I 06/30/20 20:43:33.360] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Simplified I:
kernel {
<f32 x1> $0 = const [1.0]
<i32 x1> $1 = const [1]
<i32 x1> $2 = const [0]
<f32 x1> $3 = const [0.0]
<i32 x1> $4 = const [2]
$5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
<i32 x1> $6 = loop $5 index 0
<f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
<f32 x1> $8 = global load $7
<i32 x1> $9 = cmp_gt $8 $3
<i32 x1> $10 = bit_and $9 $1
<f32 x1> $11 = div $0 $8
<f32 x1> $12 = select($10, $11, $3)
<f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
<f32*x1> $14 : global store [$13 <- $12]
}
}
before simplify
kernel {
<f32 x1> $0 = const [1.0]
<i32 x1> $1 = const [1]
<i32 x1> $2 = const [0]
<f32 x1> $3 = const [0.0]
<i32 x1> $4 = const [2]
$5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
<i32 x1> $6 = loop $5 index 0
<f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
<f32 x1> $8 = global load $7
<i32 x1> $9 = cmp_gt $8 $3
<i32 x1> $10 = bit_and $9 $1
<f32 x1> $11 = div $0 $8
<f32 x1> $12 = select($10, $11, $3)
<f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
<f32*x1> $14 : global store [$13 <- $12]
}
}
after simplify
kernel {
<f32 x1> $0 = const [1.0]
<i32 x1> $1 = const [1]
<i32 x1> $2 = const [0]
<f32 x1> $3 = const [0.0]
<i32 x1> $4 = const [2]
$5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
<i32 x1> $6 = loop $5 index 0
<f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
<f32 x1> $8 = global load $7
<i32 x1> $9 = cmp_gt $8 $3
<i32 x1> $10 = bit_and $9 $1
<f32 x1> $11 = div $0 $8
<f32 x1> $12 = select($10, $11, $3)
<f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
<f32*x1> $14 : global store [$13 <- $12]
}
}
after cse
kernel {
<f32 x1> $0 = const [1.0]
<i32 x1> $1 = const [1]
<i32 x1> $2 = const [0]
<f32 x1> $3 = const [0.0]
<i32 x1> $4 = const [2]
$5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
<i32 x1> $6 = loop $5 index 0
<f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
<f32 x1> $8 = global load $7
<i32 x1> $9 = cmp_gt $8 $3
<i32 x1> $10 = bit_and $9 $1
<f32 x1> $11 = div $0 $8
<f32 x1> $12 = select($10, $11, $3)
<f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
<f32*x1> $14 : global store [$13 <- $12]
}
}
before simplify
kernel {
<f32 x1> $205 = const [0.0]
<i32 x1> $1 = const [1]
<i32 x1> $2 = const [0]
<f32 x1> $3 = const [0.0]
<i32 x1> $4 = const [2]
$5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
<f32 x1> $229 = alloca
<f32 x1> $220 = alloca
<f32 x1> $214 = alloca
<f32 x1> $208 = alloca
<f32 x1> $201 = alloca
<i32 x1> $6 = loop $5 index 0
<f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
<f32 x1> $8 = global load $7
<i32 x1> $9 = cmp_gt $8 $3
<i32 x1> $10 = bit_and $9 $1
<f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
<f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
<f32 x1> $200 = global load $199
<f32 x1> $202 = local load [ [$201[0]]]
<f32 x1> $203 = add $202 $200
<f32 x1> $204 : local store [$201 <- $203]
<f32 x1> $206 = local load [ [$201[0]]]
<f32 x1> $207 = select($10, $206, $205)
<f32 x1> $209 = local load [ [$208[0]]]
<f32 x1> $210 = add $209 $207
<f32 x1> $211 : local store [$208 <- $210]
<f32 x1> $212 = local load [ [$201[0]]]
<f32 x1> $213 = select($10, $205, $212)
<f32 x1> $215 = local load [ [$214[0]]]
<f32 x1> $216 = add $215 $213
<f32 x1> $217 : local store [$214 <- $216]
<f32 x1> $218 = local load [ [$208[0]]]
<f32 x1> $219 = div $218 $8
<f32 x1> $221 = local load [ [$220[0]]]
<f32 x1> $222 = add $221 $219
<f32 x1> $223 : local store [$220 <- $222]
<f32 x1> $224 = mul $8 $8
<f32 x1> $225 = local load [ [$208[0]]]
<f32 x1> $227 = div $225 $224
<f32 x1> $228 = neg $227
<f32 x1> $230 = local load [ [$229[0]]]
<f32 x1> $231 = add $230 $228
<f32 x1> $232 : local store [$229 <- $231]
<f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
<f32 x1> $234 = local load [ [$229[0]]]
<f32 x1> $235 = atomic add($233, $234)
}
}
after simplify
kernel {
<f32 x1> $205 = const [0.0]
<i32 x1> $1 = const [1]
<i32 x1> $2 = const [0]
<f32 x1> $3 = const [0.0]
<i32 x1> $4 = const [2]
$5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
<i32 x1> $6 = loop $5 index 0
<f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
<f32 x1> $8 = global load $7
<i32 x1> $9 = cmp_gt $8 $3
<i32 x1> $10 = bit_and $9 $1
<f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
<f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
<f32 x1> $200 = global load $199
<f32 x1> $236 = const [0.0]
<f32 x1> $203 = add $236 $200
<f32 x1> $207 = select($10, $203, $205)
<f32 x1> $237 = const [0.0]
<f32 x1> $210 = add $237 $207
<f32 x1> $224 = mul $8 $8
<f32 x1> $227 = div $210 $224
<f32 x1> $228 = neg $227
<f32 x1> $240 = const [0.0]
<f32 x1> $231 = add $240 $228
<f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
<f32 x1> $235 = atomic add($233, $231)
}
}
after cse
kernel {
<f32 x1> $205 = const [0.0]
<i32 x1> $1 = const [1]
<i32 x1> $2 = const [0]
<i32 x1> $4 = const [2]
$5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
<i32 x1> $6 = loop $5 index 0
<f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
<f32 x1> $8 = global load $7
<i32 x1> $9 = cmp_gt $8 $205
<i32 x1> $10 = bit_and $9 $1
<f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
<f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
<f32 x1> $200 = global load $199
<f32 x1> $203 = add $205 $200
<f32 x1> $207 = select($10, $203, $205)
<f32 x1> $210 = add $205 $207
<f32 x1> $224 = mul $8 $8
<f32 x1> $227 = div $210 $224
<f32 x1> $228 = neg $227
<f32 x1> $231 = add $205 $228
<f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
<f32 x1> $235 = atomic add($233, $231)
}
}
before simplify
kernel {
<f32 x1> $205 = const [0.0]
<i32 x1> $1 = const [1]
<i32 x1> $2 = const [0]
<i32 x1> $4 = const [2]
$5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
<i32 x1> $6 = loop $5 index 0
<f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
<f32 x1> $8 = global load $7
<i32 x1> $9 = cmp_gt $8 $205
<i32 x1> $10 = bit_and $9 $1
<f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
<f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
<f32 x1> $200 = global load $199
<f32 x1> $207 = select($10, $200, $205)
<f32 x1> $224 = mul $8 $8
<f32 x1> $227 = div $207 $224
<f32 x1> $228 = neg $227
<f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
<f32 x1> $235 = atomic add($233, $228)
}
}
after simplify
kernel {
<f32 x1> $205 = const [0.0]
<i32 x1> $1 = const [1]
<i32 x1> $2 = const [0]
<i32 x1> $4 = const [2]
$5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
<i32 x1> $6 = loop $5 index 0
<f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
<f32 x1> $8 = global load $7
<i32 x1> $9 = cmp_gt $8 $205
<i32 x1> $10 = bit_and $9 $1
<f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
<f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
<f32 x1> $200 = global load $199
<f32 x1> $207 = select($10, $200, $205)
<f32 x1> $224 = mul $8 $8
<f32 x1> $227 = div $207 $224
<f32 x1> $228 = neg $227
<f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
<f32 x1> $235 = atomic add($233, $228)
}
}
after cse
kernel {
<f32 x1> $205 = const [0.0]
<i32 x1> $1 = const [1]
<i32 x1> $2 = const [0]
<i32 x1> $4 = const [2]
$5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
<i32 x1> $6 = loop $5 index 0
<f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
<f32 x1> $8 = global load $7
<i32 x1> $9 = cmp_gt $8 $205
<i32 x1> $10 = bit_and $9 $1
<f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
<f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
<f32 x1> $200 = global load $199
<f32 x1> $207 = select($10, $200, $205)
<f32 x1> $224 = mul $8 $8
<f32 x1> $227 = div $207 $224
<f32 x1> $228 = neg $227
<f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
<f32 x1> $235 = atomic add($233, $228)
}
}
before simplify
kernel {
<f32 x1> $205 = const [0.0]
<i32 x1> $1 = const [1]
<i32 x1> $2 = const [0]
<i32 x1> $4 = const [2]
$5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
<i32 x1> $6 = loop $5 index 0
<f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
<f32 x1> $8 = global load $7
<i32 x1> $9 = cmp_gt $8 $205
<i32 x1> $10 = bit_and $9 $1
<f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
<f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
<f32 x1> $200 = global load $199
<f32 x1> $207 = select($10, $200, $205)
<f32 x1> $224 = mul $8 $8
<f32 x1> $227 = div $207 $224
<f32 x1> $228 = neg $227
<f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
<f32 x1> $235 = atomic add($233, $228)
}
}
after simplify
kernel {
<f32 x1> $205 = const [0.0]
<i32 x1> $1 = const [1]
<i32 x1> $2 = const [0]
<i32 x1> $4 = const [2]
$5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
<i32 x1> $6 = loop $5 index 0
<f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
<f32 x1> $8 = global load $7
<i32 x1> $9 = cmp_gt $8 $205
<i32 x1> $10 = bit_and $9 $1
<f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
<f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
<f32 x1> $200 = global load $199
<f32 x1> $207 = select($10, $200, $205)
<f32 x1> $224 = mul $8 $8
<f32 x1> $227 = div $207 $224
<f32 x1> $228 = neg $227
<f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
<f32 x1> $235 = atomic add($233, $228)
}
}
after cse
kernel {
<f32 x1> $205 = const [0.0]
<i32 x1> $1 = const [1]
<i32 x1> $2 = const [0]
<i32 x1> $4 = const [2]
$5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
<i32 x1> $6 = loop $5 index 0
<f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
<f32 x1> $8 = global load $7
<i32 x1> $9 = cmp_gt $8 $205
<i32 x1> $10 = bit_and $9 $1
<f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
<f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
<f32 x1> $200 = global load $199
<f32 x1> $207 = select($10, $200, $205)
<f32 x1> $224 = mul $8 $8
<f32 x1> $227 = div $207 $224
<f32 x1> $228 = neg $227
<f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
<f32 x1> $235 = atomic add($233, $228)
}
}
[I 06/30/20 20:43:33.386] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Gradient:
kernel {
<f32 x1> $0 = const [0.0]
<i32 x1> $1 = const [1]
<i32 x1> $2 = const [0]
<i32 x1> $3 = const [2]
$4 : for in range($2, $3) (vectorize 1) block_dim=adaptive {
<i32 x1> $5 = loop $4 index 0
<f32*x1> $6 = global ptr [S2place_f32], index [$5] activate=true
<f32 x1> $7 = global load $6
<i32 x1> $8 = cmp_gt $7 $0
<i32 x1> $9 = bit_and $8 $1
<f32*x1> $10 = global ptr [S4place_f32], index [$5] activate=true
<f32*x1> $11 = global ptr [S6place_f32], index [$5] activate=true
<f32 x1> $12 = global load $11
<f32 x1> $13 = select($9, $12, $0)
<f32 x1> $14 = mul $7 $7
<f32 x1> $15 = div $13 $14
<f32 x1> $16 = neg $15
<f32*x1> $17 = global ptr [S5place_f32], index [$5] activate=true
<f32 x1> $18 = atomic add($17, $16)
}
}
I think although the IRs in Simplified I are different, both look pretty good... But after Gradient, the latter becomes wrong.
Final IR:
Good:
[I 06/30/20 20:38:44.166] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Simplified III:
kernel {
$0 = offloaded range_for(0, 2) block_dim=adaptive
body {
<f32 x1> $1 = const [1.0]
<f32 x1> $2 = alloca
<i32 x1> $3 = loop $0 index 0
<f32 x1> $4 = stack alloc (max_size=16)
<f32 x1> $5 = const [0.0]
<f32 x1> $6 : stack push $4, val = $5
<gen*x1> $7 = get root
<i32 x1> $8 = const [0]
<gen*x1> $9 = [S0root][root]::lookup($7, $8) activate = false
<gen*x1> $10 = get child [S0root->S1dense] $9
<i32 x1> $11 = const [1]
<gen*x1> $12 = [S1dense][dense]::lookup($10, $3) activate = false
<f32*x1> $13 = get child [S1dense->S2place_f32] $12
<f32 x1> $14 = global load $13
<i32 x1> $15 = cmp_gt $14 $5
<i32 x1> $16 = bit_and $15 $11
$17 : if $16 {
<f32 x1> $18 = global load $13
<f32 x1> $19 : local store [$2 <- $18]
<f32 x1> $20 = div $1 $18
<f32 x1> $21 : stack push $4, val = $20
}
<gen*x1> $22 = get child [S0root->S3dense] $9
<gen*x1> $23 = [S3dense][dense]::lookup($22, $3) activate = false
<f32*x1> $24 = get child [S3dense->S6place_f32] $23
<f32 x1> $25 = global load $24
<f32 x1> $26 : stack acc adj $4, val = $25
<f32 x1> $27 = stack load top adj $4
<f32 x1> $28 = local load [ [$2[0]]]
<f32 x1> $29 = mul $28 $28 <--- probably 0*0
<f32 x1> $30 = div $27 $29 <--- nan
<f32 x1> $31 = neg $30 <--- nan
<f32*x1> $32 = get child [S1dense->S5place_f32] $12
<f32 x1> $33 = global load $32
<f32 x1> $34 = add $33 $31 <--- nan
$35 : if $16 { <--- good!
<f32*x1> $36 : global store [$32 <- $34]
}
<f32 x1> $37 = global load $32
<f32 x1> $38 : global store [$32 <- $37]
}
}
Bad(nan):
[I 06/30/20 20:43:33.481] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Simplified III:
kernel {
$0 = offloaded range_for(0, 2) block_dim=adaptive
body {
<i32 x1> $1 = loop $0 index 0
<gen*x1> $2 = get root
<i32 x1> $3 = const [0]
<gen*x1> $4 = [S0root][root]::lookup($2, $3) activate = false
<gen*x1> $5 = get child [S0root->S1dense] $4
<i32 x1> $6 = const [1]
<gen*x1> $7 = [S1dense][dense]::lookup($5, $1) activate = false
<f32*x1> $8 = get child [S1dense->S2place_f32] $7
<f32 x1> $9 = global load $8
<f32 x1> $10 = const [0.0]
<i32 x1> $11 = cmp_gt $9 $10
<i32 x1> $12 = bit_and $11 $6
<gen*x1> $13 = get child [S0root->S3dense] $4
<gen*x1> $14 = [S3dense][dense]::lookup($13, $1) activate = false
<f32*x1> $15 = get child [S3dense->S6place_f32] $14
<f32 x1> $16 = global load $15
<f32 x1> $17 = select($12, $16, $10)
<f32 x1> $18 = mul $9 $9 <--- probably 0*0
<f32 x1> $19 = div $17 $18 <--- nan
<f32 x1> $20 = neg $19 <--- nan
<f32*x1> $21 = get child [S1dense->S5place_f32] $7
<f32 x1> $22 = global load $21
<f32 x1> $23 = add $22 $20 <--- nan
<f32 x1> $24 : global store [$21 <- $23] <--- bad
}
}