Taichi: Advanced optimization

Created on 25 Mar 2020  路  64Comments  路  Source: taichi-dev/taichi

Concisely describe the proposed feature
With new extensions introduced by #581, there are lots of space to optimize the IR. I also found some feasible optimizations that are not directly related to the new extension. For example, in this fragment of IR,

...
<f32 x1> $5 = alloca
if $26 {
  ...
} else {
  ...
}
if $26 {
  ...
} else {
  ...
}
<f32 x1> $83 = local load [ [$5[0]]] (the only statement about $5)
...

we could merge the two if's together, change $83 to const [0], and then delete $5.

A list of optimizations I have done and going to do:

  • [x] Basic algebraic simplification (#472, #502)
  • [x] Better algebraic simplification: -1 & a, 0 | a (#827)
  • [x] Lower linearized (#509)
  • [x] Variable optimization

    • [x] Dive into container statements to find local loads/stores for optimization (merge identical local loads, delete local stores if there are no following loads, etc.) (#662)

    • [x] Dive into container statements to merge identical global loads (#857)

    • [x] Optimize local loads of new alloca's to const [0] (#662)

    • [x] Local store elimination and forwarding (#788, #858, #859)

    • [x] Global store elimination and forwarding (#857)

  • [x] Merge adjacent if's with identical condition (#668)
  • [x] Move common statements in both branches outside if's (thanks for @archibate 's discussion) (#727)
  • [x] Add a WholeKernelCSE pass (#727, #1082)
  • [x] Eliminate WhileControlStmt with cond == const [1] (#829)
  • [x] Eliminate assertions with non-zero const conditions (#877)
  • [x] Improve optimization for OffsetAndExtractBitsStmt (#851)
  • [x] DIE for stack pop (#1324)
  • [ ] Allocate stacks with sizes on demand
  • [x] Extract consts to top-level after offloading (#897)

Additional comments
For benchmarking, we may want to introduce a temporary boolean variable as the switch of optimization.

Some nice slides: https://courses.cs.washington.edu/courses/cse401/08wi/lecture/opt-mark.v2.pdf

feature request

All 64 comments

@yuanming-hu please assign me. It seems that I can't assign myself...

Awesome!! This is vitally important for improving run-time performance & reducing compilation time. Thanks for taking charge of this.

Merge adjacent if's with identical condition

What if these if's contains statements with side-effect like x = x + 1? eg.

if (cond) x++;
if (cond) x++;

We want to obtain:

if (cond) { x++; x++; }

and the duplicated x++ can be dealt in other lower passes.

Merge identical local loads if no statements between them modify the variable even if there are if's

What if the two local load is in different blocks? eg.

if (cond) {
print 'yes';
x = local load 233;
} else {
print 'no';
x = local load 233;
}

What if a statement is shown once in IR, but ran for multiple times, should we optimize it? eg.

while (cond) {
x = local load 233
... (no changes stored to 233)
}

We may move this out the while.

First add a analysis pass to detect if a block stored an address.

Merge adjacent if's with identical condition

What if these if's contains statements with side-effect like x = x + 1? eg.

if (cond) x++;
if (cond) x++;

We want to obtain:

if (cond) { x++; x++; }

and the duplicated x++ can be dealt in other lower passes.

Exactly.

Merge identical local loads if no statements between them modify the variable even if there are if's

What if the two local load is in different blocks? eg.

if (cond) {
print 'yes';
x = local load 233;
} else {
print 'no';
x = local load 233;
}

This is non-trivial. We could analyze the common code fragment of true-branch and the false-branch, and put them outside the if, but I don't know if it would make a great difference.

What if a statement is shown once in IR, but ran for multiple times, should we optimize it? eg.

while (cond) {
x = local load 233
... (no changes stored to 233)
}

We may move this out the while.

If cond is false, does moving it out have side effects?

First add a analysis pass to detect if a block stored an address.

To merge identical local loads if no statements between them modify the variable, this is not necessary: I think directly searching for modifications when we find a local load fits the code frame better. Maybe we can add this pass later if necessary.

If cond is false, does moving it out have side effects?

No, it's just load and never used, will be opt-out by other lower passes.

How about first make:

if (cond) {
print 'yes';
x = local load 233;
} else {
print 'no';
x = local load 233;
}

to become:

if (cond) print 'yes'; else print 'no';
if (cond) xxx; else xxx;

since cond is aconstant IR value, and the second can be safely opt-out.

How about first make:

if (cond) {
print 'yes';
x = local load 233;
} else {
print 'no';
x = local load 233;
}

to become:

if (cond) print 'yes'; else print 'no';
if (cond) xxx; else xxx;

since cond is aconstant IR value, and the second can be safely opt-out.

I just thought about a situation:

if (cond) {
  print 'yes';
  x = local load 233;
  print 'yes';
} else {
  print 'no';
  x = local load 233;
  print 'no';
}

I can't tell if the following is more efficient than the above:

if (cond) print 'yes'; else print 'no';
x = local load 233;
if (cond) print 'yes'; else print 'no';

(especially when the common code fragment is relatively short than the others)

We can restrict this optimization to only the first statement and the last statement of the body of if.

@yuanming-hu What do https://github.com/taichi-dev/taichi/blob/aa90e319be3b599085495f88b660f4e987a08134/taichi/ir/ir.h#L1637 mean?

May I just ignore them when merging two adjacent if's?

Quick answer for now: yes. I'll document this in greater detail later. You don't have to worry about that until we start doing vectorization.

I just found a piece of IR:

<i32 x1> $8 = const [0]
...
if $19 {
  ...
  <i32 x1> $25 = const [0]
  ...
} else {
  ...
  <i32 x1> $40 = const [0]
  ...
}

I think we could optimize them all to $8. Currently void visit(ConstStmt*) searches statements before the current statement, and so $25 cannot find $8 as they are not in a basic block.

There are two ways to do this optimization:

  1. Search statements after the current statement (say $8) instead, and dive into container statements to replace them with $8.
  2. Search statements before the current statement (say $25), and do this recursively for parent blocks.

Which do you think is better?

I think 2 is better. At compile time it's hard to judge whether $25 or $40 will be after $8, but it's sure that $8 is before $25 and $40.

Shall this pass (identical ConstStmt elimination) be still in BasicBlockSimplify? It won't be in one basic block, so maybe I should implement it in Simplify?

Let's add a WholeKernelCSE (common subexpression elimination) pass then.

For checking if the first statements (which can be container statements) in both branches of if are exactly the same, shall we add a function like bool same_statements(IRNode *root1, IRNode *root2) in ir.h and implement it using visitors in taichi/analysis/?

Very good question. I need to think about this a little bit. One very important IR functionality is to test if two IRNodes are equivalent. IRNode can be not only one statement but also a hierarchy. We might need to use some hashing here.

A few things to think about here

  • We have to support not only a single statement but also a container with multiple statements.
  • There are many statements to support, each with special fields. We do have a common std::vector<Stmt **> Stmt::operands that keeps tracks of all operands of a statement in a unified manner, but the special fields (e.g. BinaryOpType BinaryOpStmt::op_type)
  • We don't have to worry about Expressions since they only live in the frontend.
  • Binary DNA
  • (Advanced) Reject fast.

There are 3 kinds of solutions I thought about. Denote the number of statements in the container IRNode we want to test by $n$ (if it's not a container, then n=1).

  1. Do nothing more when modifying statements. Then it takes O(n) time to find two IRNode's are the same, and O(n) time in the worst case to find two IRNode's are different. I think in most cases, we can find two IRNode's are different in O(1).
  2. Spend O(depth) more time when modifying statements, where "depth" means the number of container statements directly or indirectly containing the modified statement. We can update Binary DNA's and the hash of it in O(1) for each container statement. (Note that if we only set a boolean variable to tell if the container statement is modified, it still takes O(1) for each container statement!) So we can find two IRNode's are different in O(1) in expectation, but we still need O(n) time to find two IRNode's are the same ---- Binary DNAs' length is 惟(n).
  3. Spend O(depth * log(n)) more time when modifying statements. Then we can find two IRNode's are the same in O(log(n)) with some fancy data structures.

To me, I prefer the 1st solution. I think it unacceptable to spend O(depth) more time whenever modifying statements, just to avoid the worst-case O(n) time finding if two IRNode's are different: we modify statements far more often than checking if two IRNode's are equivalent.

If there is a stage that statements don't change anymore, we can build data structures for comparing IRNode's then.

Thanks for the detailed analysis. I agree with your decision and we should probably go with the 1st solution.

Meanwhile, a very easy-to-implement (and slightly hacky) way to test if two statements are equivalent:

  • First to a re_id pass to minimize the statement indices
  • Then use print_ir to convert the statements to an std::string
  • Then compare if the two strings are equal

This should work for most cases (assuming the print_ir pass is doing a correct job) and can probably be implemented within 20 LoC.

Thanks for the hacky way, but I want to implement a reject-fast solution. I think most of the queries will be of different IRNode's.

Maybe I can implement a visitor to visit one of the IRNode's, while storing the corresponding IRNode in the visitor class?

Sounds good. I champion your decision :-)

Maybe I can implement a visitor to visit one of the IRNode's, while storing the corresponding IRNode in the visitor class?

Right, you have to use one IRNode to guide the other.

I wonder if this IR is valid:

<f32 x1> $238 = alloca
<f32 x1> $197 = alloca
<f32 x1> $239 : local store [$238 <- $197]
<f32 x1> $199 = ...
<f32 x1> $200 : local store [$197 <- $199]
<f32 x1> $242 = local load [ [$238[0]]]
<f32 x1> $218 = local load [ [$242[0]]]

It causes simplify.cpp to crash because the alloca here
https://github.com/taichi-dev/taichi/blob/24e76a14e3ebfc4a8ee7cc2b36d44030a75e226a/taichi/transforms/simplify.cpp#L479
is not an AllocaStmt when we are visiting $218.

Good question. LocalLoad must take Allocas as inputs. $218 is invalid.

So shall we add TI_ASSERT(...->is<AllocaStmt>()); to LocalAddress::var and LocalStoreStmt::ptr in their constructors?

Oh no, it's causing assertion failure even in the initial IR.

So shall we add TI_ASSERT(...->is<AllocaStmt>()); to LocalAddress::var and LocalStoreStmt::ptr in their constructors?

Good idea.

Oh no, it's causing assertion failure even in the initial IR.

Could you share with me more details?

Test case: test_ad_if.py test_ad_if_mutable

Part of the change set:

  LocalAddress(Stmt *var, int offset) : var(var), offset(offset) {
    std::cout << "local address" << std::endl;
    TI_ASSERT(var->is<AllocaStmt>());
  }
...
  void flatten(VecStatement &ret) override {
    std::cout << "from flatten" << std::endl;
    ret.push_back(std::make_unique<LocalLoadStmt>(
        LocalAddress(current_block->lookup_var(id), 0)));
    stmt = ret.back().get();
  }

Output:

Before preprocessing:
@ti.kernel
def func(i: ti.i32):
    t = x[i]
    if t > 0:
        y[i] = t
    else:
        y[i] = 2 * t

After preprocessing:
def func():
  i = ti.decl_scalar_arg(ti.i32)
  t = ti.expr_init(ti.subscript(x, i))
  if 1:
    __cond = ti.chain_compare([t, 0], ['Gt'])
    ti.core.begin_frontend_if(ti.Expr(__cond).ptr)
    ti.core.begin_frontend_if_true()
    ti.subscript(y, i).assign(t)
    ti.core.pop_scope()
    ti.core.begin_frontend_if_false()
    ti.subscript(y, i).assign(2 * t)
    ti.core.pop_scope()

[I 04/06/20 18:22:47.127] [compile_to_offloads.cpp:taichi::lang::irpass::com
pile_to_offloads::<lambda_a9f5d9347feda29776c658d0949d74f7>::operator ()@17]
 Initial IR:
==========
kernel {
  $0 = alloca @tmp4
  @tmp4 = gbl load #@tmp0[arg[0]]
  $2 = alloca @tmp5
  @tmp5 = @tmp4
  $4 = alloca @tmp6
  @tmp6 = 0
  $6 = alloca @tmp7
  @tmp7 = 1
  if (@tmp7 & (@tmp5 > @tmp6)) {
    #@tmp2[arg[0]] = @tmp4
  } else {
    #@tmp2[arg[0]] = (@tmp4 * 2)
  }
}
==========
from flatten
local address
local address
[E 04/06/20 18:22:47.129] [taichi/ir/ir.h:taichi::lang::LocalAddress::LocalA
ddress@1687] var->is<AllocaStmt>()

Still finding where the second local address comes from now. Compiling ir.h takes minutes.

Maybe I should do the assertion only when var != nullptr?

Maybe I should do the assertion only when var != nullptr?

I assume LocalAddress'es must not have null pointers, but it would be good to be defensive.

Actually, there's a piece of Windows debugging infrastructure we can do here: could you help integrate this piece of code into taichi/system/traceback.cpp? It will give you a stack of function calls. Currently on Windows you only have a error message instead of call stack when things crash. This makes debugging hard. Feel free to open up an issue/draft PR to track this.

#include <intrin.h>
#include <dbghelp.h>
#include <cstdio>
#include <vector>
#include <string>
#include <sstream>

#include "taichi/platform/windows/windows.h"

#pragma comment(lib, "dbghelp.lib")


//  https://gist.github.com/rioki/85ca8295d51a5e0b7c56e5005b0ba8b4
//
//  Debug Helpers
//
// Copyright (c) 2015 - 2017 Sean Farrell <[email protected]>
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.
//

namespace dbg {
inline void trace(const char *msg, ...) {
  char buff[1024];

  va_list args;
  va_start(args, msg);
  vsnprintf(buff, 1024, msg, args);

  OutputDebugStringA(buff);

  va_end(args);
}

inline std::string basename(const std::string &file) {
  unsigned int i = file.find_last_of("\\/");
  if (i == std::string::npos) {
    return file;
  } else {
    return file.substr(i + 1);
  }
}

struct StackFrame {
  DWORD64 address;
  std::string name;
  std::string module;
  unsigned int line;
  std::string file;
};

inline std::vector<StackFrame> stack_trace() {
#if _WIN64
  DWORD machine = IMAGE_FILE_MACHINE_AMD64;
#else
  DWORD machine = IMAGE_FILE_MACHINE_I386;
#endif
  HANDLE process = GetCurrentProcess();
  HANDLE thread = GetCurrentThread();

  if (SymInitialize(process, NULL, TRUE) == FALSE) {
    DBG_TRACE(__FUNCTION__ ": Failed to call SymInitialize.");
    return std::vector<StackFrame>();
  }

  SymSetOptions(SYMOPT_LOAD_LINES);

  CONTEXT context = {};
  context.ContextFlags = CONTEXT_FULL;
  RtlCaptureContext(&context);

#if _WIN64
  STACKFRAME frame = {};
  frame.AddrPC.Offset = context.Rip;
  frame.AddrPC.Mode = AddrModeFlat;
  frame.AddrFrame.Offset = context.Rbp;
  frame.AddrFrame.Mode = AddrModeFlat;
  frame.AddrStack.Offset = context.Rsp;
  frame.AddrStack.Mode = AddrModeFlat;
#else
  STACKFRAME frame = {};
  frame.AddrPC.Offset = context.Eip;
  frame.AddrPC.Mode = AddrModeFlat;
  frame.AddrFrame.Offset = context.Ebp;
  frame.AddrFrame.Mode = AddrModeFlat;
  frame.AddrStack.Offset = context.Esp;
  frame.AddrStack.Mode = AddrModeFlat;
#endif

  bool first = true;

  std::vector<StackFrame> frames;
  while (StackWalk(machine, process, thread, &frame, &context, NULL,
                   SymFunctionTableAccess, SymGetModuleBase, NULL)) {
    StackFrame f = {};
    f.address = frame.AddrPC.Offset;

#if _WIN64
    DWORD64 moduleBase = 0;
#else
    DWORD moduleBase = 0;
#endif

    moduleBase = SymGetModuleBase(process, frame.AddrPC.Offset);

    char moduelBuff[MAX_PATH];
    if (moduleBase &&
        GetModuleFileNameA((HINSTANCE)moduleBase, moduelBuff, MAX_PATH)) {
      f.module = basename(moduelBuff);
    } else {
      f.module = "Unknown Module";
    }
#if _WIN64
    DWORD64 offset = 0;
#else
    DWORD offset = 0;
#endif
    char symbolBuffer[sizeof(IMAGEHLP_SYMBOL) + 255];
    PIMAGEHLP_SYMBOL symbol = (PIMAGEHLP_SYMBOL)symbolBuffer;
    symbol->SizeOfStruct = (sizeof IMAGEHLP_SYMBOL) + 255;
    symbol->MaxNameLength = 254;

    if (SymGetSymFromAddr(process, frame.AddrPC.Offset, &offset, symbol)) {
      f.name = symbol->Name;
    } else {
      DWORD error = GetLastError();
      DBG_TRACE(__FUNCTION__ ": Failed to resolve address 0x%X: %u\n",
                frame.AddrPC.Offset, error);
      f.name = "Unknown Function";
    }

    IMAGEHLP_LINE line;
    line.SizeOfStruct = sizeof(IMAGEHLP_LINE);

    DWORD offset_ln = 0;
    if (SymGetLineFromAddr(process, frame.AddrPC.Offset, &offset_ln, &line)) {
      f.file = line.FileName;
      f.line = line.LineNumber;
    } else {
      DWORD error = GetLastError();
      DBG_TRACE(__FUNCTION__ ": Failed to resolve line for 0x%X: %u\n",
                frame.AddrPC.Offset, error);
      f.line = 0;
    }

    if (!first) {
      frames.push_back(f);
    }
    first = false;
  }

  SymCleanup(process);

  return frames;
}

inline void handle_assert(const char *func, const char *cond) {
  std::stringstream buff;
  buff << func << ": Assertion '" << cond << "' failed! \n";
  buff << "\n";

  std::vector<StackFrame> stack = stack_trace();
  buff << "Callstack: \n";
  for (unsigned int i = 0; i < stack.size(); i++) {
    buff << "0x" << std::hex << stack[i].address << ": " << stack[i].name << "("
         << std::dec << stack[i].line << ") in " << stack[i].module << "\n";
  }

  // please replace with std::printf
  MessageBoxA(NULL, buff.str().c_str(), "Assert Failed", MB_OK | MB_ICONSTOP);
  abort();
}

}  // namespace dbg

I just checked that var is nullptr in the second local address.

I just located that

<f32 x1> $242 = local load [ [$238[0]]]
<f32 x1> $218 = local load [ [$242[0]]]

is introduced in make_adjoint... Debugging.

This is in BackupSSA and I printed it and found that all auto allocas are indeed allocas, at least at that place -- otherwise it should trigger assertion failure.

I suspect the problem is in MakeAdjoint. Please check #726 when you are available (the output is so long that I opened a new issue for it).

Sounds good. I'm occupied until 11:59 PM but I'll take a look after that time.

Currently, the following $47 cannot be eliminated:

<i32 x1> $2 = alloca
if $22 {
  <i32 x1> $47 : local store [$2 <- $46]
}
(nothing related to $2)

This is because $47 doesn't know that $2 will never be loaded.

There are 5 cases like this in test_ad_if_mutable, so we can reduce the number of statements by at least 10 (eliminating local store & alloca).

Describe the solution you'd like (if any)
I want to implement a pass that analyzes allocas (for each alloca, do store forwarding and useless local store elimination), but I don't know if I should implement it in a new pass or in an existing pass.

(global tmp vars may be similar, but the Stmts are different so they can't be implemented together)

I find BasicBlockSimplify's function quite limited -- there are 3 of its main functions (common subexpression elimination, store forwarding, useless local store elimination) I want to upgrade.

LocalLoadSearcher, LocalStoreSearcher, LocalStoreForwarder may be necessary for the upgraded common subexpression elimination/store forwarding/useless local store elimination passes. Shall we move them to analysis/?

Currently, the following $47 cannot be eliminated:

<i32 x1> $2 = alloca
if $22 {
  <i32 x1> $47 : local store [$2 <- $46]
}
(nothing related to $2)

This is because $47 doesn't know that $2 will never be loaded.

There are 5 cases like this in test_ad_if_mutable, so we can reduce the number of statements by at least 10 (eliminating local store & alloca).

Describe the solution you'd like (if any)
I want to implement a pass that analyzes allocas (for each alloca, do store forwarding and useless local store elimination), but I don't know if I should implement it in a new pass or in an existing pass.

(global tmp vars may be similar, but the Stmts are different so they can't be implemented together)

I find BasicBlockSimplify's function quite limited -- there are 3 of its main functions (common subexpression elimination, store forwarding, useless local store elimination) I want to upgrade.

Thanks for spotting this. A new pass sounds better since no existing pass does this. Also I think the logic of this pass would be complex enough to justify the existence of itself.

LocalLoadSearcher, LocalStoreSearcher, LocalStoreForwarder may be necessary for the upgraded common subexpression elimination/store forwarding/useless local store elimination passes. Shall we move them to analysis/?

Sounds good!!

I want to make use of AlgSimp::alg_is_one to eliminate $6 in this case:

<i32 x1> $5 = const [1]
$6 : while control nullptr, $5

(We can eliminate it even if mask is not nullptr, right?)

But should it be in the alg_simp pass?

alg_is_one

maybe alg_is_non_zero_constant :)

alg_simp pass?

I thought this is abour control flow not algebra expr level, so maybe not really related?

maybe alg_is_non_zero_constant :)

Yes... It may be clearer if the type is u1.
BTW what's the behavior of if 0.1 or while control ..., 0.1?

I found

<i32 x1> $10 = const [1]
<i32 x1> $11 = cmp_gt $6 $9
<i32 x1> $12 = bit_and $10 $11

in some IRs, but it's hard to optimize if there are neither boolean types (u1) nor logical operations (logic_and).

Maybe another way to optimize it is to change this from expr_init(True) to expr_init(-1)...
https://github.com/taichi-dev/taichi/blob/532ea3340e8c8201c97c768110be907038df7a17/python/taichi/lang/impl.py#L110

My feeling is that we should systematically fix this after we have u1 introduced...

benchmark20200422

The geometric mean of the optimization factor on the number of statements among all tests is 1.068 now.

Cool! I assume bigger means more optimized in the table. I'm curious about which test gives you < 0.75 number, and which are > 1.5?

(PS: it's almost always good to use xlabel and ylabel and title to make your plots easier to understand.)

Tests with > 1.5 boost:

test_ad_if__test_ad_if : 1.5348837209302326
test_ad_if__test_ad_if_mutable : 2.0485436893203883
test_ad_if__test_ad_if_parallel : 1.9245283018867925
test_ad_if__test_ad_if_parallel_complex : 1.625
test_continue__test_kernel_continue : 1.5844155844155845

(test_ad_if__test_ad_if_mutable should have been optimized from 105 statements to 26 statements, but there are other kernels causing ~100 statements in total that can hardly be optimized.)

Tests that become much worse (< 0.75):

test_tensor_dimensionality__test_dimensionality : 0.7463768115942029
test_tensor_reflection__test_POT : 0.7272727272727273

Thanks for the report. The bad news is that we have overfit to the test_ad_if series; the good news is there are still a lot of space to improve here...

I just found that test_tensor_reflection__test_POT has no kernels in it. Figuring out what's wrong...

For test_tensor_dimensionality, it's indeed optimized:
before:

kernel {
  $0 = offloaded range_for(0, 256) block_dim=adaptive {
    <i32 x1> $1 = const [0]
    <i32 x1> $2 = loop index 0
    <i32 x1> $3 = bit_extract($2 + 0, 7~8)
    <i32 x1> $4 = const [1]
    <i32 x1> $5 = mul $3 $4
    <i32 x1> $6 = add $1 $5
    <i32 x1> $7 = bit_extract($2 + 0, 6~7)
    <i32 x1> $8 = mul $7 $4
    <i32 x1> $9 = add $1 $8
    <i32 x1> $10 = bit_extract($2 + 0, 5~6)
    <i32 x1> $11 = mul $10 $4
    <i32 x1> $12 = add $1 $11
    <i32 x1> $13 = bit_extract($2 + 0, 4~5)
    <i32 x1> $14 = mul $13 $4
    <i32 x1> $15 = add $1 $14
    <i32 x1> $16 = bit_extract($2 + 0, 3~4)
    <i32 x1> $17 = mul $16 $4
    <i32 x1> $18 = add $1 $17
    <i32 x1> $19 = bit_extract($2 + 0, 2~3)
    <i32 x1> $20 = mul $19 $4
    <i32 x1> $21 = add $1 $20
    <i32 x1> $22 = bit_extract($2 + 0, 1~2)
    <i32 x1> $23 = mul $22 $4
    <i32 x1> $24 = add $1 $23
    <i32 x1> $25 = bit_extract($2 + 0, 0~1)
    <i32 x1> $26 = mul $25 $4
    <i32 x1> $27 = add $1 $26
    <i32 x1> $28 = add $6 $9
    <i32 x1> $29 = add $28 $12
    <i32 x1> $30 = add $29 $15
    <i32 x1> $31 = add $30 $18
    <i32 x1> $32 = add $31 $21
    <i32 x1> $33 = add $32 $24
    <i32 x1> $34 = add $33 $27
    <gen*x1> $35 = get root
    <i32 x1> $36 = linearized(ind {}, stride {})
    <gen*x1> $37 = [S0root][root]::lookup($35, $36) activate = false
    <gen*x1> $38 = get child [S0root->S1dense] $37
    <i32 x1> $39 = bit_extract($6 + 0, 0~1)
    <i32 x1> $40 = bit_extract($9 + 0, 0~1)
    <i32 x1> $41 = bit_extract($12 + 0, 0~1)
    <i32 x1> $42 = bit_extract($15 + 0, 0~1)
    <i32 x1> $43 = bit_extract($18 + 0, 0~1)
    <i32 x1> $44 = bit_extract($21 + 0, 0~1)
    <i32 x1> $45 = bit_extract($24 + 0, 0~1)
    <i32 x1> $46 = bit_extract($27 + 0, 0~1)
    <i32 x1> $47 = linearized(ind {$39, $40, $41, $42, $43, $44, $45, $46},
stride {2, 2, 2, 2, 2, 2, 2, 2})
    <gen*x1> $48 = [S1dense][dense]::lookup($38, $47) activate = false
    <i32*x1> $49 = get child [S1dense->S2place_i32] $48
    <i32 x1> $50 = atomic add($49, $34)
    <i32*x1> $51 = get child [S1dense->S3place_i32] $48
    <i32 x1> $52 = atomic add($51, $6)
  }
}

after:

kernel {
  $0 = offloaded range_for(0, 256) block_dim=adaptive {
    <i32 x1> $1 = loop index 0
    <i32 x1> $2 = bit_extract($1 + 0, 7~8)
    <i32 x1> $3 = bit_extract($1 + 0, 6~7)
    <i32 x1> $4 = bit_extract($1 + 0, 5~6)
    <i32 x1> $5 = bit_extract($1 + 0, 4~5)
    <i32 x1> $6 = bit_extract($1 + 0, 3~4)
    <i32 x1> $7 = bit_extract($1 + 0, 2~3)
    <i32 x1> $8 = bit_extract($1 + 0, 1~2)
    <i32 x1> $9 = bit_extract($1 + 0, 0~1)
    <i32 x1> $10 = add $2 $3
    <i32 x1> $11 = add $10 $4
    <i32 x1> $12 = add $11 $5
    <i32 x1> $13 = add $12 $6
    <i32 x1> $14 = add $13 $7
    <i32 x1> $15 = add $14 $8
    <i32 x1> $16 = add $15 $9
    <gen*x1> $17 = get root
    <i32 x1> $18 = const [0]
    <gen*x1> $19 = [S0root][root]::lookup($17, $18) activate = false
    <gen*x1> $20 = get child [S0root->S1dense] $19
    <i32 x1> $21 = bit_extract($2 + 0, 0~1)
    <i32 x1> $22 = bit_extract($3 + 0, 0~1)
    <i32 x1> $23 = bit_extract($4 + 0, 0~1)
    <i32 x1> $24 = bit_extract($5 + 0, 0~1)
    <i32 x1> $25 = bit_extract($6 + 0, 0~1)
    <i32 x1> $26 = bit_extract($7 + 0, 0~1)
    <i32 x1> $27 = bit_extract($8 + 0, 0~1)
    <i32 x1> $28 = bit_extract($9 + 0, 0~1)
    <i32 x1> $29 = const [2]
    <i32 x1> $30 = mul $27 $29
    <i32 x1> $31 = add $28 $30
    <i32 x1> $32 = const [4]
    <i32 x1> $33 = mul $26 $32
    <i32 x1> $34 = add $31 $33
    <i32 x1> $35 = const [8]
    <i32 x1> $36 = mul $25 $35
    <i32 x1> $37 = add $34 $36
    <i32 x1> $38 = const [16]
    <i32 x1> $39 = mul $24 $38
    <i32 x1> $40 = add $37 $39
    <i32 x1> $41 = const [32]
    <i32 x1> $42 = mul $23 $41
    <i32 x1> $43 = add $40 $42
    <i32 x1> $44 = const [64]
    <i32 x1> $45 = mul $22 $44
    <i32 x1> $46 = add $43 $45
    <i32 x1> $47 = const [128]
    <i32 x1> $48 = mul $21 $47
    <i32 x1> $49 = add $46 $48
    <gen*x1> $50 = [S1dense][dense]::lookup($20, $49) activate = false
    <i32*x1> $51 = get child [S1dense->S2place_i32] $50
    <i32 x1> $52 = atomic add($51, $16)
    <i32*x1> $53 = get child [S1dense->S3place_i32] $50
    <i32 x1> $54 = atomic add($53, $2)
  }
}

It's just lowering linearize causing too many statements.

Well... the $21-$28 here is just the same as $2-$9, isn't it?

It's just lowering linearize causing too many statements.

I see :-) People sometimes use a cost model to assign, say linearize higher weight.

Well... the $21-$28 here is just the same as $2-$9, isn't it?

Right, we can add a special optimization for a bitextract that takes as input another bitextract.

I just found that test_tensor_reflection__test_POT has no kernels in it. Figuring out what's wrong...

Maybe we shouldn't have printed stats in the destructor of Program. We compile some kernels after Program finalizes, so we've got some dislocations on statistics and test names.

Look at this log (Windows fatal exception always appears at the beginning, and it seems doesn't matter):

C:\Users\xmk\Desktop\taichi\tests\python>pytest -s test_tensor_dimensionality.py
============================================================================== test session starts ===============================================================================
platform win32 -- Python 3.7.4, pytest-5.0.1, py-1.8.0, pluggy-0.12.0
rootdir: C:\Users\xmk\Desktop\taichi\tests\python
plugins: arraydiff-0.3, doctestplus-0.3.0, openfiles-0.3.2, remotedata-0.3.2
collecting ... [Taichi] mode=development
[Taichi] <dev mode>, supported archs: [cpu only], commit d7610347, python 3.7.4
Windows fatal exception: code 0xc0000138

Current thread 0x00002798 (most recent call first):
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 1043 in create_module
  File "<frozen importlib._bootstrap>", line 583 in module_from_spec
  File "<frozen importlib._bootstrap>", line 670 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "D:\Anaconda3\lib\site-packages\torch\__init__.py", line 81 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 728 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "C:\Users\xmk\Desktop\taichi\python\taichi\lang\util.py", line 7 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 728 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py", line 2 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 728 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "C:\Users\xmk\Desktop\taichi\python\taichi\lang\impl.py", line 3 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 728 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "C:\Users\xmk\Desktop\taichi\python\taichi\lang\__init__.py", line 1 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 728 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "C:\Users\xmk\Desktop\taichi\python\taichi\__init__.py", line 17 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 728 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "C:\Users\xmk\Desktop\taichi\tests\python\test_tensor_dimensionality.py", line 1 in <module>
  File "D:\Anaconda3\lib\site-packages\_pytest\assertion\rewrite.py", line 149 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "D:\Anaconda3\lib\site-packages\py\_path\local.py", line 701 in pyimport
  File "D:\Anaconda3\lib\site-packages\_pytest\python.py", line 498 in _importtestmodule
  File "D:\Anaconda3\lib\site-packages\_pytest\python.py", line 431 in _getobj
  File "D:\Anaconda3\lib\site-packages\_pytest\python.py", line 250 in obj
  File "D:\Anaconda3\lib\site-packages\_pytest\python.py", line 446 in _inject_setup_module_fixture
  File "D:\Anaconda3\lib\site-packages\_pytest\python.py", line 434 in collect
  File "D:\Anaconda3\lib\site-packages\_pytest\runner.py", line 247 in <lambda>
  File "D:\Anaconda3\lib\site-packages\_pytest\runner.py", line 220 in from_call
  File "D:\Anaconda3\lib\site-packages\_pytest\runner.py", line 247 in pytest_make_collect_report
  File "D:\Anaconda3\lib\site-packages\pluggy\callers.py", line 187 in _multicall
  File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 81 in <lambda>
  File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 87 in _hookexec
  File "D:\Anaconda3\lib\site-packages\pluggy\hooks.py", line 289 in __call__
  File "D:\Anaconda3\lib\site-packages\_pytest\runner.py", line 368 in collect_one_node
  File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 717 in genitems
  File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 498 in _perform_collect
  File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 459 in perform_collect
  File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 266 in pytest_collection
  File "D:\Anaconda3\lib\site-packages\pluggy\callers.py", line 187 in _multicall
  File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 81 in <lambda>
  File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 87 in _hookexec
  File "D:\Anaconda3\lib\site-packages\pluggy\hooks.py", line 289 in __call__
  File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 256 in _main
  File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 213 in wrap_session
  File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 250 in pytest_cmdline_main
  File "D:\Anaconda3\lib\site-packages\pluggy\callers.py", line 187 in _multicall
  File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 81 in <lambda>
  File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 87 in _hookexec
  File "D:\Anaconda3\lib\site-packages\pluggy\hooks.py", line 289 in __call__
  File "D:\Anaconda3\lib\site-packages\_pytest\config\__init__.py", line 74 in main
  File "D:\Anaconda3\Scripts\pytest-script.py", line 10 in <module>
collected 1 item

test_tensor_dimensionality.py Running test on arch=Arch.x64
[T 04/22/20 19:03:48.499] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:48.500] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:48.501] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:48.502] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:48.503] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:48.503] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb6ad10) arch=x64 initialized.
[T 04/22/20 19:03:48.511] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:49.220] [llvm_context.cpp:taichi::lang::compile_runtime_bitcode@137] Compiling runtime module bitcode...
[T 04/22/20 19:03:49.617] [llvm_context.cpp:taichi::lang::compile_runtime_bitcode@152] runtime module bitcode compiled.
[T 04/22/20 19:03:49.892] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:49.893] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000742 s
[T 04/22/20 19:03:49.894] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 32 B
[T 04/22/20 19:03:49.916] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:49.916] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:49.947] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:49.948] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:49.951] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:49.963] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:49.964] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:49.965] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:49.978] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:49.980] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:49.981] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:49.987] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:49.987] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:49.988] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:49.990] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:49.990] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:49.991] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:50.003] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:50.034] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c4_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:50.118] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements  : 58.00
[T 04/22/20 19:03:50.127] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb6ad10) finalized.
[T 04/22/20 19:03:50.132] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:50.133] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:50.134] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:50.134] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:50.135] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:50.135] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb66010) arch=x64 initialized.
[T 04/22/20 19:03:50.141] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:50.397] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:50.398] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000711 s
[T 04/22/20 19:03:50.398] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 64 B
[T 04/22/20 19:03:50.423] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:50.423] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:50.443] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:50.444] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.444] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:50.451] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:50.451] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.454] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:50.456] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:50.458] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.459] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:50.461] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:50.461] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.462] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:50.466] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:50.466] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:50.469] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:50.483] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:50.525] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c6_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:50.610] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements  : 74.00
[T 04/22/20 19:03:50.617] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb66010) finalized.
[T 04/22/20 19:03:50.622] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:50.622] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:50.623] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:50.627] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:50.628] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:50.628] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb68b60) arch=x64 initialized.
[T 04/22/20 19:03:50.635] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:50.880] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:50.881] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000716 s
[T 04/22/20 19:03:50.883] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 128 B
[T 04/22/20 19:03:50.905] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:50.905] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:50.926] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:50.927] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.928] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:50.934] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:50.935] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.937] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:50.938] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:50.942] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.943] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:50.944] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:50.945] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.947] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:50.949] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:50.952] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:50.953] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:50.963] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:50.995] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c8_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:51.075] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements  : 90.00
[T 04/22/20 19:03:51.083] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb68b60) finalized.
[T 04/22/20 19:03:51.089] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:51.089] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:51.090] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:51.096] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:51.097] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:51.101] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb6b1e0) arch=x64 initialized.
[T 04/22/20 19:03:51.106] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:51.341] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:51.341] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000685 s
[T 04/22/20 19:03:51.342] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 256 B
[T 04/22/20 19:03:51.360] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:51.361] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:51.387] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:51.388] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.389] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:51.391] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:51.392] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.392] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:51.394] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:51.395] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.396] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:51.397] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:51.398] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.399] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:51.404] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:51.405] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:51.405] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:51.416] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:51.448] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c10_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:51.541] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements  : 106.00
[T 04/22/20 19:03:51.550] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb6b1e0) finalized.
[T 04/22/20 19:03:51.556] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:51.557] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:51.558] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:51.558] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:51.559] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:51.559] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb68b60) arch=x64 initialized.
[T 04/22/20 19:03:51.565] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:51.808] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:51.808] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000695 s
[T 04/22/20 19:03:51.809] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 512 B
[T 04/22/20 19:03:51.825] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:51.826] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:51.843] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:51.844] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.849] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:51.861] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:51.863] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.864] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:51.877] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:51.879] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.880] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:51.885] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:51.885] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.886] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:51.888] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:51.889] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:51.890] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:51.907] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:51.944] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c12_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:52.027] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements  : 122.00
[T 04/22/20 19:03:52.035] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb68b60) finalized.
[T 04/22/20 19:03:52.040] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:52.041] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:52.042] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:52.048] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:52.049] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:52.053] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb6a370) arch=x64 initialized.
[T 04/22/20 19:03:52.059] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:52.300] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:52.301] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000793 s
[T 04/22/20 19:03:52.301] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 1024 B
[T 04/22/20 19:03:52.319] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:52.320] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:52.338] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:52.339] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.340] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:52.346] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:52.347] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.349] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:52.351] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:52.355] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.356] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:52.358] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:52.359] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.360] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:52.366] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:52.367] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:52.367] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:52.381] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:52.421] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c14_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:52.508] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements  : 138.00
[T 04/22/20 19:03:52.516] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb6a370) finalized.
[T 04/22/20 19:03:52.521] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:52.521] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:52.523] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:52.527] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:52.530] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:52.530] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb68b60) arch=x64 initialized.
[T 04/22/20 19:03:52.540] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:52.779] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:52.780] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000759 s
[T 04/22/20 19:03:52.781] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 2048 B
[T 04/22/20 19:03:52.797] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:52.797] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:52.818] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:52.820] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.820] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:52.834] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:52.835] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.836] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:52.849] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:52.851] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.852] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:52.865] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:52.866] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.868] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:52.874] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:52.875] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:52.876] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:52.891] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:52.934] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c16_0...
.

============================================================================ 1 passed in 5.47 seconds ============================================================================

The final Compiling kernel fill_c16_0's statistics is mistakenly written to test_tensor_reflection__test_POT when running ti test.

@yuanming-hu do you have any ideas to fix this? Having tests and statistics mismatched looks bad.

We compile some kernels after Program finalizes, so we've got some dislocations on statistics and test names.

Interesting - I think the real issue here is kernels should always be compiled when a program exists. I'm looking into test_tensor_dimensionality - do you have an idea of how a kernel is compiled without a program?

Oh, the kernel is compiled with a program, but it seems that the last Program's destructor function is never called.

Maybe that is because Windows fatal exception: code 0xc0000138 happens before the program's dtor? Note that the outputs are not guaranteed to be sorted in wall-clock time when you have multiple output streams...

If I use ti test -v rather than pytest -s, Windows fatal exception doesn't happen. Would you please print something in Program's constructors and destructors to double-check if the last Program is not destructed?

Setting the environment variable TI_LOG_LEVEL=trace will do the job.

Looks like each test ends with Compiling kernel ... rather than Program finalizing with TI_LOG_LEVEL=trace...

BTW, the number of Program's constructions and destructions can be greater than the number of tests in files like test_tensor_dimensionality.py. However, we only record one of the Program's statistics.

On Linux it looks fine to me:

pytest -s test_tensor_dimensionality.py
=================================================== test session starts ====================================================
platform linux -- Python 3.6.9, pytest-5.4.1, py-1.8.0, pluggy-0.13.0
rootdir: /home/yuanming/repos/taichi/tests/python
plugins: arraydiff-0.3, forked-1.1.3, remotedata-0.3.2, xdist-1.31.0, doctestplus-0.4.0, openfiles-0.4.0
collecting ... [Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-le6az2sv
[Taichi] sandbox prepared
[T 04/22/20 20:17:34.188] [cuda_driver.cpp:CUDADriver@44] CUDA driver API (v10.2) loaded.
[Taichi] <dev mode>, supported archs: [cpu, cuda, opengl], commit 01391957, python 3.6.9
collected 1 item                                                                                                           

test_tensor_dimensionality.py Running test on arch=Arch.x64
[T 04/22/20 20:17:34.399] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:34.399] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:34.399] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:34.399] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:34.400] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:34.400] [program.cpp:Program@133] Program (0x56273a7f02d0) arch=x64 initialized.
[T 04/22/20 20:17:34.401] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:34.406] [llvm_context.cpp:compile_runtime_bitcode@137] Compiling runtime module bitcode...
[T 04/22/20 20:17:34.622] [llvm_context.cpp:compile_runtime_bitcode@152] runtime module bitcode compiled.
[T 04/22/20 20:17:34.731] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:34.731] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 3.29e-05 s
[T 04/22/20 20:17:34.731] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 32 B
[T 04/22/20 20:17:34.739] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:34.739] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:34.749] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:34.749] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:34.749] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f4170974000
[D 04/22/20 20:17:34.750] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:34.750] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:34.750] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f4170977000
[D 04/22/20 20:17:34.751] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:34.751] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:34.751] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f417097a000
[D 04/22/20 20:17:34.752] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:34.752] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:34.752] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f417097d000
[D 04/22/20 20:17:34.753] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:34.753] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:34.753] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f4170980000
[T 04/22/20 20:17:34.758] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:34.772] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c4_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:34.814] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:34.815] [program.cpp:finalize@512] Program (0x56273a7f02d0) finalized.
[T 04/22/20 20:17:34.817] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:34.829] [cuda_context.cpp:CUDAContext@25] Using CUDA device [id=0]: GeForce GTX 1080 Ti
[T 04/22/20 20:17:34.829] [cuda_context.cpp:CUDAContext@33] CUDA Device Compute Capability: 6.1
[T 04/22/20 20:17:34.875] [cuda_context.cpp:CUDAContext@38] Total memory 10.91 GB; free memory 9.74 GB
[T 04/22/20 20:17:34.875] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:34.876] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:34.876] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:34.876] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:34.876] [program.cpp:Program@133] Program (0x56273abbbac0) arch=cuda initialized.
[T 04/22/20 20:17:34.877] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:34.986] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:34.986] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:34.986] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:34.986] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:34.995] [llvm_context.cpp:compile_runtime_bitcode@137] Compiling runtime module bitcode...
[T 04/22/20 20:17:35.218] [llvm_context.cpp:compile_runtime_bitcode@152] runtime module bitcode compiled.
[T 04/22/20 20:17:35.414] [jit_cuda.cpp:add_module@90] PTX size: 82.11KB
[T 04/22/20 20:17:35.414] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:35.419] [jit_cuda.cpp:add_module@97] CUDA module load time : 4.472970962524414ms
[T 04/22/20 20:17:35.419] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:35.419] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 6.01e-05 s
[T 04/22/20 20:17:35.419] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 32 B
[T 04/22/20 20:17:35.428] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:35.428] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:35.437] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:35.437] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:35.437] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:35.439] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:35.439] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:35.439] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:35.440] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:35.440] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:35.440] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:35.441] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:35.441] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:35.441] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:35.442] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:35.442] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:35.442] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:35.460] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c6_0...
[T 04/22/20 20:17:35.646] [jit_cuda.cpp:add_module@90] PTX size: 83.62KB
[T 04/22/20 20:17:35.646] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:35.647] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4448890686035156ms
[D 04/22/20 20:17:35.647] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c6_0_kernel_4_range_for<<<896, 64>>>
[T 04/22/20 20:17:35.647] [jit_cuda.cpp:lookup_function@47] Kernel fill_c6_0_kernel_4_range_for compilation time: 0.0011920928955078125ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:35.661] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:35.661] [program.cpp:finalize@512] Program (0x56273abbbac0) finalized.
[T 04/22/20 20:17:35.670] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:35.670] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:35.670] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:35.670] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:35.670] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:35.670] [program.cpp:Program@133] Program (0x56273abbbac0) arch=opengl initialized.
[T 04/22/20 20:17:35.671] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:35.780] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:35.780] [program.cpp:materialize_layout@298] OpenGL root buffer size: 32 B
[T 04/22/20 20:17:35.927] [opengl_extension.inc.h:initialize_opengl@3] [glsl] Found GL_ARB_compute_shader
[T 04/22/20 20:17:35.927] [opengl_extension.inc.h:initialize_opengl@4] [glsl] Found GL_ARB_gpu_shader_int64
[T 04/22/20 20:17:35.927] [opengl_extension.inc.h:initialize_opengl@5] [glsl] Found GL_NV_shader_atomic_float
[T 04/22/20 20:17:35.927] [opengl_extension.inc.h:initialize_opengl@6] [glsl] Found GL_NV_shader_atomic_float64
[T 04/22/20 20:17:35.927] [opengl_extension.inc.h:initialize_opengl@7] [glsl] Found GL_NV_shader_atomic_int64
[T 04/22/20 20:17:35.928] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:35.928] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:35.928] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:35.928] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:35.929] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c8_0...
[D 04/22/20 20:17:35.931] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c8_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };

void fill_c8_00()
{ // range for
  // range known at compile time
  int _tid = int(gl_GlobalInvocationID.x);
  if (_tid >= 4) return;
  int _itv = 0 + _tid * 1;
    int G = _itv;
    int H = (((0 + G) >> 1) & ((1 << 1) - 1));
    int L = (((0 + G) >> 0) & ((1 << 1) - 1));
    int O = H + L;
    int S = 0;
    int Gm = 0;
    int U = S + 32 * Gm; // S0
    int V = U + 0; // S1
    int W = (((0 + H) >> 0) & ((1 << 1) - 1));
    int X = (((0 + L) >> 0) & ((1 << 1) - 1));
    int Go = 1;
    int Gp = X * Go;
    int Gq = Gm + Gp;
    int Gr = 2;
    int Gs = W * Gr;
    int Gt = Gq + Gs;
    int Z = V + 8 * Gt; // S1
    int Aq = Z + 0; // S2
    int As = atomicAdd(_data_i32_[Aq >> 2], O);
    int AE = Z + 4; // S3
    int AG = atomicAdd(_data_i32_[AE >> 2], H);
}

void main()
{
  fill_c8_00();
}
layout(local_size_x = 4 /* 1, 4 */, local_size_y = 1, local_size_z = 1) in;

[T 04/22/20 20:17:35.931] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:35.931] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:35.931] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:35.931] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:35.932] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:35.932] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:35.932] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:35.932] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:35.932] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:35.933] [program.cpp:finalize@512] Program (0x56273abbbac0) finalized.
[T 04/22/20 20:17:35.936] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:35.936] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:35.936] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:35.936] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:35.936] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:35.936] [program.cpp:Program@133] Program (0x56273b70b7e0) arch=x64 initialized.
[T 04/22/20 20:17:35.937] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:36.047] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:36.047] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 2.22e-05 s
[T 04/22/20 20:17:36.047] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 64 B
[T 04/22/20 20:17:36.055] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:36.055] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:36.064] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:36.064] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.064] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:36.065] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:36.065] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.065] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:36.067] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:36.067] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.067] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:36.068] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:36.068] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.068] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:36.069] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:36.069] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:36.069] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:36.074] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:36.088] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c10_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:36.126] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:36.127] [program.cpp:finalize@512] Program (0x56273b70b7e0) finalized.
[T 04/22/20 20:17:36.129] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:36.129] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:36.129] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:36.129] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:36.129] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:36.129] [program.cpp:Program@133] Program (0x56273abbbac0) arch=cuda initialized.
[T 04/22/20 20:17:36.130] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:36.241] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:36.241] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:36.241] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:36.241] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:36.438] [jit_cuda.cpp:add_module@90] PTX size: 82.16KB
[T 04/22/20 20:17:36.438] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:36.438] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4589557647705078ms
[T 04/22/20 20:17:36.439] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:36.439] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 6.41e-05 s
[T 04/22/20 20:17:36.439] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 64 B
[T 04/22/20 20:17:36.447] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:36.447] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:36.457] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:36.457] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.457] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:36.458] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:36.458] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.458] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:36.459] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:36.459] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.459] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:36.460] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:36.460] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.460] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:36.461] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:36.461] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:36.461] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:36.479] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c12_0...
[T 04/22/20 20:17:36.665] [jit_cuda.cpp:add_module@90] PTX size: 83.76KB
[T 04/22/20 20:17:36.665] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:36.665] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4951953887939453ms
[D 04/22/20 20:17:36.665] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c12_0_kernel_10_range_for<<<896, 64>>>
[T 04/22/20 20:17:36.665] [jit_cuda.cpp:lookup_function@47] Kernel fill_c12_0_kernel_10_range_for compilation time: 0.00095367431640625ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:36.680] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:36.680] [program.cpp:finalize@512] Program (0x56273abbbac0) finalized.
[T 04/22/20 20:17:36.689] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:36.689] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:36.689] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:36.689] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:36.689] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:36.689] [program.cpp:Program@133] Program (0x56273b70b7e0) arch=opengl initialized.
[T 04/22/20 20:17:36.690] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:36.802] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:36.802] [program.cpp:materialize_layout@298] OpenGL root buffer size: 64 B
[T 04/22/20 20:17:36.803] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:36.803] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:36.803] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:36.803] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:36.806] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c14_0...
[D 04/22/20 20:17:36.808] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c14_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };

void fill_c14_00()
{ // range for
  // range known at compile time
  int _tid = int(gl_GlobalInvocationID.x);
  if (_tid >= 8) return;
  int _itv = 0 + _tid * 1;
    int G = _itv;
    int H = (((0 + G) >> 2) & ((1 << 1) - 1));
    int L = (((0 + G) >> 1) & ((1 << 1) - 1));
    int O = (((0 + G) >> 0) & ((1 << 1) - 1));
    int R = H + L;
    int S = R + O;
    int X = 0;
    int OS = 0;
    int Z = X + 64 * OS; // S0
    int Aq = Z + 0; // S1
    int Ar = (((0 + H) >> 0) & ((1 << 1) - 1));
    int As = (((0 + L) >> 0) & ((1 << 1) - 1));
    int At = (((0 + O) >> 0) & ((1 << 1) - 1));
    int OU = 1;
    int OV = At * OU;
    int OW = OS + OV;
    int OX = 2;
    int OY = As * OX;
    int OZ = OW + OY;
    int P0 = 4;
    int P1 = Ar * P0;
    int P2 = OZ + P1;
    int Av = Aq + 8 * P2; // S1
    int Aw = Av + 0; // S2
    int Ay = atomicAdd(_data_i32_[Aw >> 2], S);
    int AM = Av + 4; // S3
    int AO = atomicAdd(_data_i32_[AM >> 2], H);
}

void main()
{
  fill_c14_00();
}
layout(local_size_x = 8 /* 1, 8 */, local_size_y = 1, local_size_z = 1) in;

[T 04/22/20 20:17:36.808] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:36.808] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:36.808] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:36.808] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:36.808] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:36.808] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:36.808] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:36.808] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:36.809] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:36.809] [program.cpp:finalize@512] Program (0x56273b70b7e0) finalized.
[T 04/22/20 20:17:36.812] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:36.812] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:36.812] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:36.812] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:36.812] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:36.812] [program.cpp:Program@133] Program (0x56273c08bd40) arch=x64 initialized.
[T 04/22/20 20:17:36.813] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:36.925] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:36.925] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 1.91e-05 s
[T 04/22/20 20:17:36.925] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 128 B
[T 04/22/20 20:17:36.933] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:36.933] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:36.943] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:36.943] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.943] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:36.944] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:36.944] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.944] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:36.945] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:36.945] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.945] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:36.946] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:36.946] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.946] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:36.947] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:36.947] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:36.947] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:36.952] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:36.967] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c16_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:37.007] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:37.008] [program.cpp:finalize@512] Program (0x56273c08bd40) finalized.
[T 04/22/20 20:17:37.010] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:37.010] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:37.010] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:37.010] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:37.010] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:37.010] [program.cpp:Program@133] Program (0x56273c2e8cc0) arch=cuda initialized.
[T 04/22/20 20:17:37.011] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:37.123] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:37.123] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:37.123] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:37.123] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:37.316] [jit_cuda.cpp:add_module@90] PTX size: 82.22KB
[T 04/22/20 20:17:37.316] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:37.317] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4658699035644531ms
[T 04/22/20 20:17:37.317] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:37.317] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 7.20e-05 s
[T 04/22/20 20:17:37.317] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 128 B
[T 04/22/20 20:17:37.326] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:37.326] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:37.336] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:37.336] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:37.336] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:37.337] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:37.337] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:37.337] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:37.338] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:37.338] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:37.339] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:37.340] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:37.340] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:37.340] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:37.341] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:37.341] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:37.341] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:37.359] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c18_0...
[T 04/22/20 20:17:37.548] [jit_cuda.cpp:add_module@90] PTX size: 83.82KB
[T 04/22/20 20:17:37.548] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:37.549] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.476837158203125ms
[D 04/22/20 20:17:37.549] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c18_0_kernel_16_range_for<<<896, 64>>>
[T 04/22/20 20:17:37.549] [jit_cuda.cpp:lookup_function@47] Kernel fill_c18_0_kernel_16_range_for compilation time: 0.00095367431640625ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:37.564] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:37.565] [program.cpp:finalize@512] Program (0x56273c2e8cc0) finalized.
[T 04/22/20 20:17:37.573] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:37.573] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:37.573] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:37.573] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:37.573] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:37.573] [program.cpp:Program@133] Program (0x56273abbbac0) arch=opengl initialized.
[T 04/22/20 20:17:37.574] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:37.687] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:37.687] [program.cpp:materialize_layout@298] OpenGL root buffer size: 128 B
[T 04/22/20 20:17:37.688] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:37.688] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:37.688] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:37.688] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:37.690] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c20_0...
[D 04/22/20 20:17:37.692] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c20_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };

void fill_c20_00()
{ // range for
  // range known at compile time
  int _tid = int(gl_GlobalInvocationID.x);
  if (_tid >= 16) return;
  int _itv = 0 + _tid * 1;
    int G = _itv;
    int H = (((0 + G) >> 3) & ((1 << 1) - 1));
    int L = (((0 + G) >> 2) & ((1 << 1) - 1));
    int O = (((0 + G) >> 1) & ((1 << 1) - 1));
    int R = (((0 + G) >> 0) & ((1 << 1) - 1));
    int U = H + L;
    int V = U + O;
    int W = V + R;
    int As = 0;
    int Z2 = 0;
    int Au = As + 128 * Z2; // S0
    int Av = Au + 0; // S1
    int Aw = (((0 + H) >> 0) & ((1 << 1) - 1));
    int Ax = (((0 + L) >> 0) & ((1 << 1) - 1));
    int Ay = (((0 + O) >> 0) & ((1 << 1) - 1));
    int Az = (((0 + R) >> 0) & ((1 << 1) - 1));
    int Z4 = 1;
    int Z5 = Az * Z4;
    int Z6 = Z2 + Z5;
    int Z7 = 2;
    int Z8 = Ay * Z7;
    int Z9 = Z6 + Z8;
    int Za = 4;
    int Zb = Ax * Za;
    int Zc = Z9 + Zb;
    int Zd = 8;
    int Ze = Aw * Zd;
    int Zf = Zc + Ze;
    int AB = Av + 8 * Zf; // S1
    int AC = AB + 0; // S2
    int AE = atomicAdd(_data_i32_[AC >> 2], W);
    int AU = AB + 4; // S3
    int AW = atomicAdd(_data_i32_[AU >> 2], H);
}

void main()
{
  fill_c20_00();
}
layout(local_size_x = 16 /* 1, 16 */, local_size_y = 1, local_size_z = 1) in;

[T 04/22/20 20:17:37.692] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:37.692] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:37.692] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:37.692] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:37.693] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:37.693] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:37.693] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:37.693] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:37.693] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:37.694] [program.cpp:finalize@512] Program (0x56273abbbac0) finalized.
[T 04/22/20 20:17:37.696] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:37.696] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:37.696] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:37.696] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:37.697] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:37.697] [program.cpp:Program@133] Program (0x56273c08bd40) arch=x64 initialized.
[T 04/22/20 20:17:37.698] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:37.810] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:37.810] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 3.31e-05 s
[T 04/22/20 20:17:37.810] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 256 B
[T 04/22/20 20:17:37.818] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:37.818] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:37.827] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:37.827] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:37.827] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:37.829] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:37.829] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:37.829] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:37.830] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:37.830] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:37.830] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:37.831] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:37.831] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:37.831] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:37.832] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:37.832] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:37.832] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:37.837] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:37.852] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c22_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:37.894] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:37.894] [program.cpp:finalize@512] Program (0x56273c08bd40) finalized.
[T 04/22/20 20:17:37.896] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:37.896] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:37.896] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:37.897] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:37.897] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:37.897] [program.cpp:Program@133] Program (0x56273b13cd50) arch=cuda initialized.
[T 04/22/20 20:17:37.898] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:38.009] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:38.009] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:38.009] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:38.009] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:38.206] [jit_cuda.cpp:add_module@90] PTX size: 82.22KB
[T 04/22/20 20:17:38.206] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:38.207] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.43392181396484375ms
[T 04/22/20 20:17:38.207] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:38.207] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 6.70e-05 s
[T 04/22/20 20:17:38.207] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 256 B
[T 04/22/20 20:17:38.215] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:38.215] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:38.225] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:38.225] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:38.225] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:38.226] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:38.226] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:38.226] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:38.227] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:38.227] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:38.227] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:38.228] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:38.228] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:38.228] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:38.229] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:38.229] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:38.229] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:38.247] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c24_0...
[T 04/22/20 20:17:38.432] [jit_cuda.cpp:add_module@90] PTX size: 83.88KB
[T 04/22/20 20:17:38.433] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:38.433] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4620552062988281ms
[D 04/22/20 20:17:38.433] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c24_0_kernel_22_range_for<<<896, 64>>>
[T 04/22/20 20:17:38.433] [jit_cuda.cpp:lookup_function@47] Kernel fill_c24_0_kernel_22_range_for compilation time: 0.00095367431640625ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:38.448] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:38.448] [program.cpp:finalize@512] Program (0x56273b13cd50) finalized.
[T 04/22/20 20:17:38.457] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:38.457] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:38.457] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:38.457] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:38.457] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:38.457] [program.cpp:Program@133] Program (0x56273b8e1b00) arch=opengl initialized.
[T 04/22/20 20:17:38.458] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:38.570] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:38.570] [program.cpp:materialize_layout@298] OpenGL root buffer size: 256 B
[T 04/22/20 20:17:38.571] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:38.571] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:38.571] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:38.571] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:38.574] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c26_0...
[D 04/22/20 20:17:38.576] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c26_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };

void fill_c26_00()
{ // range for
  // range known at compile time
  int _tid = int(gl_GlobalInvocationID.x);
  if (_tid >= 32) return;
  int _itv = 0 + _tid * 1;
    int G = _itv;
    int H = (((0 + G) >> 4) & ((1 << 1) - 1));
    int L = (((0 + G) >> 3) & ((1 << 1) - 1));
    int O = (((0 + G) >> 2) & ((1 << 1) - 1));
    int R = (((0 + G) >> 1) & ((1 << 1) - 1));
    int U = (((0 + G) >> 0) & ((1 << 1) - 1));
    int X = H + L;
    int Y = X + O;
    int Z = Y + R;
    int Aq = Z + U;
    int Ax = 0;
    int AAQ = 0;
    int Az = Ax + 256 * AAQ; // S0
    int AA = Az + 0; // S1
    int AB = (((0 + H) >> 0) & ((1 << 1) - 1));
    int AC = (((0 + L) >> 0) & ((1 << 1) - 1));
    int AD = (((0 + O) >> 0) & ((1 << 1) - 1));
    int AE = (((0 + R) >> 0) & ((1 << 1) - 1));
    int AF = (((0 + U) >> 0) & ((1 << 1) - 1));
    int AAS = 1;
    int AAT = AF * AAS;
    int AAU = AAQ + AAT;
    int AAV = 2;
    int AAW = AE * AAV;
    int AAX = AAU + AAW;
    int AAY = 4;
    int AAZ = AD * AAY;
    int AB0 = AAX + AAZ;
    int AB1 = 8;
    int AB2 = AC * AB1;
    int AB3 = AB0 + AB2;
    int AB4 = 16;
    int AB5 = AB * AB4;
    int AB6 = AB3 + AB5;
    int AH = AA + 8 * AB6; // S1
    int AI = AH + 0; // S2
    int AK = atomicAdd(_data_i32_[AI >> 2], Aq);
    int B2 = AH + 4; // S3
    int B4 = atomicAdd(_data_i32_[B2 >> 2], H);
}

void main()
{
  fill_c26_00();
}
layout(local_size_x = 32 /* 1, 32 */, local_size_y = 1, local_size_z = 1) in;

[T 04/22/20 20:17:38.576] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:38.576] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:38.576] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:38.576] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:38.577] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:38.577] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:38.577] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:38.577] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:38.578] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:38.579] [program.cpp:finalize@512] Program (0x56273b8e1b00) finalized.
[T 04/22/20 20:17:38.581] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:38.581] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:38.581] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:38.581] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:38.581] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:38.581] [program.cpp:Program@133] Program (0x56273b5dbb30) arch=x64 initialized.
[T 04/22/20 20:17:38.582] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:38.692] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:38.692] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 3.29e-05 s
[T 04/22/20 20:17:38.692] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 512 B
[T 04/22/20 20:17:38.701] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:38.701] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:38.710] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:38.710] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:38.710] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:38.711] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:38.711] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:38.711] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:38.712] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:38.712] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:38.712] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:38.713] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:38.713] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:38.713] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:38.714] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:38.714] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:38.714] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:38.720] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:38.735] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c28_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:38.776] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:38.777] [program.cpp:finalize@512] Program (0x56273b5dbb30) finalized.
[T 04/22/20 20:17:38.779] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:38.779] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:38.779] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:38.779] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:38.779] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:38.779] [program.cpp:Program@133] Program (0x56273b2d9800) arch=cuda initialized.
[T 04/22/20 20:17:38.780] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:38.892] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:38.892] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:38.892] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:38.892] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:39.085] [jit_cuda.cpp:add_module@90] PTX size: 82.22KB
[T 04/22/20 20:17:39.085] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:39.085] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.46515464782714844ms
[T 04/22/20 20:17:39.086] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:39.086] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 8.51e-05 s
[T 04/22/20 20:17:39.086] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 512 B
[T 04/22/20 20:17:39.094] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:39.094] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:39.104] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:39.104] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:39.104] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:39.105] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:39.105] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:39.105] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:39.106] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:39.106] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:39.106] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:39.107] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:39.107] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:39.107] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:39.108] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:39.108] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:39.108] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:39.127] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c30_0...
[T 04/22/20 20:17:39.317] [jit_cuda.cpp:add_module@90] PTX size: 83.93KB
[T 04/22/20 20:17:39.317] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:39.317] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.47397613525390625ms
[D 04/22/20 20:17:39.317] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c30_0_kernel_28_range_for<<<896, 64>>>
[T 04/22/20 20:17:39.317] [jit_cuda.cpp:lookup_function@47] Kernel fill_c30_0_kernel_28_range_for compilation time: 0.00095367431640625ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:39.334] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:39.335] [program.cpp:finalize@512] Program (0x56273b2d9800) finalized.
[T 04/22/20 20:17:39.355] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:39.355] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:39.355] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:39.355] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:39.355] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:39.355] [program.cpp:Program@133] Program (0x56273b541240) arch=opengl initialized.
[T 04/22/20 20:17:39.357] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:39.473] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:39.473] [program.cpp:materialize_layout@298] OpenGL root buffer size: 512 B
[T 04/22/20 20:17:39.474] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:39.474] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:39.474] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:39.474] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:39.477] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c32_0...
[D 04/22/20 20:17:39.480] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c32_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };

void fill_c32_00()
{ // range for
  // range known at compile time
  int _tid = int(gl_GlobalInvocationID.x);
  if (_tid >= 64) return;
  int _itv = 0 + _tid * 1;
    int G = _itv;
    int H = (((0 + G) >> 5) & ((1 << 1) - 1));
    int L = (((0 + G) >> 4) & ((1 << 1) - 1));
    int O = (((0 + G) >> 3) & ((1 << 1) - 1));
    int R = (((0 + G) >> 2) & ((1 << 1) - 1));
    int U = (((0 + G) >> 1) & ((1 << 1) - 1));
    int X = (((0 + G) >> 0) & ((1 << 1) - 1));
    int Aq = H + L;
    int Ar = Aq + O;
    int As = Ar + R;
    int At = As + U;
    int Au = At + X;
    int AC = 0;
    int AOi = 0;
    int AE = AC + 512 * AOi; // S0
    int AF = AE + 0; // S1
    int AG = (((0 + H) >> 0) & ((1 << 1) - 1));
    int AH = (((0 + L) >> 0) & ((1 << 1) - 1));
    int AI = (((0 + O) >> 0) & ((1 << 1) - 1));
    int AJ = (((0 + R) >> 0) & ((1 << 1) - 1));
    int AK = (((0 + U) >> 0) & ((1 << 1) - 1));
    int AL = (((0 + X) >> 0) & ((1 << 1) - 1));
    int AOk = 1;
    int AOl = AL * AOk;
    int AOm = AOi + AOl;
    int AOn = 2;
    int AOo = AK * AOn;
    int AOp = AOm + AOo;
    int AOq = 4;
    int AOr = AJ * AOq;
    int AOs = AOp + AOr;
    int AOt = 8;
    int AOu = AI * AOt;
    int AOv = AOs + AOu;
    int AOw = 16;
    int AOx = AH * AOw;
    int AOy = AOv + AOx;
    int AOz = 32;
    int AOA = AG * AOz;
    int AOB = AOy + AOA;
    int AN = AF + 8 * AOB; // S1
    int AO = AN + 0; // S2
    int AQ = atomicAdd(_data_i32_[AO >> 2], Au);
    int Ba = AN + 4; // S3
    int Bc = atomicAdd(_data_i32_[Ba >> 2], H);
}

void main()
{
  fill_c32_00();
}
layout(local_size_x = 64 /* 1, 64 */, local_size_y = 1, local_size_z = 1) in;

[T 04/22/20 20:17:39.480] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:39.480] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:39.480] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:39.480] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:39.481] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:39.481] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:39.481] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:39.481] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:39.484] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:39.484] [program.cpp:finalize@512] Program (0x56273b541240) finalized.
[T 04/22/20 20:17:39.487] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:39.487] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:39.487] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:39.487] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:39.487] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:39.487] [program.cpp:Program@133] Program (0x56273b007f00) arch=x64 initialized.
[T 04/22/20 20:17:39.488] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:39.601] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:39.601] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 3.19e-05 s
[T 04/22/20 20:17:39.601] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 1024 B
[T 04/22/20 20:17:39.609] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:39.609] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:39.619] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:39.619] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:39.619] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:39.621] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:39.621] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:39.621] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:39.622] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:39.622] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:39.622] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:39.623] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:39.623] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:39.623] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:39.624] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:39.624] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:39.624] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:39.629] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:39.646] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c34_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:39.689] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:39.689] [program.cpp:finalize@512] Program (0x56273b007f00) finalized.
[T 04/22/20 20:17:39.692] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:39.692] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:39.692] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:39.692] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:39.693] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:39.693] [program.cpp:Program@133] Program (0x56273c07b960) arch=cuda initialized.
[T 04/22/20 20:17:39.694] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:39.807] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:39.807] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:39.807] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:39.807] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:40.002] [jit_cuda.cpp:add_module@90] PTX size: 82.22KB
[T 04/22/20 20:17:40.002] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:40.002] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.5109310150146484ms
[T 04/22/20 20:17:40.003] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:40.003] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 8.61e-05 s
[T 04/22/20 20:17:40.003] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 1024 B
[T 04/22/20 20:17:40.011] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:40.011] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:40.021] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:40.021] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.021] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:40.022] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:40.022] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.022] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:40.023] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:40.023] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.023] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:40.024] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:40.024] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.024] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:40.025] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:40.025] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:40.025] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:40.046] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c36_0...
[T 04/22/20 20:17:40.233] [jit_cuda.cpp:add_module@90] PTX size: 83.99KB
[T 04/22/20 20:17:40.233] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:40.234] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.5099773406982422ms
[D 04/22/20 20:17:40.234] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c36_0_kernel_34_range_for<<<896, 64>>>
[T 04/22/20 20:17:40.234] [jit_cuda.cpp:lookup_function@47] Kernel fill_c36_0_kernel_34_range_for compilation time: 0.00095367431640625ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:40.251] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:40.252] [program.cpp:finalize@512] Program (0x56273c07b960) finalized.
[T 04/22/20 20:17:40.261] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:40.261] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:40.262] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:40.262] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:40.262] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:40.262] [program.cpp:Program@133] Program (0x56273abbbac0) arch=opengl initialized.
[T 04/22/20 20:17:40.263] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:40.377] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:40.377] [program.cpp:materialize_layout@298] OpenGL root buffer size: 1024 B
[T 04/22/20 20:17:40.378] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:40.378] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:40.378] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:40.378] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:40.384] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c38_0...
[D 04/22/20 20:17:40.388] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c38_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };

void fill_c38_00()
{ // range for
  // range known at compile time
  int _tid = int(gl_GlobalInvocationID.x);
  if (_tid >= 128) return;
  int _itv = 0 + _tid * 1;
    int G = _itv;
    int H = (((0 + G) >> 6) & ((1 << 1) - 1));
    int L = (((0 + G) >> 5) & ((1 << 1) - 1));
    int O = (((0 + G) >> 4) & ((1 << 1) - 1));
    int R = (((0 + G) >> 3) & ((1 << 1) - 1));
    int U = (((0 + G) >> 2) & ((1 << 1) - 1));
    int X = (((0 + G) >> 1) & ((1 << 1) - 1));
    int Aq = (((0 + G) >> 0) & ((1 << 1) - 1));
    int At = H + L;
    int Au = At + O;
    int Av = Au + R;
    int Aw = Av + U;
    int Ax = Aw + X;
    int Ay = Ax + Aq;
    int AH = 0;
    int B3o = 0;
    int AJ = AH + 1024 * B3o; // S0
    int AK = AJ + 0; // S1
    int AL = (((0 + H) >> 0) & ((1 << 1) - 1));
    int AM = (((0 + L) >> 0) & ((1 << 1) - 1));
    int AN = (((0 + O) >> 0) & ((1 << 1) - 1));
    int AO = (((0 + R) >> 0) & ((1 << 1) - 1));
    int AP = (((0 + U) >> 0) & ((1 << 1) - 1));
    int AQ = (((0 + X) >> 0) & ((1 << 1) - 1));
    int AR = (((0 + Aq) >> 0) & ((1 << 1) - 1));
    int B3q = 1;
    int B3r = AR * B3q;
    int B3s = B3o + B3r;
    int B3t = 2;
    int B3u = AQ * B3t;
    int B3v = B3s + B3u;
    int B3w = 4;
    int B3x = AP * B3w;
    int B3y = B3v + B3x;
    int B3z = 8;
    int B3A = AO * B3z;
    int B3B = B3y + B3A;
    int B3C = 16;
    int B3D = AN * B3C;
    int B3E = B3B + B3D;
    int B3F = 32;
    int B3G = AM * B3F;
    int B3H = B3E + B3G;
    int B3I = 64;
    int B3J = AL * B3I;
    int B3K = B3H + B3J;
    int AT = AK + 8 * B3K; // S1
    int AU = AT + 0; // S2
    int AW = atomicAdd(_data_i32_[AU >> 2], Ay);
    int Bi = AT + 4; // S3
    int Bk = atomicAdd(_data_i32_[Bi >> 2], H);
}

void main()
{
  fill_c38_00();
}
layout(local_size_x = 128 /* 1, 128 */, local_size_y = 1, local_size_z = 1) in;

[T 04/22/20 20:17:40.388] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:40.388] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:40.388] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:40.388] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:40.389] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:40.389] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:40.389] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:40.389] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:40.397] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:40.398] [program.cpp:finalize@512] Program (0x56273abbbac0) finalized.
[T 04/22/20 20:17:40.401] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:40.401] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:40.401] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:40.401] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:40.401] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:40.401] [program.cpp:Program@133] Program (0x56273b13f180) arch=x64 initialized.
[T 04/22/20 20:17:40.402] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:40.516] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:40.516] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 2.10e-05 s
[T 04/22/20 20:17:40.516] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 2048 B
[T 04/22/20 20:17:40.524] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:40.524] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:40.534] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:40.534] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.534] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:40.535] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:40.535] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.535] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:40.536] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:40.536] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.536] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:40.537] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:40.537] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.537] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:40.538] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:40.538] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:40.538] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:40.544] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:40.562] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c40_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:40.607] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:40.608] [program.cpp:finalize@512] Program (0x56273b13f180) finalized.
[T 04/22/20 20:17:40.610] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:40.610] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:40.610] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:40.610] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:40.610] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:40.610] [program.cpp:Program@133] Program (0x56273c6e92b0) arch=cuda initialized.
[T 04/22/20 20:17:40.611] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:40.723] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:40.723] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:40.723] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:40.723] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:40.922] [jit_cuda.cpp:add_module@90] PTX size: 82.22KB
[T 04/22/20 20:17:40.922] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:40.923] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4589557647705078ms
[T 04/22/20 20:17:40.923] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:40.923] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 7.30e-05 s
[T 04/22/20 20:17:40.923] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 2048 B
[T 04/22/20 20:17:40.931] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:40.931] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:40.941] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:40.941] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.941] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:40.942] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:40.942] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.942] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:40.943] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:40.943] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.943] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:40.944] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:40.944] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.944] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:40.945] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:40.945] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:40.945] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:40.966] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c42_0...
[T 04/22/20 20:17:41.153] [jit_cuda.cpp:add_module@90] PTX size: 84.04KB
[T 04/22/20 20:17:41.153] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:41.153] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.45800209045410156ms
[D 04/22/20 20:17:41.154] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c42_0_kernel_40_range_for<<<896, 64>>>
[T 04/22/20 20:17:41.154] [jit_cuda.cpp:lookup_function@47] Kernel fill_c42_0_kernel_40_range_for compilation time: 0.014781951904296875ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:41.171] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:41.172] [program.cpp:finalize@512] Program (0x56273c6e92b0) finalized.
[T 04/22/20 20:17:41.181] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:41.181] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:41.181] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:41.181] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:41.181] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:41.181] [program.cpp:Program@133] Program (0x56273c85ee90) arch=opengl initialized.
[T 04/22/20 20:17:41.182] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:41.298] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:41.298] [program.cpp:materialize_layout@298] OpenGL root buffer size: 2048 B
[T 04/22/20 20:17:41.299] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:41.299] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:41.299] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:41.299] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:41.314] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c44_0...
[D 04/22/20 20:17:41.318] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c44_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };

void fill_c44_00()
{ // range for
  // range known at compile time
  int _tid = int(gl_GlobalInvocationID.x);
  if (_tid >= 256) return;
  int _itv = 0 + _tid * 1;
    int G = _itv;
    int H = (((0 + G) >> 7) & ((1 << 1) - 1));
    int L = (((0 + G) >> 6) & ((1 << 1) - 1));
    int O = (((0 + G) >> 5) & ((1 << 1) - 1));
    int R = (((0 + G) >> 4) & ((1 << 1) - 1));
    int U = (((0 + G) >> 3) & ((1 << 1) - 1));
    int X = (((0 + G) >> 2) & ((1 << 1) - 1));
    int Aq = (((0 + G) >> 1) & ((1 << 1) - 1));
    int At = (((0 + G) >> 0) & ((1 << 1) - 1));
    int Aw = H + L;
    int Ax = Aw + O;
    int Ay = Ax + R;
    int Az = Ay + U;
    int AA = Az + X;
    int AB = AA + Aq;
    int AC = AB + At;
    int AM = 0;
    int Bk8 = 0;
    int AO = AM + 2048 * Bk8; // S0
    int AP = AO + 0; // S1
    int AQ = (((0 + H) >> 0) & ((1 << 1) - 1));
    int AR = (((0 + L) >> 0) & ((1 << 1) - 1));
    int AS = (((0 + O) >> 0) & ((1 << 1) - 1));
    int AT = (((0 + R) >> 0) & ((1 << 1) - 1));
    int AU = (((0 + U) >> 0) & ((1 << 1) - 1));
    int AV = (((0 + X) >> 0) & ((1 << 1) - 1));
    int AW = (((0 + Aq) >> 0) & ((1 << 1) - 1));
    int AX = (((0 + At) >> 0) & ((1 << 1) - 1));
    int Bka = 1;
    int Bkb = AX * Bka;
    int Bkc = Bk8 + Bkb;
    int Bkd = 2;
    int Bke = AW * Bkd;
    int Bkf = Bkc + Bke;
    int Bkg = 4;
    int Bkh = AV * Bkg;
    int Bki = Bkf + Bkh;
    int Bkj = 8;
    int Bkk = AU * Bkj;
    int Bkl = Bki + Bkk;
    int Bkm = 16;
    int Bkn = AT * Bkm;
    int Bko = Bkl + Bkn;
    int Bkp = 32;
    int Bkq = AS * Bkp;
    int Bkr = Bko + Bkq;
    int Bks = 64;
    int Bkt = AR * Bks;
    int Bku = Bkr + Bkt;
    int Bkv = 128;
    int Bkw = AQ * Bkv;
    int Bkx = Bku + Bkw;
    int AZ = AP + 8 * Bkx; // S1
    int B0 = AZ + 0; // S2
    int B2 = atomicAdd(_data_i32_[B0 >> 2], AC);
    int Bq = AZ + 4; // S3
    int Bs = atomicAdd(_data_i32_[Bq >> 2], H);
}

void main()
{
  fill_c44_00();
}
layout(local_size_x = 256 /* 1, 256 */, local_size_y = 1, local_size_z = 1) in;

[T 04/22/20 20:17:41.318] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:41.318] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:41.318] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:41.318] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:41.319] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:41.319] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:41.319] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:41.319] [opengl_api.cpp:link@129] glLinkProgram OUT
.

==================================================== 1 passed in 7.31s =====================================================
[T 04/22/20 20:17:41.374] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:41.375] [program.cpp:finalize@512] Program (0x56273c85ee90) finalized.

Looks like the only difference is the final two lines -- on Windows I've never seen them.

Anyway, as long as they are mismatched in the same way with/without advanced optimization, we can still get a plausible comparison...

@yuanming-hu I found an issue when doing CSE for global pointers:
Case: test_ad_if_parallel_complex
Before (good):

[I 06/30/20 20:38:44.108] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Simplified I:
kernel {
  <f32 x1> $0 = const [1.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <f32 x1> $3 = const [0.0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32 x1> $7 = alloca
    <f32 x1> $8 : local store [$7 <- $3]
    <f32*x1> $9 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $10 = global load $9
    <i32 x1> $11 = cmp_gt $10 $3
    <i32 x1> $12 = bit_and $11 $1
    $13 : if $12 {
      <f32*x1> $14 = global ptr [S2place_f32], index [$6] activate=true
      <f32 x1> $15 = global load $14
      <f32 x1> $16 = div $0 $15
      <f32 x1> $17 : local store [$7 <- $16]
    }
    <f32 x1> $18 = local load [ [$7[0]]]
    <f32*x1> $19 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $20 : global store [$19 <- $18]
  }
}
[I 06/30/20 20:38:44.110] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Gradient:
kernel {
  <f32 x1> $0 = const [1.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <f32 x1> $3 = const [0.0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <f32 x1> $6 = alloca
    <f32 x1> $7 = alloca
    <f32 x1> $8 = alloca
    <f32 x1> $9 = alloca
    <i32 x1> $10 = loop $5 index 0
    <f32 x1> $11 = stack alloc (max_size=16)
    <f32 x1> $12 : stack push $11, val = $3
    <f32*x1> $13 = global ptr [S2place_f32], index [$10] activate=true
    <f32 x1> $14 = global load $13
    <i32 x1> $15 = cmp_gt $14 $3
    <i32 x1> $16 = bit_and $15 $1
    $17 : if $16 {
      <f32*x1> $18 = global ptr [S2place_f32], index [$10] activate=true
      <f32 x1> $19 = global load $18
      <f32 x1> $20 : local store [$6 <- $19]
      <f32 x1> $21 = div $0 $19
      <f32 x1> $22 : stack push $11, val = $21
    }
    <f32*x1> $23 = global ptr [S4place_f32], index [$10] activate=true
    <f32*x1> $24 = global ptr [S6place_f32], index [$10] activate=true
    <f32 x1> $25 = global load $24
    <f32 x1> $26 : stack acc adj $11, val = $25
    $27 : if $16 {
      <f32 x1> $28 = stack load top adj $11
      <f32 x1> $29 = local load [ [$9[0]]]
      <f32 x1> $30 = add $29 $28
      <f32 x1> $31 : local store [$9 <- $30]
      <f32 x1> $32 : stack pop $11
      <f32 x1> $33 = local load [ [$6[0]]]
      <f32 x1> $34 = div $30 $33
      <f32 x1> $35 = local load [ [$8[0]]]
      <f32 x1> $36 = add $35 $34
      <f32 x1> $37 : local store [$8 <- $36]
      <f32 x1> $38 = mul $33 $33
      <f32 x1> $39 = div $30 $38
      <f32 x1> $40 = neg $39
      <f32 x1> $41 = local load [ [$7[0]]]
      <f32 x1> $42 = add $41 $40
      <f32 x1> $43 : local store [$7 <- $42]
      <f32*x1> $44 = global ptr [S5place_f32], index [$10] activate=true
      <f32 x1> $45 = atomic add($44, $42)
    }
    <f32*x1> $46 = global ptr [S5place_f32], index [$10] activate=true
    <f32 x1> $47 = atomic add($46, $3)
    <f32 x1> $48 : stack pop $11
  }
}

After(bad, with some debug output in full_simplify()):

[I 06/30/20 20:43:33.360] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Simplified I:
kernel {
  <f32 x1> $0 = const [1.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <f32 x1> $3 = const [0.0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $3
    <i32 x1> $10 = bit_and $9 $1
    <f32 x1> $11 = div $0 $8
    <f32 x1> $12 = select($10, $11, $3)
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $14 : global store [$13 <- $12]
  }
}
before simplify
kernel {
  <f32 x1> $0 = const [1.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <f32 x1> $3 = const [0.0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $3
    <i32 x1> $10 = bit_and $9 $1
    <f32 x1> $11 = div $0 $8
    <f32 x1> $12 = select($10, $11, $3)
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $14 : global store [$13 <- $12]
  }
}
after simplify
kernel {
  <f32 x1> $0 = const [1.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <f32 x1> $3 = const [0.0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $3
    <i32 x1> $10 = bit_and $9 $1
    <f32 x1> $11 = div $0 $8
    <f32 x1> $12 = select($10, $11, $3)
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $14 : global store [$13 <- $12]
  }
}
after cse
kernel {
  <f32 x1> $0 = const [1.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <f32 x1> $3 = const [0.0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $3
    <i32 x1> $10 = bit_and $9 $1
    <f32 x1> $11 = div $0 $8
    <f32 x1> $12 = select($10, $11, $3)
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $14 : global store [$13 <- $12]
  }
}
before simplify
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <f32 x1> $3 = const [0.0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <f32 x1> $229 = alloca
    <f32 x1> $220 = alloca
    <f32 x1> $214 = alloca
    <f32 x1> $208 = alloca
    <f32 x1> $201 = alloca
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $3
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $202 = local load [ [$201[0]]]
    <f32 x1> $203 = add $202 $200
    <f32 x1> $204 : local store [$201 <- $203]
    <f32 x1> $206 = local load [ [$201[0]]]
    <f32 x1> $207 = select($10, $206, $205)
    <f32 x1> $209 = local load [ [$208[0]]]
    <f32 x1> $210 = add $209 $207
    <f32 x1> $211 : local store [$208 <- $210]
    <f32 x1> $212 = local load [ [$201[0]]]
    <f32 x1> $213 = select($10, $205, $212)
    <f32 x1> $215 = local load [ [$214[0]]]
    <f32 x1> $216 = add $215 $213
    <f32 x1> $217 : local store [$214 <- $216]
    <f32 x1> $218 = local load [ [$208[0]]]
    <f32 x1> $219 = div $218 $8
    <f32 x1> $221 = local load [ [$220[0]]]
    <f32 x1> $222 = add $221 $219
    <f32 x1> $223 : local store [$220 <- $222]
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $225 = local load [ [$208[0]]]
    <f32 x1> $227 = div $225 $224
    <f32 x1> $228 = neg $227
    <f32 x1> $230 = local load [ [$229[0]]]
    <f32 x1> $231 = add $230 $228
    <f32 x1> $232 : local store [$229 <- $231]
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $234 = local load [ [$229[0]]]
    <f32 x1> $235 = atomic add($233, $234)
  }
}
after simplify
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <f32 x1> $3 = const [0.0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $3
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $236 = const [0.0]
    <f32 x1> $203 = add $236 $200
    <f32 x1> $207 = select($10, $203, $205)
    <f32 x1> $237 = const [0.0]
    <f32 x1> $210 = add $237 $207
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $227 = div $210 $224
    <f32 x1> $228 = neg $227
    <f32 x1> $240 = const [0.0]
    <f32 x1> $231 = add $240 $228
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $235 = atomic add($233, $231)
  }
}
after cse
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $205
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $203 = add $205 $200
    <f32 x1> $207 = select($10, $203, $205)
    <f32 x1> $210 = add $205 $207
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $227 = div $210 $224
    <f32 x1> $228 = neg $227
    <f32 x1> $231 = add $205 $228
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $235 = atomic add($233, $231)
  }
}
before simplify
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $205
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $207 = select($10, $200, $205)
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $227 = div $207 $224
    <f32 x1> $228 = neg $227
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $235 = atomic add($233, $228)
  }
}
after simplify
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $205
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $207 = select($10, $200, $205)
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $227 = div $207 $224
    <f32 x1> $228 = neg $227
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $235 = atomic add($233, $228)
  }
}
after cse
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $205
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $207 = select($10, $200, $205)
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $227 = div $207 $224
    <f32 x1> $228 = neg $227
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $235 = atomic add($233, $228)
  }
}
before simplify
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $205
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $207 = select($10, $200, $205)
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $227 = div $207 $224
    <f32 x1> $228 = neg $227
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $235 = atomic add($233, $228)
  }
}
after simplify
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $205
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $207 = select($10, $200, $205)
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $227 = div $207 $224
    <f32 x1> $228 = neg $227
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $235 = atomic add($233, $228)
  }
}
after cse
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $205
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $207 = select($10, $200, $205)
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $227 = div $207 $224
    <f32 x1> $228 = neg $227
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $235 = atomic add($233, $228)
  }
}
[I 06/30/20 20:43:33.386] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Gradient:
kernel {
  <f32 x1> $0 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <i32 x1> $3 = const [2]
  $4 : for in range($2, $3) (vectorize 1) block_dim=adaptive {
    <i32 x1> $5 = loop $4 index 0
    <f32*x1> $6 = global ptr [S2place_f32], index [$5] activate=true
    <f32 x1> $7 = global load $6
    <i32 x1> $8 = cmp_gt $7 $0
    <i32 x1> $9 = bit_and $8 $1
    <f32*x1> $10 = global ptr [S4place_f32], index [$5] activate=true
    <f32*x1> $11 = global ptr [S6place_f32], index [$5] activate=true
    <f32 x1> $12 = global load $11
    <f32 x1> $13 = select($9, $12, $0)
    <f32 x1> $14 = mul $7 $7
    <f32 x1> $15 = div $13 $14
    <f32 x1> $16 = neg $15
    <f32*x1> $17 = global ptr [S5place_f32], index [$5] activate=true
    <f32 x1> $18 = atomic add($17, $16)
  }
}

I think although the IRs in Simplified I are different, both look pretty good... But after Gradient, the latter becomes wrong.

Final IR:
Good:

[I 06/30/20 20:38:44.166] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Simplified III:
kernel {
  $0 = offloaded range_for(0, 2) block_dim=adaptive
  body {
    <f32 x1> $1 = const [1.0]
    <f32 x1> $2 = alloca
    <i32 x1> $3 = loop $0 index 0
    <f32 x1> $4 = stack alloc (max_size=16)
    <f32 x1> $5 = const [0.0]
    <f32 x1> $6 : stack push $4, val = $5
    <gen*x1> $7 = get root
    <i32 x1> $8 = const [0]
    <gen*x1> $9 = [S0root][root]::lookup($7, $8) activate = false
    <gen*x1> $10 = get child [S0root->S1dense] $9
    <i32 x1> $11 = const [1]
    <gen*x1> $12 = [S1dense][dense]::lookup($10, $3) activate = false
    <f32*x1> $13 = get child [S1dense->S2place_f32] $12
    <f32 x1> $14 = global load $13
    <i32 x1> $15 = cmp_gt $14 $5
    <i32 x1> $16 = bit_and $15 $11
    $17 : if $16 {
      <f32 x1> $18 = global load $13
      <f32 x1> $19 : local store [$2 <- $18]
      <f32 x1> $20 = div $1 $18
      <f32 x1> $21 : stack push $4, val = $20
    }
    <gen*x1> $22 = get child [S0root->S3dense] $9
    <gen*x1> $23 = [S3dense][dense]::lookup($22, $3) activate = false
    <f32*x1> $24 = get child [S3dense->S6place_f32] $23
    <f32 x1> $25 = global load $24
    <f32 x1> $26 : stack acc adj $4, val = $25
    <f32 x1> $27 = stack load top adj $4
    <f32 x1> $28 = local load [ [$2[0]]]
    <f32 x1> $29 = mul $28 $28   <--- probably 0*0
    <f32 x1> $30 = div $27 $29   <--- nan
    <f32 x1> $31 = neg $30   <--- nan
    <f32*x1> $32 = get child [S1dense->S5place_f32] $12
    <f32 x1> $33 = global load $32
    <f32 x1> $34 = add $33 $31   <--- nan
    $35 : if $16 {   <--- good!
      <f32*x1> $36 : global store [$32 <- $34]
    }
    <f32 x1> $37 = global load $32
    <f32 x1> $38 : global store [$32 <- $37]
  }
}

Bad(nan):

[I 06/30/20 20:43:33.481] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Simplified III:
kernel {
  $0 = offloaded range_for(0, 2) block_dim=adaptive
  body {
    <i32 x1> $1 = loop $0 index 0
    <gen*x1> $2 = get root
    <i32 x1> $3 = const [0]
    <gen*x1> $4 = [S0root][root]::lookup($2, $3) activate = false
    <gen*x1> $5 = get child [S0root->S1dense] $4
    <i32 x1> $6 = const [1]
    <gen*x1> $7 = [S1dense][dense]::lookup($5, $1) activate = false
    <f32*x1> $8 = get child [S1dense->S2place_f32] $7
    <f32 x1> $9 = global load $8
    <f32 x1> $10 = const [0.0]
    <i32 x1> $11 = cmp_gt $9 $10
    <i32 x1> $12 = bit_and $11 $6
    <gen*x1> $13 = get child [S0root->S3dense] $4
    <gen*x1> $14 = [S3dense][dense]::lookup($13, $1) activate = false
    <f32*x1> $15 = get child [S3dense->S6place_f32] $14
    <f32 x1> $16 = global load $15
    <f32 x1> $17 = select($12, $16, $10)
    <f32 x1> $18 = mul $9 $9   <--- probably 0*0
    <f32 x1> $19 = div $17 $18  <--- nan
    <f32 x1> $20 = neg $19   <--- nan
    <f32*x1> $21 = get child [S1dense->S5place_f32] $7
    <f32 x1> $22 = global load $21
    <f32 x1> $23 = add $22 $20   <--- nan
    <f32 x1> $24 : global store [$21 <- $23]   <--- bad
  }
}
Was this page helpful?
0 / 5 - 0 ratings

Related issues

yuanming-hu picture yuanming-hu  路  4Comments

KLozes picture KLozes  路  4Comments

yuanming-hu picture yuanming-hu  路  3Comments

yuanming-hu picture yuanming-hu  路  3Comments

jackalcooper picture jackalcooper  路  4Comments