Environment: Ubuntu 18.04 + ROCm 2.2 + TVM (built from current master with ROCM = ON)
I ensure the target TVM library successfully detect and link with ROCM, and the tuning procedure runs successfully, however, while executing tvm.build(s, arg_bufs, 'rocm', name='matmul'), it failed with the following error:
WARNING:autotvm:Too many errors happen in the tuning. Now is in debug mode
Finish loading 500 records
DEBUG:autotvm:Finish loading 500 records
Cannot find config for target=rocm, workload=('tvm_matmul_tune_op', 4, 256, 256). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=rocm, workload=('tvm_matmul_tune_op', 4, 256, 256). A fallback configuration is used, which may bring great performance regression.
Best config:
,None,None
[14:47:54] /host/docker/matmul_tvm/tvm/src/pass/vectorize_loop.cc:362: Detect vector condition in Vectorized Loop, scalarizing...
[14:47:54] /host/docker/matmul_tvm/tvm/src/pass/vectorize_loop.cc:362: Detect vector condition in Vectorized Loop, scalarizing...
Traceback (most recent call last):
File "matmul_autotvm.py", line 260, in <module>
search_matmul_config(4, 256, 256, 500) # m, k, n, num_trials
File "matmul_autotvm.py", line 165, in search_matmul_config
func = tvm.build(s, arg_bufs, 'rocm', name='matmul')
File "/host/docker/matmul_tvm/tvm/python/tvm/build_module.py", line 617, in build
fhost, mdev = _build_for_device(flist, tar, target_host)
File "/host/docker/matmul_tvm/tvm/python/tvm/build_module.py", line 484, in _build_for_device
mdev = codegen.build_module(fdevice, str(target)) if fdevice else None
File "/host/docker/matmul_tvm/tvm/python/tvm/codegen.py", line 36, in build_module
return _Build(lowered_func, target)
File "/host/docker/matmul_tvm/tvm/python/tvm/_ffi/_ctypes/function.py", line 206, in __call__
raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
[bt] (2) /host/docker/matmul_tvm/tvm/build_rocm/libtvm.so(TVMFuncCall+0x61) [0x7f9598de3f01]
[bt] (1) /host/docker/matmul_tvm/tvm/build_rocm/libtvm.so(+0x14b2e9) [0x7f95986992e9]
[bt] (0) /host/docker/matmul_tvm/tvm/build_rocm/libtvm.so(+0x231aaa) [0x7f959877faaa]
File "/host/docker/matmul_tvm/tvm/src/codegen/codegen.cc", line 46
TVMError: Check failed: bf != nullptr: Target rocm is not enabled
All target TVM libraries link to ROCm successfully:
/host/docker/matmul_tvm/tvm/build# ldd libtvm.so
linux-vdso.so.1 (0x00007fff995ed000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fc8f0e65000)
libhip_hcc.so => /opt/rocm/lib/libhip_hcc.so (0x00007fc8f0560000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fc8f01d7000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fc8efe39000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fc8efc21000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc8ef830000)
/lib64/ld-linux-x86-64.so.2 (0x00007fc8f1d7f000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc8ef611000)
libhc_am.so => /opt/rocm/lib/libhc_am.so (0x00007fc8ef3b1000)
libmcwamp.so => /opt/rocm/lib/libmcwamp.so (0x00007fc8ef197000)
libhsa-runtime64.so.1 => /opt/rocm/hsa/lib/libhsa-runtime64.so.1 (0x00007fc8eeee0000)
libhsakmt.so.1 => /opt/rocm/lib/libhsakmt.so.1 (0x00007fc8eecbc000)
libelf.so.1 => /usr/lib/x86_64-linux-gnu/libelf.so.1 (0x00007fc8eeaa2000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fc8ee89a000)
libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007fc8ee68f000)
libpci.so.3 => /lib/x86_64-linux-gnu/libpci.so.3 (0x00007fc8ee482000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fc8ee265000)
libresolv.so.2 => /lib/x86_64-linux-gnu/libresolv.so.2 (0x00007fc8ee04a000)
libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007fc8ede2c000)
I printed all the registered providers:
For TVM built with CUDA on, TVM runtime detects following providers:
[codegen.build_aocl] [codegen.build_opengl] [codegen.build_cuda] [codegen.build_metal] [codegen.build_opencl] [codegen.build_aocl_sw_emu] [codegen.build_sdaccel] [codegen.build_c] [codegen.build_stackvm]
and no errors for the whole tuning procedure.
For TVM built with OpenCL on, TVM runtime detects following providers:
[codegen.build_aocl] [codegen.build_opengl] [codegen.build_metal] [codegen.build_opencl] [codegen.build_aocl_sw_emu] [codegen.build_sdaccel] [codegen.build_c] [codegen.build_stackvm]
and no errors for the whole tuning procedure.
For TVM built with ROCm on, TVM runtime detects following providers:
[codegen.build_aocl] [codegen.build_opengl] [codegen.build_metal] [codegen.build_opencl] [codegen.build_aocl_sw_emu] [codegen.build_sdaccel] [codegen.build_c] [codegen.build_stackvm]
So it's clear that codegen.build_rocm is not detected.
Have you enabled LLVM? The rocm target won't be enabled if USE_LLVM is off (even if USE_ROCM is on)
@masahi After enabling LLVM, both codegen.build_rocm and codegen.build_llvm is detected from the providers list. However, the runtime turned from undetected exception into segment fault crash:
(tuning..)
(tuning..)
(tuning..)
(tuning..)
'gfx906' is not a recognized processor for this target (ignoring processor)
'gfx906' is not a recognized processor for this target (ignoring processor)
No: 97 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(TimeoutError(),), error_no=6, all_cost=10, timestamp=1555835059.0408077) [('tile_k', [2, 4, 512]), ('tile_y', [32, 16, 2, 4]), ('tile_x', [4, 16, 1, 64]), ('auto_unroll_max_step', 1500)],,None,34382977
No: 98 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(TimeoutError(),), error_no=6, all_cost=10, timestamp=1555835059.1273813) [('tile_k', [64, 64, 1]), ('tile_y', [64, 4, 4, 4]), ('tile_x', [16, 1, 1, 256]), ('auto_unroll_max_step', 512)],,None,17407578
No: 99 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(TimeoutError(),), error_no=6, all_cost=10, timestamp=1555835059.1275487) [('tile_k', [32, 1, 128]), ('tile_y', [1, 8, 8, 64]), ('tile_x', [1, 16, 4, 64]), ('auto_unroll_max_step', 1500)],,None,34940157
No: 100 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(TimeoutError(),), error_no=6, all_cost=10, timestamp=1555835059.12769) [('tile_k', [64, 8, 8]), ('tile_y', [1, 2048, 2, 1]), ('tile_x', [4, 16, 1, 64]), ('auto_unroll_max_step', 512)],,None,15529098
Finish loading 668 records
Cannot find config for target=rocm, workload=('tvm_matmul_tune_op', 4096, 4096, 4096). A fallback configuration is used, which may bring great performance regression.
Best config:
,None,None
[08:24:19] /host/docker/matmul_tvm/tune_rocm/tvm/src/pass/vectorize_loop.cc:362: Detect vector condition in Vectorized Loop, scalarizing...
[08:24:19] /host/docker/matmul_tvm/tune_rocm/tvm/src/pass/vectorize_loop.cc:362: Detect vector condition in Vectorized Loop, scalarizing...
[codegen.build_aocl] [codegen.build_opengl] [codegen.build_metal] [codegen.build_opencl] [codegen.build_aocl_sw_emu] [codegen.build_sdaccel] [codegen.build_c] [codegen.build_stackvm] [codegen.build_rocm] [codegen.build_nvptx] [codegen.build_llvm] ..
'gfx906' is not a recognized processor for this target (ignoring processor)
'gfx906' is not a recognized processor for this target (ignoring processor)
Segmentation fault (core dumped)
Is your card really gfx906? Then I think you need a fairly recent LLVM.
@masahi Yes, all other rocm apps run well. I am using LLVM 6.0, can you suggest a specific version that you have tested for ROCm 2.x? Seems like I tried some other LLVM version but not working well for TVM.
I have only used LLVM 6.0 with my gfx803 card. If other rocm apps are working, you can try the same LLVM version that comes with your rocm installation.
@masahi I upgraded to llvm-8.0 this time, but TVM throws another kind of exception:
[13:11:17] /host/docker/matmul_tvm/tune_rocm/tvm/src/pass/vectorize_loop.cc:362: Detect vector condition in Vectorized Loop, scalarizing...
[13:11:17] /host/docker/matmul_tvm/tune_rocm/tvm/src/pass/vectorize_loop.cc:362: Detect vector condition in Vectorized Loop, scalarizing...
Traceback (most recent call last):
File "matmul_autotvm.py", line 259, in <module>
search_matmul_config(64, 1024, 1024, 100) # m, k, n, num_trials
File "matmul_autotvm.py", line 178, in search_matmul_config
func(a, b, c)
File "/host/docker/matmul_tvm/tune_rocm/tvm/python/tvm/_ffi/function.py", line 144, in __call__
return f(*args)
File "/host/docker/matmul_tvm/tune_rocm/tvm/python/tvm/_ffi/_ctypes/function.py", line 206, in __call__
raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
[bt] (3) /host/docker/matmul_tvm/tune_rocm/tvm/build/libtvm.so(TVMFuncCall+0x61) [0x7f16cf74cd51]
[bt] (2) /host/docker/matmul_tvm/tune_rocm/tvm/build/libtvm.so(+0x95b917) [0x7f16cf7bc917]
[bt] (1) /host/docker/matmul_tvm/tune_rocm/tvm/build/libtvm.so(+0x95b64f) [0x7f16cf7bc64f]
[bt] (0) /host/docker/matmul_tvm/tune_rocm/tvm/build/libtvm.so(+0x1470a3) [0x7f16cefa80a3]
File "/host/docker/matmul_tvm/tune_rocm/tvm/src/runtime/rocm/rocm_module.cc", line 98
File "/host/docker/matmul_tvm/tune_rocm/tvm/src/runtime/module_util.cc", line 73
ROCMError: Check failed: ret == 0 (-1 vs. 0) : hipModuleGetFunction matmul_kernel0 failed with error: hipErrorNotFound
@masahi Seems like TVM is not working well with ROCm 2.2
hmm I don't know what's happening. It's been a while since I used rocm backend. Does a very simple example like vector add work?
@masahi Can you provide such an sample? Maybe the code I am using not correct as well.
However, the same code works for CUDA and OpenCL backend, only not working for ROCm backend.
https://github.com/dmlc/tvm/blob/master/tutorials/tensor_expr_get_started.py
You can replace cuda with rocm.
@masahi Also failed with the same error:
vecadd# python3 example.py
<class 'tvm.tensor.Tensor'>
Traceback (most recent call last):
File "example.py", line 155, in <module>
fadd(a, b, c)
File "/host/docker/matmul_tvm/tune_rocm/tvm/python/tvm/_ffi/function.py", line 144, in __call__
return f(*args)
File "/host/docker/matmul_tvm/tune_rocm/tvm/python/tvm/_ffi/_ctypes/function.py", line 206, in __call__
raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
[bt] (3) /host/docker/matmul_tvm/tune_rocm/tvm/build/libtvm.so(TVMFuncCall+0x61) [0x7fb25ce7ed51]
[bt] (2) /host/docker/matmul_tvm/tune_rocm/tvm/build/libtvm.so(+0x95b917) [0x7fb25ceee917]
[bt] (1) /host/docker/matmul_tvm/tune_rocm/tvm/build/libtvm.so(+0x95b64f) [0x7fb25ceee64f]
[bt] (0) /host/docker/matmul_tvm/tune_rocm/tvm/build/libtvm.so(+0x1470a3) [0x7fb25c6da0a3]
File "/host/docker/matmul_tvm/tune_rocm/tvm/src/runtime/rocm/rocm_module.cc", line 98
File "/host/docker/matmul_tvm/tune_rocm/tvm/src/runtime/module_util.cc", line 73
ROCMError: Check failed: ret == 0 (-1 vs. 0) : hipModuleGetFunction myadd_kernel0 failed with error: hipErrorNotFound
Ok thanks. I will take a look
hi, @masahi @ghostplant
I have same problem, and I found the solution that works for me. So, I would sharing my experience on this. The hipErrorNotFound error was got when your environment is ROCm2.3 (not compatible with LLVM6 or 7) or use LLVM8+.
The reason is that LLVM uses -mattr=+code-object-v3 in default for HSACO generation, but the ROCm hipModuleLoad function which only accepts a HSACO with code object v2.
It means you need to pass -mattr=-code-object-v3 as the option explicitly. Note the value given to -mattr, the plus sign + be replaced to minus sign -.
@masahi My error is caused on LLVM 8+. It is interesting to know why TVM for ROCm has to use another LLVM, considering hcc from ROCm is already based on modified LLVM clang-9.
thanks @fundamat for the info, this is very helpful.
@ghostplant From rocm we only use their HIP runtime. For codegen all we need is LLVM's AMDGPU backend. The hcc compiler is for compling HIP code (cuda like language supported by AMD). We don't generate HIP, rather we generate LLVM IR directly, so we don't need hcc.
@masahi Thanks, hope to get it fixed soon for ROCm 2.3
hi @fundamat can you point me to the relevant documentation, so that I can understand your finding?
@masahi
You can get more information at here: https://llvm.org/docs/AMDGPUUsage.html#code-object-metadata
And I just simply add the -mattr=-code-object-v3 in BuildAMDGPU at codegen_amdgpu.cc:182
config << "-mtriple=amdgcn-amd-amdhsa-hcc -mcpu=gfx"
<< DetectROCMComputeVersion(target) << " -mattr=-code-object-v3 "
<< target.substr(4, target.length() - 4);
Change target also works, no need to change codegen.
On Wed, Apr 24, 2019 at 19:21 fundamat notifications@github.com wrote:
@masahi https://github.com/masahi
You can get more information at here:
https://llvm.org/docs/AMDGPUUsage.html#code-object-metadataAnd I just simply add the -mattr=-code-object-v3 in BuildAMDGPU at
codegen_amdgpu.cc:182config << "-mtriple=amdgcn-amd-amdhsa-hcc -mcpu=gfx"
<< DetectROCMComputeVersion(target) << " -mattr=-code-object-v3 "
<< target.substr(4, target.length() - 4);—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/tvm/issues/3058#issuecomment-486495030, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAJTLXUHKWTG4U2I2AMFVCLPSEIS7ANCNFSM4HHJ2GPA
.>
Bing Xu
@fundamat After applying your patch, I tried the tuning again, and hipErrorNotFound are solved, but following with "Segment Fault" again as I explained earlier. (Also segfault for official tuturial example)
[08:32:40] /host/docker/matmul_tvm/tune_rocm/tvm/src/pass/vectorize_loop.cc:362: Detect vector condition in Vectorized Loop, scalarizing...
[08:32:40] /host/docker/matmul_tvm/tune_rocm/tvm/src/pass/vectorize_loop.cc:362: Detect vector condition in Vectorized Loop, scalarizing...
Segmentation fault (core dumped)
/root $
I am using LLVM-8 (current lastest stable version), and ROCm-2.3, as LLVM-9 is not stable so I didn't use it.
Can you share your specific library version for LLVM and ROCm?
@ghostplant
I have no idea on "Segment Fault" problem, but I'd be happy to share my library version used:
LLVM-9 (manually build)
ROCm 2.3
@fundamat I built the lastest LLVM-9 on ubuntu 18.04 but the makefile fail on half way based on gcc-7.
@msakai Is there a Dockerfile that contains TVM for ROCm environment which is reproducible? I only see it for cpu/cuda/opencl but rocm is not available.
No sorry, our ROCm support is limited (not many people are interested) and I'm not familiar with docker. You can add one if you like.
@masahi If I can build TVM for ROCm without problems, I can write one. However, the environment for TVM + ROCm is not working, so the purpose I ask for a Dockerfile for ROCm is just to get an environment that uses TVM + ROCm normally without failures.
I've tested ROCm 2.3 + LLVM 6.0 on my gfx803 card. vector add test and VGG inference work. But for some workload tvm crashes with
### HCC STATUS_CHECK Error: HSA_STATUS_ERROR_INVALID_ISA (0x100f) at file:mcwamp_hsa.cpp line:1195
In particular, resnet workload doesn't work. I'll try upgrading LLVM.
@fundamat What is your ld.lld version? Is tutorial/relay_quick_start.py working for you?
@ghostplant Does segfault still occur on vector add? Please try the simplest test first.
I've tested ROCm 2.3 + LLVM 6.0 on my gfx803 card. vector add test and VGG inference work. But for some workload tvm crashes with
### HCC STATUS_CHECK Error: HSA_STATUS_ERROR_INVALID_ISA (0x100f) at file:mcwamp_hsa.cpp line:1195In particular, resnet workload doesn't work. I'll try upgrading LLVM.
@fundamat What is your ld.lld version? Is tutorial/relay_quick_start.py working for you?
I have same issue, VGG, standard ResNet are working, but AutoTVM is not able to search out good schedule because this error blocks most of candidates.
I am using LLVM 9 + ROCM 2.3
@masahi Yes, what I tested is the simplest one, you can try this Dockerfile on your host, which is segfault on my host:
FROM rocm/dev-ubuntu-18.04
ENV HIP_PLATFORM hcc
ENV PATH $PATH:/opt/rocm/bin:/usr/local/rocm/bin
RUN apt-get update && apt install -y --no-install-recommends git ca-certificates \
python3-pip python3-wheel python3-setuptools python3-pytest python3-dev \
vim less netcat-openbsd inetutils-ping curl patch iproute2 \
g++ libpci3 libnuma-dev make file libelf-dev libboost-system-dev libboost-filesystem-dev \
miopen-hip miopengemm hipblas rocrand hipsparse hip-thrust cxlactivitylogger rocfft rocprofiler-dev llvm-7-dev cmake \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /root
RUN git clone https://github.com/dmlc/tvm --branch master --depth 1 --single-branch --recursive && \
cd tvm && git submodule init && git submodule update
RUN mkdir -p tvm/build && cd tvm/build && cp ../cmake/config.cmake . && \
sed -i 's/ROCM OFF/ROCM ON/g' config.cmake && \
sed -i 's/LLVM OFF/LLVM ON/g' config.cmake && \
cmake .. && make -j16
RUN pip3 install numpy tornado psutil xgboost numpy decorator attrs
ENV TVM_HOME=/root/tvm
ENV PYTHONPATH=$TVM_HOME/python:$TVM_HOME/topi/python:$TVM_HOME/nnvm/python
RUN curl -Ls https://github.com/dmlc/tvm/raw/master/tutorials/tensor_expr_get_started.py | sed 's/cuda/rocm/g' > tensor_expr_get_started.py
To build and run:
# [DOCKER BUILD]
docker build -t tvm-rocm --network=host .
# [RUN TEST]
docker run -it --privileged --network=host tvm-rocm python3 tensor_expr_get_started.py
And this is my output:
/root/docker$ docker run -it --privileged --network=host tvm-rocm python3 tensor_expr_get_started.py
<class 'tvm.tensor.Tensor'>
Segmentation fault (core dumped)
It seems you need a newer LLVM. That error comes from LLVM, not TVM.
@masahi OK, I updated to 7, 8, 9 respectively (updated in my scripts), and gfx906 warning no longer exists, but they all outputs Segmentation fault (core dumped).
@masahi I rebuilt llvm-9 from source and segfault is solved, so it is an issue that installing llvm-9 from binary will cause segfault.
After that, I tried tuning applications again, and a very small group of applications can run successfully, while the majority will output various python-level TVM exceptions, while I also tested all same cases run well using cuda / opencl.
@masahi @anijain2305 I have another question. Seems like source code generation after tuning is not working for ROCm platform, as TVM codegen using other platform like CUDA, OpenCL can output respective kernel source code, but TVM codegen for ROCm platform just outputs nothing..
you can use get_source("llvm") or get_source("asm").
@masahi OK, but how can I compile them after getting the llvm/asm-level source code?
@masahi For asm, I think I can follow https://gpuopen.com/amdgcn-assembly/, but how about codes from llvm format?
The output of get_source is not meant to be combined by other means. What do you want to do?
@masahi I want to tune an op with specific config, get the best-working kernel source code, and freeze it as a hipModule.
given that this question has evolved to more general discussions, I would recommend starting a conversation on https://discuss.tvm.ai/ :)
@masahi Do you know why CUDA/OpenCL platform can get C source, but ROCm platform can't?
Because CUDA and OpenCL backends generate kernel source as C strings, while ROCm backends generates LLVM IR
@antinucleon Actually, most of my tuning results in this failure but not happens in CUDA backend:
error: local memory limit exceeded (73728) in default_function_kernel0
Do you know the solution to this error?
This is not clean, but you can modify this block for rocm target.
if 'cuda' in self.task.target.keys or 'opencl' in self.task.target.keys:
remote = request_remote(self.key, self.host, self.port)
ctx = remote.context(str(self.task.target), 0)
max_dims = ctx.max_thread_dimensions
kwargs['check_gpu'] = {
'max_shared_memory_per_block': ctx.max_shared_memory_per_block,
'max_threads_per_block': ctx.max_threads_per_block,
'max_thread_x': max_dims[0],
'max_thread_y': max_dims[1],
'max_thread_z': max_dims[2],
}
For rocm, max_shared_memory_per_block should be 48KB, and max threads per block should be 256. Don't forget to add "if 'rocm' in self.task.target.keys".
@masahi Do you know whether this warning matters for tuning?
DEBUG:autotvm:No: 1000 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError('Traceback (most recent call last):\n [bt] (1) /opt/tvm/build/libtvm.so(TVMFuncCall+0x65) [0x7fc0db08f915]\n [bt] (0) /opt/tvm/build/libtvm
.so(+0x8f2e8b) [0x7fc0db08be8b]\n File "/opt/tvm/python/tvm/_ffi/_ctypes/function.py", line 71, in cfun\n rv = local_pyfunc(*pyargs)\n File "/opt/tvm/python/tvm/autotvm/measure/measure_methods.py", line 595, in verify_pass\n rais
e InstantiationError("Skipped because of invalid gpu kernel")\ntvm.autotvm.task.space.InstantiationError: Skipped because of invalid gpu kernel',),), error_no=1, all_cost=0.02058720588684082, timestamp=1557040253.856357) [('tile_k', [
16, 2, 64]), ('tile_y', [1, 2, 32, 32]), ('tile_x', [1, 1024, 1, 2]), ('auto_unroll_max_step', 1500)],,None,12857110
Too many errors happen in the tuning. Now is in debug mode
WARNING:autotvm:Too many errors happen in the tuning. Now is in debug mode
If you specify constraints on shared mem size, number of threads, etc, verify_pass would reject kernel configurations that do not satisfied that constraints (invalid configurations are still considered because of random nature of autotvm). I think the warning you got is the result of rejection.
As long as autotvm visits configuration that satisfy your constraints, autotvm should be able to make progress.
I also recommend making 'auto_unroll_max_step' smaller. 1500 is too big and causes a pressure on register usage. 125-256 should be good.
@masahi Thanks, I updated this parameter, though I didn't find the tuning result get better.
@tqchen I cannot sign up Tvm discuss successfully, my email server is qq.com and I cannot receive the activation email, also no new emails from junk category.