Incubator-mxnet: MKLDNN incompatibility with large tensor (dim >= 2^32) data

Created on 14 Feb 2020 · 10Comments · Source: apache/incubator-mxnet

Description

While testing individual ops for large tensor (dimension >= 2^32) input functionality, I found an error in MKLDNN. Within 3rdparty/mkldnn/src/cpu/gemm/gemm.cpp on line 43 there is a function which takes in several parameters, including M (the variable used to accept the data dimension in the input). M is designated as an int, so when the value 2^32 is passed in as the first dimension of the input data the > 0 assertion on the next line fails (since the int dtype in C++ interprets 2^32 as 0).

Note that this error occurs whenever MKLDNN is enabled - whether the BLAS engine is MKL, OpenBLAS, or none. When MKLDNN is disabled, the error does not occur.

All tests were run on the latest master, building from source.

Environment

----------Python Info----------
Version      : 3.6.6
Compiler     : GCC 7.2.0
Build        : ('default', 'Jun 28 2018 17:14:51')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.3.1
Directory    : /home/ubuntu/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /home/ubuntu/mxnet/python/mxnet
Num GPUs     : 0
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.4.0-1098-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-47-40
release      : 4.4.0-1098-aws
version      : #109-Ubuntu SMP Fri Nov 8 09:30:18 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                96
On-line CPU(s) list:   0-95
Thread(s) per core:    2
Core(s) per socket:    24
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping:              7
CPU MHz:               2500.000
BogoMIPS:              5000.00
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              36608K
NUMA node0 CPU(s):     0-23,48-71
NUMA node1 CPU(s):     24-47,72-95
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku

Steps to reproduce

Create a Python script with the following content:

import mxnet as mx

print(mx.nd.FullyConnected(data=mx.nd.random_normal(shape=(2**32,1)), weight=mx.nd.random_normal(shape=(1,1)), bias=mx.nd.random_normal(shape=(1,)), flatten=False, num_hidden=1))

and run it with Python3.

Failing Environments and Errors

BLAS = None, MKLDNN = enabled

Feature List

✖ CUDA, ✖ CUDNN, ✖ NCCL, ✖ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖ CPU_AVX2, ✔ OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✖ BLAS_OPEN, ✖ BLAS_ATLAS, ✖ BLAS_MKL, ✖ BLAS_APPLE, ✔ LAPACK, ✔ MKLDNN, ✖ OPENCV, ✖ CAFFE, ✖ PROFILER, ✖ DIST_KVSTORE, ✖ CXX14, ✔ INT64_TENSOR_SIZE, ✖ SIGNAL_HANDLER, ✔ DEBUG, ✖ TVM_OP

Error

python3: /home/ubuntu/mxnet/3rdparty/mkldnn/src/cpu/gemm/gemm.cpp:43: void dnnl::impl::cpu::msan_unpoison_matrix(void*, int, int, int, size_t): Assertion `C
!= nullptr && M > 0 && N > 0 && LDC >= M && typesize' failed.
Aborted (core dumped)

BLAS = MKL, MKLDNN = enabled

Feature List

✖ CUDA, ✖ CUDNN, ✖ NCCL, ✖ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖ CPU_AVX2, ✔ OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✖ BLAS_OPEN, ✖ BLAS_ATLAS, ✔ BLAS_MKL, ✖ BLAS_APPLE, ✔ LAPACK, ✔ MKLDNN, ✖ OPENCV, ✖ CAFFE, ✖ PROFILER, ✖ DIST_KVSTORE, ✖ CXX14, ✔ INT64_TENSOR_SIZE, ✖ SIGNAL_HANDLER, ✔ DEBUG, ✖ TVM_OP

Error

python3: /home/ubuntu/mxnet/3rdparty/mkldnn/src/cpu/gemm/gemm.cpp:43: void dnnl::impl::cpu::msan_unpoison_matrix(void*, int, int, int, size_t): Assertion `C
!= nullptr && M > 0 && N > 0 && LDC >= M && typesize' failed.
Aborted (core dumped)

BLAS = OpenBLAS, MKLDNN = enabled

Feature List

✖ CUDA, ✖ CUDNN, ✖ NCCL, ✖ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖ CPU_AVX2, ✔ OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖ BLAS_MKL, ✖ BLAS_APPLE, ✔ LAPACK, ✔ MKLDNN, ✖ OPENCV, ✖ CAFFE, ✖ PROFILER, ✖ DIST_KVSTORE, ✖ CXX14, ✔ INT64_TENSOR_SIZE, ✖ SIGNAL_HANDLER, ✔ DEBUG, ✖ TVM_OP

Error

python3: /home/ubuntu/mxnet/3rdparty/mkldnn/src/cpu/gemm/gemm.cpp:43: void dnnl::impl::cpu::msan_unpoison_matrix(void*, int, int, int, size_t): Assertion `C
!= nullptr && M > 0 && N > 0 && LDC >= M && typesize' failed.
Aborted (core dumped)

Successful Environments and Outputs

BLAS = None, MKLDNN = disabled

Feature List

✖ CUDA, ✖ CUDNN, ✖ NCCL, ✖ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖ CPU_AVX2, ✔ OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✖ BLAS_OPEN, ✖ BLAS_ATLAS, ✖ BLAS_MKL, ✖ BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✖ OPENCV, ✖ CAFFE, ✖ PROFILER, ✖ DIST_KVSTORE, ✖ CXX14, ✔ INT64_TENSOR_SIZE, ✖ SIGNAL_HANDLER, ✔ DEBUG, ✖ TVM_OP

Output

[[1.1367434]
 [1.1367434]
 [1.1367434]
 ...
 [1.1367434]
 [1.1367434]
 [1.1367434]]
<NDArray 4294967296x1 @cpu(0)>

BLAS = MKL, MKLDNN = disabled

Feature List

✖ CUDA, ✖ CUDNN, ✖ NCCL, ✖ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖ CPU_AVX2, ✔ OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✖ BLAS_OPEN, ✖ BLAS_ATLAS, ✔ BLAS_MKL, ✖ BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✖ OPENCV, ✖ CAFFE, ✖ PROFILER, ✖ DIST_KVSTORE, ✖ CXX14, ✔ INT64_TENSOR_SIZE, ✖ SIGNAL_HANDLER, ✔ DEBUG, ✖ TVM_OP

Output

[[1.1367434]
 [1.1367434]
 [1.1367434]
 ...
 [1.1367434]
 [1.1367434]
 [1.1367434]]
<NDArray 4294967296x1 @cpu(0)>

BLAS = OpenBLAS, MKLDNN = disabled

Feature List

✖ CUDA, ✖ CUDNN, ✖ NCCL, ✖ CUDA_RTC, ✖ TENSORRT, ✔ CPU_SSE, ✔ CPU_SSE2, ✔ CPU_SSE3, ✔ CPU_SSE4_1, ✔ CPU_SSE4_2, ✖ CPU_SSE4A, ✔ CPU_AVX, ✖ CPU_AVX2, ✔ OPENMP, ✖ SSE, ✔ F16C, ✖ JEMALLOC, ✔ BLAS_OPEN, ✖ BLAS_ATLAS, ✖ BLAS_MKL, ✖ BLAS_APPLE, ✔ LAPACK, ✖ MKLDNN, ✖ OPENCV, ✖ CAFFE, ✖ PROFILER, ✖ DIST_KVSTORE, ✖ CXX14, ✔ INT64_TENSOR_SIZE, ✖ SIGNAL_HANDLER, ✔ DEBUG, ✖ TVM_OP

Output

[[1.1367434]
 [1.1367434]
 [1.1367434]
 ...
 [1.1367434]
 [1.1367434]
 [1.1367434]]
<NDArray 4294967296x1 @cpu(0)>

Bug MKLDNN

Source

connorgoggins

Most helpful comment

I can reproduce the crash.

TaoLv on 15 Feb 2020

👍2

All 10 comments

@mxnet-label-bot add [MKLDNN]

access2rohit on 14 Feb 2020

@PatricZhao Could your team please take a look at this? Thanks!

apeforest on 14 Feb 2020

👍2

@connorgoggins thanks for bringing this up

@PatricZhao @TaoLv looks like blas=MKL/openblas/none(mnative mxnet) and MKLDNN=OFF are supporting gemm on int64 but with MKLDNN its not. If its not a known issue with MKLDNN can you guys please take a look

access2rohit on 14 Feb 2020

👍1

Thank you for reporting the issue. I will take a look at this. But my initial thought is that MKL-DNN itself already supports int64 shape since the v1.0 upgrading, while I don't think the current integration of MKL/openblas supports int64 GEMM.

TaoLv on 15 Feb 2020

I can reproduce the crash.

TaoLv on 15 Feb 2020

👍2

@TaoLv thanks for taking a look !

access2rohit on 16 Feb 2020

@access2rohit @connorgoggins This was confirmed to be a bug of the DNNL library. But we still need to wait for the next release of the library to get the bug fixed.

TaoLv on 6 Mar 2020

This issue is resolved in oneDNN v1.4.