Currently, MXNet only supports tensor size smaller than 2^31. To support large tensors, users need to recompile MXNet with USE_INT64_TENSOR_SIZE compiler flag set to ON.
Large tensor is used often in applications such as recommendation system with sparse embedding matrix and graph neural networks such as DGL.
To provide a better user experience, we would like to turn on this compiler flag by default so that MXNet binary release will support large tensors.
Large tensor support is already implemented in MXNet backend and C API. Over 80 operators have been tested and more are being tested.
There was performance degradation in a few operators such as transpose and it has been fixed (https://github.com/apache/incubator-mxnet/pull/16104)
int64/int32 P50 records the 50-percentile inference runtime
% Diff: Runtime speedup of int64 build vs int32 build.
Thus a positive means inference time is reduced using int64 as tensor index.
Model | Mode | int64 P50 (ms) | int32 P50 (ms) | Diff (%)
-- | -- | -- | -- | --
resnext101_64x4d | gluon | 47.34253883 | 49.46685 | 4.29%
resnext101_64x4d | module | 28.83672714 | 28.48792 | -1.22%
resnext50 | gluon | 17.14539528 | 18.05592 | 5.04%
resnext50 | module | 10.05506516 | 9.636641 | -4.34%
nin | gluon | 2.574443817 | 2.608061 | 1.29%
nin | module | 2.432107925 | 2.737761 | 11.16%
resnet18 | gluon | 3.895759583 | 3.638268 | -7.08%
resnet18 | module | 2.954959869 | 3.182888 | 7.16%
wavernn | gluon | 262.9389763 | 256.5546 | -2.49%
caffenet | gluon | 2.930879593 | 3.087759 | 5.08%
caffenet | module | 3.169536591 | 3.225327 | 1.73%
vgg19 | gluon | 14.18304443 | 13.89098 | -2.10%
vgg19 | module | 13.80157471 | 14.33492 | 3.72%
maskrcnn | gluon | 2340.852737 | 2391.741 | 2.13%
maskrcnn | module | 1943.515778 | 1926.38 | -0.89%
superres | gluon | 17.39168167 | 18.00895 | 3.43%
superres | module | 16.98470116 | 17.26198 | 1.61%
resnet101 | gluon | 18.73707771 | 18.4412 | -1.60%
resnet101 | module | 16.66593552 | 14.78386 | -12.73%
vgg16 | gluon | 12.403965 | 16.2611 | 23.72%
vgg16 | module | 17.93074608 | 11.83605 | -51.49%
yolov3 | gluon | 22.96686172 | 23.01311 | 0.20%
yolov3 | module | 18.57829094 | 20.05506 | 7.36%
ssd | gluon | 17.17400551 | 16.73698 | -2.61%
ssd | module | 13.98611069 | 14.00757 | 0.15%
rnn | gluon | 28.2740593 | 28.92017 | 2.23%
rnn | module | 19.32096481 | 28.63479 | 32.53%
a3c | gluon | 0.928401947 | 0.94223 | 1.47%
a3c | module | 0.673055649 | 0.858545 | 21.61%
squeezenetv10 | gluon | 4.072666168 | 4.251957 | 4.22%
squeezenetv10 | module | 3.686189651 | 3.818274 | 3.46%
resnet152 | gluon | 25.8705616 | 27.65441 | 6.45%
resnet152 | module | 20.5206871 | 21.03257 | 2.43%
resnet34 | gluon | 6.978273392 | 7.166862 | 2.63%
resnet34 | module | 5.693674088 | 5.653858 | -0.70%
squeezenetv11 | gluon | 3.037929535 | 3.165722 | 4.04%
squeezenetv11 | module | 2.890110016 | 2.983332 | 3.12%
resnext101 | gluon | 29.1929245 | 27.65107 | -5.58%
resnext101 | module | 15.9804821 | 17.51709 | 8.77%
bert | gluon | 44.32678223 | 43.77675 | -1.26%
bert | module | 43.85828972 | 45.38655 | 3.37%
resnet50 | gluon | 10.39171219 | 10.31256 | -0.77%
resnet50 | module | 9.351491928 | 8.312941 | -12.49%
fasterrcnn | gluon | 1041.807413 | 1061.532 | 1.86%
fasterrcnn | module | 702.3141384 | 703.7232 | 0.20%
inception | gluon | 7.934331894 | 8.714437 | 8.95%
inception | module | 5.178928375 | 5.363703 | 3.44%
Average | gluon | n/a | n/a | 0.69%
Average | module | n/a | n/a | -0.37%
Model | int64 Samples/Second | int32 Samples/Second | Percentage Change
-- | -- | -- | --
xception | 67.51961 | 68.61849 | -1.60%
resnet50_v2 | 299.0174 | 299.1728 | -0.05%
gnmt | 7.65 | 7.675 | -0.33%
vgg16 | 228.4218 | 230.0739 | -0.72%
bert | 38.1 | 46.7 | -18.42%
yolo3_darknet53_custom | 31.6145 | 40.65 | -22.23%
inceptionv3 | 225.4025 | 227.1884 | -0.79%
se_resnet152_v1 | 123.7371 | 124.1493 | -0.33%
word_language_model | 15651.19 | 15524.71 | 0.81%
*mobilenet0.25_cifar10 | 56.6609205 | 60.5992765 | 6.50% 聽
resnet101_v1 | 176.6355 | 177.3132 | -0.38%
squeezenet1.0 | 790.7722 | 790.1395 | 0.08%
mobilenetv2_0.75 | 680.4143 | 672.2202 | 1.22%
ssd | 66.2365 | 67.56 | -1.96%
Average | | | -3.44%
* measures speed instead of throughput
Thanks to @JonTanS for running the profiler, we have ping pointed the performance degradation in operator broadcast_axis (from 138ms to 177ms) and MXNDArraySyncCopyToCPU (from 592ms to 679ms).
Running operator-level profiler we could identify the 2.2X performance drop in broadcast_axis operator.
w/o USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 2.7753}]}]
w/ USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 6.3178}]}
Too many div/mul/mod ALU operations in the indices which changed from int32 type to int64 type
template<typename OP>
struct broadcast_kernel {
template<typename IType, typename OType>
MSHADOW_XINLINE static void Map(index_t i,
IType *input,
OType *output,
mshadow::Shape<MXNET_SPECIAL_MAX_NDIM> in_shape,
mshadow::Shape<MXNET_SPECIAL_MAX_NDIM> out_shape,
const OpReqType req,
const uint32_t ndim) {
size_t in_stride = 1;
size_t out_stride = 1;
index_t idx = i;
index_t in_idx = i;
for (int iter = ndim - 1; iter >= 0; --iter) {
size_t dim_idx = idx % out_shape[iter];
in_idx -= dim_idx * out_stride;
if (in_shape[iter] != 1) {
in_idx += dim_idx * in_stride;
}
idx /= out_shape[iter];
in_stride *= in_shape[iter];
out_stride *= out_shape[iter];
}
KERNEL_ASSIGN(output[i], req, OP::Map(input[in_idx]));
}
};
Add LT support to ops found via OpPerf
NN optimizers and 1 activation https://github.com/apache/incubator-mxnet/pull/17444 [Merged]
Random, Sample, PDF ops : https://github.com/apache/incubator-mxnet/pull/17445 [Merged]
[OpPerf] : Indexing Ops https://github.com/apache/incubator-mxnet/pull/16253 [Merged]
[OpPerf] : Neural Network Loss Ops https://github.com/apache/incubator-mxnet/pull/17482 [Merged]
[OpPerf] : Consolidate array manipulation related operators #17487
Inference Benchmarks comparing LT_MKL with just MKL Enabled.
All Time in MS.
% Diff calculated by doing 1 - (P50 with LT divided by P50 without LT).
A positive number means a speed increase, a negative number means a speed decrease.
聽 | 聽 | 聽 | 聽 | 聽
-- | -- | -- | -- | --
Model | Mode | P50 w/ LT | P50 No LT | Percentage Difference
resnext101_64x4d | gluon | 47.34253883 | 49.46685 | 4.29%
resnext101_64x4d | module | 28.83672714 | 28.48792 | -1.22%
resnext50 | gluon | 17.14539528 | 18.05592 | 5.04%
resnext50 | module | 10.05506516 | 9.636641 | -4.34%
nin | gluon | 2.574443817 | 2.608061 | 1.29%
nin | module | 2.432107925 | 2.737761 | 11.16%
resnet18 | gluon | 3.895759583 | 3.638268 | -7.08%
resnet18 | module | 2.954959869 | 3.182888 | 7.16%
wavernn | gluon | 262.9389763 | 256.5546 | -2.49%
caffenet | gluon | 2.930879593 | 3.087759 | 5.08%
caffenet | module | 3.169536591 | 3.225327 | 1.73%
vgg19 | gluon | 14.18304443 | 13.89098 | -2.10%
vgg19 | module | 13.80157471 | 14.33492 | 3.72%
maskrcnn | gluon | 2340.852737 | 2391.741 | 2.13%
maskrcnn | module | 1943.515778 | 1926.38 | -0.89%
superres | gluon | 17.39168167 | 18.00895 | 3.43%
superres | module | 16.98470116 | 17.26198 | 1.61%
resnet101 | gluon | 18.73707771 | 18.4412 | -1.60%
resnet101 | module | 16.66593552 | 14.78386 | -12.73%
vgg16 | gluon | 12.403965 | 16.2611 | 23.72%
vgg16 | module | 17.93074608 | 11.83605 | -51.49%
yolov3 | gluon | 22.96686172 | 23.01311 | 0.20%
yolov3 | module | 18.57829094 | 20.05506 | 7.36%
ssd | gluon | 17.17400551 | 16.73698 | -2.61%
ssd | module | 13.98611069 | 14.00757 | 0.15%
rnn | gluon | 28.2740593 | 28.92017 | 2.23%
rnn | module | 19.32096481 | 28.63479 | 32.53%
a3c | gluon | 0.928401947 | 0.94223 | 1.47%
a3c | module | 0.673055649 | 0.858545 | 21.61%
squeezenetv10 | gluon | 4.072666168 | 4.251957 | 4.22%
squeezenetv10 | module | 3.686189651 | 3.818274 | 3.46%
resnet152 | gluon | 25.8705616 | 27.65441 | 6.45%
resnet152 | module | 20.5206871 | 21.03257 | 2.43%
resnet34 | gluon | 6.978273392 | 7.166862 | 2.63%
resnet34 | module | 5.693674088 | 5.653858 | -0.70%
squeezenetv11 | gluon | 3.037929535 | 3.165722 | 4.04%
squeezenetv11 | module | 2.890110016 | 2.983332 | 3.12%
resnext101 | gluon | 29.1929245 | 27.65107 | -5.58%
resnext101 | module | 15.9804821 | 17.51709 | 8.77%
bert | gluon | 44.32678223 | 43.77675 | -1.26%
bert | module | 43.85828972 | 45.38655 | 3.37%
resnet50 | gluon | 10.39171219 | 10.31256 | -0.77%
resnet50 | module | 9.351491928 | 8.312941 | -12.49%
fasterrcnn | gluon | 1041.807413 | 1061.532 | 1.86%
fasterrcnn | module | 702.3141384 | 703.7232 | 0.20%
inception | gluon | 7.934331894 | 8.714437 | 8.95%
inception | module | 5.178928375 | 5.363703 | 3.44%
drmm | gluon | 837.1179104 | 614.3708 | -36.26%
drmm | module | 830.9795856 | 607.6496 | -36.75%
Average Percentage Change over all numbers:
Gluon: 0.69%
Module: -0.37%
Training Benchmarks comparing LT_MKL with just MKL Enabled.
Speed measured seconds per Epoch.
GPU Memory measured in MB.
Note: Samples/Second are opposite so I have multiple the percentages by -1. A quick explanation: The number should be going higher so a positive percentage change means there are now less samples/second. A negative percentage change means there are more samples/second.
Model | Speed P50 LT | Speed P50 No LT | GPU Memory LT | GPU Memory No LT | Samples/Second P50 LT | Samples/Second P50 no LT | Speed Percentage Change | GPU Memory Percentage Change | Samples/Second Percentage Change
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
xception | 19247.12517 | 18935.02989 | 15304 | 15320 | 67.51961 | 68.61849 | -1.65% | 0.10% | -1.60%
resnet50_v2 | 4342.953992 | 4342.899322 | 6892 | 6762 | 299.0174 | 299.1728 | 0.00% | -1.92% | -0.05%
gnmt | N/A | N/A | 4244 | 4112 | 7.65 | 7.675 | 聽 | -3.21% | -0.33%
vgg16 | 5680.658345 | 5641.058277 | 9480 | 9496 | 228.4218 | 230.0739 | -0.70% | 0.17% | -0.72%
bert | 20.66 | 16.8 | 4684 | 4050 | 38.1 | 46.7 | -22.98% | -15.65% | -18.42%
yolo3_darknet53_custom | 517.4205 | 454.908 | 7304 | 12436 | 31.6145 | 40.65 | -13.74% | 41.27% | -22.23%
inceptionv3 | 5765.122603 | 5723.867063 | 8318 | 8304 | 225.4025 | 227.1884 | -0.72% | -0.17% | -0.79%
se_resnet152_v1 | 10497.33863 | 10465.23692 | 11290 | 10568 | 123.7371 | 124.1493 | -0.31% | -6.83% | -0.33%
word_language_model | 141.125 | 142.3 | 8846 | 7426 | 15651.19 | 15524.71 | 0.83% | -19.12% | 0.81%
mobilenet0.25_cifar10 | 56.6609205 | 60.5992765 | 1234 | 1134 | N/A | N/A | 6.50% | -8.82% | 聽
resnet101_v1 | 7354.353666 | 7329.202738 | 8118 | 8022 | 176.6355 | 177.3132 | -0.34% | -1.20% | -0.38%
squeezenet1.0 | 1677.752777 | 1678.684668 | 3770 | 3590 | 790.7722 | 790.1395 | 0.06% | -5.01% | 0.08%
mobilenetv2_0.75 | 1938.194231 | 1968.429737 | 5078 | 5008 | 680.4143 | 672.2202 | 1.54% | -1.40% | 1.22%
ssd | 424.28 | 254.9485 | 4702 | 4592 | 66.2365 | 67.56 | -66.42% | -2.40% | -1.96%
Average Percentage Change:
Speed: -7.53%
GPU Memory: -1.73%
Samples / Second: -3.44%
@jonatan1626 thanks for the update. Does -22.98% mean 22.98% slower?
@eric-haibin-lin Yes I am calculating this by: 1 - (LT, MKL value / MKL value).
For the samples/sec I doing the above and then multiplying by -1.
@eric-haibin-lin Yes I am calculating this by: 1 - (LT, MKL value / MKL value).
For the samples/sec I doing the above and then multiplying by -1.
In your description "A negative percentage change means there are more samples/second." Doesn't that mean negative percentage is faster?
@apeforest Oh sorry, so I'm multiplying by only for the samples/second column -1 to keep the meaning consistent with everything else. The rest of the columns depict the correct positive percentage improvement and negative percentage degradation.
For example if MKL_LT gives 66 samples/sec and MKL gives 70 samples/sec that will be:
1-(66/70) or 6%. Because it's positive, we think that it's better but actually it's worse because the throughput has gone down.
On the other hand if MKL_LT gives 74 samples/sec and MKL gives 70 samples/sec that will be:
1-(74/70) or -5%. Because it's negative, we think it's worse but actually it's better because our throughput has gone up.
So I multiply by -1 to give it the same meaning as the rest of the percentages, where positive is better and negative is worse.
The slowdown for BERT (-22.98%) is quite significant. We will need to mitigate this before moving forward.
Thanks to @JonTanS for running the profiler, we have ping pointed the performance degradation in operator broadcast_axis (from 138ms to 177ms) and MXNDArraySyncCopyToCPU (from 592ms to 679ms).
Running operator-level profiler we could also identify the performance drop in broadcast_axis alone.
w/o USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 2.7753}]}]
w/ USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 6.3178}]}
Also, as I look into the implementation of broadcast_axis operator, many modulo and multiplication operator on the indices are involved. The next step will be to find an optimal implementation of broadcast_axis to reduce the ALU on indices in the kernel.
@szha @eric-haibin-lin @apeforest
With current master and new broadcast_axis changes on p3.16xl single GPU training run.
Bert Run Command:
python3 run_pretraining.py --data='./part-0000.train' --data_eval='./part-0000.train' --num_steps 100 --lr 1e-4 --optimizer lamb --accumulate 1 --raw --gpus 0 --num_dataset_workers 2 --num_batch_workers 1 --circle_length 1 --total_batch_size 4 --total_batch_size_eval 4 --log_interval 10
Results:
| Code Version | throughput (samples/sec) | | | total time |
|--------------|------------|---------------|--------|-------------------------------------------|
| | avg | p50 | p90 | (only training ignoring evaluation steps) |
| master LT | 24.38k | 25.50k | 28.47k | 134.8 sec |
| master | 25.90k | 25.90k | 27.82k | 131.9 sec |
| new LT | 25.87k | 25.80k | 28.00k | 127.3 sec |
| new | 25.92k | 25.80k | 27.80k | 131.5 sec |
"new" refers to mxnet code with optimized broadcast_axis.
"master" refers to mxnet master branch code
"LT" refers to of the build was done after enabling large tensor.
@access2rohit This result is a little surprising. In the earlier benchmark results provided by @JonTanS, there is a ~18% degradation in BERT training when large tensor (LT) compiler flag is turned on:
bert | 38.1 | 46.7 | -18.42%
-- | -- | -- | --
However, from your result, even without your latest speedup in broadcast_axis operator, there is very little difference with LT flag is on:
master LT | 24.38k | 25.50k | 28.47k | 134.8 sec
-- | -- | -- | -- | --
master | 25.90k | 25.90k | 27.82k | 131.9 sec
Could you provide more insights?
@apeforest THe profiling done by @JonTanS was done long back using mxnet-1.6in november. These results are using current master branch of MXNet, bert scripts have changed too. If there are newer setting for running BERT on single node they are not available on Gluon NLP site. If @eric-haibin-lin or @szhengac can verify whether my BERT is correct or not and also provide proper tuning params to run BERT on single node I will re-run benchmarks and update the results here.
PR: https://github.com/apache/incubator-mxnet/pull/17882 fixes regression in SSD. Following are the new results for SSD run:
Code | SSD 1 Epoch time (sec) | %age Speedup/Slowdown w.r.t Master (large tensor disabled)
-- | -- | --
Master (large tensor disabled) | 226 | 0
Master (large tensor enabled) | 335 | 48.23% slowdown
Master + CPU Optimized broadcast_axis (large tensor disabled) | 130 | 42.5% speedup
Master + CPU Optimized broadcast_axis (large tensor enabled) | 184 | 18.5% speedup
@apeforest @sandeep-krishnamurthy @szha @zheng-da
PR's to enable Large Tensor Support as default in master are divided into two stages:
Stage1: Unix CPU/GPU and Windows CPU/GPU https://github.com/apache/incubator-mxnet/pull/18625
Stage2: All remaining platforms https://github.com/apache/incubator-mxnet/pull/18626
Once the above 2 PR's are merged MXNet will support Large Tensors for CPU/GPU(depending on Global Memory) on master.
Currently Large Tensor Support work on all operators implemented in MXNet and MKLDNN also supports int64. CUDA kernels written inside MXNET both generic(cpu/gpu) and specific(gpu only) support large tensors depending on device memory.
BLAS and LAPACK libs were not considered while defining the scope of the project. Currently following BLAS and LAPACK implementations are supported inside MXNet
openBLAS (Default)
MKL
ATLAS
Apple Accelerate
upon investigation openBLAS needs to be built with specific flag to support int64_t signatures and MKL will support long long int signatures(in which case reinterpret_cast<>() is needed for casting pointers as int64_t is treated as long int* as opposed to long long int* in MKL). Additionally LAPACK and BLAS wrappers need to be updated from int -> int64_t.
Initially openBLAS can be supported since it is used by default and in pypi wheels as well. Thus not, breaking any default behaviour of customer. Users attempting to use Large Tensor with other BLAS and LAPACK implementations won't face issues as long as they don't use large tensors. Additional error messages will be added in case Large Tensor is used BLAS implementation is not openBLAS until that BLAS library is made to work with large tensor support of MXNet.
NOTE: currently openBLAS works correctly with smaller inputs(within range of int32) but will truncate parameters passed with higher values and hence will result in either SIGSEGV(mostly) or garbage values being found(will eventually cause SIGSEGV in a bigger script)
@sandeep-krishnamurthy @leezu @szha @zheng-da
Thanks @access2rohit for the summary.
Is the plan for enabling Large Tensor Support in the following order?
Do you see this order of execution okay @access2rohit @leezu @szha @zheng-da ?
Has the large tensor for numpy array been supported?
@access2rohit can correct me, but, few of them are supported as they use same kernels under the hood. This issue scope was mainly on the NDArray when it got started. After these are done, remaining Numpy ops will also be supported.
Make openBLAS compatible with Large Tensor support and merge the PR for Enabling Large Tensor Support so that default PyPi users of MXNet can already benefit from the new capability. This will actually cover largest user base of MXNet.
yes
Has the large tensor for numpy array been supported?
upon inspecting numpy files inside MXNet and they are using index_t for iterating over elements in their own kernels and use NDarray ones for remaining in which we ensured to use index_t where required. For kernels using BLAS I will update them in the same PR as making MXNet wrappers for openBLAS int64 compatible.
NOTE: currently openBLAS works correctly with smaller inputs(within range of int32) but will truncate parameters passed with higher values and hence will result in either SIGSEGV(mostly) or garbage values being found(will eventually cause SIGSEGV in a bigger script)
I'm a little concerned that we don't have a correct integration of BLAS and Lapack. BLAS kernels and will get potential crashes or corrupt results. But I think @sandeep-krishnamurthy's point
Make openBLAS compatible with Large Tensor support and merge the PR for Enabling Large Tensor Support so that default PyPi users of MXNet can already benefit from the new capability. This will actually cover largest user base of MXNet.
refers to fixing this? If so, I'm fine with the order of execution. Thank you @access2rohit for the hard work on this feature
upon investigation openBLAS needs to be built with specific flag to support int64_t signatures and MKL will support long long int signatures(in which case reinterpret_cast<>() is needed for casting pointers as int64_t is treated as long int* as opposed to long long int* in MKL). Additionally LAPACK and BLAS wrappers need to be updated from int -> int64_t.
@leezu yes thats what I meant
I think the numpy frontend hasn't supported large tensors yet. I started working on it here https://github.com/apache/incubator-mxnet/pull/18368 but I haven't found the time to finish migrating all the tests. @access2rohit would you be able to help out and take that over?
Most helpful comment
Currently Large Tensor Support work on all operators implemented in MXNet and MKLDNN also supports int64. CUDA kernels written inside MXNET both generic(cpu/gpu) and specific(gpu only) support large tensors depending on device memory.
BLAS and LAPACK libs were not considered while defining the scope of the project. Currently following BLAS and LAPACK implementations are supported inside MXNet
openBLAS (Default)
MKL
ATLAS
Apple Accelerate
upon investigation openBLAS needs to be built with specific flag to support int64_t signatures and MKL will support long long int signatures(in which case reinterpret_cast<>() is needed for casting pointers as int64_t is treated as long int* as opposed to long long int* in MKL). Additionally LAPACK and BLAS wrappers need to be updated from int -> int64_t.
Initially openBLAS can be supported since it is used by default and in pypi wheels as well. Thus not, breaking any default behaviour of customer. Users attempting to use Large Tensor with other BLAS and LAPACK implementations won't face issues as long as they don't use large tensors. Additional error messages will be added in case Large Tensor is used BLAS implementation is not openBLAS until that BLAS library is made to work with large tensor support of MXNet.
NOTE: currently openBLAS works correctly with smaller inputs(within range of int32) but will truncate parameters passed with higher values and hence will result in either SIGSEGV(mostly) or garbage values being found(will eventually cause SIGSEGV in a bigger script)
@sandeep-krishnamurthy @leezu @szha @zheng-da