Incubator-mxnet: [mxnet 2.0] [item 2.4] Turning on large tensor support by default

Created on 16 Jan 2020 · 25Comments · Source: apache/incubator-mxnet

Description

Currently, MXNet only supports tensor size smaller than 2^31. To support large tensors, users need to recompile MXNet with USE_INT64_TENSOR_SIZE compiler flag set to ON.

Large tensor is used often in applications such as recommendation system with sparse embedding matrix and graph neural networks such as DGL.

To provide a better user experience, we would like to turn on this compiler flag by default so that MXNet binary release will support large tensors.

RFC: https://lists.apache.org/thread.html/df53b8c26e9e0433378dd803baba9fec4dd922728a5ce9135dc164b3@%3Cdev.mxnet.apache.org%3E

Current Status:

Large tensor support is already implemented in MXNet backend and C API. Over 80 operators have been tested and more are being tested.

There was performance degradation in a few operators such as transpose and it has been fixed (https://github.com/apache/incubator-mxnet/pull/16104)

Model Inference Performance

int64/int32 P50 records the 50-percentile inference runtime
% Diff: Runtime speedup of int64 build vs int32 build.
Thus a positive means inference time is reduced using int64 as tensor index.

Model | Mode | int64 P50 (ms) | int32 P50 (ms) | Diff (%)
-- | -- | -- | -- | --
resnext101_64x4d | gluon | 47.34253883 | 49.46685 | 4.29%
resnext101_64x4d | module | 28.83672714 | 28.48792 | -1.22%
resnext50 | gluon | 17.14539528 | 18.05592 | 5.04%
resnext50 | module | 10.05506516 | 9.636641 | -4.34%
nin | gluon | 2.574443817 | 2.608061 | 1.29%
nin | module | 2.432107925 | 2.737761 | 11.16%
resnet18 | gluon | 3.895759583 | 3.638268 | -7.08%
resnet18 | module | 2.954959869 | 3.182888 | 7.16%
wavernn | gluon | 262.9389763 | 256.5546 | -2.49%
caffenet | gluon | 2.930879593 | 3.087759 | 5.08%
caffenet | module | 3.169536591 | 3.225327 | 1.73%
vgg19 | gluon | 14.18304443 | 13.89098 | -2.10%
vgg19 | module | 13.80157471 | 14.33492 | 3.72%
maskrcnn | gluon | 2340.852737 | 2391.741 | 2.13%
maskrcnn | module | 1943.515778 | 1926.38 | -0.89%
superres | gluon | 17.39168167 | 18.00895 | 3.43%
superres | module | 16.98470116 | 17.26198 | 1.61%
resnet101 | gluon | 18.73707771 | 18.4412 | -1.60%
resnet101 | module | 16.66593552 | 14.78386 | -12.73%
vgg16 | gluon | 12.403965 | 16.2611 | 23.72%
vgg16 | module | 17.93074608 | 11.83605 | -51.49%
yolov3 | gluon | 22.96686172 | 23.01311 | 0.20%
yolov3 | module | 18.57829094 | 20.05506 | 7.36%
ssd | gluon | 17.17400551 | 16.73698 | -2.61%
ssd | module | 13.98611069 | 14.00757 | 0.15%
rnn | gluon | 28.2740593 | 28.92017 | 2.23%
rnn | module | 19.32096481 | 28.63479 | 32.53%
a3c | gluon | 0.928401947 | 0.94223 | 1.47%
a3c | module | 0.673055649 | 0.858545 | 21.61%
squeezenetv10 | gluon | 4.072666168 | 4.251957 | 4.22%
squeezenetv10 | module | 3.686189651 | 3.818274 | 3.46%
resnet152 | gluon | 25.8705616 | 27.65441 | 6.45%
resnet152 | module | 20.5206871 | 21.03257 | 2.43%
resnet34 | gluon | 6.978273392 | 7.166862 | 2.63%
resnet34 | module | 5.693674088 | 5.653858 | -0.70%
squeezenetv11 | gluon | 3.037929535 | 3.165722 | 4.04%
squeezenetv11 | module | 2.890110016 | 2.983332 | 3.12%
resnext101 | gluon | 29.1929245 | 27.65107 | -5.58%
resnext101 | module | 15.9804821 | 17.51709 | 8.77%
bert | gluon | 44.32678223 | 43.77675 | -1.26%
bert | module | 43.85828972 | 45.38655 | 3.37%
resnet50 | gluon | 10.39171219 | 10.31256 | -0.77%
resnet50 | module | 9.351491928 | 8.312941 | -12.49%
fasterrcnn | gluon | 1041.807413 | 1061.532 | 1.86%
fasterrcnn | module | 702.3141384 | 703.7232 | 0.20%
inception | gluon | 7.934331894 | 8.714437 | 8.95%
inception | module | 5.178928375 | 5.363703 | 3.44%
Average | gluon | n/a | n/a | 0.69%
Average | module | n/a | n/a | -0.37%

Model Training Performance

Model | int64 Samples/Second | int32 Samples/Second | Percentage Change
-- | -- | -- | --
xception | 67.51961 | 68.61849 | -1.60%
resnet50_v2 | 299.0174 | 299.1728 | -0.05%
gnmt | 7.65 | 7.675 | -0.33%
vgg16 | 228.4218 | 230.0739 | -0.72%
bert | 38.1 | 46.7 | -18.42%
yolo3_darknet53_custom | 31.6145 | 40.65 | -22.23%
inceptionv3 | 225.4025 | 227.1884 | -0.79%
se_resnet152_v1 | 123.7371 | 124.1493 | -0.33%
word_language_model | 15651.19 | 15524.71 | 0.81%
*mobilenet0.25_cifar10 | 56.6609205 | 60.5992765 | 6.50%
resnet101_v1 | 176.6355 | 177.3132 | -0.38%
squeezenet1.0 | 790.7722 | 790.1395 | 0.08%
mobilenetv2_0.75 | 680.4143 | 672.2202 | 1.22%
ssd | 66.2365 | 67.56 | -1.96%
Average | | | -3.44%

* measures speed instead of throughput

What Caused Performance Drop in BERT

Thanks to @JonTanS for running the profiler, we have ping pointed the performance degradation in operator broadcast_axis (from 138ms to 177ms) and MXNDArraySyncCopyToCPU (from 592ms to 679ms).

Running operator-level profiler we could identify the 2.2X performance drop in broadcast_axis operator.

w/o USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 2.7753}]}]

w/ USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 6.3178}]}

Why is broadcast_axis Operator Affected

Too many div/mul/mod ALU operations in the indices which changed from int32 type to int64 type

template<typename OP>
struct broadcast_kernel {
  template<typename IType, typename OType>
  MSHADOW_XINLINE static void Map(index_t i,
                                  IType *input,
                                  OType *output,
                                  mshadow::Shape<MXNET_SPECIAL_MAX_NDIM> in_shape,
                                  mshadow::Shape<MXNET_SPECIAL_MAX_NDIM> out_shape,
                                  const OpReqType req,
                                  const uint32_t ndim) {
    size_t in_stride = 1;
    size_t out_stride = 1;
    index_t idx = i;
    index_t in_idx = i;
    for (int iter = ndim - 1; iter >= 0; --iter) {
      size_t dim_idx = idx % out_shape[iter];
      in_idx -= dim_idx * out_stride;
      if (in_shape[iter] != 1) {
        in_idx += dim_idx * in_stride;
      }
      idx /= out_shape[iter];
      in_stride *= in_shape[iter];
      out_stride *= out_shape[iter];
    }
    KERNEL_ASSIGN(output[i], req, OP::Map(input[in_idx]));
  }
};

TODO

(DONE) update MXNet development doc and FAQ for adding new operators
(@ChaiBapchya )
(DONE) turning on nightly tests for large tensor (@access2rohit )
https://github.com/apache/incubator-mxnet/pull/17450
~~https://github.com/apache/incubator-mxnet/pull/16164
~~https://github.com/apache/incubator-mxnet/pull/17546~~

test performance in npx operators (@access2rohit)

(DONE) test more operators (@ChaiBapchya)
https://github.com/apache/incubator-mxnet/pull/17456

(DONE) adding end-to-end tests for a list of models (@jonatan1626)
https://github.com/apache/incubator-mxnet/pull/17462

Fix training regression in BERT model

setting the flag to ON and clean up (@apeforest)

Feature request
Source

~~apeforest~~

Most helpful comment

Currently Large Tensor Support work on all operators implemented in MXNet and MKLDNN also supports int64. CUDA kernels written inside MXNET both generic(cpu/gpu) and specific(gpu only) support large tensors depending on device memory.

BLAS and LAPACK libs were not considered while defining the scope of the project. Currently following BLAS and LAPACK implementations are supported inside MXNet

openBLAS (Default)
MKL
ATLAS
Apple Accelerate

upon investigation openBLAS needs to be built with specific flag to support int64_t signatures and MKL will support long long int signatures(in which case reinterpret_cast<>() is needed for casting pointers as int64_t is treated as long int* as opposed to long long int* in MKL). Additionally LAPACK and BLAS wrappers need to be updated from int -> int64_t.

Initially openBLAS can be supported since it is used by default and in pypi wheels as well. Thus not, breaking any default behaviour of customer. Users attempting to use Large Tensor with other BLAS and LAPACK implementations won't face issues as long as they don't use large tensors. Additional error messages will be added in case Large Tensor is used BLAS implementation is not openBLAS until that BLAS library is made to work with large tensor support of MXNet.

NOTE: currently openBLAS works correctly with smaller inputs(within range of int32) but will truncate parameters passed with higher values and hence will result in either SIGSEGV(mostly) or garbage values being found(will eventually cause SIGSEGV in a bigger script)

@sandeep-krishnamurthy @leezu @szha @zheng-da

access2rohit on 10 Jul 2020

👍2

All 25 comments

Add LT support to ops found via OpPerf
NN optimizers and 1 activation https://github.com/apache/incubator-mxnet/pull/17444 [Merged]
Random, Sample, PDF ops : https://github.com/apache/incubator-mxnet/pull/17445 [Merged]

ChaiBapchya on 30 Jan 2020

[OpPerf] : Indexing Ops https://github.com/apache/incubator-mxnet/pull/16253 [Merged]
[OpPerf] : Neural Network Loss Ops https://github.com/apache/incubator-mxnet/pull/17482 [Merged]
[OpPerf] : Consolidate array manipulation related operators #17487

ChaiBapchya on 30 Jan 2020

Inference Benchmarks comparing LT_MKL with just MKL Enabled.
All Time in MS.
% Diff calculated by doing 1 - (P50 with LT divided by P50 without LT).
A positive number means a speed increase, a negative number means a speed decrease.

| | | |
-- | -- | -- | -- | --
Model | Mode | P50 w/ LT | P50 No LT | Percentage Difference
resnext101_64x4d | gluon | 47.34253883 | 49.46685 | 4.29%
resnext101_64x4d | module | 28.83672714 | 28.48792 | -1.22%
resnext50 | gluon | 17.14539528 | 18.05592 | 5.04%
resnext50 | module | 10.05506516 | 9.636641 | -4.34%
nin | gluon | 2.574443817 | 2.608061 | 1.29%
nin | module | 2.432107925 | 2.737761 | 11.16%
resnet18 | gluon | 3.895759583 | 3.638268 | -7.08%
resnet18 | module | 2.954959869 | 3.182888 | 7.16%
wavernn | gluon | 262.9389763 | 256.5546 | -2.49%
caffenet | gluon | 2.930879593 | 3.087759 | 5.08%
caffenet | module | 3.169536591 | 3.225327 | 1.73%
vgg19 | gluon | 14.18304443 | 13.89098 | -2.10%
vgg19 | module | 13.80157471 | 14.33492 | 3.72%
maskrcnn | gluon | 2340.852737 | 2391.741 | 2.13%
maskrcnn | module | 1943.515778 | 1926.38 | -0.89%
superres | gluon | 17.39168167 | 18.00895 | 3.43%
superres | module | 16.98470116 | 17.26198 | 1.61%
resnet101 | gluon | 18.73707771 | 18.4412 | -1.60%
resnet101 | module | 16.66593552 | 14.78386 | -12.73%
vgg16 | gluon | 12.403965 | 16.2611 | 23.72%
vgg16 | module | 17.93074608 | 11.83605 | -51.49%
yolov3 | gluon | 22.96686172 | 23.01311 | 0.20%
yolov3 | module | 18.57829094 | 20.05506 | 7.36%
ssd | gluon | 17.17400551 | 16.73698 | -2.61%
ssd | module | 13.98611069 | 14.00757 | 0.15%
rnn | gluon | 28.2740593 | 28.92017 | 2.23%
rnn | module | 19.32096481 | 28.63479 | 32.53%
a3c | gluon | 0.928401947 | 0.94223 | 1.47%
a3c | module | 0.673055649 | 0.858545 | 21.61%
squeezenetv10 | gluon | 4.072666168 | 4.251957 | 4.22%
squeezenetv10 | module | 3.686189651 | 3.818274 | 3.46%
resnet152 | gluon | 25.8705616 | 27.65441 | 6.45%
resnet152 | module | 20.5206871 | 21.03257 | 2.43%
resnet34 | gluon | 6.978273392 | 7.166862 | 2.63%
resnet34 | module | 5.693674088 | 5.653858 | -0.70%
squeezenetv11 | gluon | 3.037929535 | 3.165722 | 4.04%
squeezenetv11 | module | 2.890110016 | 2.983332 | 3.12%
resnext101 | gluon | 29.1929245 | 27.65107 | -5.58%
resnext101 | module | 15.9804821 | 17.51709 | 8.77%
bert | gluon | 44.32678223 | 43.77675 | -1.26%
bert | module | 43.85828972 | 45.38655 | 3.37%
resnet50 | gluon | 10.39171219 | 10.31256 | -0.77%
resnet50 | module | 9.351491928 | 8.312941 | -12.49%
fasterrcnn | gluon | 1041.807413 | 1061.532 | 1.86%
fasterrcnn | module | 702.3141384 | 703.7232 | 0.20%
inception | gluon | 7.934331894 | 8.714437 | 8.95%
inception | module | 5.178928375 | 5.363703 | 3.44%
drmm | gluon | 837.1179104 | 614.3708 | -36.26%
drmm | module | 830.9795856 | 607.6496 | -36.75%

Average Percentage Change over all numbers:
Gluon: 0.69%
Module: -0.37%

JonTanS on 5 Feb 2020

👍1

Training Benchmarks comparing LT_MKL with just MKL Enabled.
Speed measured seconds per Epoch.
GPU Memory measured in MB.

Note: Samples/Second are opposite so I have multiple the percentages by -1. A quick explanation: The number should be going higher so a positive percentage change means there are now less samples/second. A negative percentage change means there are more samples/second.

Model | Speed P50 LT | Speed P50 No LT | GPU Memory LT | GPU Memory No LT | Samples/Second P50 LT | Samples/Second P50 no LT | Speed Percentage Change | GPU Memory Percentage Change | Samples/Second Percentage Change
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
xception | 19247.12517 | 18935.02989 | 15304 | 15320 | 67.51961 | 68.61849 | -1.65% | 0.10% | -1.60%
resnet50_v2 | 4342.953992 | 4342.899322 | 6892 | 6762 | 299.0174 | 299.1728 | 0.00% | -1.92% | -0.05%
gnmt | N/A | N/A | 4244 | 4112 | 7.65 | 7.675 | | -3.21% | -0.33%
vgg16 | 5680.658345 | 5641.058277 | 9480 | 9496 | 228.4218 | 230.0739 | -0.70% | 0.17% | -0.72%
bert | 20.66 | 16.8 | 4684 | 4050 | 38.1 | 46.7 | -22.98% | -15.65% | -18.42%
yolo3_darknet53_custom | 517.4205 | 454.908 | 7304 | 12436 | 31.6145 | 40.65 | -13.74% | 41.27% | -22.23%
inceptionv3 | 5765.122603 | 5723.867063 | 8318 | 8304 | 225.4025 | 227.1884 | -0.72% | -0.17% | -0.79%
se_resnet152_v1 | 10497.33863 | 10465.23692 | 11290 | 10568 | 123.7371 | 124.1493 | -0.31% | -6.83% | -0.33%
word_language_model | 141.125 | 142.3 | 8846 | 7426 | 15651.19 | 15524.71 | 0.83% | -19.12% | 0.81%
mobilenet0.25_cifar10 | 56.6609205 | 60.5992765 | 1234 | 1134 | N/A | N/A | 6.50% | -8.82% |
resnet101_v1 | 7354.353666 | 7329.202738 | 8118 | 8022 | 176.6355 | 177.3132 | -0.34% | -1.20% | -0.38%
squeezenet1.0 | 1677.752777 | 1678.684668 | 3770 | 3590 | 790.7722 | 790.1395 | 0.06% | -5.01% | 0.08%
mobilenetv2_0.75 | 1938.194231 | 1968.429737 | 5078 | 5008 | 680.4143 | 672.2202 | 1.54% | -1.40% | 1.22%
ssd | 424.28 | 254.9485 | 4702 | 4592 | 66.2365 | 67.56 | -66.42% | -2.40% | -1.96%

Average Percentage Change:
Speed: -7.53%
GPU Memory: -1.73%
Samples / Second: -3.44%

JonTanS on 6 Feb 2020

@jonatan1626 thanks for the update. Does -22.98% mean 22.98% slower?

eric-haibin-lin on 6 Feb 2020

@eric-haibin-lin Yes I am calculating this by: 1 - (LT, MKL value / MKL value).
For the samples/sec I doing the above and then multiplying by -1.

JonTanS on 6 Feb 2020

@eric-haibin-lin Yes I am calculating this by: 1 - (LT, MKL value / MKL value).
For the samples/sec I doing the above and then multiplying by -1.

In your description "A negative percentage change means there are more samples/second." Doesn't that mean negative percentage is faster?

apeforest on 6 Feb 2020

@apeforest Oh sorry, so I'm multiplying by only for the samples/second column -1 to keep the meaning consistent with everything else. The rest of the columns depict the correct positive percentage improvement and negative percentage degradation.

For example if MKL_LT gives 66 samples/sec and MKL gives 70 samples/sec that will be:
1-(66/70) or 6%. Because it's positive, we think that it's better but actually it's worse because the throughput has gone down.

On the other hand if MKL_LT gives 74 samples/sec and MKL gives 70 samples/sec that will be:
1-(74/70) or -5%. Because it's negative, we think it's worse but actually it's better because our throughput has gone up.

So I multiply by -1 to give it the same meaning as the rest of the percentages, where positive is better and negative is worse.

JonTanS on 6 Feb 2020

The slowdown for BERT (-22.98%) is quite significant. We will need to mitigate this before moving forward.

szha on 17 Feb 2020

👀1

Thanks to @JonTanS for running the profiler, we have ping pointed the performance degradation in operator broadcast_axis (from 138ms to 177ms) and MXNDArraySyncCopyToCPU (from 592ms to 679ms).

Running operator-level profiler we could also identify the performance drop in broadcast_axis alone.

w/o USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 2.7753}]}]

w/ USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 6.3178}]}

Also, as I look into the implementation of broadcast_axis operator, many modulo and multiplication operator on the indices are involved. The next step will be to find an optimal implementation of broadcast_axis to reduce the ALU on indices in the kernel.

apeforest on 21 Feb 2020

new_bert_train.log
new_lt_bert_train.log
master_bert_train.log
master_lt_bert_train.log

access2rohit on 2 May 2020

@szha @eric-haibin-lin @apeforest

With current master and new broadcast_axis changes on p3.16xl single GPU training run.

Bert Run Command:

python3 run_pretraining.py --data='./part-0000.train' --data_eval='./part-0000.train' --num_steps 100 --lr 1e-4 --optimizer lamb --accumulate 1 --raw --gpus 0 --num_dataset_workers 2 --num_batch_workers 1 --circle_length 1 --total_batch_size 4 --total_batch_size_eval 4 --log_interval 10

Results:

| Code Version | throughput (samples/sec) | | | total time |
|--------------|------------|---------------|--------|-------------------------------------------|
| | avg | p50 | p90 | (only training ignoring evaluation steps) |
| master LT | 24.38k | 25.50k | 28.47k | 134.8 sec |
| master | 25.90k | 25.90k | 27.82k | 131.9 sec |
| new LT | 25.87k | 25.80k | 28.00k | 127.3 sec |
| new | 25.92k | 25.80k | 27.80k | 131.5 sec |

"new" refers to mxnet code with optimized broadcast_axis.
"master" refers to mxnet master branch code
"LT" refers to of the build was done after enabling large tensor.

access2rohit on 2 May 2020

@access2rohit This result is a little surprising. In the earlier benchmark results provided by @JonTanS, there is a ~18% degradation in BERT training when large tensor (LT) compiler flag is turned on:

bert | 38.1 | 46.7 | -18.42%
-- | -- | -- | --

However, from your result, even without your latest speedup in broadcast_axis operator, there is very little difference with LT flag is on:

master LT | 24.38k | 25.50k | 28.47k | 134.8 sec
-- | -- | -- | -- | --
master | 25.90k | 25.90k | 27.82k | 131.9 sec

Could you provide more insights?

apeforest on 6 May 2020

@apeforest THe profiling done by @JonTanS was done long back using mxnet-1.6in november. These results are using current master branch of MXNet, bert scripts have changed too. If there are newer setting for running BERT on single node they are not available on Gluon NLP site. If @eric-haibin-lin or @szhengac can verify whether my BERT is correct or not and also provide proper tuning params to run BERT on single node I will re-run benchmarks and update the results here.

access2rohit on 8 May 2020

PR: https://github.com/apache/incubator-mxnet/pull/17882 fixes regression in SSD. Following are the new results for SSD run:

Code | SSD 1 Epoch time (sec) | %age Speedup/Slowdown w.r.t Master (large tensor disabled)
-- | -- | --
Master (large tensor disabled) | 226 | 0
Master (large tensor enabled) | 335 | 48.23% slowdown
Master + CPU Optimized broadcast_axis (large tensor disabled) | 130 | 42.5% speedup
Master + CPU Optimized broadcast_axis (large tensor enabled) | 184 | 18.5% speedup

access2rohit on 27 Jun 2020

@apeforest @sandeep-krishnamurthy @szha @zheng-da

PR's to enable Large Tensor Support as default in master are divided into two stages:
Stage1: Unix CPU/GPU and Windows CPU/GPU https://github.com/apache/incubator-mxnet/pull/18625
Stage2: All remaining platforms https://github.com/apache/incubator-mxnet/pull/18626

Once the above 2 PR's are merged MXNet will support Large Tensors for CPU/GPU(depending on Global Memory) on master.

access2rohit on 27 Jun 2020

🚀1 👍1

Currently Large Tensor Support work on all operators implemented in MXNet and MKLDNN also supports int64. CUDA kernels written inside MXNET both generic(cpu/gpu) and specific(gpu only) support large tensors depending on device memory.

BLAS and LAPACK libs were not considered while defining the scope of the project. Currently following BLAS and LAPACK implementations are supported inside MXNet

openBLAS (Default)
MKL
ATLAS
Apple Accelerate

upon investigation openBLAS needs to be built with specific flag to support int64_t signatures and MKL will support long long int signatures(in which case reinterpret_cast<>() is needed for casting pointers as int64_t is treated as long int* as opposed to long long int* in MKL). Additionally LAPACK and BLAS wrappers need to be updated from int -> int64_t.

Initially openBLAS can be supported since it is used by default and in pypi wheels as well. Thus not, breaking any default behaviour of customer. Users attempting to use Large Tensor with other BLAS and LAPACK implementations won't face issues as long as they don't use large tensors. Additional error messages will be added in case Large Tensor is used BLAS implementation is not openBLAS until that BLAS library is made to work with large tensor support of MXNet.

NOTE: currently openBLAS works correctly with smaller inputs(within range of int32) but will truncate parameters passed with higher values and hence will result in either SIGSEGV(mostly) or garbage values being found(will eventually cause SIGSEGV in a bigger script)

@sandeep-krishnamurthy @leezu @szha @zheng-da

access2rohit on 10 Jul 2020

👍2

Thanks @access2rohit for the summary.

Is the plan for enabling Large Tensor Support in the following order?

Make openBLAS compatible with Large Tensor support and merge the PR for Enabling Large Tensor Support so that default PyPi users of MXNet can already benefit from the new capability. This will actually cover largest user base of MXNet.

Next, we work on enabling MKL bindings capable of Large Tensor Support, as a separate PR. So users building custom MXNet builds with MKL as BLAS will get the Large Tensor functionality.

We need to debate on ATLAS and Accelerate BLAS support and we can pick up this discussion once we get above 2 major steps done.

Do you see this order of execution okay @access2rohit @leezu @szha @zheng-da ?

sandeep-krishnamurthy on 10 Jul 2020

Has the large tensor for numpy array been supported?

szha on 10 Jul 2020

@access2rohit can correct me, but, few of them are supported as they use same kernels under the hood. This issue scope was mainly on the NDArray when it got started. After these are done, remaining Numpy ops will also be supported.

sandeep-krishnamurthy on 10 Jul 2020

Make openBLAS compatible with Large Tensor support and merge the PR for Enabling Large Tensor Support so that default PyPi users of MXNet can already benefit from the new capability. This will actually cover largest user base of MXNet.

yes

access2rohit on 10 Jul 2020

Has the large tensor for numpy array been supported?

upon inspecting numpy files inside MXNet and they are using index_t for iterating over elements in their own kernels and use NDarray ones for remaining in which we ensured to use index_t where required. For kernels using BLAS I will update them in the same PR as making MXNet wrappers for openBLAS int64 compatible.

access2rohit on 10 Jul 2020

NOTE: currently openBLAS works correctly with smaller inputs(within range of int32) but will truncate parameters passed with higher values and hence will result in either SIGSEGV(mostly) or garbage values being found(will eventually cause SIGSEGV in a bigger script)

I'm a little concerned that we don't have a correct integration of BLAS and Lapack. BLAS kernels and will get potential crashes or corrupt results. But I think @sandeep-krishnamurthy's point

Make openBLAS compatible with Large Tensor support and merge the PR for Enabling Large Tensor Support so that default PyPi users of MXNet can already benefit from the new capability. This will actually cover largest user base of MXNet.

refers to fixing this? If so, I'm fine with the order of execution. Thank you @access2rohit for the hard work on this feature

leezu on 10 Jul 2020

upon investigation openBLAS needs to be built with specific flag to support int64_t signatures and MKL will support long long int signatures(in which case reinterpret_cast<>() is needed for casting pointers as int64_t is treated as long int* as opposed to long long int* in MKL). Additionally LAPACK and BLAS wrappers need to be updated from int -> int64_t.

@leezu yes thats what I meant

access2rohit on 10 Jul 2020

👍1

I think the numpy frontend hasn't supported large tensors yet. I started working on it here https://github.com/apache/incubator-mxnet/pull/18368 but I haven't found the time to finish migrating all the tests. @access2rohit would you be able to help out and take that over?

szha on 10 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Training Accuracy and Validation Accuracy problem

WangcsShuai · 3Comments

Correct Speedometer Callback Usage

JonBoyleCoding · 3Comments

Automatic Batching for Dynamic Graphs

sbodenstein · 3Comments

Is there a simple way to make two similar networks share same weights?

xzqjack · 3Comments

CMake Error: The following variables are used in this project, but they are set to NOTFOUND.

zy-huang · 3Comments