Apex: Mixed precision training slow

Created on 24 May 2019 · 11Comments · Source: NVIDIA/apex

Hi,

I'm trying to fine-tuning bert using Bert fine-tuning.

My problem is: after using apex, the GPU memory usage is reduced, but the training time is about 1.3 times before.

My GPU is V100(16G, CUDA9, CUDNN7), Pytorch version is 1.0.

Is it a problem with my hardware?

BERT

Source

fatmelon

All 11 comments

A single V100, or multiple? Also, what level of device utilization are you achieving? For a quick-and-dirty (by no means definitive) check, try watch -n 0.5 nvidia-smi from another terminal while you run BERT, and see what fraction of device memory you are using.

We've got some people right now working on optimizing BERT specifically. I'll let you know if we observe similar behavior, and detail whatever best practices we discover.

mcarilli on 24 May 2019

🎉1

A single V100.
opt_level = "O1"

| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:31:00.0 Off |                    0 |
| N/A   63C    P0   218W / 250W |  13495MiB / 16160MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

fatmelon on 25 May 2019

Update:

I try the code with a 2080Ti and docker, everything works fine. The memory usage is reduced and the training speed is also faster. I hope this helps to find the problem.

fatmelon on 27 May 2019

use cuda 10, it is much faster in tensorcore computation(2080ti)
and your pytorch cuda version should be matched to installed cuda version

seongwook-ham on 2 Jun 2019

Recently, I was helping optimize an internal version of BERT with @sharatht. We're using Amp with opt_level=O1, so all GEMMs are patched to cast inputs and weights such that each GEMM itself runs in FP16 (the weights are stored in FP32, but cast to FP16 on entrance to torch.mm functions). However, Tensor Cores additionally require that participating dimensions of a GEMM are multiples of 8 (otherwise cublas will fall back to a slower, non-Tensor Core enabled kernel, even if the input and weight entering the GEMM are FP16).

We noticed that the dictionary size was not a multiple of 8, which prevented Tensor Core use for FP16 GEMMs in a particular linear decoder layer, causing that layer to take an annoyingly long time, even with Amp.

See https://github.com/NVIDIA/apex/issues/221#issuecomment-478084841. Bert is not rnn-based, but the same concepts apply (to enable Tensor Core use with Amp, you should make sure any dimensions that participate in GEMMs are multiples of 8).

mcarilli on 5 Jun 2019

👍1

I have the same issue. My model is this one https://github.com/kenshohara/3D-ResNets-PyTorch

Acitvating O1 on apex give degraded performance on 2080 ti compared to 1080 Ti. But using a dumb .half() everywhere shows that 2080 ti are indeed faster.

hyperfraise on 18 Jul 2019

👍1

Hi @hyperfraise,

no script in your repo seems to import apex.
Could you add the script you are using to profile the code and let us know, how to reproduce it?

ptrblck on 18 Jul 2019

Please see this link with reproductible code https://github.com/hyperfraise/Apex-bench

hyperfraise on 19 Jul 2019

I think this is related to this https://github.com/pytorch/pytorch/issues/22961

hyperfraise on 19 Jul 2019

After profiling via torch.autograd.profiler.profile, I observed the following issue, a significant amount of time is spent on the CPU side during CudnnConvolutionBackward, cudnn_convolution_backward,CudnnBatchNormBackward,cudnn_batch_norm_backward. Note that I am using half precision (via apex), and my network use 3D convolution operations. I use cuDNN 7.6.1, CUDA 10.0, and pytorch 1.1.0. The GPU is RTX 2080 ti.

In contrast, a dumb approach which uses .half() only spends a tiny fraction of this time on the CPU side.

RTX 2080 ti with torch half

------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                                  Self CPU total %   Self CPU total      CPU total %        CPU total     CPU time avg     CUDA total %       CUDA total    CUDA time avg  Number of Calls
------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
torch::autograd::GraphRoot                      0.01%         30.060us            0.01%         30.060us         30.060us            0.00%          8.320us          8.320us                1
NllLossBackward                                 0.08%        253.392us            0.08%        253.392us        253.392us            0.00%        246.368us        246.368us                1
nll_loss_backward                               0.06%        177.542us            0.06%        177.542us        177.542us            0.00%        176.064us        176.064us                1
LogSoftmaxBackward                              0.03%         92.631us            0.03%         92.631us         92.631us            0.00%         92.160us         92.160us                1
_log_softmax_backward_data                      0.02%         75.321us            0.02%         75.321us         75.321us            0.00%         77.152us         77.152us                1
AddmmBackward                                   0.09%        272.563us            0.09%        272.563us        272.563us            0.01%        272.544us        272.544us                1
unsigned short                                  0.01%         19.150us            0.01%         19.150us         19.150us            0.00%         18.592us         18.592us                1
mm                                              0.04%        123.522us            0.04%        123.522us        123.522us            0.00%        125.408us        125.408us                1
unsigned short                                  0.00%         12.120us            0.00%         12.120us         12.120us            0.00%         12.288us         12.288us                1
mm                                              0.02%         56.040us            0.02%         56.040us         56.040us            0.00%         57.376us         57.376us                1
unsigned short                                  0.00%          7.751us            0.00%          7.751us          7.751us            0.00%          7.168us          7.168us                1
sum                                             0.03%         89.521us            0.03%         89.521us         89.521us            0.00%         90.368us         90.368us                1
view                                            0.00%         15.110us            0.00%         15.110us         15.110us            0.00%         15.488us         15.488us                1
torch::autograd::AccumulateGrad                 0.01%         20.210us            0.01%         20.210us         20.210us            0.00%         20.352us         20.352us                1
TBackward                                       0.01%         16.341us            0.01%         16.341us         16.341us            0.00%         16.096us         16.096us                1
unsigned short                                  0.00%          7.851us            0.00%          7.851us          7.851us            0.00%          7.712us          7.712us                1
torch::autograd::AccumulateGrad                 0.00%          5.730us            0.00%          5.730us          5.730us            0.00%          4.960us          4.960us                1
ViewBackward                                    0.01%         36.970us            0.01%         36.970us         36.970us            0.00%         36.576us         36.576us                1
reshape                                         0.01%         28.080us            0.01%         28.080us         28.080us            0.00%         28.512us         28.512us                1
as_strided                                      0.00%          7.000us            0.00%          7.000us          7.000us            0.00%          7.680us          7.680us                1
AdaptiveAvgPool3DBackward                       0.02%         77.891us            0.02%         77.891us         77.891us            0.02%        808.960us        808.960us                1
adaptive_avg_pool3d_backward                    0.02%         64.461us            0.02%         64.461us         64.461us            0.02%        800.512us        800.512us                1
ReluBackward1                                   0.02%         59.111us            0.02%         59.111us         59.111us            0.00%         40.960us         40.960us                1
threshold_backward                              0.01%         42.751us            0.01%         42.751us         42.751us            0.00%         38.304us         38.304us                1
AddBackward0                                    0.00%          4.440us            0.00%          4.440us          4.440us            0.00%          1.632us          1.632us                1
NativeBatchNormBackward                         0.03%        103.371us            0.03%        103.371us        103.371us            0.00%         74.496us         74.496us                1
native_batch_norm_backward                      0.02%         75.431us            0.02%         75.431us         75.431us            0.00%         71.680us         71.680us                1
torch::autograd::AccumulateGrad                 0.00%          6.361us            0.00%          6.361us          6.361us            0.00%          0.704us          0.704us                1
torch::autograd::AccumulateGrad                 0.00%          4.970us            0.00%          4.970us          4.970us            0.00%          1.824us          1.824us                1
CudnnConvolutionBackward                        0.69%          2.191ms            0.69%          2.191ms          2.191ms            0.04%          2.274ms          2.274ms                1
cudnn_convolution_backward                      0.68%          2.171ms            0.68%          2.171ms          2.171ms            0.04%          2.271ms          2.271ms                1
torch::autograd::AccumulateGrad                 0.00%          6.710us            0.00%          6.710us          6.710us            0.00%          0.929us          0.929us                1
ReluBackward1                                   0.01%         46.211us            0.01%         46.211us         46.211us            0.00%         26.592us         26.592us                1
threshold_backward                              0.01%         33.381us            0.01%         33.381us         33.381us            0.00%         22.880us         22.880us                1
NativeBatchNormBackward                         0.02%         65.761us            0.02%         65.761us         65.761us            0.00%         43.584us         43.584us                1
native_batch_norm_backward                      0.01%         46.851us            0.01%         46.851us         46.851us            0.00%         40.960us         40.960us                1
torch::autograd::AccumulateGrad                 0.00%          6.090us            0.00%          6.090us          6.090us            0.00%          1.729us          1.729us                1
torch::autograd::AccumulateGrad                 0.00%          4.590us            0.00%          4.590us          4.590us            0.00%          0.832us          0.832us                1
CudnnConvolutionBackward                        0.47%          1.495ms            0.47%          1.495ms          1.495ms            0.03%          1.626ms          1.626ms                1
cudnn_convolution_backward                      0.46%          1.479ms            0.46%          1.479ms          1.479ms            0.03%          1.622ms          1.622ms                1
torch::autograd::AccumulateGrad                 0.00%          6.580us            0.00%          6.580us          6.580us            0.00%          2.048us          2.048us                1
ReluBackward1                                   0.01%         43.341us            0.01%         43.341us         43.341us            0.00%         22.688us         22.688us                1
threshold_backward                              0.01%         31.021us            0.01%         31.021us         31.021us            0.00%         19.136us         19.136us                1
NativeBatchNormBackward                         0.02%         64.161us            0.02%         64.161us         64.161us            0.00%         40.320us         40.320us                1
native_batch_norm_backward                      0.01%         45.981us            0.01%         45.981us         45.981us            0.00%         37.312us         37.312us                1
torch::autograd::AccumulateGrad                 0.00%         10.750us            0.00%         10.750us         10.750us            0.00%          2.048us          2.048us                1
torch::autograd::AccumulateGrad                 0.00%          4.750us            0.00%          4.750us          4.750us            0.00%          1.504us          1.504us                1
CudnnConvolutionBackward                        0.06%        187.662us            0.06%        187.662us        187.662us            0.03%          1.384ms          1.384ms                1
cudnn_convolution_backward                      0.05%        173.032us            0.05%        173.032us        173.032us            0.03%          1.381ms          1.381ms                1
add                                             0.01%         40.201us            0.01%         40.201us         40.201us            0.00%         34.528us         34.528us                1
torch::autograd::AccumulateGrad                 0.00%          6.110us            0.00%          6.110us          6.110us            0.00%          0.832us          0.832us                1
ReluBackward1                                   0.01%         37.130us            0.01%         37.130us         37.130us            0.00%         45.057us         45.057us                1
threshold_backward                              0.01%         25.600us            0.01%         25.600us         25.600us            0.00%         42.592us         42.592us                1
AddBackward0                                    0.00%          4.000us            0.00%          4.000us          4.000us            0.00%          1.761us          1.761us                1
NativeBatchNormBackward                         0.02%         57.550us            0.02%         57.550us         57.550us            0.00%         76.607us         76.607us                1
native_batch_norm_backward                      0.01%         39.830us            0.01%         39.830us         39.830us            0.00%         75.008us         75.008us                1
torch::autograd::AccumulateGrad                 0.00%          6.060us            0.00%          6.060us          6.060us            0.00%          1.695us          1.695us                1
torch::autograd::AccumulateGrad                 0.00%          4.720us            0.00%          4.720us          4.720us            0.00%          0.736us          0.736us                1
CudnnConvolutionBackward                        0.05%        153.481us            0.05%        153.481us        153.481us            0.03%          1.411ms          1.411ms                1
cudnn_convolution_backward                      0.04%        134.891us            0.04%        134.891us        134.891us            0.03%          1.408ms          1.408ms                1
torch::autograd::AccumulateGrad                 0.00%          6.150us            0.00%          6.150us          6.150us            0.00%          1.568us          1.568us                1
ReluBackward1                                   0.01%         46.971us            0.01%         46.971us         46.971us            0.00%         27.487us         27.487us                1
threshold_backward                              0.01%         31.490us            0.01%         31.490us         31.490us            0.00%         26.111us         26.111us                1
NativeBatchNormBackward                         0.02%         64.061us            0.02%         64.061us         64.061us            0.00%         47.104us         47.104us                1
native_batch_norm_backward                      0.01%         38.801us            0.01%         38.801us         38.801us            0.00%         44.353us         44.353us                1
torch::autograd::AccumulateGrad                 0.00%          5.890us            0.00%          5.890us          5.890us            0.00%          1.695us          1.695us                1
torch::autograd::AccumulateGrad                 0.00%          4.540us            0.00%          4.540us          4.540us            0.00%          0.896us          0.896us                1
CudnnConvolutionBackward                        0.43%          1.358ms            0.43%          1.358ms          1.358ms            0.03%          1.624ms          1.624ms                1
cudnn_convolution_backward                      0.42%          1.343ms            0.42%          1.343ms          1.343ms            0.03%          1.620ms          1.620ms                1
torch::autograd::AccumulateGrad                 0.00%          6.400us            0.00%          6.400us          6.400us            0.00%          1.663us          1.663us                1
ReluBackward1                                   0.02%         49.950us            0.02%         49.950us         49.950us            0.00%         27.553us         27.553us                1
threshold_backward                              0.01%         37.140us            0.01%         37.140us         37.140us            0.00%         24.575us         24.575us                1
NativeBatchNormBackward                         0.02%         63.521us            0.02%         63.521us         63.521us            0.00%         43.391us         43.391us                1
native_batch_norm_backward                      0.01%         45.331us            0.01%         45.331us         45.331us            0.00%         41.119us         41.119us                1
torch::autograd::AccumulateGrad                 0.00%          6.310us            0.00%          6.310us          6.310us            0.00%          1.664us          1.664us                1
torch::autograd::AccumulateGrad                 0.00%          4.830us            0.00%          4.830us          4.830us            0.00%          0.896us          0.896us                1
CudnnConvolutionBackward                        0.04%        135.992us            0.04%        135.992us        135.992us            0.03%          1.393ms          1.393ms                1
cudnn_convolution_backward                      0.04%        118.831us            0.04%        118.831us        118.831us            0.03%          1.389ms          1.389ms                1
add                                             0.01%         28.780us            0.01%         28.780us         28.780us            0.00%         35.008us         35.008us                1
torch::autograd::AccumulateGrad                 0.00%          6.130us            0.00%          6.130us          6.130us            0.00%          1.951us          1.951us                1
ReluBackward1                                   0.01%         42.411us            0.01%         42.411us         42.411us            0.00%         46.943us         46.943us                1
threshold_backward                              0.01%         30.281us            0.01%         30.281us         30.281us            0.00%         44.770us         44.770us                1
AddBackward0                                    0.00%          4.210us            0.00%          4.210us          4.210us            0.00%          2.049us          2.049us                1
NativeBatchNormBackward                         0.02%         62.710us            0.02%         62.710us         62.710us            0.00%         80.287us         80.287us                1
native_batch_norm_backward                      0.01%         44.850us            0.01%         44.850us         44.850us            0.00%         78.209us         78.209us                1
torch::autograd::AccumulateGrad                 0.00%          5.570us            0.00%          5.570us          5.570us            0.00%          1.920us          1.920us                1
torch::autograd::AccumulateGrad                 0.00%          4.750us            0.00%          4.750us          4.750us            0.00%          0.672us          0.672us                1
CudnnConvolutionBackward                        0.17%        544.115us            0.17%        544.115us        544.115us            0.11%          5.790ms          5.790ms                1
cudnn_convolution_backward                      0.17%        528.815us            0.17%        528.815us        528.815us            0.11%          5.787ms          5.787ms                1
torch::autograd::AccumulateGrad                 0.00%         15.000us            0.00%         15.000us         15.000us            0.00%          1.822us          1.822us                1
NativeBatchNormBackward                         0.02%         66.350us            0.02%         66.350us         66.350us            0.00%         76.543us         76.543us                1
native_batch_norm_backward                      0.01%         46.760us            0.01%         46.760us         46.760us            0.00%         74.848us         74.848us                1
torch::autograd::AccumulateGrad                 0.00%          5.750us            0.00%          5.750us          5.750us            0.00%          1.537us          1.537us                1
torch::autograd::AccumulateGrad                 0.00%          5.000us            0.00%          5.000us          5.000us            0.00%          2.049us          2.049us                1
CudnnConvolutionBackward                        0.04%        130.121us            0.04%        130.121us        130.121us            0.03%          1.412ms          1.412ms                1
cudnn_convolution_backward                      0.04%        115.561us            0.04%        115.561us        115.561us            0.03%          1.409ms          1.409ms                1
torch::autograd::AccumulateGrad                 0.00%          5.880us            0.00%          5.880us          5.880us            0.00%          2.049us          2.049us                1
ReluBackward1                                   0.01%         46.741us            0.01%         46.741us         46.741us            0.00%         31.264us         31.264us                1
threshold_backward                              0.01%         35.161us            0.01%         35.161us         35.161us            0.00%         29.727us         29.727us                1
NativeBatchNormBackward                         0.02%         60.841us            0.02%         60.841us         60.841us            0.00%         46.176us         46.176us                1
------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 318.945ms
CUDA time total: 5.090s

RTX 2080 ti with apex half

------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                                  Self CPU total %   Self CPU total      CPU total %        CPU total     CPU time avg     CUDA total %       CUDA total    CUDA time avg  Number of Calls
------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
to                                              0.00%          4.650us            0.00%          4.650us          4.650us            0.00%          3.904us          3.904us                1
is_floating_point                               0.00%          2.270us            0.00%          2.270us          2.270us            0.00%          2.048us          2.048us                1
mul                                             0.00%         41.471us            0.00%         41.471us         41.471us            0.00%         41.568us         41.568us                1
torch::autograd::GraphRoot                      0.00%         29.271us            0.00%         29.271us         29.271us            0.00%          7.616us          7.616us                1
MulBackward0                                    0.00%        162.214us            0.00%        162.214us        162.214us            0.00%        157.504us        157.504us                1
mul                                             0.00%        107.863us            0.00%        107.863us        107.863us            0.00%        108.768us        108.768us                1
NllLossBackward                                 0.00%        159.613us            0.00%        159.613us        159.613us            0.00%        159.744us        159.744us                1
nll_loss_backward                               0.00%        129.183us            0.00%        129.183us        129.183us            0.00%        128.640us        128.640us                1
LogSoftmaxBackward                              0.00%         78.661us            0.00%         78.661us         78.661us            0.00%         77.856us         77.856us                1
_log_softmax_backward_data                      0.00%         64.331us            0.00%         64.331us         64.331us            0.00%         65.536us         65.536us                1
torch::autograd::CopyBackwards                  0.00%         85.042us            0.00%         85.042us         85.042us            0.00%         85.376us         85.376us                1
to                                              0.00%         63.722us            0.00%         63.722us         63.722us            0.00%         64.833us         64.833us                1
empty                                           0.00%          8.841us            0.00%          8.841us          8.841us            0.00%          9.376us          9.376us                1
AddmmBackward                                   0.00%        233.895us            0.00%        233.895us        233.895us            0.00%        233.792us        233.792us                1
unsigned short                                  0.00%         21.271us            0.00%         21.271us         21.271us            0.00%         21.792us         21.792us                1
mm                                              0.00%         95.212us            0.00%         95.212us         95.212us            0.00%         98.176us         98.176us                1
unsigned short                                  0.00%          8.100us            0.00%          8.100us          8.100us            0.00%          8.160us          8.160us                1
mm                                              0.00%         52.021us            0.00%         52.021us         52.021us            0.00%         53.152us         53.152us                1
unsigned short                                  0.00%         12.400us            0.00%         12.400us         12.400us            0.00%         12.288us         12.288us                1
sum                                             0.00%         88.282us            0.00%         88.282us         88.282us            0.00%         88.736us         88.736us                1
view                                            0.00%         14.390us            0.00%         14.390us         14.390us            0.00%         14.336us         14.336us                1
TBackward                                       0.00%         20.000us            0.00%         20.000us         20.000us            0.00%         18.976us         18.976us                1
unsigned short                                  0.00%         10.590us            0.00%         10.590us         10.590us            0.00%         10.240us         10.240us                1
torch::autograd::CopyBackwards                  0.00%         60.421us            0.00%         60.421us         60.421us            0.00%         59.520us         59.520us                1
to                                              0.00%         49.261us            0.00%         49.261us         49.261us            0.00%         50.176us         50.176us                1
empty                                           0.00%          8.110us            0.00%          8.110us          8.110us            0.00%          8.192us          8.192us                1
torch::autograd::AccumulateGrad                 0.00%         11.930us            0.00%         11.930us         11.930us            0.00%         12.000us         12.000us                1
torch::autograd::CopyBackwards                  0.00%         51.741us            0.00%         51.741us         51.741us            0.00%         52.512us         52.512us                1
to                                              0.00%         37.550us            0.00%         37.550us         37.550us            0.00%         38.400us         38.400us                1
empty                                           0.00%          9.070us            0.00%          9.070us          9.070us            0.00%          8.705us          8.705us                1
torch::autograd::AccumulateGrad                 0.00%          6.170us            0.00%          6.170us          6.170us            0.00%          6.144us          6.144us                1
ViewBackward                                    0.00%         44.671us            0.00%         44.671us         44.671us            0.00%         43.840us         43.840us                1
reshape                                         0.00%         31.741us            0.00%         31.741us         31.741us            0.00%         32.320us         32.320us                1
as_strided                                      0.00%          9.930us            0.00%          9.930us          9.930us            0.00%          8.128us          8.128us                1
AdaptiveAvgPool3DBackward                       0.00%         85.322us            0.00%         85.322us         85.322us            0.02%        810.496us        810.496us                1
adaptive_avg_pool3d_backward                    0.00%         71.082us            0.00%         71.082us         71.082us            0.02%        800.768us        800.768us                1
ReluBackward1                                   0.00%         60.452us            0.00%         60.452us         60.452us            0.00%         42.432us         42.432us                1
threshold_backward                              0.00%         38.450us            0.00%         38.450us         38.450us            0.00%         38.880us         38.880us                1
AddBackward0                                    0.00%          4.560us            0.00%          4.560us          4.560us            0.00%          2.048us          2.048us                1
CudnnBatchNormBackward                          0.03%          1.231ms            0.03%          1.231ms          1.231ms            0.01%        579.008us        579.008us                1
contiguous                                      0.00%          5.170us            0.00%          5.170us          5.170us            0.00%          0.640us          0.640us                1
cudnn_batch_norm_backward                       0.02%          1.190ms            0.02%          1.190ms          1.190ms            0.01%        573.280us        573.280us                1
torch::autograd::AccumulateGrad                 0.00%          7.030us            0.00%          7.030us          7.030us            0.00%          6.176us          6.176us                1
torch::autograd::AccumulateGrad                 0.00%          5.290us            0.00%          5.290us          5.290us            0.00%          5.408us          5.408us                1
CudnnConvolutionBackward                        0.02%          1.134ms            0.02%          1.134ms          1.134ms            0.04%          1.837ms          1.837ms                1
cudnn_convolution_backward                      0.02%          1.116ms            0.02%          1.116ms          1.116ms            0.04%          1.825ms          1.825ms                1
torch::autograd::CopyBackwards                  0.00%         64.342us            0.00%         64.342us         64.342us            0.00%         35.457us         35.457us                1
to                                              0.00%         52.031us            0.00%         52.031us         52.031us            0.00%         32.769us         32.769us                1
empty                                           0.00%         10.860us            0.00%         10.860us         10.860us            0.00%          1.760us          1.760us                1
torch::autograd::AccumulateGrad                 0.00%          6.470us            0.00%          6.470us          6.470us            0.00%          0.640us          0.640us                1
ReluBackward1                                   0.00%         47.741us            0.00%         47.741us         47.741us            0.00%         26.272us         26.272us                1
threshold_backward                              0.00%         35.231us            0.00%         35.231us         35.231us            0.00%         22.752us         22.752us                1
CudnnBatchNormBackward                          0.00%         79.771us            0.00%         79.771us         79.771us            0.00%         34.911us         34.911us                1
contiguous                                      0.00%          4.840us            0.00%          4.840us          4.840us            0.00%          2.049us          2.049us                1
cudnn_batch_norm_backward                       0.00%         51.131us            0.00%         51.131us         51.131us            0.00%         28.673us         28.673us                1
torch::autograd::AccumulateGrad                 0.00%         10.820us            0.00%         10.820us         10.820us            0.00%          2.048us          2.048us                1
torch::autograd::AccumulateGrad                 0.00%          5.210us            0.00%          5.210us          5.210us            0.00%          1.504us          1.504us                1
CudnnConvolutionBackward                        0.03%          1.502ms            0.03%          1.502ms          1.502ms            0.03%          1.608ms          1.608ms                1
cudnn_convolution_backward                      0.03%          1.486ms            0.03%          1.486ms          1.486ms            0.03%          1.605ms          1.605ms                1
torch::autograd::CopyBackwards                  0.00%         60.001us            0.00%         60.001us         60.001us            0.00%         16.384us         16.384us                1
to                                              0.00%         48.501us            0.00%         48.501us         48.501us            0.00%         13.409us         13.409us                1
empty                                           0.00%          9.650us            0.00%          9.650us          9.650us            0.00%          0.576us          0.576us                1
torch::autograd::AccumulateGrad                 0.00%          6.470us            0.00%          6.470us          6.470us            0.00%          1.504us          1.504us                1
ReluBackward1                                   0.00%         51.271us            0.00%         51.271us         51.271us            0.00%         25.887us         25.887us                1
threshold_backward                              0.00%         35.340us            0.00%         35.340us         35.340us            0.00%         22.369us         22.369us                1
CudnnBatchNormBackward                          0.00%         78.302us            0.00%         78.302us         78.302us            0.00%         32.385us         32.385us                1
contiguous                                      0.00%          4.910us            0.00%          4.910us          4.910us            0.00%          2.048us          2.048us                1
cudnn_batch_norm_backward                       0.00%         49.581us            0.00%         49.581us         49.581us            0.00%         25.150us         25.150us                1
torch::autograd::AccumulateGrad                 0.00%          6.150us            0.00%          6.150us          6.150us            0.00%          0.960us          0.960us                1
torch::autograd::AccumulateGrad                 0.00%          8.380us            0.00%          8.380us          8.380us            0.00%          1.792us          1.792us                1
CudnnConvolutionBackward                        0.00%        180.934us            0.00%        180.934us        180.934us            0.03%          1.368ms          1.368ms                1
cudnn_convolution_backward                      0.00%        167.003us            0.00%        167.003us        167.003us            0.03%          1.365ms          1.365ms                1
add                                             0.00%         33.070us            0.00%         33.070us         33.070us            0.00%         37.184us         37.184us                1
torch::autograd::CopyBackwards                  0.01%        534.221us            0.01%        534.221us        534.221us            0.00%         37.280us         37.280us                1
to                                              0.01%        522.121us            0.01%        522.121us        522.121us            0.00%         34.816us         34.816us                1
empty                                           0.01%        469.580us            0.01%        469.580us        469.580us            0.00%          2.048us          2.048us                1
torch::autograd::AccumulateGrad                 0.00%          6.831us            0.00%          6.831us          6.831us            0.00%          0.800us          0.800us                1
ReluBackward1                                   0.00%         41.371us            0.00%         41.371us         41.371us            0.00%         47.104us         47.104us                1
threshold_backward                              0.00%         29.670us            0.00%         29.670us         29.670us            0.00%         43.393us         43.393us                1
AddBackward0                                    0.00%          4.590us            0.00%          4.590us          4.590us            0.00%          1.471us          1.471us                1
CudnnBatchNormBackward                          0.00%         86.972us            0.00%         86.972us         86.972us            0.00%         56.735us         56.735us                1
contiguous                                      0.00%          4.840us            0.00%          4.840us          4.840us            0.00%          2.048us          2.048us                1
cudnn_batch_norm_backward                       0.00%         51.351us            0.00%         51.351us         51.351us            0.00%         49.632us         49.632us                1
torch::autograd::AccumulateGrad                 0.00%          6.170us            0.00%          6.170us          6.170us            0.00%          2.048us          2.048us                1
torch::autograd::AccumulateGrad                 0.00%          8.880us            0.00%          8.880us          8.880us            0.00%          1.504us          1.504us                1
CudnnConvolutionBackward                        0.00%        144.583us            0.00%        144.583us        144.583us            0.03%          1.407ms          1.407ms                1
cudnn_convolution_backward                      0.00%        131.113us            0.00%        131.113us        131.113us            0.03%          1.404ms          1.404ms                1
torch::autograd::CopyBackwards                  0.00%         58.572us            0.00%         58.572us         58.572us            0.00%         34.880us         34.880us                1
to                                              0.00%         47.011us            0.00%         47.011us         47.011us            0.00%         32.544us         32.544us                1
empty                                           0.00%         14.510us            0.00%         14.510us         14.510us            0.00%          2.049us          2.049us                1
torch::autograd::AccumulateGrad                 0.00%          6.430us            0.00%          6.430us          6.430us            0.00%          1.888us          1.888us                1
ReluBackward1                                   0.00%         39.851us            0.00%         39.851us         39.851us            0.00%         28.960us         28.960us                1
threshold_backward                              0.00%         28.581us            0.00%         28.581us         28.581us            0.00%         26.623us         26.623us                1
CudnnBatchNormBackward                          0.00%         76.212us            0.00%         76.212us         76.212us            0.00%         35.744us         35.744us                1
contiguous                                      0.00%          4.850us            0.00%          4.850us          4.850us            0.00%          2.047us          2.047us                1
cudnn_batch_norm_backward                       0.00%         48.611us            0.00%         48.611us         48.611us            0.00%         30.016us         30.016us                1
torch::autograd::AccumulateGrad                 0.00%          5.940us            0.00%          5.940us          5.940us            0.00%          1.504us          1.504us                1
torch::autograd::AccumulateGrad                 0.00%          8.201us            0.00%          8.201us          8.201us            0.00%          2.049us          2.049us                1
CudnnConvolutionBackward                        0.03%          1.334ms            0.03%          1.334ms          1.334ms            0.03%          1.620ms          1.620ms                1
cudnn_convolution_backward                      0.03%          1.317ms            0.03%          1.317ms          1.317ms            0.03%          1.617ms          1.617ms                1
------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
Self CPU time total: 4.869s
CUDA time total: 5.038s