Hi,
I'm trying to fine-tuning bert using Bert fine-tuning.
My problem is: after using apex, the GPU memory usage is reduced, but the training time is about 1.3 times before.
My GPU is V100(16G, CUDA9, CUDNN7), Pytorch version is 1.0.
Is it a problem with my hardware?
A single V100, or multiple? Also, what level of device utilization are you achieving? For a quick-and-dirty (by no means definitive) check, try watch -n 0.5 nvidia-smi from another terminal while you run BERT, and see what fraction of device memory you are using.
We've got some people right now working on optimizing BERT specifically. I'll let you know if we observe similar behavior, and detail whatever best practices we discover.
A single V100.
opt_level = "O1"
| NVIDIA-SMI 396.26 Driver Version: 396.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:31:00.0 Off | 0 |
| N/A 63C P0 218W / 250W | 13495MiB / 16160MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Update:
I try the code with a 2080Ti and docker, everything works fine. The memory usage is reduced and the training speed is also faster. I hope this helps to find the problem.
use cuda 10, it is much faster in tensorcore computation(2080ti)
and your pytorch cuda version should be matched to installed cuda version
Recently, I was helping optimize an internal version of BERT with @sharatht. We're using Amp with opt_level=O1, so all GEMMs are patched to cast inputs and weights such that each GEMM itself runs in FP16 (the weights are stored in FP32, but cast to FP16 on entrance to torch.mm functions). However, Tensor Cores additionally require that participating dimensions of a GEMM are multiples of 8 (otherwise cublas will fall back to a slower, non-Tensor Core enabled kernel, even if the input and weight entering the GEMM are FP16).
We noticed that the dictionary size was not a multiple of 8, which prevented Tensor Core use for FP16 GEMMs in a particular linear decoder layer, causing that layer to take an annoyingly long time, even with Amp.
See https://github.com/NVIDIA/apex/issues/221#issuecomment-478084841. Bert is not rnn-based, but the same concepts apply (to enable Tensor Core use with Amp, you should make sure any dimensions that participate in GEMMs are multiples of 8).
I have the same issue. My model is this one https://github.com/kenshohara/3D-ResNets-PyTorch
Acitvating O1 on apex give degraded performance on 2080 ti compared to 1080 Ti. But using a dumb .half() everywhere shows that 2080 ti are indeed faster.
Hi @hyperfraise,
no script in your repo seems to import apex.
Could you add the script you are using to profile the code and let us know, how to reproduce it?
Please see this link with reproductible code https://github.com/hyperfraise/Apex-bench
I think this is related to this https://github.com/pytorch/pytorch/issues/22961
After profiling via torch.autograd.profiler.profile, I observed the following issue, a significant amount of time is spent on the CPU side during CudnnConvolutionBackward, cudnn_convolution_backward,CudnnBatchNormBackward,cudnn_batch_norm_backward. Note that I am using half precision (via apex), and my network use 3D convolution operations. I use cuDNN 7.6.1, CUDA 10.0, and pytorch 1.1.0. The GPU is RTX 2080 ti.
In contrast, a dumb approach which uses .half() only spends a tiny fraction of this time on the CPU side.
------------------------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls
------------------------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
torch::autograd::GraphRoot 0.01% 30.060us 0.01% 30.060us 30.060us 0.00% 8.320us 8.320us 1
NllLossBackward 0.08% 253.392us 0.08% 253.392us 253.392us 0.00% 246.368us 246.368us 1
nll_loss_backward 0.06% 177.542us 0.06% 177.542us 177.542us 0.00% 176.064us 176.064us 1
LogSoftmaxBackward 0.03% 92.631us 0.03% 92.631us 92.631us 0.00% 92.160us 92.160us 1
_log_softmax_backward_data 0.02% 75.321us 0.02% 75.321us 75.321us 0.00% 77.152us 77.152us 1
AddmmBackward 0.09% 272.563us 0.09% 272.563us 272.563us 0.01% 272.544us 272.544us 1
unsigned short 0.01% 19.150us 0.01% 19.150us 19.150us 0.00% 18.592us 18.592us 1
mm 0.04% 123.522us 0.04% 123.522us 123.522us 0.00% 125.408us 125.408us 1
unsigned short 0.00% 12.120us 0.00% 12.120us 12.120us 0.00% 12.288us 12.288us 1
mm 0.02% 56.040us 0.02% 56.040us 56.040us 0.00% 57.376us 57.376us 1
unsigned short 0.00% 7.751us 0.00% 7.751us 7.751us 0.00% 7.168us 7.168us 1
sum 0.03% 89.521us 0.03% 89.521us 89.521us 0.00% 90.368us 90.368us 1
view 0.00% 15.110us 0.00% 15.110us 15.110us 0.00% 15.488us 15.488us 1
torch::autograd::AccumulateGrad 0.01% 20.210us 0.01% 20.210us 20.210us 0.00% 20.352us 20.352us 1
TBackward 0.01% 16.341us 0.01% 16.341us 16.341us 0.00% 16.096us 16.096us 1
unsigned short 0.00% 7.851us 0.00% 7.851us 7.851us 0.00% 7.712us 7.712us 1
torch::autograd::AccumulateGrad 0.00% 5.730us 0.00% 5.730us 5.730us 0.00% 4.960us 4.960us 1
ViewBackward 0.01% 36.970us 0.01% 36.970us 36.970us 0.00% 36.576us 36.576us 1
reshape 0.01% 28.080us 0.01% 28.080us 28.080us 0.00% 28.512us 28.512us 1
as_strided 0.00% 7.000us 0.00% 7.000us 7.000us 0.00% 7.680us 7.680us 1
AdaptiveAvgPool3DBackward 0.02% 77.891us 0.02% 77.891us 77.891us 0.02% 808.960us 808.960us 1
adaptive_avg_pool3d_backward 0.02% 64.461us 0.02% 64.461us 64.461us 0.02% 800.512us 800.512us 1
ReluBackward1 0.02% 59.111us 0.02% 59.111us 59.111us 0.00% 40.960us 40.960us 1
threshold_backward 0.01% 42.751us 0.01% 42.751us 42.751us 0.00% 38.304us 38.304us 1
AddBackward0 0.00% 4.440us 0.00% 4.440us 4.440us 0.00% 1.632us 1.632us 1
NativeBatchNormBackward 0.03% 103.371us 0.03% 103.371us 103.371us 0.00% 74.496us 74.496us 1
native_batch_norm_backward 0.02% 75.431us 0.02% 75.431us 75.431us 0.00% 71.680us 71.680us 1
torch::autograd::AccumulateGrad 0.00% 6.361us 0.00% 6.361us 6.361us 0.00% 0.704us 0.704us 1
torch::autograd::AccumulateGrad 0.00% 4.970us 0.00% 4.970us 4.970us 0.00% 1.824us 1.824us 1
CudnnConvolutionBackward 0.69% 2.191ms 0.69% 2.191ms 2.191ms 0.04% 2.274ms 2.274ms 1
cudnn_convolution_backward 0.68% 2.171ms 0.68% 2.171ms 2.171ms 0.04% 2.271ms 2.271ms 1
torch::autograd::AccumulateGrad 0.00% 6.710us 0.00% 6.710us 6.710us 0.00% 0.929us 0.929us 1
ReluBackward1 0.01% 46.211us 0.01% 46.211us 46.211us 0.00% 26.592us 26.592us 1
threshold_backward 0.01% 33.381us 0.01% 33.381us 33.381us 0.00% 22.880us 22.880us 1
NativeBatchNormBackward 0.02% 65.761us 0.02% 65.761us 65.761us 0.00% 43.584us 43.584us 1
native_batch_norm_backward 0.01% 46.851us 0.01% 46.851us 46.851us 0.00% 40.960us 40.960us 1
torch::autograd::AccumulateGrad 0.00% 6.090us 0.00% 6.090us 6.090us 0.00% 1.729us 1.729us 1
torch::autograd::AccumulateGrad 0.00% 4.590us 0.00% 4.590us 4.590us 0.00% 0.832us 0.832us 1
CudnnConvolutionBackward 0.47% 1.495ms 0.47% 1.495ms 1.495ms 0.03% 1.626ms 1.626ms 1
cudnn_convolution_backward 0.46% 1.479ms 0.46% 1.479ms 1.479ms 0.03% 1.622ms 1.622ms 1
torch::autograd::AccumulateGrad 0.00% 6.580us 0.00% 6.580us 6.580us 0.00% 2.048us 2.048us 1
ReluBackward1 0.01% 43.341us 0.01% 43.341us 43.341us 0.00% 22.688us 22.688us 1
threshold_backward 0.01% 31.021us 0.01% 31.021us 31.021us 0.00% 19.136us 19.136us 1
NativeBatchNormBackward 0.02% 64.161us 0.02% 64.161us 64.161us 0.00% 40.320us 40.320us 1
native_batch_norm_backward 0.01% 45.981us 0.01% 45.981us 45.981us 0.00% 37.312us 37.312us 1
torch::autograd::AccumulateGrad 0.00% 10.750us 0.00% 10.750us 10.750us 0.00% 2.048us 2.048us 1
torch::autograd::AccumulateGrad 0.00% 4.750us 0.00% 4.750us 4.750us 0.00% 1.504us 1.504us 1
CudnnConvolutionBackward 0.06% 187.662us 0.06% 187.662us 187.662us 0.03% 1.384ms 1.384ms 1
cudnn_convolution_backward 0.05% 173.032us 0.05% 173.032us 173.032us 0.03% 1.381ms 1.381ms 1
add 0.01% 40.201us 0.01% 40.201us 40.201us 0.00% 34.528us 34.528us 1
torch::autograd::AccumulateGrad 0.00% 6.110us 0.00% 6.110us 6.110us 0.00% 0.832us 0.832us 1
ReluBackward1 0.01% 37.130us 0.01% 37.130us 37.130us 0.00% 45.057us 45.057us 1
threshold_backward 0.01% 25.600us 0.01% 25.600us 25.600us 0.00% 42.592us 42.592us 1
AddBackward0 0.00% 4.000us 0.00% 4.000us 4.000us 0.00% 1.761us 1.761us 1
NativeBatchNormBackward 0.02% 57.550us 0.02% 57.550us 57.550us 0.00% 76.607us 76.607us 1
native_batch_norm_backward 0.01% 39.830us 0.01% 39.830us 39.830us 0.00% 75.008us 75.008us 1
torch::autograd::AccumulateGrad 0.00% 6.060us 0.00% 6.060us 6.060us 0.00% 1.695us 1.695us 1
torch::autograd::AccumulateGrad 0.00% 4.720us 0.00% 4.720us 4.720us 0.00% 0.736us 0.736us 1
CudnnConvolutionBackward 0.05% 153.481us 0.05% 153.481us 153.481us 0.03% 1.411ms 1.411ms 1
cudnn_convolution_backward 0.04% 134.891us 0.04% 134.891us 134.891us 0.03% 1.408ms 1.408ms 1
torch::autograd::AccumulateGrad 0.00% 6.150us 0.00% 6.150us 6.150us 0.00% 1.568us 1.568us 1
ReluBackward1 0.01% 46.971us 0.01% 46.971us 46.971us 0.00% 27.487us 27.487us 1
threshold_backward 0.01% 31.490us 0.01% 31.490us 31.490us 0.00% 26.111us 26.111us 1
NativeBatchNormBackward 0.02% 64.061us 0.02% 64.061us 64.061us 0.00% 47.104us 47.104us 1
native_batch_norm_backward 0.01% 38.801us 0.01% 38.801us 38.801us 0.00% 44.353us 44.353us 1
torch::autograd::AccumulateGrad 0.00% 5.890us 0.00% 5.890us 5.890us 0.00% 1.695us 1.695us 1
torch::autograd::AccumulateGrad 0.00% 4.540us 0.00% 4.540us 4.540us 0.00% 0.896us 0.896us 1
CudnnConvolutionBackward 0.43% 1.358ms 0.43% 1.358ms 1.358ms 0.03% 1.624ms 1.624ms 1
cudnn_convolution_backward 0.42% 1.343ms 0.42% 1.343ms 1.343ms 0.03% 1.620ms 1.620ms 1
torch::autograd::AccumulateGrad 0.00% 6.400us 0.00% 6.400us 6.400us 0.00% 1.663us 1.663us 1
ReluBackward1 0.02% 49.950us 0.02% 49.950us 49.950us 0.00% 27.553us 27.553us 1
threshold_backward 0.01% 37.140us 0.01% 37.140us 37.140us 0.00% 24.575us 24.575us 1
NativeBatchNormBackward 0.02% 63.521us 0.02% 63.521us 63.521us 0.00% 43.391us 43.391us 1
native_batch_norm_backward 0.01% 45.331us 0.01% 45.331us 45.331us 0.00% 41.119us 41.119us 1
torch::autograd::AccumulateGrad 0.00% 6.310us 0.00% 6.310us 6.310us 0.00% 1.664us 1.664us 1
torch::autograd::AccumulateGrad 0.00% 4.830us 0.00% 4.830us 4.830us 0.00% 0.896us 0.896us 1
CudnnConvolutionBackward 0.04% 135.992us 0.04% 135.992us 135.992us 0.03% 1.393ms 1.393ms 1
cudnn_convolution_backward 0.04% 118.831us 0.04% 118.831us 118.831us 0.03% 1.389ms 1.389ms 1
add 0.01% 28.780us 0.01% 28.780us 28.780us 0.00% 35.008us 35.008us 1
torch::autograd::AccumulateGrad 0.00% 6.130us 0.00% 6.130us 6.130us 0.00% 1.951us 1.951us 1
ReluBackward1 0.01% 42.411us 0.01% 42.411us 42.411us 0.00% 46.943us 46.943us 1
threshold_backward 0.01% 30.281us 0.01% 30.281us 30.281us 0.00% 44.770us 44.770us 1
AddBackward0 0.00% 4.210us 0.00% 4.210us 4.210us 0.00% 2.049us 2.049us 1
NativeBatchNormBackward 0.02% 62.710us 0.02% 62.710us 62.710us 0.00% 80.287us 80.287us 1
native_batch_norm_backward 0.01% 44.850us 0.01% 44.850us 44.850us 0.00% 78.209us 78.209us 1
torch::autograd::AccumulateGrad 0.00% 5.570us 0.00% 5.570us 5.570us 0.00% 1.920us 1.920us 1
torch::autograd::AccumulateGrad 0.00% 4.750us 0.00% 4.750us 4.750us 0.00% 0.672us 0.672us 1
CudnnConvolutionBackward 0.17% 544.115us 0.17% 544.115us 544.115us 0.11% 5.790ms 5.790ms 1
cudnn_convolution_backward 0.17% 528.815us 0.17% 528.815us 528.815us 0.11% 5.787ms 5.787ms 1
torch::autograd::AccumulateGrad 0.00% 15.000us 0.00% 15.000us 15.000us 0.00% 1.822us 1.822us 1
NativeBatchNormBackward 0.02% 66.350us 0.02% 66.350us 66.350us 0.00% 76.543us 76.543us 1
native_batch_norm_backward 0.01% 46.760us 0.01% 46.760us 46.760us 0.00% 74.848us 74.848us 1
torch::autograd::AccumulateGrad 0.00% 5.750us 0.00% 5.750us 5.750us 0.00% 1.537us 1.537us 1
torch::autograd::AccumulateGrad 0.00% 5.000us 0.00% 5.000us 5.000us 0.00% 2.049us 2.049us 1
CudnnConvolutionBackward 0.04% 130.121us 0.04% 130.121us 130.121us 0.03% 1.412ms 1.412ms 1
cudnn_convolution_backward 0.04% 115.561us 0.04% 115.561us 115.561us 0.03% 1.409ms 1.409ms 1
torch::autograd::AccumulateGrad 0.00% 5.880us 0.00% 5.880us 5.880us 0.00% 2.049us 2.049us 1
ReluBackward1 0.01% 46.741us 0.01% 46.741us 46.741us 0.00% 31.264us 31.264us 1
threshold_backward 0.01% 35.161us 0.01% 35.161us 35.161us 0.00% 29.727us 29.727us 1
NativeBatchNormBackward 0.02% 60.841us 0.02% 60.841us 60.841us 0.00% 46.176us 46.176us 1
------------------------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Self CPU time total: 318.945ms
CUDA time total: 5.090s
------------------------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls
------------------------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
to 0.00% 4.650us 0.00% 4.650us 4.650us 0.00% 3.904us 3.904us 1
is_floating_point 0.00% 2.270us 0.00% 2.270us 2.270us 0.00% 2.048us 2.048us 1
mul 0.00% 41.471us 0.00% 41.471us 41.471us 0.00% 41.568us 41.568us 1
torch::autograd::GraphRoot 0.00% 29.271us 0.00% 29.271us 29.271us 0.00% 7.616us 7.616us 1
MulBackward0 0.00% 162.214us 0.00% 162.214us 162.214us 0.00% 157.504us 157.504us 1
mul 0.00% 107.863us 0.00% 107.863us 107.863us 0.00% 108.768us 108.768us 1
NllLossBackward 0.00% 159.613us 0.00% 159.613us 159.613us 0.00% 159.744us 159.744us 1
nll_loss_backward 0.00% 129.183us 0.00% 129.183us 129.183us 0.00% 128.640us 128.640us 1
LogSoftmaxBackward 0.00% 78.661us 0.00% 78.661us 78.661us 0.00% 77.856us 77.856us 1
_log_softmax_backward_data 0.00% 64.331us 0.00% 64.331us 64.331us 0.00% 65.536us 65.536us 1
torch::autograd::CopyBackwards 0.00% 85.042us 0.00% 85.042us 85.042us 0.00% 85.376us 85.376us 1
to 0.00% 63.722us 0.00% 63.722us 63.722us 0.00% 64.833us 64.833us 1
empty 0.00% 8.841us 0.00% 8.841us 8.841us 0.00% 9.376us 9.376us 1
AddmmBackward 0.00% 233.895us 0.00% 233.895us 233.895us 0.00% 233.792us 233.792us 1
unsigned short 0.00% 21.271us 0.00% 21.271us 21.271us 0.00% 21.792us 21.792us 1
mm 0.00% 95.212us 0.00% 95.212us 95.212us 0.00% 98.176us 98.176us 1
unsigned short 0.00% 8.100us 0.00% 8.100us 8.100us 0.00% 8.160us 8.160us 1
mm 0.00% 52.021us 0.00% 52.021us 52.021us 0.00% 53.152us 53.152us 1
unsigned short 0.00% 12.400us 0.00% 12.400us 12.400us 0.00% 12.288us 12.288us 1
sum 0.00% 88.282us 0.00% 88.282us 88.282us 0.00% 88.736us 88.736us 1
view 0.00% 14.390us 0.00% 14.390us 14.390us 0.00% 14.336us 14.336us 1
TBackward 0.00% 20.000us 0.00% 20.000us 20.000us 0.00% 18.976us 18.976us 1
unsigned short 0.00% 10.590us 0.00% 10.590us 10.590us 0.00% 10.240us 10.240us 1
torch::autograd::CopyBackwards 0.00% 60.421us 0.00% 60.421us 60.421us 0.00% 59.520us 59.520us 1
to 0.00% 49.261us 0.00% 49.261us 49.261us 0.00% 50.176us 50.176us 1
empty 0.00% 8.110us 0.00% 8.110us 8.110us 0.00% 8.192us 8.192us 1
torch::autograd::AccumulateGrad 0.00% 11.930us 0.00% 11.930us 11.930us 0.00% 12.000us 12.000us 1
torch::autograd::CopyBackwards 0.00% 51.741us 0.00% 51.741us 51.741us 0.00% 52.512us 52.512us 1
to 0.00% 37.550us 0.00% 37.550us 37.550us 0.00% 38.400us 38.400us 1
empty 0.00% 9.070us 0.00% 9.070us 9.070us 0.00% 8.705us 8.705us 1
torch::autograd::AccumulateGrad 0.00% 6.170us 0.00% 6.170us 6.170us 0.00% 6.144us 6.144us 1
ViewBackward 0.00% 44.671us 0.00% 44.671us 44.671us 0.00% 43.840us 43.840us 1
reshape 0.00% 31.741us 0.00% 31.741us 31.741us 0.00% 32.320us 32.320us 1
as_strided 0.00% 9.930us 0.00% 9.930us 9.930us 0.00% 8.128us 8.128us 1
AdaptiveAvgPool3DBackward 0.00% 85.322us 0.00% 85.322us 85.322us 0.02% 810.496us 810.496us 1
adaptive_avg_pool3d_backward 0.00% 71.082us 0.00% 71.082us 71.082us 0.02% 800.768us 800.768us 1
ReluBackward1 0.00% 60.452us 0.00% 60.452us 60.452us 0.00% 42.432us 42.432us 1
threshold_backward 0.00% 38.450us 0.00% 38.450us 38.450us 0.00% 38.880us 38.880us 1
AddBackward0 0.00% 4.560us 0.00% 4.560us 4.560us 0.00% 2.048us 2.048us 1
CudnnBatchNormBackward 0.03% 1.231ms 0.03% 1.231ms 1.231ms 0.01% 579.008us 579.008us 1
contiguous 0.00% 5.170us 0.00% 5.170us 5.170us 0.00% 0.640us 0.640us 1
cudnn_batch_norm_backward 0.02% 1.190ms 0.02% 1.190ms 1.190ms 0.01% 573.280us 573.280us 1
torch::autograd::AccumulateGrad 0.00% 7.030us 0.00% 7.030us 7.030us 0.00% 6.176us 6.176us 1
torch::autograd::AccumulateGrad 0.00% 5.290us 0.00% 5.290us 5.290us 0.00% 5.408us 5.408us 1
CudnnConvolutionBackward 0.02% 1.134ms 0.02% 1.134ms 1.134ms 0.04% 1.837ms 1.837ms 1
cudnn_convolution_backward 0.02% 1.116ms 0.02% 1.116ms 1.116ms 0.04% 1.825ms 1.825ms 1
torch::autograd::CopyBackwards 0.00% 64.342us 0.00% 64.342us 64.342us 0.00% 35.457us 35.457us 1
to 0.00% 52.031us 0.00% 52.031us 52.031us 0.00% 32.769us 32.769us 1
empty 0.00% 10.860us 0.00% 10.860us 10.860us 0.00% 1.760us 1.760us 1
torch::autograd::AccumulateGrad 0.00% 6.470us 0.00% 6.470us 6.470us 0.00% 0.640us 0.640us 1
ReluBackward1 0.00% 47.741us 0.00% 47.741us 47.741us 0.00% 26.272us 26.272us 1
threshold_backward 0.00% 35.231us 0.00% 35.231us 35.231us 0.00% 22.752us 22.752us 1
CudnnBatchNormBackward 0.00% 79.771us 0.00% 79.771us 79.771us 0.00% 34.911us 34.911us 1
contiguous 0.00% 4.840us 0.00% 4.840us 4.840us 0.00% 2.049us 2.049us 1
cudnn_batch_norm_backward 0.00% 51.131us 0.00% 51.131us 51.131us 0.00% 28.673us 28.673us 1
torch::autograd::AccumulateGrad 0.00% 10.820us 0.00% 10.820us 10.820us 0.00% 2.048us 2.048us 1
torch::autograd::AccumulateGrad 0.00% 5.210us 0.00% 5.210us 5.210us 0.00% 1.504us 1.504us 1
CudnnConvolutionBackward 0.03% 1.502ms 0.03% 1.502ms 1.502ms 0.03% 1.608ms 1.608ms 1
cudnn_convolution_backward 0.03% 1.486ms 0.03% 1.486ms 1.486ms 0.03% 1.605ms 1.605ms 1
torch::autograd::CopyBackwards 0.00% 60.001us 0.00% 60.001us 60.001us 0.00% 16.384us 16.384us 1
to 0.00% 48.501us 0.00% 48.501us 48.501us 0.00% 13.409us 13.409us 1
empty 0.00% 9.650us 0.00% 9.650us 9.650us 0.00% 0.576us 0.576us 1
torch::autograd::AccumulateGrad 0.00% 6.470us 0.00% 6.470us 6.470us 0.00% 1.504us 1.504us 1
ReluBackward1 0.00% 51.271us 0.00% 51.271us 51.271us 0.00% 25.887us 25.887us 1
threshold_backward 0.00% 35.340us 0.00% 35.340us 35.340us 0.00% 22.369us 22.369us 1
CudnnBatchNormBackward 0.00% 78.302us 0.00% 78.302us 78.302us 0.00% 32.385us 32.385us 1
contiguous 0.00% 4.910us 0.00% 4.910us 4.910us 0.00% 2.048us 2.048us 1
cudnn_batch_norm_backward 0.00% 49.581us 0.00% 49.581us 49.581us 0.00% 25.150us 25.150us 1
torch::autograd::AccumulateGrad 0.00% 6.150us 0.00% 6.150us 6.150us 0.00% 0.960us 0.960us 1
torch::autograd::AccumulateGrad 0.00% 8.380us 0.00% 8.380us 8.380us 0.00% 1.792us 1.792us 1
CudnnConvolutionBackward 0.00% 180.934us 0.00% 180.934us 180.934us 0.03% 1.368ms 1.368ms 1
cudnn_convolution_backward 0.00% 167.003us 0.00% 167.003us 167.003us 0.03% 1.365ms 1.365ms 1
add 0.00% 33.070us 0.00% 33.070us 33.070us 0.00% 37.184us 37.184us 1
torch::autograd::CopyBackwards 0.01% 534.221us 0.01% 534.221us 534.221us 0.00% 37.280us 37.280us 1
to 0.01% 522.121us 0.01% 522.121us 522.121us 0.00% 34.816us 34.816us 1
empty 0.01% 469.580us 0.01% 469.580us 469.580us 0.00% 2.048us 2.048us 1
torch::autograd::AccumulateGrad 0.00% 6.831us 0.00% 6.831us 6.831us 0.00% 0.800us 0.800us 1
ReluBackward1 0.00% 41.371us 0.00% 41.371us 41.371us 0.00% 47.104us 47.104us 1
threshold_backward 0.00% 29.670us 0.00% 29.670us 29.670us 0.00% 43.393us 43.393us 1
AddBackward0 0.00% 4.590us 0.00% 4.590us 4.590us 0.00% 1.471us 1.471us 1
CudnnBatchNormBackward 0.00% 86.972us 0.00% 86.972us 86.972us 0.00% 56.735us 56.735us 1
contiguous 0.00% 4.840us 0.00% 4.840us 4.840us 0.00% 2.048us 2.048us 1
cudnn_batch_norm_backward 0.00% 51.351us 0.00% 51.351us 51.351us 0.00% 49.632us 49.632us 1
torch::autograd::AccumulateGrad 0.00% 6.170us 0.00% 6.170us 6.170us 0.00% 2.048us 2.048us 1
torch::autograd::AccumulateGrad 0.00% 8.880us 0.00% 8.880us 8.880us 0.00% 1.504us 1.504us 1
CudnnConvolutionBackward 0.00% 144.583us 0.00% 144.583us 144.583us 0.03% 1.407ms 1.407ms 1
cudnn_convolution_backward 0.00% 131.113us 0.00% 131.113us 131.113us 0.03% 1.404ms 1.404ms 1
torch::autograd::CopyBackwards 0.00% 58.572us 0.00% 58.572us 58.572us 0.00% 34.880us 34.880us 1
to 0.00% 47.011us 0.00% 47.011us 47.011us 0.00% 32.544us 32.544us 1
empty 0.00% 14.510us 0.00% 14.510us 14.510us 0.00% 2.049us 2.049us 1
torch::autograd::AccumulateGrad 0.00% 6.430us 0.00% 6.430us 6.430us 0.00% 1.888us 1.888us 1
ReluBackward1 0.00% 39.851us 0.00% 39.851us 39.851us 0.00% 28.960us 28.960us 1
threshold_backward 0.00% 28.581us 0.00% 28.581us 28.581us 0.00% 26.623us 26.623us 1
CudnnBatchNormBackward 0.00% 76.212us 0.00% 76.212us 76.212us 0.00% 35.744us 35.744us 1
contiguous 0.00% 4.850us 0.00% 4.850us 4.850us 0.00% 2.047us 2.047us 1
cudnn_batch_norm_backward 0.00% 48.611us 0.00% 48.611us 48.611us 0.00% 30.016us 30.016us 1
torch::autograd::AccumulateGrad 0.00% 5.940us 0.00% 5.940us 5.940us 0.00% 1.504us 1.504us 1
torch::autograd::AccumulateGrad 0.00% 8.201us 0.00% 8.201us 8.201us 0.00% 2.049us 2.049us 1
CudnnConvolutionBackward 0.03% 1.334ms 0.03% 1.334ms 1.334ms 0.03% 1.620ms 1.620ms 1
cudnn_convolution_backward 0.03% 1.317ms 0.03% 1.317ms 1.317ms 0.03% 1.617ms 1.617ms 1
------------------------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Self CPU time total: 4.869s
CUDA time total: 5.038s
@OValery16 I get the same problem, have you solved the problem? It seems that GPU is not used during backward.