I have a model modified from resnet50, just remove the last avgpool
& fc
.
I found if I constantly changing the input size during training, the speed is slow.
Minimal code:
import time
import torch
import torch.nn as nn
import torch.backends.cudnn as cudnn
import numpy as np
from torch.autograd import Variable
# ... remove avgpool & fc from resnet50 here
net = resnet50()
net.cuda()
net = torch.nn.DataParallel(net, device_ids=range(torch.cuda.device_count()))
cudnn.benchmark = True
for i in range(10):
h = np.random.randint(400,600)
w = np.random.randint(400,600)
# or fix h = w = 600
x = Variable(torch.randn(1,3,h,w)).cuda()
t1 = time.time()
y = net(x)
t2 = time.time()
print(t2-t1)
3.14512705803
0.11568403244
0.0255229473114
0.0228650569916
0.0235478878021
0.0225219726562
0.0436158180237
0.0222969055176
0.0223350524902
0.0227248668671
3.12573313713
0.670918941498
2.32590889931
2.3486700058
2.31507301331
0.593285083771
0.68169093132
2.34181690216
0.597991943359
1.74615192413
I also trained with only CPU, both works OK. So I think the reason might related to CUDA overhead. Any ideas to fix this?
do you set cudnn.benchmark=True
anywhere in your code? that is probably the culprit.
As @fmassa says here: https://discuss.pytorch.org/t/pytorch-performance/3079/7?u=smth
In benchmark mode, for each input size, cudnn will perform a bunch of computations to infer the fastest algorithm for that specific case, and caches the result. This brings some overhead, and if your input dimensions change all the time, using benchmark will actually slow down things because of this overhead.
Cool. I comment the line out, both works OK now.
Thanks.
Most helpful comment
As @fmassa says here: https://discuss.pytorch.org/t/pytorch-performance/3079/7?u=smth