Pytorch: Variable input size training is slow

Created on 17 Sep 2017 · 3Comments · Source: pytorch/pytorch

I have a model modified from resnet50, just remove the last avgpool & fc.
I found if I constantly changing the input size during training, the speed is slow.

Minimal code:

import time
import torch
import torch.nn as nn
import torch.backends.cudnn as cudnn
import numpy as np
from torch.autograd import Variable

# ... remove avgpool & fc from resnet50 here
net = resnet50()
net.cuda()
net = torch.nn.DataParallel(net, device_ids=range(torch.cuda.device_count()))
cudnn.benchmark = True

for i in range(10):
    h = np.random.randint(400,600)
    w = np.random.randint(400,600)
    # or fix h = w = 600
    x = Variable(torch.randn(1,3,h,w)).cuda()

    t1 = time.time()
    y = net(x)
    t2 = time.time()
    print(t2-t1)

If I fix input size to [600,600], what I got on my 8 Nvidia P40 machine is:

3.14512705803
0.11568403244
0.0255229473114
0.0228650569916
0.0235478878021
0.0225219726562
0.0436158180237
0.0222969055176
0.0223350524902
0.0227248668671

If I change the input size randomly from [400,600], I got:

3.12573313713
0.670918941498
2.32590889931
2.3486700058
2.31507301331
0.593285083771
0.68169093132
2.34181690216
0.597991943359
1.74615192413

I also trained with only CPU, both works OK. So I think the reason might related to CUDA overhead. Any ideas to fix this?

Source

kuangliu

Most helpful comment

As @fmassa says here: https://discuss.pytorch.org/t/pytorch-performance/3079/7?u=smth

In benchmark mode, for each input size, cudnn will perform a bunch of computations to infer the fastest algorithm for that specific case, and caches the result. This brings some overhead, and if your input dimensions change all the time, using benchmark will actually slow down things because of this overhead.

soumith on 17 Sep 2017

👍3 🚀1

All 3 comments

do you set cudnn.benchmark=True anywhere in your code? that is probably the culprit.

soumith on 17 Sep 2017

As @fmassa says here: https://discuss.pytorch.org/t/pytorch-performance/3079/7?u=smth

In benchmark mode, for each input size, cudnn will perform a bunch of computations to infer the fastest algorithm for that specific case, and caches the result. This brings some overhead, and if your input dimensions change all the time, using benchmark will actually slow down things because of this overhead.

soumith on 17 Sep 2017

👍3 🚀1

Cool. I comment the line out, both works OK now.
Thanks.

kuangliu on 17 Sep 2017

Was this page helpful?

0 / 5 - 0 ratings