Pytorch-cyclegan-and-pix2pix: Out of memory?

Created on 20 Jun 2017 · 5Comments · Source: junyanz/pytorch-CycleGAN-and-pix2pix

I trained CycleGAN with a Nvidia Tesla K80 GPU, Ubuntu, batchSize=1.
But I got an error of "out of memory".
Anything I have missed? How large memory does this model use?

Edited: I tested the same thing on another machine with Nvidia TitanX , Ubuntu, batchSize=1, and got the same error.

I ran:
python train.py --dataroot ./datasets/horse2zebra --name horse2zebra_cyclegan --model cycle_gan

The messages I got:

batchSize: 1
beta1: 0.5
checkpoints_dir: ./checkpoints
continue_train: False
dataroot: ./datasets/horse2zebra
dataset_mode: unaligned
display_freq: 100
display_id: 1
display_winsize: 256
fineSize: 256
gpu_ids: [0]
identity: 0.0
input_nc: 3
isTrain: True
lambda_A: 10.0
lambda_B: 10.0
loadSize: 286
lr: 0.0002
max_dataset_size: inf
model: cycle_gan
nThreads: 1
n_layers_D: 3
name: horse2zebra_cyclegan
ndf: 64
ngf: 64
niter: 100
niter_decay: 100
no_flip: False
no_html: False
no_lsgan: False
norm: instance
output_nc: 3
phase: train
pool_size: 50
print_freq: 100
resize_or_crop: resize_and_crop
save_epoch_freq: 5
save_latest_freq: 5000
serial_batches: False
use_dropout: False
which_direction: AtoB
which_epoch: latest
which_model_netD: basic
which_model_netG: resnet_9blocks
-------------- End ----------------
CustomDatasetDataLoader
dataset [UnalignedDataset] was created
#training images = 1067
cycle_gan
---------- Networks initialized -------------
ResnetGenerator (
  (model): Sequential (
    (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
    (1): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
    (2): ReLU (inplace)
    (3): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (4): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (5): ReLU (inplace)
    (6): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (7): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
    (8): ReLU (inplace)
    (9): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (10): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (11): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (12): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (13): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (14): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (15): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (16): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (17): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (18): ConvTranspose2d(256, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
    (19): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (20): ReLU (inplace)
    (21): ConvTranspose2d(128, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
    (22): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
    (23): ReLU (inplace)
    (24): Conv2d(64, 3, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
    (25): Tanh ()
  )
)
Total number of parameters: 11388675
ResnetGenerator (
  (model): Sequential (
    (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
    (1): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
    (2): ReLU (inplace)
    (3): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (4): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (5): ReLU (inplace)
    (6): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (7): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
    (8): ReLU (inplace)
    (9): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (10): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (11): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (12): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (13): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (14): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (15): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (16): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (17): ResnetBlock (
      (conv_block): Sequential (
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
        (2): ReLU (inplace)
        (3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
      )
    )
    (18): ConvTranspose2d(256, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
    (19): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (20): ReLU (inplace)
    (21): ConvTranspose2d(128, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
    (22): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
    (23): ReLU (inplace)
    (24): Conv2d(64, 3, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
    (25): Tanh ()
  )
)
Total number of parameters: 11388675
NLayerDiscriminator (
  (model): Sequential (
    (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (1): LeakyReLU (0.2, inplace)
    (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (3): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (4): LeakyReLU (0.2, inplace)
    (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (6): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
    (7): LeakyReLU (0.2, inplace)
    (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
    (9): InstanceNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
    (10): LeakyReLU (0.2, inplace)
    (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
  )
)
Total number of parameters: 2766529
NLayerDiscriminator (
  (model): Sequential (
    (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (1): LeakyReLU (0.2, inplace)
    (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (3): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
    (4): LeakyReLU (0.2, inplace)
    (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (6): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
    (7): LeakyReLU (0.2, inplace)
    (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
    (9): InstanceNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
    (10): LeakyReLU (0.2, inplace)
    (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
  )
)
Total number of parameters: 2766529
-----------------------------------------------
model [CycleGANModel] was created
create web directory ./checkpoints/horse2zebra_cyclegan/web...
THCudaCheck FAIL file=/home/liyh/pytorch/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
Traceback (most recent call last):
  File "train.py", line 25, in <module>
    model.optimize_parameters()
  File "/home/liyh/projects/pytorch_implementation/CycleGAN/models/cycle_gan_model.py", line 158, in optimize_parameters
    self.backward_G()
  File "/home/liyh/projects/pytorch_implementation/CycleGAN/models/cycle_gan_model.py", line 144, in backward_G
    self.rec_A = self.netG_B.forward(self.fake_B)
  File "/home/liyh/projects/pytorch_implementation/CycleGAN/models/networks.py", line 170, in forward
    return nn.parallel.data_parallel(self.model, input, self.gpu_ids)
  File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 103, in data_parallel
    return module(*inputs[0], **module_kwargs[0])
  File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/modules/container.py", line 64, in forward
    input = module(input)
  File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 237, in forward
    self.padding, self.dilation, self.groups)
  File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/functional.py", line 41, in conv2d
    return f(input, weight, bias)
RuntimeError: cuda runtime error (2) : out of memory at /home/liyh/pytorch/torch/lib/THC/generic/THCStorage.cu:66

Source

lyhangustc

Most helpful comment

I solved it by running with smaller loadSize and fineSize：
python train.py --dataroot ./datasets/horse2zebra --name horse2zebra_cyclegan --model cycle_gan --pool_size 50 --loadSize 128 --fineSize 128 --batchSize 1
And the memory used is 4441MB on a Tesla K80 GPU.

I also ran on four Tesla K80 GPUs：
python train.py --dataroot ./datasets/horse2zebra --name horse2zebra_Q_cyclegan --model cycle_gan --pool_size 50 --loadSize 128 --fineSize 128 --batchSize 16 --gpu_ids=0,1,2,3
And the memory used is about 4972MB on each GPU.

lyhangustc on 20 Jun 2017

👍3

All 5 comments

lyhangustc on 20 Jun 2017

👍3

@lyhangustc On my GTX 1080 GPU, it takes 2.8 GB to train a horse2zebra model on 256x256 images. I think K80 or Titan X should have enough memory for 256x256 models. I wonder if is is related to your GPU settings. (e.g. ECC on/off)

junyanz on 20 Jun 2017

@lyhangustc when you change pool_size, loadSize,fineSize in training, should you change it in testing as well?
Also, can you please what is the difference between loadSize and fineSize ? and what does pool_size do?

Thanks

isalirezag on 16 Aug 2017

@isalirezag We first (1) read the image, (2) resize it to (LoadSize, LoadSize) (3) crop random patches of (fineSize, fineSize). It is a common data argumentation.

junyanz on 29 Dec 2017

I installed PyTorch from source, then I met the same problem. I uninstalled the PyTorch, and I installed Pytorch using conda conda install pytorch torchvision -c pytorch, I ran it successfully.