I trained CycleGAN with a Nvidia Tesla K80 GPU, Ubuntu, batchSize=1.
But I got an error of "out of memory".
Anything I have missed? How large memory does this model use?
Edited: I tested the same thing on another machine with Nvidia TitanX , Ubuntu, batchSize=1, and got the same error.
I ran:
python train.py --dataroot ./datasets/horse2zebra --name horse2zebra_cyclegan --model cycle_gan
The messages I got:
batchSize: 1
beta1: 0.5
checkpoints_dir: ./checkpoints
continue_train: False
dataroot: ./datasets/horse2zebra
dataset_mode: unaligned
display_freq: 100
display_id: 1
display_winsize: 256
fineSize: 256
gpu_ids: [0]
identity: 0.0
input_nc: 3
isTrain: True
lambda_A: 10.0
lambda_B: 10.0
loadSize: 286
lr: 0.0002
max_dataset_size: inf
model: cycle_gan
nThreads: 1
n_layers_D: 3
name: horse2zebra_cyclegan
ndf: 64
ngf: 64
niter: 100
niter_decay: 100
no_flip: False
no_html: False
no_lsgan: False
norm: instance
output_nc: 3
phase: train
pool_size: 50
print_freq: 100
resize_or_crop: resize_and_crop
save_epoch_freq: 5
save_latest_freq: 5000
serial_batches: False
use_dropout: False
which_direction: AtoB
which_epoch: latest
which_model_netD: basic
which_model_netG: resnet_9blocks
-------------- End ----------------
CustomDatasetDataLoader
dataset [UnalignedDataset] was created
#training images = 1067
cycle_gan
---------- Networks initialized -------------
ResnetGenerator (
(model): Sequential (
(0): Conv2d(3, 64, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
(1): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(4): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
(5): ReLU (inplace)
(6): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(7): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(8): ReLU (inplace)
(9): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(10): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(11): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(12): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(13): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(14): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(15): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(16): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(17): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(18): ConvTranspose2d(256, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
(19): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
(20): ReLU (inplace)
(21): ConvTranspose2d(128, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
(22): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
(23): ReLU (inplace)
(24): Conv2d(64, 3, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
(25): Tanh ()
)
)
Total number of parameters: 11388675
ResnetGenerator (
(model): Sequential (
(0): Conv2d(3, 64, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
(1): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(4): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
(5): ReLU (inplace)
(6): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(7): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(8): ReLU (inplace)
(9): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(10): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(11): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(12): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(13): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(14): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(15): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(16): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(17): ResnetBlock (
(conv_block): Sequential (
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(2): ReLU (inplace)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
)
)
(18): ConvTranspose2d(256, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
(19): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
(20): ReLU (inplace)
(21): ConvTranspose2d(128, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
(22): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
(23): ReLU (inplace)
(24): Conv2d(64, 3, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3))
(25): Tanh ()
)
)
Total number of parameters: 11388675
NLayerDiscriminator (
(model): Sequential (
(0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(1): LeakyReLU (0.2, inplace)
(2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(3): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
(4): LeakyReLU (0.2, inplace)
(5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(6): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(7): LeakyReLU (0.2, inplace)
(8): Conv2d(256, 512, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
(9): InstanceNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
(10): LeakyReLU (0.2, inplace)
(11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
)
)
Total number of parameters: 2766529
NLayerDiscriminator (
(model): Sequential (
(0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(1): LeakyReLU (0.2, inplace)
(2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(3): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
(4): LeakyReLU (0.2, inplace)
(5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
(6): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
(7): LeakyReLU (0.2, inplace)
(8): Conv2d(256, 512, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
(9): InstanceNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
(10): LeakyReLU (0.2, inplace)
(11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
)
)
Total number of parameters: 2766529
-----------------------------------------------
model [CycleGANModel] was created
create web directory ./checkpoints/horse2zebra_cyclegan/web...
THCudaCheck FAIL file=/home/liyh/pytorch/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
Traceback (most recent call last):
File "train.py", line 25, in <module>
model.optimize_parameters()
File "/home/liyh/projects/pytorch_implementation/CycleGAN/models/cycle_gan_model.py", line 158, in optimize_parameters
self.backward_G()
File "/home/liyh/projects/pytorch_implementation/CycleGAN/models/cycle_gan_model.py", line 144, in backward_G
self.rec_A = self.netG_B.forward(self.fake_B)
File "/home/liyh/projects/pytorch_implementation/CycleGAN/models/networks.py", line 170, in forward
return nn.parallel.data_parallel(self.model, input, self.gpu_ids)
File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 103, in data_parallel
return module(*inputs[0], **module_kwargs[0])
File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
result = self.forward(*input, **kwargs)
File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/modules/container.py", line 64, in forward
input = module(input)
File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
result = self.forward(*input, **kwargs)
File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 237, in forward
self.padding, self.dilation, self.groups)
File "/home/liyh/anaconda2/envs/pytorch/lib/python2.7/site-packages/torch/nn/functional.py", line 41, in conv2d
return f(input, weight, bias)
RuntimeError: cuda runtime error (2) : out of memory at /home/liyh/pytorch/torch/lib/THC/generic/THCStorage.cu:66
I solved it by running with smaller loadSize and fineSize:
python train.py --dataroot ./datasets/horse2zebra --name horse2zebra_cyclegan --model cycle_gan --pool_size 50 --loadSize 128 --fineSize 128 --batchSize 1
And the memory used is 4441MB on a Tesla K80 GPU.
I also ran on four Tesla K80 GPUs:
python train.py --dataroot ./datasets/horse2zebra --name horse2zebra_Q_cyclegan --model cycle_gan --pool_size 50 --loadSize 128 --fineSize 128 --batchSize 16 --gpu_ids=0,1,2,3
And the memory used is about 4972MB on each GPU.
@lyhangustc On my GTX 1080 GPU, it takes 2.8 GB to train a horse2zebra model on 256x256 images. I think K80 or Titan X should have enough memory for 256x256 models. I wonder if is is related to your GPU settings. (e.g. ECC on/off)
@lyhangustc when you change pool_size, loadSize,fineSize in training, should you change it in testing as well?
Also, can you please what is the difference between loadSize and fineSize ? and what does pool_size do?
Thanks
@isalirezag We first (1) read the image, (2) resize it to (LoadSize, LoadSize) (3) crop random patches of (fineSize, fineSize). It is a common data argumentation.
I installed PyTorch from source, then I met the same problem. I uninstalled the PyTorch, and I installed Pytorch using conda conda install pytorch torchvision -c pytorch, I ran it successfully.
Most helpful comment
I solved it by running with smaller loadSize and fineSize:
python train.py --dataroot ./datasets/horse2zebra --name horse2zebra_cyclegan --model cycle_gan --pool_size 50 --loadSize 128 --fineSize 128 --batchSize 1And the memory used is 4441MB on a Tesla K80 GPU.
I also ran on four Tesla K80 GPUs:
python train.py --dataroot ./datasets/horse2zebra --name horse2zebra_Q_cyclegan --model cycle_gan --pool_size 50 --loadSize 128 --fineSize 128 --batchSize 16 --gpu_ids=0,1,2,3And the memory used is about 4972MB on each GPU.