Fairseq: memory increase in every batch that lead to out memory in some epoch

Created on 27 Jul 2018  路  11Comments  路  Source: pytorch/fairseq

i use the this code to train a model , i find the memory which is increasing in every batch. that lead to out memory in some epoch.

script as follow:
sbatch --partition airesearch_middle --job-name mem_fairseq-py --gres gpu:4 --cpus-per-task 10
--nodes 1 --ntasks-per-node 1
--wrap "srun --output ${savedir}/train.log.node%t --error ${savedir}/train.stderr.node%t.%j
python train.py $DATA
--distributed-world-size 4
--save-dir=${savedir}
--update-freq 16
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000
--lr 0.0005 --min-lr 1e-09
--dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1
--max-tokens 1000 "

script to get memory as follow:

!/usr/bin/env python

def print_mem(itnum,bnum):
import os
import psutil
pid = os.getpid()
py = psutil.Process(pid)
memoryUse = py.memory_info()[0]/2.**20
return 'iteration: {} batchnum {} memory use: {}MB'.format(itnum, bnum, memoryUse)

log as follow:
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, arch='transformer_vaswani_wmt_en_de_big', attention_dropout=0.0, clip_norm=0.0, criterion='label_smoothed_cross_entropy', data='data-bin/wmt14_en_de_joined_dict', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, device_id=0, distributed_backend='nccl', distributed_init_method='tcp://localhost:16693', distributed_port=-1, distributed_rank=0, distributed_world_size=4, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fp16=False, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=1000, max_update=0, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, optimizer='adam', raw_text=False, relu_dropout=0.0, restore_file='checkpoint_last.pt', save_dir='./checkpoints/transformer_vaswani_wmt_en_de_big/mem_single_node_multi_4gpus/', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation', train_subset='train', update_freq=[16], valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [en] dictionary: 32768 types
| [de] dictionary: 32768 types
Load dataset splits
| data-bin/wmt14_en_de_joined_dict train 4528446 examples
| data-bin/wmt14_en_de_joined_dict valid 3000 examples
Build model and criterion
| model transformer_vaswani_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 209911808
| training on 4 GPUs
| max tokens per GPU = 1000 and max sentences per GPU = None
iteration: 1 batchnum 15 memory use: 3760.8203125MB
iteration: 1 batchnum 31 memory use: 3761.0703125MB
iteration: 1 batchnum 47 memory use: 3761.12109375MB
iteration: 1 batchnum 63 memory use: 3761.234375MB
iteration: 1 batchnum 79 memory use: 3761.3203125MB
iteration: 1 batchnum 95 memory use: 3761.40625MB
iteration: 1 batchnum 111 memory use: 3761.453125MB
iteration: 1 batchnum 127 memory use: 3761.5546875MB
iteration: 1 batchnum 143 memory use: 3761.62890625MB
iteration: 1 batchnum 159 memory use: 3761.69140625MB
iteration: 1 batchnum 175 memory use: 3761.7578125MB
iteration: 1 batchnum 191 memory use: 3761.87109375MB
iteration: 1 batchnum 207 memory use: 3761.92578125MB
iteration: 1 batchnum 223 memory use: 3762.00390625MB
iteration: 1 batchnum 239 memory use: 3762.08203125MB
iteration: 1 batchnum 255 memory use: 3762.16015625MB
iteration: 1 batchnum 271 memory use: 3762.2421875MB
iteration: 1 batchnum 287 memory use: 3762.3203125MB
iteration: 1 batchnum 303 memory use: 3762.390625MB
iteration: 1 batchnum 319 memory use: 3762.47265625MB
iteration: 1 batchnum 335 memory use: 3762.5546875MB
iteration: 1 batchnum 351 memory use: 3762.63671875MB
iteration: 1 batchnum 367 memory use: 3762.7109375MB
iteration: 1 batchnum 383 memory use: 3762.75390625MB
iteration: 1 batchnum 399 memory use: 3762.828125MB
iteration: 1 batchnum 415 memory use: 3762.87890625MB
iteration: 1 batchnum 431 memory use: 3762.98828125MB
iteration: 1 batchnum 447 memory use: 3763.06640625MB
iteration: 1 batchnum 463 memory use: 3763.12890625MB
iteration: 1 batchnum 479 memory use: 3763.17578125MB
iteration: 1 batchnum 495 memory use: 3763.265625MB
iteration: 1 batchnum 511 memory use: 3763.33203125MB
iteration: 1 batchnum 527 memory use: 3763.4296875MB
iteration: 1 batchnum 543 memory use: 3763.47265625MB
iteration: 1 batchnum 559 memory use: 3763.5625MB
iteration: 1 batchnum 575 memory use: 3763.640625MB
iteration: 1 batchnum 591 memory use: 3763.72265625MB
iteration: 1 batchnum 607 memory use: 3763.81640625MB
iteration: 1 batchnum 623 memory use: 3763.8828125MB
iteration: 1 batchnum 639 memory use: 3763.94921875MB
iteration: 1 batchnum 655 memory use: 3764.02734375MB
iteration: 1 batchnum 671 memory use: 3764.125MB
iteration: 1 batchnum 687 memory use: 3764.16015625MB
iteration: 1 batchnum 703 memory use: 3764.3046875MB
iteration: 1 batchnum 719 memory use: 3764.33984375MB
iteration: 1 batchnum 735 memory use: 3764.4296875MB
iteration: 1 batchnum 751 memory use: 3764.51171875MB
iteration: 1 batchnum 767 memory use: 3764.578125MB
iteration: 1 batchnum 783 memory use: 3764.66015625MB
iteration: 1 batchnum 799 memory use: 3764.72265625MB
iteration: 1 batchnum 815 memory use: 3764.796875MB
iteration: 1 batchnum 831 memory use: 3764.86328125MB
iteration: 1 batchnum 847 memory use: 3764.9375MB
iteration: 1 batchnum 863 memory use: 3765.02734375MB
iteration: 1 batchnum 879 memory use: 3765.125MB
iteration: 1 batchnum 895 memory use: 3765.17578125MB
iteration: 1 batchnum 911 memory use: 3765.2890625MB
iteration: 1 batchnum 927 memory use: 3765.3359375MB
iteration: 1 batchnum 943 memory use: 3765.375MB
iteration: 1 batchnum 959 memory use: 3765.44140625MB
iteration: 1 batchnum 975 memory use: 3765.5078125MB
iteration: 1 batchnum 991 memory use: 3765.58984375MB
| epoch 001: 1000 / 46210 loss=14.347, nll_loss=14.185, ppl=18624.06, wps=16255, ups=0.3, wpb=47786, bsz=1585, num_updates=62, lr=7.84845e-06, gnorm=3.188, clip=100%, oom=0, wall=182
iteration: 1 batchnum 1007 memo

Most helpful comment

after use torch.cuda.memory_allocated() and torch.cuda.memory_cached to log GPU memory. I find the most GPU memory taken by pytorch is unoccupied cached memory. I use torch.cuda.empty_cache() to release this part memory after each batch finishes and the memory will not increase.

All 11 comments

What version of pytorch are you using? I've noticed the same issue (different fairseq model, self_att_wp) this morning after upgrading to 0.4.1, while the same model trained fine on 0.4.0 so I think it's a recent regression.

@mjc14, that's measuring system memory, not GPU memory, correct?

@hmc-cs-mdrissi, are you also referring to system memory or GPU memory?

@myleott, I'm referring to gpu memory. I end up getting a cuda out of memory error after a bit of time even if I decrease the num_tokens.

edit: Other things that could be the underlying source is when upgrading to 0.4.1, I also upgraded cuda from 9.0 to 9.2 and upgraded cudnn to 7.1.4.

edit 2: Downgrading still gives me much more memory warnings than yesterday, but it no longer crashes. I downgraded cuda 9.2 -> 9.0 and pytorch 0.4.1 -> 0.4.0. So my guess is the memory leak lies in one of those two. The only other things that could be relevant is I did remove opencv (I kind of doubt an image library would be used here) and upgraded my nvidia drivers. My computer is currently unaccessible, but later today I'll try downgrading the nvidia drivers and seeing if that fixes the issue.

@myleott yes ,it sys memory not GPU memory, my pytorch is 0.4.0.

After down grading everything no more memory issues. So at least one of pytorch 0.4.1, cuda 9.2, 396.45 nvidia drivers have a gpu memory leak/regression on the self_att_wp model. The downgraded versions are pytorch 0.4.0, cuda 9.0, 384.130 nvidia drivers.

Yeah, I also meet same problem, the memory seems grows when batch number bigger .
E.g. If I use batch = 100 and run many batches, the memory grows only once. And if I use a larger batch number like 120 ,the memory grows.
This may not be right, just my feeling.
I use pytorch docker image 0.4.1-cuda9-cudnn7-devel
nvidia-driver version:390.12

I try 0.4. pytorch docker image, but it not helps

after use torch.cuda.memory_allocated() and torch.cuda.memory_cached to log GPU memory. I find the most GPU memory taken by pytorch is unoccupied cached memory. I use torch.cuda.empty_cache() to release this part memory after each batch finishes and the memory will not increase.

after use torch.cuda.memory_allocated() and torch.cuda.memory_cached to log GPU memory. I find the most GPU memory taken by pytorch is unoccupied cached memory. I use torch.cuda.empty_cache() to release this part memory after each batch finishes and the memory will not increase.

This seems to fix the issue. But I wonder if this would increase overhead and if there is a proper way to reuse allocated cache?

torch.cuda.empty_cache() solves the issue for me too, thank you ocean1992. I still wonder why that happens? Does someone have any hypothesis?

It's not working for me. I still have tensors stored in GPU memory that are not being cleared out between batches. I don't store them anywhere as a variable, so I don't know what the issue is. Is it necessary, after calling to('cuda') on a batch of tensors and inputting them into a model, to then call to('cpu') on the output or something like that? And if I try to use CPU, that also eats up all of my (32 GB) of system RAM, so clearly intermediate results of calling a resnet model on batches of tensors is storing intermediate results at different layers of the model?

I'm new to Pytorch, so probably I've made a simple mistake somewhere, like forgetting to deactivate some caching mechanism.

Edit: never mind, I figured it out--hadn't disabled gradient computations, and hadn't set the run-mode to eval.

@ctivanovich did you ever solve the problem I am having the same exact problem not sure what I am doing wrong here. GPU usage is blowing up after very batch and clearing up

Was this page helpful?
0 / 5 - 0 ratings