Transformers: MultiGPU Trainer: each processes uses more memory than 1 GPU job

Created on 1 Oct 2020  路  4Comments  路  Source: huggingface/transformers

I tried to run an 8 GPU training job, and it OOM'd, so I investigated whether it could run on 1 GPU. It could!

So here are two commands, the first one says that it is using 14814MiB on GPU 0.
The second says it is using
15594MiB on each.

This doesn't happen in PL, which leads me to believe that distributed_scalars is to blame, but I am not sure. Has anyone run into this?
cc @sgugger @patil-suraj

Delta between two commands

- CUDA_VISIBLE_DEVICES=0 python
+ CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2

Command 1

CUDA_VISIBLE_DEVICES=0 python finetune_trainer.py \
    --model_name_or_path student_pegasus_cnn_12_2 \
    --data_dir cnn_dm \
    --output_dir dpx_cnn_12_2_pl_comb_noDO --overwrite_output_dir --freeze_embeds \
    --learning_rate=3e-5 \
    --warmup_steps 500 --sortish_sampler \
    --gradient_accumulation_steps=4 \
    --per_device_train_batch_size=4 --per_device_eval_batch_size=8 --eval_beams 2 \
    --num_train_epochs=5 \
    --save_steps 3000 --eval_steps 3000 \
    --logging_first_step \
    --max_target_length 56 --val_max_target_length 142 --test_max_target_length 142 \
    --do_train --do_eval --do_predict --evaluate_during_training \
    --predict_with_generate --load_best_model_at_end

Command 2

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 fine
tune_trainer.py \
    --model_name_or_path student_pegasus_cnn_12_2 \
    --data_dir cnn_dm \
    --output_dir dpx_cnn_12_2_pl_comb_noDO --overwrite_output_dir --freeze_embeds \
    --learning_rate=3e-5 \
    --warmup_steps 500 --sortish_sampler \
    --gradient_accumulation_steps=4 \
    --per_device_train_batch_size=4 --per_device_eval_batch_size=8 --eval_beams 2 \
    --num_train_epochs=5 \
    --save_steps 3000 --eval_steps 3000 \
    --logging_first_step \
    --max_target_length 56 --val_max_target_length 142 --test_max_target_length 142 \
    --do_train --do_eval --do_predict --evaluate_during_training \
    --predict_with_generate --load_best_model_at_end
Distributed Training / Models

Most helpful comment

This seems to be a duplicate of #7169, finishing something and will investigate now that I have a proper multi-GPU steup.

All 4 comments

This seems to be a duplicate of #7169, finishing something and will investigate now that I have a proper multi-GPU steup.

Yes, most likely a duplicate.
One clue might be the call to DataParallel right before the call to DistributedDataParallel.

Investigation seems to lead to: this is normal to have slightly more memory use per GPU in distributed mode since PyTorch keeps two copies in the gradients in that case. See this issue.

You are correct, thank you for investigating. The difference was using --adafactor which is saving about 3GB.
I added that option to Seq2SeqTrainer. Take it if you want it :)

Was this page helpful?
0 / 5 - 0 ratings