Pytorch-lightning: save_checkpoint consumed many GBs of gpu ram.

Created on 12 Nov 2020  路  6Comments  路  Source: PyTorchLightning/pytorch-lightning

馃悰 Bug

This is a split of from https://github.com/huggingface/transformers/issues/8403 where a much larger memory leak is discussed (has to do with native amp usage in particular circumstances) (note: there was also an attempt to move the discussion to https://github.com/PyTorchLightning/pytorch-lightning/issues/4614 but the original prevailed).

I found another unrelated strange huge gpu memory requirement in PL. It happens during save_checkpoint - As I'm stepping through it with debugger I observe the gpu memory growing from 2GB (training was complete) to 5GB.

It's very difficult to debug PL due to dozens of indirections :( here is the stack trace (in v1.0.6)

dump_checkpoint, checkpoint_connector.py:337
save_checkpoint, checkpoint_connector.py:389
save_checkpoint, properties.py:207
_save_model, model_checkpoint.py:333
_update_best_and_save, model_checkpoint.py:583
_save_top_k_checkpoints, model_checkpoint.py:534
save_checkpoint, model_checkpoint.py:232
on_validation_end, model_checkpoint.py:186
on_validation_end, callback_hook.py:177
call_hook, trainer.py:833
on_evaluation_end, evaluation_loop.py:109
run_evaluation, trainer.py:620
run_training_epoch, training_loop.py:589
train, trainer.py:493
train_or_test, accelerator.py:74
train, gpu_accelerator.py:63
fit, trainer.py:444
generic_train, lightning_base.py:398
main, finetune.py:413
<module>, finetune.py:446

So the gpu ram is growing through dump_checkpoint to about 1GB, and then jumps by another 2GB at:

        # give the model a chance to dump a few things
        model.on_save_checkpoint(checkpoint)

Would this be something that can be done on CPU? This is a checkpoint saving - shouldn't need more gpu RAM here in theory.

Ideally this function should need 0 extra gpu ram. Otherwise one may have a successful training but the program would OOM at the end of it during checkpoint saving which is not best. or if it's intermediary saving then it's definitely a problem.

The setup where I discovered this is identical to the first post: https://github.com/huggingface/transformers/issues/8403
which some of you have already replicated while debugging https://github.com/PyTorchLightning/pytorch-lightning/issues/4614

Environment

* CUDA:
        - GPU:
                - GeForce RTX 3090
                - GeForce GTX 1070 Ti
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.19.2
        - pyTorch_debug:     False
        - pyTorch_version:   1.6.0
        - pytorch-lightning: 1.0.6
        - tqdm:              4.51.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.8.5
        - version:           #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020
bug / fix help wanted

All 6 comments

@stas00 mind share some example to reproduce, and/or some more details about your training:

  • what distributed you used
  • have you used AMP for training

it would be also interesting to know how much memory it took on GPU/CPU while training if you use batch size 1

As the OP mentions this is a split off from https://github.com/huggingface/transformers/issues/8403 (re-played partially at https://github.com/PyTorchLightning/pytorch-lightning/issues/4614) where a huge memory leak has been discovered. As I was trying to isolate the cause, while stepping through with debugger, I found this one as well - it was suggested I post it separately. So all the reproduction details are in the https://github.com/huggingface/transformers/issues/8403 - if it'd help to re-paste the details please let me know. I know at least @SeanNaren already setup the reproducible case as described in https://github.com/PyTorchLightning/pytorch-lightning/issues/4614 as he was trying to help to debug the issue.

Oh I see, in this Issue I linked to the issue you guys created where there are no reproduction details, I fixed that to link to the source.

what distributed you used

None - accelerator_gpu.py

have you used AMP for training

--fp16 (native amp)

it would be also interesting to know how much memory it took on GPU/CPU while training if you use batch size 1

As the OP says, it was 2GB at the end of training and jumped to 5GB during save_checkpoint - hence the issue. dump_checkpoint is where the majority of this allocation happened. Note, I haven't tried to gc.collect() + torch.cuda.empty_cache after save_checkpoint finished - it's possible that some of it was cache - but still if the trainer allocates 2.5 times more gpu ram when the training is over, this would OOM for sure on a tight setup.

If you need any other details please let me know.

Thank you!

@SeanNaren how far have you got with this leak debugging?

AFAIK @SeanNaren only worked and reproduced the main part of https://github.com/huggingface/transformers/issues/8403, which has now been diagnosed and the problem is coming neither from PL nor transformers, but pytorch's autocast caching mechanism, Though it's very likely whatever the solution will be either or both projects might have to add some extra code to prevent such situations in the future. So we are waiting for the pytorch developers to tell us what's going on and how to move forward.

@stas00 is this tied to fp16 or completely separate?

Good question, @SeanNaren - it appears to be fp16 related.

Here is a quick reproduction of the problem w/o needing any complex debugger setup.

  1. I did:
diff --git a/pytorch_lightning/callbacks/model_checkpoint.py b/pytorch_lightning/callbacks/model_checkpoint.py
index d257e1ea..3ec381e0 100644
--- a/pytorch_lightning/callbacks/model_checkpoint.py
+++ b/pytorch_lightning/callbacks/model_checkpoint.py
@@ -181,7 +181,10 @@ class ModelCheckpoint(Callback):
         """
         checkpoints can be saved at the end of the val loop
         """
+        print(f"before: peak {torch.cuda.memory_stats()['allocated_bytes.all.peak']>>20}MB")
         self.save_checkpoint(trainer, pl_module)
+        print(f"after: peak {torch.cuda.memory_stats()['allocated_bytes.all.peak']>>20}MB")
+        torch.cuda.reset_peak_memory_stats()

     def on_save_checkpoint(self, trainer, pl_module) -> Dict[str, Any]:
         return {
  1. Now when running with apex.
cd examples/seq2seq
PYTHONPATH="../../src" CUDA_VISIBLE_DEVICES=0 python  -W ignore finetune.py --learning_rate 3e-5 --gpus 1 --do_train --val_check_interval 1 --num_train_epochs 1 --freeze_encoder --freeze_embeds --data_dir cnn_dm --max_target_length 142 --val_max_target_length 142 --train_batch_size 1 --eval_batch_size 1 --gradient_accumulation_steps 1 --model_name_or_path sshleifer/student_cnn_12_6 --tokenizer_name facebook/bart-large --warmup_steps 1 --output_dir distilbart-cnn-12-6 --overwrite_output_dir --num_sanity_val_steps 0 --n_train 1 --n_val 1 --fp16 --amp_backend=apex
Epoch 0: 
before: peak 1547MB
after: peak 4279MB

Epoch 0: 
before: peak 1162MB
after: peak 1162MB

As you can see the first time it runs on_validation_end - it allocates an additional 2.7GB.

(I removed the tqdm noise)

With native amp I can't easily test at the moment since it consumes 10x gpu RAM so the peak is at 19201MB, and a 3GB temp fluctuation won't appear on a radar. It will require a different kind of memory tracing. But since it's easy to see with apex, then it's probably good enough for now.

Without fp16 I get:

Epoch 0: 
before: peak 2464MB
after: peak 2464MB
Epoch 0: 1
before: peak 2323MB
after: peak 2323MB

So no peaking at all for save_checkpoint.

So it does point to fp16 as the culprit.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

polars05 picture polars05  路  3Comments

versatran01 picture versatran01  路  3Comments

monney picture monney  路  3Comments

Vichoko picture Vichoko  路  3Comments

awaelchli picture awaelchli  路  3Comments