Transformers: [seq2seq] finetune.sh OOMs in fp16 w torch 1.6 on colab

Created on 19 Aug 2020 · 32Comments · Source: huggingface/transformers

On trying to fine-tune either T5 or BART models for summarization I was encountering OOM repeatedly in the latest code whereas it used to work fine earlier for me, atleast on Google Colab.
On checking the startup scripts and latest commits I saw that optimizations have been added for native pytorch fp16 support recently. On removing the fp16 parameter from the script it started working as expected.
Please check if this could be a real issue or just a matter of a dangling parameter that needs to be removed?
Thanks

@sshleifer @patil-suraj

Help wanted

Source

amanpreet692

All 32 comments

try passing --fp16 --fp_16_opt_level=O1
that is a relevant default that has changed. I have also experienced some torch 1.6 issues, so would love to know if that helps.

Semi-relatedly, a good practice is to run

!pip freeze | grep transformers
!pip freeze | grep torch

at the top of colab so that when you go back you can know what version you were on.

sshleifer on 19 Aug 2020

Thanks for the quick reply! I tried this and it didn't work though :(
Removing the fp16 parameter for now and fine-tuning.
Will keep the colab advice in mind :)

amanpreet692 on 19 Aug 2020

+1 on this. Using fp16 with level O1 or O2 both causes OOM even for batch size 1. Without fp16 fine-tuning works.

Torch 1.6.0, transformers 3.0.2, Linux, V100 GPU.

setu4993 on 31 Aug 2020

This is a torch 1.6 issue.
I haven't gotten anything working well with torch 1.6 + fp16.
torch 1.5.1 with apex installed works well for me.

sshleifer on 31 Aug 2020

👍1

I tried running fp16 training with amp_backend=apex and amp_backend=native (passing them as additional args) and the latter does much better in terms power consumption, but memory consumption is same for both (wandb GPU graphs). However, both of them OOM during the validation step. May have something to do with beam search since my validation batch size is 1.

Screen Shot 2020-08-31 at 12 47 28

setu4993 on 31 Aug 2020

Can you try torch 1.5.1 ?

sshleifer on 31 Aug 2020

Succeeds with 1.5.1, and power and temperature are in-line with native.

setu4993 on 31 Aug 2020

However, the process failing during generation for 1.6.0 suggests there's some optimization missing during the generation steps which causes OOM.

setu4993 on 31 Aug 2020

Another thing to note which might be related: Validation (400 samples) takes 5x time for 1 epoch of training (2400 samples). Even if accounting for beam size (4x), it is much slower.

setu4993 on 31 Aug 2020

Interesting! I would definitely be open to a PR here if you have a fix in mind!

sshleifer on 1 Sep 2020

Thanks! I have a couple ideas and will try them out and create a PR if any of them works.

setu4993 on 1 Sep 2020

I think the problem is that the _generative_step method calls _step in it, causing 2x forward steps within each validation step. Also, model.generate steps are inherently slower than an eval forward pass, even with num_beams=1, about 30-60x slower. But this is a different problem than the OOM issue on 1.6.0. Maybe should split this up into a different issue?

setu4993 on 1 Sep 2020

The problem is with model.generate that causes OOM on PyTorch 1.6. I switched out to using a custom validation_step that only uses _step and does not make a call to model.generate; it succeeds and is fast. The drawback is that I cannot use beam search for the validation step and keep do_predict set to False to ensure the test step does not execute. All of which are acceptable limitations to me for faster val, val not running into OOM and being able to use native fp16 with PyTorch 1.6.0.

I'm happy to create a PR for it if it makes sense to check it in.

setu4993 on 1 Sep 2020

That PR would be interesting. More interesting would be figuring out why generate OOMs in these conditions.

sshleifer on 7 Sep 2020

Definitely the question for why generate OOMs is interesting but one I haven't found an answer for yet. I suggested a workaround in #7004 using the fix I described earlier.

setu4993 on 8 Sep 2020

OK, I'm gunna try to fix the underlying issue today/tomorrow and if I fail, we'll move to your PR.
Thanks!

sshleifer on 8 Sep 2020

Does anyone have a snippet that replicates the OOM outside of colab?
I have no trouble running examples/seq2seq/test_bash_script.py on self hosted hardware in torch 1.6.

sshleifer on 8 Sep 2020

The issue wasn't on Colab but on AWS.

setu4993 on 8 Sep 2020

What was your command/hardware?

sshleifer on 8 Sep 2020

Command: python script.py ... with a bunch of args (I have a custom wrapper that inherits SummarizationModule for initialization and adds extra args. I did not modify train / eval / test in that so should be identical to running python fine-tune.py from finetune.sh).
GPU: V100-SXM2-16GB.

setu4993 on 8 Sep 2020

👍1

I can replicate on v100 with cuda 10.1, torch 1.6, python 3.7.
The problem is that during the first call to generate (during the validation sanity check) the untrained model generates config.max_length tokens, causing OOM.

Easiest fix is adding --num_sanity_val_steps=0 to your command. LMK if that works.
The linked PR above allows the user to limit how many tokens are generating during validation, which may be independently helpful.

sshleifer on 8 Sep 2020

Hmm, that makes sense. I'll also say that in the screenshots I had attached earlier it occurred at the end of the first epoch during validation, so setting that new flag should help with that. It is a tricky choice between setting a max_length for generate steps that is different from the model's expected output. I do prefer using a forward pass' output (my PR #7004) as a substitute for the runtime output when it is with the correct max_length instead of a shorter output that fits within the memory at that time.

However, this still does not explain the avg gen time being 30-60x time per batch (with equal batch sizes for training and validation).

Lastly, the same model does generate without producing an OOM at run-time on similar hardware with the model producing upto max_length, which continues to baffle me.

setu4993 on 8 Sep 2020

I don't have any slow down after the validation sanity check in my replication. Maybe I haven't found your bug.

sshleifer on 8 Sep 2020

I don't understand

Lastly, the same model does generate without producing an OOM at run-time on similar hardware with the model producing upto max_length, which continues to baffle me.

Do you have a snippet that does not involve finetune.py (just calls generate) that OOMs/is way slower?

sshleifer on 8 Sep 2020

I don't have any slow down after the validation sanity check in my replication. Maybe I haven't found your bug.

Or maybe it got resolved between when I tested it and this version. No worries.

Do you have a snippet that does not involve finetune.py (just calls generate) that OOMs/is way slower?

This might not matter anymore if the previous is fixed during training since this is specifically at runtime. Regardless, here's an example I have that is much slower.

%%timeit
with torch.no_grad():
    generated_ids = model.generate(tokenized_input["input_ids"].to("cuda"), skip_special_tokens=True, clean_up_tokenization_spaces=False, 
                                    num_beams=3, top_p=0.9, repetition_penalty=10, decoder_start_token_id=model.config.decoder_start_token_id, max_length=model.config.max_length)

which produces:
10.4 s ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

setu4993 on 9 Sep 2020

The sequence produced is of length 311. While the output sequence length is long (max possible is 768), 10 seconds is still quite a lot.

setu4993 on 9 Sep 2020

can you send a full working example that I can copy paste and try in different torch versions?

sshleifer on 9 Sep 2020

Sure! I'm using a finetuned model and a custom dataset so changed the below to bart-large and removed the lines where a dataset is queried. Everything else is the same.

from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large")
model = model.to("cuda")
model = model.eval()
tokenized_input = tokenizer(..., return_tensors="pt", max_length=model.config.max_position_embeddings)
with torch.no_grad():
    generated_ids = model.generate(tokenized_input["input_ids"].to("cuda"), skip_special_tokens=True, clean_up_tokenization_spaces=False, 
                                    num_beams=3, top_p=0.9, repetition_penalty=10, decoder_start_token_id=model.config.decoder_start_token_id, max_length=model.config.max_length)

I'm running this in a notebook so I can time profile the generate step.

setu4993 on 9 Sep 2020

I've spent a quite some time today focused on trying various combinations of generate. The increased delay arises from a num_beams count that is large, which leads to the model producing longer outputs, thus compounding the generate time (num_beams * max_length).

In conclusion, it doesn't appear to be a bug but a property of generate being more memory intensive.

setu4993 on 10 Sep 2020

@sshleifer , I have the same issue, and I am using the latest version that includes the PR you provided. I set eval_max_gen_length to 30 and still getting OOM during the sanity check. Do I also have to set num_sanity_val_steps=0 ?

vikigenius on 17 Sep 2020

@vikigenius : I don't think setting num_sanity_val_steps to 0 will solve it since it'll only delay what's bound to happen during the validation step later.

setu4993 on 17 Sep 2020

--num_sanity_val_steps=0 fixes it for me.
I only have a problem in that sanity check phase when the model is untrained. Future calls to generate (without super high config.min_length/eval_max_gen_length) don't OOM for me.
--eval_num_beams=2 may also help save memory.
I'd love to see somebody isolate the OOMing call to generate outside of training logic so that I can reproduce.