On trying to fine-tune either T5 or BART models for summarization I was encountering OOM repeatedly in the latest code whereas it used to work fine earlier for me, atleast on Google Colab.
On checking the startup scripts and latest commits I saw that optimizations have been added for native pytorch fp16 support recently. On removing the fp16 parameter from the script it started working as expected.
Please check if this could be a real issue or just a matter of a dangling parameter that needs to be removed?
Thanks
@sshleifer @patil-suraj
try passing --fp16 --fp_16_opt_level=O1
that is a relevant default that has changed. I have also experienced some torch 1.6 issues, so would love to know if that helps.
Semi-relatedly, a good practice is to run
!pip freeze | grep transformers
!pip freeze | grep torch
at the top of colab so that when you go back you can know what version you were on.
Thanks for the quick reply! I tried this and it didn't work though :(
Removing the fp16 parameter for now and fine-tuning.
Will keep the colab advice in mind :)
+1 on this. Using fp16 with level O1 or O2 both causes OOM even for batch size 1. Without fp16 fine-tuning works.
Torch 1.6.0, transformers 3.0.2, Linux, V100 GPU.
This is a torch 1.6 issue.
I haven't gotten anything working well with torch 1.6 + fp16.
torch 1.5.1 with apex installed works well for me.
I tried running fp16 training with amp_backend=apex and amp_backend=native (passing them as additional args) and the latter does much better in terms power consumption, but memory consumption is same for both (wandb GPU graphs). However, both of them OOM during the validation step. May have something to do with beam search since my validation batch size is 1.

Can you try torch 1.5.1 ?
Succeeds with 1.5.1, and power and temperature are in-line with native.
However, the process failing during generation for 1.6.0 suggests there's some optimization missing during the generation steps which causes OOM.
Another thing to note which might be related: Validation (400 samples) takes 5x time for 1 epoch of training (2400 samples). Even if accounting for beam size (4x), it is much slower.
Interesting! I would definitely be open to a PR here if you have a fix in mind!
Thanks! I have a couple ideas and will try them out and create a PR if any of them works.
I think the problem is that the _generative_step method calls _step in it, causing 2x forward steps within each validation step. Also, model.generate steps are inherently slower than an eval forward pass, even with num_beams=1, about 30-60x slower. But this is a different problem than the OOM issue on 1.6.0. Maybe should split this up into a different issue?
The problem is with model.generate that causes OOM on PyTorch 1.6. I switched out to using a custom validation_step that only uses _step and does not make a call to model.generate; it succeeds and is fast. The drawback is that I cannot use beam search for the validation step and keep do_predict set to False to ensure the test step does not execute. All of which are acceptable limitations to me for faster val, val not running into OOM and being able to use native fp16 with PyTorch 1.6.0.
I'm happy to create a PR for it if it makes sense to check it in.
That PR would be interesting. More interesting would be figuring out why generate OOMs in these conditions.
Definitely the question for why generate OOMs is interesting but one I haven't found an answer for yet. I suggested a workaround in #7004 using the fix I described earlier.
OK, I'm gunna try to fix the underlying issue today/tomorrow and if I fail, we'll move to your PR.
Thanks!
Does anyone have a snippet that replicates the OOM outside of colab?
I have no trouble running examples/seq2seq/test_bash_script.py on self hosted hardware in torch 1.6.
The issue wasn't on Colab but on AWS.
What was your command/hardware?
Command: python script.py ... with a bunch of args (I have a custom wrapper that inherits SummarizationModule for initialization and adds extra args. I did not modify train / eval / test in that so should be identical to running python fine-tune.py from finetune.sh).
GPU: V100-SXM2-16GB.
I can replicate on v100 with cuda 10.1, torch 1.6, python 3.7.
The problem is that during the first call to generate (during the validation sanity check) the untrained model generates config.max_length tokens, causing OOM.
Easiest fix is adding --num_sanity_val_steps=0 to your command. LMK if that works.
The linked PR above allows the user to limit how many tokens are generating during validation, which may be independently helpful.
Hmm, that makes sense. I'll also say that in the screenshots I had attached earlier it occurred at the end of the first epoch during validation, so setting that new flag should help with that. It is a tricky choice between setting a max_length for generate steps that is different from the model's expected output. I do prefer using a forward pass' output (my PR #7004) as a substitute for the runtime output when it is with the correct max_length instead of a shorter output that fits within the memory at that time.
However, this still does not explain the avg gen time being 30-60x time per batch (with equal batch sizes for training and validation).
Lastly, the same model does generate without producing an OOM at run-time on similar hardware with the model producing upto max_length, which continues to baffle me.
I don't have any slow down after the validation sanity check in my replication. Maybe I haven't found your bug.
I don't understand
Lastly, the same model does generate without producing an OOM at run-time on similar hardware with the model producing upto max_length, which continues to baffle me.
Do you have a snippet that does not involve finetune.py (just calls generate) that OOMs/is way slower?
I don't have any slow down after the validation sanity check in my replication. Maybe I haven't found your bug.
Or maybe it got resolved between when I tested it and this version. No worries.
Do you have a snippet that does not involve finetune.py (just calls generate) that OOMs/is way slower?
This might not matter anymore if the previous is fixed during training since this is specifically at runtime. Regardless, here's an example I have that is much slower.
%%timeit
with torch.no_grad():
generated_ids = model.generate(tokenized_input["input_ids"].to("cuda"), skip_special_tokens=True, clean_up_tokenization_spaces=False,
num_beams=3, top_p=0.9, repetition_penalty=10, decoder_start_token_id=model.config.decoder_start_token_id, max_length=model.config.max_length)
which produces:
10.4 s 卤 7.89 ms per loop (mean 卤 std. dev. of 7 runs, 1 loop each)
The sequence produced is of length 311. While the output sequence length is long (max possible is 768), 10 seconds is still quite a lot.
can you send a full working example that I can copy paste and try in different torch versions?
Sure! I'm using a finetuned model and a custom dataset so changed the below to bart-large and removed the lines where a dataset is queried. Everything else is the same.
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large")
model = model.to("cuda")
model = model.eval()
tokenized_input = tokenizer(..., return_tensors="pt", max_length=model.config.max_position_embeddings)
with torch.no_grad():
generated_ids = model.generate(tokenized_input["input_ids"].to("cuda"), skip_special_tokens=True, clean_up_tokenization_spaces=False,
num_beams=3, top_p=0.9, repetition_penalty=10, decoder_start_token_id=model.config.decoder_start_token_id, max_length=model.config.max_length)
I'm running this in a notebook so I can time profile the generate step.
I've spent a quite some time today focused on trying various combinations of generate. The increased delay arises from a num_beams count that is large, which leads to the model producing longer outputs, thus compounding the generate time (num_beams * max_length).
In conclusion, it doesn't appear to be a bug but a property of generate being more memory intensive.
@sshleifer , I have the same issue, and I am using the latest version that includes the PR you provided. I set eval_max_gen_length to 30 and still getting OOM during the sanity check. Do I also have to set num_sanity_val_steps=0 ?
@vikigenius : I don't think setting num_sanity_val_steps to 0 will solve it since it'll only delay what's bound to happen during the validation step later.
--num_sanity_val_steps=0 fixes it for me.generate (without super high config.min_length/eval_max_gen_length) don't OOM for me.--eval_num_beams=2 may also help save memory.