Transformers: FunnelTransformerForSequenceClassification crashes when fine tuning with mixed precision flag

Created on 24 Sep 2020 · 8Comments · Source: huggingface/transformers

Environment info

transformers version: 3.2.0
Platform: Linux-4.15.0-45-generic-x86_64-with-debian-buster-sid
Python version: Python 3.7.7
PyTorch version (GPU?): 1.5.1 (True)
Tensorflow version (GPU?): 2.2.0 (True)
Using GPU in script?: True
Using distributed or parallel set-up in script?: No

Who can help

@sgugger As I saw you were the one who worked on the PR implementing Funnel Transformer

Information

Model I am using: Funnel Transformer

The problem arises when using:

[ o ] the official example scripts: (give details below)
[ x ] my own modified scripts:
Only when enabling the mixed precision flag. I am now training the model without it, but I had to lower the batch size, thus increasing the training time.
I have to mention that I just fined tuned a roberta-base model using fp16=True and fp16_opt_level='O1', thus nvidia APEX is properly installed/configured.

The tasks I am working on is:

[ o ] an official GLUE/SQUaD task: (give the name)
[ x ] my own task or dataset:
Basically I am trying to fine tune FunnelForSequenceClassification using my own custom data-set:

# some code to load data from CSV
# ...
# wrapper around PyTorch for holding datasets
class IMDbDataset(torch.utils.data.Dataset):
    # same code as in the Huggingface docs
    # ...

# load tokenizer
tokenizer = FunnelTokenizer.from_pretrained('funnel-transformer/large-base')

# tokenize texts
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

# training args used
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,  # batch size for evaluation
    #learning_rate=35e-6,
    weight_decay=0.01,               # strength of weight decay
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    fp16=True,
    fp16_opt_level='O1'       # here I tried both O1 and O2 with the same result
)

model = FunnelForSequenceClassification.from_pretrained('funnel-transformer/large-base',
                                                        return_dict=True,
                                                        num_labels=max(train_labels)+1)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

trainer.save_model('funnel')

To reproduce

Steps to reproduce the behavior:

Run script
Wait for script to reach the training part

Stacktrace:

  File "funnel.py", line 89, in <module>
    trainer.train()
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/transformers/trainer.py", line 741, in train
    tr_loss += self.training_step(model, inputs)
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/transformers/trainer.py", line 1046, in training_step
    loss = self.compute_loss(model, inputs)
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/transformers/trainer.py", line 1070, in compute_loss
    outputs = model(**inputs)
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/transformers/modeling_funnel.py", line 1263, in forward
    return_dict=return_dict,
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/transformers/modeling_funnel.py", line 950, in forward
    return_dict=return_dict,
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/transformers/modeling_funnel.py", line 655, in forward
    layer_output = layer(query, key, value, attention_inputs, output_attentions=output_attentions)
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/transformers/modeling_funnel.py", line 602, in forward
    attn = self.attention(query, key, value, attention_inputs, output_attentions=output_attentions)
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/transformers/modeling_funnel.py", line 548, in forward
    content_score = torch.einsum("bind,bjnd->bnij", q_head + r_w_bias, k_head)
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/functional.py", line 292, in einsum
    return _VF.einsum(equation, operands)
RuntimeError: Expected object of scalar type Float but got scalar type Half for argument #2 'mat2' in call to _th_bmm

This seems like a very similar issue.

Expected behavior

We should be able to train the model with mixed precision to use VRAM more efficiently.

Source

iAlex97

All 8 comments

Thanks for flagging!
I think I have found the cause for this. Model runs fine on my end in half precision when it's applied.

sgugger on 24 Sep 2020

Thanks for the quick fix, but unfortunately I checked out that branch (and installed from source) and I still get the issue at this line: https://github.com/huggingface/transformers/blob/624cb37b38574566522072c19659b4cff60b98f9/src/transformers/modeling_funnel.py#L544

Edit (attached new stacktrace):

  File "funnel.py", line 90, in <module>
    trainer.train()
  File "/root/transformers/src/transformers/trainer.py", line 743, in train
    tr_loss += self.training_step(model, inputs)
  File "/root/transformers/src/transformers/trainer.py", line 1050, in training_step
    loss = self.compute_loss(model, inputs)
  File "/root/transformers/src/transformers/trainer.py", line 1074, in compute_loss
    outputs = model(**inputs)
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/transformers/src/transformers/modeling_funnel.py", line 1269, in forward
    return_dict=return_dict,
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/transformers/src/transformers/modeling_funnel.py", line 955, in forward
    return_dict=return_dict,
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/transformers/src/transformers/modeling_funnel.py", line 651, in forward
    layer_output = layer(query, key, value, attention_inputs, output_attentions=output_attentions)
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/transformers/src/transformers/modeling_funnel.py", line 598, in forward
    attn = self.attention(query, key, value, attention_inputs, output_attentions=output_attentions)
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/transformers/src/transformers/modeling_funnel.py", line 544, in forward
    content_score = torch.einsum("bind,bjnd->bnij", q_head + r_w_bias, k_head)
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/functional.py", line 292, in einsum
    return _VF.einsum(equation, operands)
RuntimeError: Expected object of scalar type Float but got scalar type Half for argument #2 'mat2' in call to _th_bmm

iAlex97 on 25 Sep 2020

What got me past this error was casting .float() on all tensor arguments to torch.einsum(), but then I ran into this issue:

  File "funnel.py", line 90, in <module>
    trainer.train()
  File "/root/transformers/src/transformers/trainer.py", line 743, in train
    tr_loss += self.training_step(model, inputs)
  File "/root/transformers/src/transformers/trainer.py", line 1062, in training_step
    scaled_loss.backward()
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: expected dtype Float but got dtype Long (validate_dtype at /opt/conda/conda-bld/pytorch_1591914880026/work/aten/src/ATen/native/TensorIterator.cpp:143)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f49c1e64b5e in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: at::TensorIterator::compute_types() + 0xce3 (0x7f49ea00c113 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: at::TensorIterator::build() + 0x44 (0x7f49ea00eaf4 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: at::native::mse_loss_backward_out(at::Tensor&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long) + 0x193 (0x7f49e9e5c043 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0xdfc047 (0x7f49c30ba047 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::mse_loss_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, long) + 0x172 (0x7f49e9e64782 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xdfc2ff (0x7f49c30ba2ff in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xe20c26 (0x7f49ea294c26 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x27fd3cb (0x7f49ebc713cb in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0xe20c26 (0x7f49ea294c26 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::autograd::generated::MseLossBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x1f7 (0x7f49eba78e67 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x2ae7df5 (0x7f49ebf5bdf5 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7f49ebf590f3 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) + 0x3d2 (0x7f49ebf59ed2 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f49ebf52549 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f49ef4a2638 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0xc819d (0x7f49f1cfd19d in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #17: <unknown function> + 0x76db (0x7f4a0a4186db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #18: clone + 0x3f (0x7f4a0a141a3f in /lib/x86_64-linux-gnu/libc.so.6)

iAlex97 on 25 Sep 2020

Okay, it turns out the first issue with torch.einsum was PyTorch's fault as the function did not accept mixed precision types. After updating it to 1.6.0 and recompiling nvidia APEX, I'm stuck with:

  File "funnel.py", line 90, in <module>
    trainer.train()
  File "/root/transformers/src/transformers/trainer.py", line 743, in train
    tr_loss += self.training_step(model, inputs)
  File "/root/transformers/src/transformers/trainer.py", line 1059, in training_step
    self.scaler.scale(loss).backward()
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Found dtype Long but expected Float
Exception raised from compute_types at /opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/ATen/native/TensorIterator.cpp:183 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f6b6fede77d in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: at::TensorIterator::compute_types(at::TensorIteratorConfig const&) + 0x259 (0x7f6ba2f35ca9 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: at::TensorIterator::build(at::TensorIteratorConfig&) + 0x6b (0x7f6ba2f3944b in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: at::TensorIterator::TensorIterator(at::TensorIteratorConfig&) + 0xdd (0x7f6ba2f39abd in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: at::native::mse_loss_backward_out(at::Tensor&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long) + 0x18a (0x7f6ba2d9e71a in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xd1d610 (0x7f6b71061610 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: at::native::mse_loss_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, long) + 0x90 (0x7f6ba2d9b140 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xd1d6b0 (0x7f6b710616b0 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0xd3f936 (0x7f6b71083936 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #9: at::mse_loss_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, long) + 0x119 (0x7f6ba325dda9 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x2b5e8c9 (0x7f6ba4eb68c9 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x7f60d6 (0x7f6ba2b4e0d6 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: at::mse_loss_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, long) + 0x119 (0x7f6ba325dda9 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::generated::MseLossBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x1af (0x7f6ba4df252f in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x30d1017 (0x7f6ba5429017 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1400 (0x7f6ba5424860 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x451 (0x7f6ba5425401 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7f6ba541d579 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4a (0x7f6ba974c99a in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0xc819d (0x7f6bac27e19d in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #20: <unknown function> + 0x76db (0x7f6bc49996db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #21: clone + 0x3f (0x7f6bc46c2a3f in /lib/x86_64-linux-gnu/libc.so.6)

iAlex97 on 25 Sep 2020

Due to reducing my data-set to be able to load it faster and check various fixes, I was accidentally passing only one training labels to my classifier. Fixed this and the model started training, however, loss is always reported as nan.

Is this an issue? I double checked and running without mixed precision mode correctly reports the loss and I can see it decreasing between log statements.

iAlex97 on 25 Sep 2020

I can reproduce the losses being at nan and will try to investigate the source of this bug. Note that starting in PyTorch 1.6, apex is not used anymore for mixed precision training since PyTorch has native support for it.

sgugger on 25 Sep 2020

👍1

I have found the reason (and why I wasn't managing to fine-tune a model on some GLUE task yesterday). Turns out I was matching exactly the implementation of the authors but in transformers, we put 1 in attentions masks for tokens not masked... stupid me.

sgugger on 25 Sep 2020

👍1

Good thing to know I don't have to build APEX next time ;)

I just pulled the latest commit from your branch and can confirm loss is no longer nan.

Great job and thanks for assistance!

iAlex97 on 25 Sep 2020

❤1

Was this page helpful?

0 / 5 - 0 ratings