transformers version: 3.2.0@sgugger As I saw you were the one who worked on the PR implementing Funnel Transformer
Model I am using: Funnel Transformer
The problem arises when using:
roberta-base model using fp16=True and fp16_opt_level='O1', thus nvidia APEX is properly installed/configured.The tasks I am working on is:
FunnelForSequenceClassification using my own custom data-set:# some code to load data from CSV
# ...
# wrapper around PyTorch for holding datasets
class IMDbDataset(torch.utils.data.Dataset):
# same code as in the Huggingface docs
# ...
# load tokenizer
tokenizer = FunnelTokenizer.from_pretrained('funnel-transformer/large-base')
# tokenize texts
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)
# training args used
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=3, # total number of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
#learning_rate=35e-6,
weight_decay=0.01, # strength of weight decay
warmup_steps=500, # number of warmup steps for learning rate scheduler
logging_dir='./logs', # directory for storing logs
logging_steps=10,
fp16=True,
fp16_opt_level='O1' # here I tried both O1 and O2 with the same result
)
model = FunnelForSequenceClassification.from_pretrained('funnel-transformer/large-base',
return_dict=True,
num_labels=max(train_labels)+1)
trainer = Trainer(
model=model, # the instantiated 馃 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset # evaluation dataset
)
trainer.train()
trainer.save_model('funnel')
Steps to reproduce the behavior:
Stacktrace:
File "funnel.py", line 89, in <module>
trainer.train()
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/transformers/trainer.py", line 741, in train
tr_loss += self.training_step(model, inputs)
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/transformers/trainer.py", line 1046, in training_step
loss = self.compute_loss(model, inputs)
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/transformers/trainer.py", line 1070, in compute_loss
outputs = model(**inputs)
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/transformers/modeling_funnel.py", line 1263, in forward
return_dict=return_dict,
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/transformers/modeling_funnel.py", line 950, in forward
return_dict=return_dict,
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/transformers/modeling_funnel.py", line 655, in forward
layer_output = layer(query, key, value, attention_inputs, output_attentions=output_attentions)
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/transformers/modeling_funnel.py", line 602, in forward
attn = self.attention(query, key, value, attention_inputs, output_attentions=output_attentions)
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/transformers/modeling_funnel.py", line 548, in forward
content_score = torch.einsum("bind,bjnd->bnij", q_head + r_w_bias, k_head)
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/functional.py", line 292, in einsum
return _VF.einsum(equation, operands)
RuntimeError: Expected object of scalar type Float but got scalar type Half for argument #2 'mat2' in call to _th_bmm
This seems like a very similar issue.
We should be able to train the model with mixed precision to use VRAM more efficiently.
Thanks for flagging!
I think I have found the cause for this. Model runs fine on my end in half precision when it's applied.
Thanks for the quick fix, but unfortunately I checked out that branch (and installed from source) and I still get the issue at this line: https://github.com/huggingface/transformers/blob/624cb37b38574566522072c19659b4cff60b98f9/src/transformers/modeling_funnel.py#L544
Edit (attached new stacktrace):
File "funnel.py", line 90, in <module>
trainer.train()
File "/root/transformers/src/transformers/trainer.py", line 743, in train
tr_loss += self.training_step(model, inputs)
File "/root/transformers/src/transformers/trainer.py", line 1050, in training_step
loss = self.compute_loss(model, inputs)
File "/root/transformers/src/transformers/trainer.py", line 1074, in compute_loss
outputs = model(**inputs)
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/root/transformers/src/transformers/modeling_funnel.py", line 1269, in forward
return_dict=return_dict,
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/root/transformers/src/transformers/modeling_funnel.py", line 955, in forward
return_dict=return_dict,
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/root/transformers/src/transformers/modeling_funnel.py", line 651, in forward
layer_output = layer(query, key, value, attention_inputs, output_attentions=output_attentions)
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/root/transformers/src/transformers/modeling_funnel.py", line 598, in forward
attn = self.attention(query, key, value, attention_inputs, output_attentions=output_attentions)
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/root/transformers/src/transformers/modeling_funnel.py", line 544, in forward
content_score = torch.einsum("bind,bjnd->bnij", q_head + r_w_bias, k_head)
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/functional.py", line 292, in einsum
return _VF.einsum(equation, operands)
RuntimeError: Expected object of scalar type Float but got scalar type Half for argument #2 'mat2' in call to _th_bmm
What got me past this error was casting .float() on all tensor arguments to torch.einsum(), but then I ran into this issue:
File "funnel.py", line 90, in <module>
trainer.train()
File "/root/transformers/src/transformers/trainer.py", line 743, in train
tr_loss += self.training_step(model, inputs)
File "/root/transformers/src/transformers/trainer.py", line 1062, in training_step
scaled_loss.backward()
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: expected dtype Float but got dtype Long (validate_dtype at /opt/conda/conda-bld/pytorch_1591914880026/work/aten/src/ATen/native/TensorIterator.cpp:143)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f49c1e64b5e in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: at::TensorIterator::compute_types() + 0xce3 (0x7f49ea00c113 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: at::TensorIterator::build() + 0x44 (0x7f49ea00eaf4 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: at::native::mse_loss_backward_out(at::Tensor&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long) + 0x193 (0x7f49e9e5c043 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0xdfc047 (0x7f49c30ba047 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::mse_loss_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, long) + 0x172 (0x7f49e9e64782 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xdfc2ff (0x7f49c30ba2ff in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xe20c26 (0x7f49ea294c26 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x27fd3cb (0x7f49ebc713cb in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0xe20c26 (0x7f49ea294c26 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::autograd::generated::MseLossBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x1f7 (0x7f49eba78e67 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x2ae7df5 (0x7f49ebf5bdf5 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7f49ebf590f3 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) + 0x3d2 (0x7f49ebf59ed2 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f49ebf52549 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f49ef4a2638 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0xc819d (0x7f49f1cfd19d in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #17: <unknown function> + 0x76db (0x7f4a0a4186db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #18: clone + 0x3f (0x7f4a0a141a3f in /lib/x86_64-linux-gnu/libc.so.6)
Okay, it turns out the first issue with torch.einsum was PyTorch's fault as the function did not accept mixed precision types. After updating it to 1.6.0 and recompiling nvidia APEX, I'm stuck with:
File "funnel.py", line 90, in <module>
trainer.train()
File "/root/transformers/src/transformers/trainer.py", line 743, in train
tr_loss += self.training_step(model, inputs)
File "/root/transformers/src/transformers/trainer.py", line 1059, in training_step
self.scaler.scale(loss).backward()
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/root/anaconda/envs/ai/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Found dtype Long but expected Float
Exception raised from compute_types at /opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/ATen/native/TensorIterator.cpp:183 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f6b6fede77d in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: at::TensorIterator::compute_types(at::TensorIteratorConfig const&) + 0x259 (0x7f6ba2f35ca9 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: at::TensorIterator::build(at::TensorIteratorConfig&) + 0x6b (0x7f6ba2f3944b in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: at::TensorIterator::TensorIterator(at::TensorIteratorConfig&) + 0xdd (0x7f6ba2f39abd in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: at::native::mse_loss_backward_out(at::Tensor&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long) + 0x18a (0x7f6ba2d9e71a in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0xd1d610 (0x7f6b71061610 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: at::native::mse_loss_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, long) + 0x90 (0x7f6ba2d9b140 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xd1d6b0 (0x7f6b710616b0 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0xd3f936 (0x7f6b71083936 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #9: at::mse_loss_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, long) + 0x119 (0x7f6ba325dda9 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x2b5e8c9 (0x7f6ba4eb68c9 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x7f60d6 (0x7f6ba2b4e0d6 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: at::mse_loss_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, long) + 0x119 (0x7f6ba325dda9 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::generated::MseLossBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x1af (0x7f6ba4df252f in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x30d1017 (0x7f6ba5429017 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1400 (0x7f6ba5424860 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x451 (0x7f6ba5425401 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7f6ba541d579 in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4a (0x7f6ba974c99a in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0xc819d (0x7f6bac27e19d in /root/anaconda/envs/ai/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #20: <unknown function> + 0x76db (0x7f6bc49996db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #21: clone + 0x3f (0x7f6bc46c2a3f in /lib/x86_64-linux-gnu/libc.so.6)
Due to reducing my data-set to be able to load it faster and check various fixes, I was accidentally passing only one training labels to my classifier. Fixed this and the model started training, however, loss is always reported as nan.
Is this an issue? I double checked and running without mixed precision mode correctly reports the loss and I can see it decreasing between log statements.
I can reproduce the losses being at nan and will try to investigate the source of this bug. Note that starting in PyTorch 1.6, apex is not used anymore for mixed precision training since PyTorch has native support for it.
I have found the reason (and why I wasn't managing to fine-tune a model on some GLUE task yesterday). Turns out I was matching exactly the implementation of the authors but in transformers, we put 1 in attentions masks for tokens not masked... stupid me.
Good thing to know I don't have to build APEX next time ;)
I just pulled the latest commit from your branch and can confirm loss is no longer nan.
Great job and thanks for assistance!