Model I am using (Bert, XLNet....): CamemBERT but this probly applies to all MLMs.
Language I am using the model on (English, Chinese....): French
The problem arise when using:
https://github.com/huggingface/transformers/blob/master/examples/run_lm_finetuning.py is also impacted.Basically, the masking procedure raises an assertion error device-side when I try to run something akin to:
model(labels, masked_lm_labels=labels)
I pinpointed the error to be due to the fact that making values to be ignored in the labels with value -100 like here in the run_lm_finetuning.py script is problably deprecated. The documentation is unclear on the subject, as it says:
masked_lm_labels: (optional) torch.LongTensor of shape (batch_size, sequence_length):
Labels for computing the masked language modeling loss. Indices should be in [-1, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
As you can see, information is contradictory: on one hand, they say values should be between [-1, vocab_size], but also say like in the script that tokens with values -100 are ignored. I tried, and using value -1 does indeed work.
The task I am working on is:
Steps to reproduce the behavior:
import torch
from transformers import CamembertForMaskedLM
model = CamembertForMaskedLM.from_pretrained(
"camembert-base", cache_dir="models/pretrained_camembert"
)
inputs = torch.full((30, 1), 4).to(torch.long)
labels = inputs.clone()
labels[10] = -100
model(inputs, masked_lm_labels=labels)
This gives:
RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes' failed. at /pytorch/aten/src/THNN/generic/ClassNLLCriterion.c:97
If you run it on GPU a similar error is raised.
Should return a loss.
Okay my bad it seems this was actually intentional, this commit was passed and integrated in either version 2.2.2 or 2.3, causing the error on my version. It seems the current proper way to do this is indeed by specifying -100 as index.
The doc is unclear though, this sentence: Indices should be in [-1, 0, ..., config.vocab_size] should be Indices should be in [-100, 0, ..., config.vocab_size].
Anyway cheers, I PRed the documentation fix everywhere it's needed if you wanna have a look, but regardless feel free to close this issue.
@LysandreJik merged the PR for the doc, however I just realized that I incorrectly assumed hte commit was part of 2.3 or 2.2.2, from the merge date of the uniformisation commit. It is currently only in the master branch but not in any tagged version, which means anyone that gets the above bug should switch to -1 until that is the case. Here is the error I got when training on GPU by the way:
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106:
void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *,
Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float,
Acctype = float]: block: [0,0,0], thread: [31,0,0]
Assertion `t >= 0 && t < n_classes` failed.
Thanks for figuring this out!
This was a hair-pulling bug due to the fact that the conda package from the pytorch channel has the updated version while a pypi package with a release tag does not...I was wondering why indice masking for bert labels was having such issues in the conda version 1.3.1 and the pip version 1.3.1 (they're labeled as the same version D:)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hello, thanks for sharing.
I also want to finetune the CamemBERT pretrained model on a MLM task for later extraction of sentence embedding then for clustering. I am a bit confused of how to use the Trainer to fine tune.
should I create by myself the masked_lm_labels with indice in [-100, 0, ..., config.vocab_size]? but how should I know which word is masked?
Could you share the piece of codes if it doesn't bother. Thank you in advance.
Most helpful comment
@LysandreJik merged the PR for the doc, however I just realized that I incorrectly assumed hte commit was part of 2.3 or 2.2.2, from the merge date of the uniformisation commit. It is currently only in the master branch but not in any tagged version, which means anyone that gets the above bug should switch to -1 until that is the case. Here is the error I got when training on GPU by the way: