I initialized the tokenizer and the model like
def load_bert_score_model(bert="bert-base-multilingual-cased", num_layers=8):
assert bert in bert_types
tokenizer = BertTokenizer.from_pretrained(bert, do_lower_case=True)
model = BertModel.from_pretrained(bert)
model.eval()
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)
# drop unused layers
model.encoder.layer = torch.nn.ModuleList([layer for layer in model.encoder.layer[:num_layers]])
return model, tokenizer
so setting the do_lower_case=True, but I'm getting this warning:
The pre-trained model you are loading is a cased model but you have not set `do_lower_case` to False. We are setting `do_lower_case=False` for you but you may want to check this behavior.
Hi! You seem to be loading a cased model (such as the bert-base-multilingual-cased), but you're specifying do_lower_case to your tokenizer, which strips accents and lowercases every character.
The model you specified has been trained with uppercase and lowercase characters as well as accent markers, so you should use it with such characters as well. If you're looking at using only lowercase characters, it would be better for you to use an uncased model (such as the bert-base-multilingual-uncased).
@LysandreJik that is correct, thank you.
Most helpful comment
Hi! You seem to be loading a cased model (such as the
bert-base-multilingual-cased), but you're specifyingdo_lower_caseto your tokenizer, which strips accents and lowercases every character.The model you specified has been trained with uppercase and lowercase characters as well as accent markers, so you should use it with such characters as well. If you're looking at using only lowercase characters, it would be better for you to use an uncased model (such as the
bert-base-multilingual-uncased).