Model I am using (albert.):
Language I am using the model on (English, Chinese ...):
The problem arises when using:
follow the instructions on :
https://huggingface.co/models
such as use "voidful/albert_chinese_tiny" model,
AutoTokenizer.from_pretrained('voidful/albert_chinese_tiny')
will raise
Model name 'voidful/albert_chinese_tiny' was not found in tokenizers model name list (albert-base-v1, albert-large-v1, albert-xlarge-v1, albert-xxlarge-v1, albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2). We assumed 'voidful/albert_chinese_tiny' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.
Hi, could you specify which version of transformers you're running?
I encountered the same problem when using Albert. @voidful
AutoTokenizer.from_pretrained('voidful/albert_chinese_xxlarge')
will raise
04/04/2020 14:21:28 - INFO - Model name 'voidful/albert_chinese_xxlarge' not found in model shortcut name list (albert-base-v1, albert-large-v1, albert-xlarge-v1, albert-xxlarge-v1, albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2). Assuming 'voidful/albert_chinese_xxlarge' is a path, a model identifier, or url to a directory containing tokenizer files.
Traceback (most recent call last):
File "preprocess.py", line 353, in <module>
main()
File "preprocess.py", line 303, in main
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_path, do_lower_case=not args.cased, cache_dir=args.cache_dir)
File "/data0/username/anaconda3/lib/python3.7/site-packages/transformers/tokenization_auto.py", line 192, in from_pretrained
return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/data0/username/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 393, in from_pretrained
return cls._from_pretrained(*inputs, **kwargs)
File "/data0/username/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 496, in _from_pretrained
list(cls.vocab_files_names.values()),
OSError: Model name 'voidful/albert_chinese_xxlarge' was not found in tokenizers model name list (albert-base-v1, albert-large-v1, albert-xlarge-v1, albert-xxlarge-v1, albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2). We assumed 'voidful/albert_chinese_xxlarge' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.
>>> torch.__version__
'1.3.1'
>>> transformers.__version__
'2.7.0'
@LysandreJik @WiseDoge
the problem is that the model type is different with the tokenizer type.
eg. the model use albert model type and tokenizer use bert tokenizer, so the autoken class won know about it
you should let others can specify the tokenizer class or tokenizer model type if nessuary
waiting for confirm or feature requests
Thank you. I use BERT tokenizer instead, and it works.
Since sentencepiece is not used in albert_chinese model
you have to call BertTokenizer instead of AlbertTokenizer !!! we can eval it using an example on MaskedLM
from transformers import *
import torch
from torch.nn.functional import softmax
pretrained = 'voidful/albert_chinese_large'
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = AlbertForMaskedLM.from_pretrained(pretrained)
inputtext = "今天[MASK]情很好"
maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token,logit_prob[predicted_index])
Result: 心 0.9422469735145569
close for now
Most helpful comment
Since sentencepiece is not used in albert_chinese model
you have to call BertTokenizer instead of AlbertTokenizer !!! we can eval it using an example on MaskedLM
colab trial
Result:
心 0.9422469735145569