Transformers: can not init tokenizers from third party model , on albert model

Created on 1 Apr 2020 · 7Comments · Source: huggingface/transformers

🐛 Bug

Information

Model I am using (albert.):

Language I am using the model on (English, Chinese ...):

The problem arises when using:

[ *] the official example scripts: (give details below)

follow the instructions on :

https://huggingface.co/models

such as use "voidful/albert_chinese_tiny" model,

AutoTokenizer.from_pretrained('voidful/albert_chinese_tiny')

will raise
Model name 'voidful/albert_chinese_tiny' was not found in tokenizers model name list (albert-base-v1, albert-large-v1, albert-xlarge-v1, albert-xxlarge-v1, albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2). We assumed 'voidful/albert_chinese_tiny' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.

Source

aohan237

Most helpful comment

Since sentencepiece is not used in albert_chinese model
you have to call BertTokenizer instead of AlbertTokenizer !!! we can eval it using an example on MaskedLM

colab trial

from transformers import *
import torch
from torch.nn.functional import softmax

pretrained = 'voidful/albert_chinese_large'
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = AlbertForMaskedLM.from_pretrained(pretrained)

inputtext = "今天[MASK]情很好"

maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)

input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token,logit_prob[predicted_index])

Result: 心 0.9422469735145569

voidful on 4 Apr 2020

👍5

All 7 comments

Hi, could you specify which version of transformers you're running?

LysandreJik on 2 Apr 2020

I encountered the same problem when using Albert. @voidful

AutoTokenizer.from_pretrained('voidful/albert_chinese_xxlarge')

will raise

04/04/2020 14:21:28 - INFO - Model name 'voidful/albert_chinese_xxlarge' not found in model shortcut name list (albert-base-v1, albert-large-v1, albert-xlarge-v1, albert-xxlarge-v1, albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2). Assuming 'voidful/albert_chinese_xxlarge' is a path, a model identifier, or url to a directory containing tokenizer files.
Traceback (most recent call last):
  File "preprocess.py", line 353, in <module>
    main()
  File "preprocess.py", line 303, in main
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_path, do_lower_case=not args.cased, cache_dir=args.cache_dir)
  File "/data0/username/anaconda3/lib/python3.7/site-packages/transformers/tokenization_auto.py", line 192, in from_pretrained
    return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/data0/username/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 393, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "/data0/username/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 496, in _from_pretrained
    list(cls.vocab_files_names.values()),
OSError: Model name 'voidful/albert_chinese_xxlarge' was not found in tokenizers model name list (albert-base-v1, albert-large-v1, albert-xlarge-v1, albert-xxlarge-v1, albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2). We assumed 'voidful/albert_chinese_xxlarge' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.

>>> torch.__version__
'1.3.1'
>>> transformers.__version__
'2.7.0'

WiseDoge on 4 Apr 2020

@LysandreJik @WiseDoge
the problem is that the model type is different with the tokenizer type.
eg. the model use albert model type and tokenizer use bert tokenizer, so the autoken class won know about it

you should let others can specify the tokenizer class or tokenizer model type if nessuary

aohan237 on 4 Apr 2020

waiting for confirm or feature requests

aohan237 on 4 Apr 2020

Thank you. I use BERT tokenizer instead, and it works.

WiseDoge on 4 Apr 2020

Since sentencepiece is not used in albert_chinese model
you have to call BertTokenizer instead of AlbertTokenizer !!! we can eval it using an example on MaskedLM

colab trial

from transformers import *
import torch
from torch.nn.functional import softmax

pretrained = 'voidful/albert_chinese_large'
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = AlbertForMaskedLM.from_pretrained(pretrained)

inputtext = "今天[MASK]情很好"

maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)

input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token,logit_prob[predicted_index])

Result: 心 0.9422469735145569

voidful on 4 Apr 2020

👍5

close for now

aohan237 on 7 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings