Transformers: can not init tokenizers from third party model , on albert model

Created on 1 Apr 2020  ·  7Comments  ·  Source: huggingface/transformers

🐛 Bug

Information

Model I am using (albert.):

Language I am using the model on (English, Chinese ...):

The problem arises when using:

  • [ *] the official example scripts: (give details below)

follow the instructions on :

https://huggingface.co/models

such as use "voidful/albert_chinese_tiny" model,

AutoTokenizer.from_pretrained('voidful/albert_chinese_tiny')

will raise
Model name 'voidful/albert_chinese_tiny' was not found in tokenizers model name list (albert-base-v1, albert-large-v1, albert-xlarge-v1, albert-xxlarge-v1, albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2). We assumed 'voidful/albert_chinese_tiny' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.

Most helpful comment

Since sentencepiece is not used in albert_chinese model
you have to call BertTokenizer instead of AlbertTokenizer !!! we can eval it using an example on MaskedLM

colab trial

from transformers import *
import torch
from torch.nn.functional import softmax

pretrained = 'voidful/albert_chinese_large'
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = AlbertForMaskedLM.from_pretrained(pretrained)

inputtext = "今天[MASK]情很好"

maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)

input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token,logit_prob[predicted_index])

Result: 心 0.9422469735145569

All 7 comments

Hi, could you specify which version of transformers you're running?

I encountered the same problem when using Albert. @voidful

AutoTokenizer.from_pretrained('voidful/albert_chinese_xxlarge')

will raise

04/04/2020 14:21:28 - INFO - Model name 'voidful/albert_chinese_xxlarge' not found in model shortcut name list (albert-base-v1, albert-large-v1, albert-xlarge-v1, albert-xxlarge-v1, albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2). Assuming 'voidful/albert_chinese_xxlarge' is a path, a model identifier, or url to a directory containing tokenizer files.
Traceback (most recent call last):
  File "preprocess.py", line 353, in <module>
    main()
  File "preprocess.py", line 303, in main
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_path, do_lower_case=not args.cased, cache_dir=args.cache_dir)
  File "/data0/username/anaconda3/lib/python3.7/site-packages/transformers/tokenization_auto.py", line 192, in from_pretrained
    return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/data0/username/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 393, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "/data0/username/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 496, in _from_pretrained
    list(cls.vocab_files_names.values()),
OSError: Model name 'voidful/albert_chinese_xxlarge' was not found in tokenizers model name list (albert-base-v1, albert-large-v1, albert-xlarge-v1, albert-xxlarge-v1, albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2). We assumed 'voidful/albert_chinese_xxlarge' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.
>>> torch.__version__
'1.3.1'
>>> transformers.__version__
'2.7.0'

@LysandreJik @WiseDoge
the problem is that the model type is different with the tokenizer type.
eg. the model use albert model type and tokenizer use bert tokenizer, so the autoken class won know about it

you should let others can specify the tokenizer class or tokenizer model type if nessuary

waiting for confirm or feature requests

Thank you. I use BERT tokenizer instead, and it works.

Since sentencepiece is not used in albert_chinese model
you have to call BertTokenizer instead of AlbertTokenizer !!! we can eval it using an example on MaskedLM

colab trial

from transformers import *
import torch
from torch.nn.functional import softmax

pretrained = 'voidful/albert_chinese_large'
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = AlbertForMaskedLM.from_pretrained(pretrained)

inputtext = "今天[MASK]情很好"

maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)

input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids, masked_lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token,logit_prob[predicted_index])

Result: 心 0.9422469735145569

close for now

Was this page helpful?
0 / 5 - 0 ratings

Related issues

HanGuo97 picture HanGuo97  ·  3Comments

chuanmingliu picture chuanmingliu  ·  3Comments

rsanjaykamath picture rsanjaykamath  ·  3Comments

adigoryl picture adigoryl  ·  3Comments

yspaik picture yspaik  ·  3Comments