Transformers: Cannot load 'bert-base-german-cased'

Created on 11 Jul 2019  路  6Comments  路  Source: huggingface/transformers

tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased')

Output:

Model name 'bert-base-german-cased' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed 'bert-base-german-cased' was a path or url but couldn't find any file associated to this path or url.

Most helpful comment

Hi @laifi,

I cannot reproduce this issue. Are you sure that you run with the latest code from master branch? It looks suspicious to me that tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased') doesn't find the model.
Can you please check if you have the according line in your PRETRAINED_VOCAB_ARCHIVE_MAP?

For your second approach with downloaded files:

  • be aware that model packaging changed lately from archives to individual files for vocab, model and config (see here). If you really want to download manually you should download the .bin, bert_config.json and the vocab file to a folder called "bert-base-german-cased"
  • from_pretrained expects a model name or path not a .bin . You should try: BertTokenizer.from_pretrained('YOUR_PATH_TO/bert-base-german-cased')

Hope that helps!

All 6 comments

Hi @laifi,

I cannot reproduce this issue. Are you sure that you run with the latest code from master branch? It looks suspicious to me that tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased') doesn't find the model.
Can you please check if you have the according line in your PRETRAINED_VOCAB_ARCHIVE_MAP?

For your second approach with downloaded files:

  • be aware that model packaging changed lately from archives to individual files for vocab, model and config (see here). If you really want to download manually you should download the .bin, bert_config.json and the vocab file to a folder called "bert-base-german-cased"
  • from_pretrained expects a model name or path not a .bin . You should try: BertTokenizer.from_pretrained('YOUR_PATH_TO/bert-base-german-cased')

Hope that helps!

Thank you @tholor , i installed the package with pip and i cannot find 'bert-german-cased' in PRETRAINED_VOCAB_ARCHIVE_MAP
Now , i tried to reinstall the package from source and it's working .

@laifi I am keep getting the same error as the one that you got:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

I also tried to reinstall it, how did you fix it?

@laifi I am keep getting the same error as the one that you got:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

I also tried to reinstall it, how did you fix it?

@shaked571 , i have just uninstalled the pip package and installed it again from source (try to not keep any cache for the package).
PS: the issue is fixed in the last migration from pytorch-pretrained-bert to pytorch-transformers .

Hi,
I also run into the same issue when I try this piece of code in google colab.
tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased')

Hi,
I also have the same issue. Using

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-german-cased")

solves the problem for me

Was this page helpful?
0 / 5 - 0 ratings