Transformers: Cannot load 'bert-base-german-cased'

Created on 11 Jul 2019 · 6Comments · Source: huggingface/transformers

tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased')

Output:

Model name 'bert-base-german-cased' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed 'bert-base-german-cased' was a path or url but couldn't find any file associated to this path or url.

Source

laifi

Most helpful comment

Hi @laifi,

I cannot reproduce this issue. Are you sure that you run with the latest code from master branch? It looks suspicious to me that tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased') doesn't find the model.
Can you please check if you have the according line in your PRETRAINED_VOCAB_ARCHIVE_MAP?

For your second approach with downloaded files:

be aware that model packaging changed lately from archives to individual files for vocab, model and config (see here). If you really want to download manually you should download the .bin, bert_config.json and the vocab file to a folder called "bert-base-german-cased"
from_pretrained expects a model name or path not a .bin . You should try: BertTokenizer.from_pretrained('YOUR_PATH_TO/bert-base-german-cased')

Hope that helps!

tholor on 11 Jul 2019

👍2

All 6 comments

Hi @laifi,

For your second approach with downloaded files:

be aware that model packaging changed lately from archives to individual files for vocab, model and config (see here). If you really want to download manually you should download the .bin, bert_config.json and the vocab file to a folder called "bert-base-german-cased"
from_pretrained expects a model name or path not a .bin . You should try: BertTokenizer.from_pretrained('YOUR_PATH_TO/bert-base-german-cased')

Hope that helps!

tholor on 11 Jul 2019

👍2

Thank you @tholor , i installed the package with pip and i cannot find 'bert-german-cased' in PRETRAINED_VOCAB_ARCHIVE_MAP
Now , i tried to reinstall the package from source and it's working .

laifi on 11 Jul 2019

👍1

@laifi I am keep getting the same error as the one that you got:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

I also tried to reinstall it, how did you fix it?

shaked571 on 5 Aug 2019

@laifi I am keep getting the same error as the one that you got:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

I also tried to reinstall it, how did you fix it?

@shaked571 , i have just uninstalled the pip package and installed it again from source (try to not keep any cache for the package).
PS: the issue is fixed in the last migration from pytorch-pretrained-bert to pytorch-transformers .

laifi on 12 Aug 2019

Hi,
I also run into the same issue when I try this piece of code in google colab.
tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased')

zahrakolagar on 27 Aug 2019

👍1

Hi,
I also have the same issue. Using

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-german-cased")

solves the problem for me

thaitrinh on 20 Nov 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Tokenizer not found after conversion from TF checkpoint to PyTorch

HansBambel · 3Comments

Fine-tune specific layers

hsajjad · 3Comments

Sudden catastrophic classification output during NER training

fabiocapsouza · 3Comments

GPT2 tokenizer is so slow because of sum()

iedmrc · 3Comments

if crf needed when do ner?

alphanlp · 3Comments