Transformers: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Created on 24 Oct 2019 · 7Comments · Source: huggingface/transformers

When I load the pretrained model from the local bin file, there is a decoding problem.

Source

lipingbj

Most helpful comment

@lipingbj With the latest versions of transformers you need to pass the path to the PyTorch-compatible model, so in your example use:

tokenizer = BertTokenizer.from_pretrained("/home/liping/liping/bert/")

The following files must be located in that folder:

vocab.txt - vocabulary file
pytorch_model.bin - the PyTorch-compatible (and converted) model
config.json - json-based model configuration

Please make sure that these files exist and e.g. rename bert-base-cased-pytorch_model.bin to pytorch_model.bin.

That should work :)

stefan-it on 24 Oct 2019

👍13 🚀2 ❤2

All 7 comments

Hi, could you provide more information: e.g. respect the template? Please tell us which model, which bin file, with which command?

LysandreJik on 24 Oct 2019

Hi, could you provide more information: e.g. respect the template? Please tell us which model, which bin file, with which command?

tokenizer = BertTokenizer.from_pretrained("/home/liping/liping/bert/bert-base-cased-pytorch_model.bin")

XLNetModel.from_pretrained("/data2/liping/xlnet/xlnet-base-cased-pytorch_model.bin")
Those two command will make the problem occur.

lipingbj on 24 Oct 2019

@lipingbj With the latest versions of transformers you need to pass the path to the PyTorch-compatible model, so in your example use:

tokenizer = BertTokenizer.from_pretrained("/home/liping/liping/bert/")

The following files must be located in that folder:

vocab.txt - vocabulary file
pytorch_model.bin - the PyTorch-compatible (and converted) model
config.json - json-based model configuration

Please make sure that these files exist and e.g. rename bert-base-cased-pytorch_model.bin to pytorch_model.bin.

That should work :)

stefan-it on 24 Oct 2019

👍13 🚀2 ❤2

@lipingbj With the latest versions of transformers you need to pass the path to the PyTorch-compatible model, so in your example use:
tokenizer = BertTokenizer.from_pretrained("/home/liping/liping/bert/")
The following files must be located in that folder:

vocab.txt - vocabulary file

pytorch_model.bin - the PyTorch-compatible (and converted) model

config.json - json-based model configuration

Please make sure that these files exist and e.g. rename bert-base-cased-pytorch_model.bin to pytorch_model.bin.

That should work :)

encoder_model = BertModel.from_pretrained("/home/liping/liping/bert/pytorch-bert-model")
tokenizer = BertTokenizer.from_pretrained("/home/liping/liping/bert/pytorch-bert-model")

vocab.txt, pytorch_model.bin, config.json have included in directory bert/pytorch-bert-model

OSError: Model name '/home/liping/liping/bert/pytorch-bert-model' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased). We assumed '/home/liping/liping/bert/pytorch-bert-model/config.json' was a path or url to a configuration file named config.json or a directory containing such a file but couldn't find any such file at this path or url.

lipingbj on 24 Oct 2019

👍3

As the error says, "We assumed '/home/liping/liping/bert/pytorch-bert-model/config.json' was a path or url to a configuration file named config.json or a directory containing such a file but couldn't find any such file at this path or url."

Your data does not seem to be in "/home/liping/liping/bert/pytorch-bert-model"

LysandreJik on 24 Oct 2019

Hello,

I'm trying to load biobert into pytorch, seeing a different error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

any hints? @LysandreJik

phaniram-sayapaneni on 6 Aug 2020

Hello,

I'm trying to load biobert into pytorch, seeing a different error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

any hints? @LysandreJik

Can you show the code that you are running to load from pre-trained weights?
For example

model = BertForSequenceClassification.from_pretrained('/path/to/directory/containing/model_artifacts/')

As stefan-it mentioned above, the directory must contain the 3 required files.

khu834 on 26 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Fine-tune specific layers

hsajjad · 3Comments

_load_from_state_dict() takes 7 positional arguments but 8 were given

guanlongtianzi · 3Comments

GPT2 tokenizer is so slow because of sum()

iedmrc · 3Comments

What should be the label of sub-word units in Token Classification with Bert

ereday · 3Comments

Tokenizer not found after conversion from TF checkpoint to PyTorch

HansBambel · 3Comments