Transformers: UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 3920: character maps to <undefined>

Created on 22 Nov 2018  路  2Comments  路  Source: huggingface/transformers

Installed pytorch-pretrained-BERT from source, Python 3.7, Windows 10

When I run the following snippet:

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM

Load pre-trained model tokenizer (vocabulary)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

I get the following:


UnicodeDecodeError Traceback (most recent call last)
in ()
3
4 # Load pre-trained model tokenizer (vocabulary)
----> 5 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

~Anaconda3libsite-packagespytorch_pretrained_berttokenization.py in from_pretrained(cls, pretrained_model_name, do_lower_case)
139 vocab_file, resolved_vocab_file))
140 # Instantiate tokenizer.
--> 141 tokenizer = cls(resolved_vocab_file, do_lower_case)
142 except FileNotFoundError:
143 logger.error(

~Anaconda3libsite-packagespytorch_pretrained_berttokenization.py in __init__(self, vocab_file, do_lower_case)
93 "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "
94 "model use tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)".format(vocab_file))
---> 95 self.vocab = load_vocab(vocab_file)
96 self.ids_to_tokens = collections.OrderedDict(
97 [(ids, tok) for tok, ids in self.vocab.items()])

~Anaconda3libsite-packagespytorch_pretrained_berttokenization.py in load_vocab(vocab_file)
68 with open(vocab_file, "r", encoding="utf8") as reader:
69 while True:
---> 70 token = convert_to_unicode(reader.readline())
71 if not token:
72 break

~Anaconda3libencodingscp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 3920: character maps to

Most helpful comment

I am facing the same problem.

Fixed it with "with open(vocab_file, "r", encoding="utf-8") as reader:" in line 68 of tokenization.py

All 2 comments

I am facing the same problem.

Fixed it with "with open(vocab_file, "r", encoding="utf-8") as reader:" in line 68 of tokenization.py

Thanks, it's fixed on master and will be included in the next release.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

alphanlp picture alphanlp  路  3Comments

yspaik picture yspaik  路  3Comments

0x01h picture 0x01h  路  3Comments

fabiocapsouza picture fabiocapsouza  路  3Comments

fyubang picture fyubang  路  3Comments