Installed pytorch-pretrained-BERT from source, Python 3.7, Windows 10
When I run the following snippet:
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
I get the following:
UnicodeDecodeError Traceback (most recent call last)
3
4 # Load pre-trained model tokenizer (vocabulary)
----> 5 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
~Anaconda3libsite-packagespytorch_pretrained_berttokenization.py in from_pretrained(cls, pretrained_model_name, do_lower_case)
139 vocab_file, resolved_vocab_file))
140 # Instantiate tokenizer.
--> 141 tokenizer = cls(resolved_vocab_file, do_lower_case)
142 except FileNotFoundError:
143 logger.error(
~Anaconda3libsite-packagespytorch_pretrained_berttokenization.py in __init__(self, vocab_file, do_lower_case)
93 "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "
94 "model use tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)".format(vocab_file))
---> 95 self.vocab = load_vocab(vocab_file)
96 self.ids_to_tokens = collections.OrderedDict(
97 [(ids, tok) for tok, ids in self.vocab.items()])
~Anaconda3libsite-packagespytorch_pretrained_berttokenization.py in load_vocab(vocab_file)
68 with open(vocab_file, "r", encoding="utf8") as reader:
69 while True:
---> 70 token = convert_to_unicode(reader.readline())
71 if not token:
72 break
~Anaconda3libencodingscp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 3920: character maps to
I am facing the same problem.
Fixed it with "with open(vocab_file, "r", encoding="utf-8") as reader:" in line 68 of tokenization.py
Thanks, it's fixed on master and will be included in the next release.
Most helpful comment
I am facing the same problem.
Fixed it with "with open(vocab_file, "r", encoding="utf-8") as reader:" in line 68 of tokenization.py