Infersent: Unicode Decode Error

Created on 6 Nov 2018  路  7Comments  路  Source: facebookresearch/InferSent

I'm receiving an error when following the encoder demo notebook

I'm using fastText instead of gLoVe on a Windows 10 machine.


UnicodeDecodeError Traceback (most recent call last)
in ()
1 # Load embeddings of K most frequent words
2
----> 3 model.build_vocab_k_words(K=100000)

D:............projectInferSentmodels.py in build_vocab_k_words(self, K)
143 def build_vocab_k_words(self, K):
144 assert hasattr(self, 'w2v_path'), 'w2v path not set'
--> 145 self.word_vec = self.get_w2v_k(K)
146 print('Vocab size : %s' % (K))
147

D:............projectInferSentmodels.py in get_w2v_k(self, K)
121 word_vec = {}
122 with open(self.w2v_path) as f:
--> 123 for line in f:
124 word, vec = line.split(' ', 1)
125 if k <= K:

~Anaconda3libencodingscp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7674: character maps to

A quick search reveals a possible encoding issue at this line but I'm not entirely certain.

Most helpful comment

@chuzhifeng Unfortunately, this repo isn't open for PRs so we can't really do much but workaround this issue. As a workaround I modified the models.py at line 109 & 122, both which read with open(self.w2v_path) as f:, were changed to read with open(self.w2v_path, encoding="utf-8") as f:.

All 7 comments

Nevermind, installing Visual Studio C++ Build Tools 2015 and pip installing fasttext fixed this problem.

Alright, I take that back on windows there seems to be an issue specifying open() without the encoding type. PR not incoming. Please fix.

hi, I get same question with you ,but I used GloVe on windows 7,do you fix it ?,and follow is my error

UnicodeDecodeError Traceback (most recent call last)
in ()
----> 1 infersent.build_vocab(sentences, tokenize=True)

D:CodejupyterSQuAD-masterInferSentmodels.py in build_vocab(self, sentences, tokenize)
137 assert hasattr(self, 'w2v_path'), 'w2v path not set'
138 word_dict = self.get_word_dict(sentences, tokenize)
--> 139 self.word_vec = self.get_w2v(word_dict)
140 print('Vocab size : %s' % (len(self.word_vec)))
141

D:CodejupyterSQuAD-masterInferSentmodels.py in get_w2v(self, word_dict)
108 word_vec = {}
109 with open(self.w2v_path) as f:
--> 110 for line in f:
111 word, vec = line.split(' ', 1)
112 if word in word_dict:

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa2 in position 1389: illegal multibyte sequence

@chuzhifeng Unfortunately, this repo isn't open for PRs so we can't really do much but workaround this issue. As a workaround I modified the models.py at line 109 & 122, both which read with open(self.w2v_path) as f:, were changed to read with open(self.w2v_path, encoding="utf-8") as f:.

yeah,when I changed this code,it can run,thanks

Are you using python3? The solution proposed by Drappier will be usable only for python3 users but it's the workaround indeed.

i already change the code but still error. any idea to fix it? thanks before

Was this page helpful?
0 / 5 - 0 ratings