Keras: UnicodeDecodeError for GloVe

Created on 1 Jul 2017 · 7Comments · Source: keras-team/keras

Hi,

when im using

embeddings_index = {}
glove_data = 'glove.6B.50d.txt'
f = open(glove_data)
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Loaded %s word vectors.' % len(embeddings_index))`

I get the following error in for line in f:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-72-ad0473c921c9> in <module>()
      2 glove_data = 'glove.6B.50d.txt'
      3 f = open(glove_data)
----> 4 for line in f:
      5     values = line.split()
      6     word = values[0]

C:\Users\Leonard\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2273: character maps to <undefined>

Source

leonardltk

Most helpful comment

oh it works now after i use

f = open(glove_data, encoding="utf8")

leonardltk on 1 Jul 2017

👍21

All 7 comments

oh it works now after i use

f = open(glove_data, encoding="utf8")

leonardltk on 1 Jul 2017

👍21

Hi,
I am experiencing the same problem, when using 'utf8' decoding I get this error ;
'charmap' codec can't decode byte 0x9d in position 3692: character maps to
it seems "utf8" decoding dose not resolve the pb. I am using windows 10, do you know if there is a specific encoder/decoder for windows 10.Thanks

habdulkader on 10 Feb 2018

'utf8' as suggested by leonardltk worked for me. thanks!

thakurrishabh on 15 Sep 2018

👍1

Hi,
I am experiencing the same problem, when using 'utf8' decoding I get this error ;
'charmap' codec can't decode byte 0x9d in position 3692: character maps to
it seems "utf8" decoding dose not resolve the pb. I am using windows 10, do you know if there is a specific encoder/decoder for windows 10.Thanks

Hi,
have you resolved the issue? I'm facing same problem.

ppanja on 23 Mar 2019

👍2

try it

import os, re
word_embeddings = {}
with open(os.path.join('../input/glove6b50dtxt/glove.6B.50d.txt')) as f:

for line in f:

    values = line.split()

    word = values[0]

    coefs = np.asarray(values[1:],  dtype='float32')

    word_embeddings[word] = coefs

f.close()