Gensim: 'utf-8' decode error when loading a word2vec module

Created on 23 Dec 2015 · 10Comments · Source: RaRe-Technologies/gensim

I have to use a word2vec module trained by my coworkers. The module is saved as a bin file.

I installed gensim and tries to load the module, but following error occurred:

In [1]: import gensim  

In [2]: model = gensim.models.Word2Vec.load_word2vec_format('/data5/momo-projects/user_interest_classification/code/word2vec/vectors_groups_1105.bin', binary=True)

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: unexpected end of data

I tried to load the module both in python 2.7 and 3.5, failed in the same way. So how can I load the module in gensim? Thanks.

Source

zfz

👍4

Most helpful comment

This is a common question, so I created an FAQ entry for it:
https://github.com/piskvorky/gensim/wiki/Recipes-&-FAQ#q10-loading-a-word2vec-model-fails-with-unicodedecodeerror-utf-8-codec-cant-decode-bytes-in-position-

CC @gojomo

piskvorky on 24 Dec 2015

👍6

All 10 comments

A little more about this model, please. How was it trained / what was it trained with?

cscorley on 23 Dec 2015

I have faced the same problem many times. The load_word2vec_format has a flag for ignoring the character decoding errors. Change it to ignore and then run it again. see here
https://github.com/piskvorky/gensim/pull/466

nick-magnini on 23 Dec 2015

@cscorley Sorry for the little information, the module I used was trained by my coworkers using Java. The words was mostly Chinese characters. But I've no idea what the encoding format about the module when trained.

zfz on 24 Dec 2015

@nick-magnini Thanks. It works when loading the module. However, does ignoring encode errors matter the module usage?

zfz on 24 Dec 2015

@zfz it means the Java program clobbered the strings, so they are no longer utf8, leading to this exception.

There is no "clean fix" from gensim side -- either force the Java program to respect the utf8 encoding (preferable), or settle for ignoring/replacing invalid utf8 characters, using the unicode_errors flag like @nick-magnini says.

The non-utf8 error often comes from truncating multi-byte characters in the middle. Check the Java code, maybe you'll find such "truncation constant" in their code, and will be able to fix it. The C word2vec code by Mikolov also has the same problem. If this is really the root cause, encoding your strings using some single-byte encoding (so that truncation doesn't result in invalid characters) would work too.

piskvorky on 24 Dec 2015

CC @gojomo

piskvorky on 24 Dec 2015

👍6

Hello,
Sorry for posting after even you have created the FAQ.
I trained a model with tweets which had some undecodable unicode characters. When i train it and try querying the model with similarity and doesnt_match it works well enough.

Saving the model and opening it generates the problem discussed here, even with the unicode_errors='ignore' flag.

I checked whether sys.setdefaultencoding('utf8') changes anything, but it doesn't.

bhashithe on 3 Jul 2016

The unicode_errors='ignore' option should make it impossible for the exact same error to occur; perhaps you're getting some other very-similar error?

What does your code do, exactly, and what error stack are you receiving, exactly?

Also if you are in fact training your own model rather than reusing someone else's, what makes you prefer to use save_word2vec_format() and load_word2vec_format() instead of gensim's native save() and load()?

gojomo on 4 Jul 2016

👍1

Hello @gojomo thank you for replying fast.
I have used save() to save the model and load_word2vec_format() to load the model. Thats where the problem was.
I am just experimenting with Machine Learning and NLP, thats why I had my own word embeddings. If i want to use word embeddings in an actual scenario i will use one which is already trained.

Thank you very much for pointing out the native save and load methods

bhashithe on 4 Jul 2016

👍1

It's very reasonable to do your own training – if you have enough data, it's often better than using someone else's vectors trained on less-relevant text! Glad to hear it was just a matter of using mismatched save/load formats.

gojomo on 5 Jul 2016

Was this page helpful?

0 / 5 - 0 ratings