Fasttext: Python Implementation return 'utf-8' codec can't decode error

Created on 6 Jan 2019 · 2Comments · Source: facebookresearch/fastText

I used pretrained korean bin file 'cc.ko.300.bin'.

But, When I test bin_to_vec.py, I got

Traceback (most recent call last):
File "bin_to_vec.py", line 30, in
words = f.get_words()
File "/usr/local/lib/python3.5/dist-packages/fastText/FastText.py", line 170, in get_words
pair = self.f.getVocab()
'UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte'

My docker container charset is C.UTF-8.

Python bug

Source

akakakakakaa

👍2

Most helpful comment

Hi @akakakakakaa ,
It looks like the surrogate pair characters can be handled as valid utf-8 in fasttext binary and in fastText's python bindings in Python 2.

With e13484bcb261cda51d33c4940ab5e207aba3ee79, you can now replace the line :

words = f.get_words()

words = f.get_words(on_unicode_error='replace')

in bin_to_vec.py. You can also use ignore, by default it is set to strict.
That should unlock you.

Thank you for reporting the issue.
Best regards,
Onur

Celebio on 17 Apr 2019

👍4

All 2 comments

Hi @akakakakakaa,

Thank you for reporting this issue! This is probably due to some invalid utf-8 data that was not filtered out of the common crawl training data. We will try to fix this issue rapidly.

Best,
Edouard.

EdouardGrave on 15 Jan 2019

👍1

Hi @akakakakakaa ,
It looks like the surrogate pair characters can be handled as valid utf-8 in fasttext binary and in fastText's python bindings in Python 2.

With e13484bcb261cda51d33c4940ab5e207aba3ee79, you can now replace the line :

words = f.get_words()

words = f.get_words(on_unicode_error='replace')

in bin_to_vec.py. You can also use ignore, by default it is set to strict.
That should unlock you.

Thank you for reporting the issue.
Best regards,
Onur

Celebio on 17 Apr 2019

👍4

Was this page helpful?

0 / 5 - 0 ratings

Related issues

How to recreate the English pretrained word vectors using enwik9

poppingtonic · 3Comments

Is there support for regression in fastText?

hughbzhang · 3Comments

Python fasttext build failure

shriiitk · 3Comments

"Unsupported compiler"

premrajnarkhede · 3Comments

Version is somehow behind the one in PyPi

yasonk · 3Comments