Fasttext: Python Implementation return 'utf-8' codec can't decode error

Created on 6 Jan 2019  路  2Comments  路  Source: facebookresearch/fastText

I used pretrained korean bin file 'cc.ko.300.bin'.

But, When I test bin_to_vec.py, I got

Traceback (most recent call last):
File "bin_to_vec.py", line 30, in
words = f.get_words()
File "/usr/local/lib/python3.5/dist-packages/fastText/FastText.py", line 170, in get_words
pair = self.f.getVocab()
'UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte'

My docker container charset is C.UTF-8.

Python bug

Most helpful comment

Hi @akakakakakaa ,
It looks like the surrogate pair characters can be handled as valid utf-8 in fasttext binary and in fastText's python bindings in Python 2.

With e13484bcb261cda51d33c4940ab5e207aba3ee79, you can now replace the line :

words = f.get_words()

by

words = f.get_words(on_unicode_error='replace')

in bin_to_vec.py. You can also use ignore, by default it is set to strict.
That should unlock you.

Thank you for reporting the issue.
Best regards,
Onur

All 2 comments

Hi @akakakakakaa,

Thank you for reporting this issue! This is probably due to some invalid utf-8 data that was not filtered out of the common crawl training data. We will try to fix this issue rapidly.

Best,
Edouard.

Hi @akakakakakaa ,
It looks like the surrogate pair characters can be handled as valid utf-8 in fasttext binary and in fastText's python bindings in Python 2.

With e13484bcb261cda51d33c4940ab5e207aba3ee79, you can now replace the line :

words = f.get_words()

by

words = f.get_words(on_unicode_error='replace')

in bin_to_vec.py. You can also use ignore, by default it is set to strict.
That should unlock you.

Thank you for reporting the issue.
Best regards,
Onur

Was this page helpful?
0 / 5 - 0 ratings

Related issues

poppingtonic picture poppingtonic  路  3Comments

hughbzhang picture hughbzhang  路  3Comments

shriiitk picture shriiitk  路  3Comments

premrajnarkhede picture premrajnarkhede  路  3Comments

yasonk picture yasonk  路  3Comments