I used pretrained korean bin file 'cc.ko.300.bin'.
But, When I test bin_to_vec.py, I got
Traceback (most recent call last):
File "bin_to_vec.py", line 30, in
words = f.get_words()
File "/usr/local/lib/python3.5/dist-packages/fastText/FastText.py", line 170, in get_words
pair = self.f.getVocab()
'UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte'
My docker container charset is C.UTF-8.
Hi @akakakakakaa,
Thank you for reporting this issue! This is probably due to some invalid utf-8 data that was not filtered out of the common crawl training data. We will try to fix this issue rapidly.
Best,
Edouard.
Hi @akakakakakaa ,
It looks like the surrogate pair characters can be handled as valid utf-8 in fasttext binary and in fastText's python bindings in Python 2.
With e13484bcb261cda51d33c4940ab5e207aba3ee79, you can now replace the line :
words = f.get_words()
by
words = f.get_words(on_unicode_error='replace')
in bin_to_vec.py. You can also use ignore, by default it is set to strict.
That should unlock you.
Thank you for reporting the issue.
Best regards,
Onur
Most helpful comment
Hi @akakakakakaa ,
It looks like the surrogate pair characters can be handled as valid utf-8 in fasttext binary and in fastText's python bindings in Python 2.
With e13484bcb261cda51d33c4940ab5e207aba3ee79, you can now replace the line :
by
in bin_to_vec.py. You can also use
ignore, by default it is set tostrict.That should unlock you.
Thank you for reporting the issue.
Best regards,
Onur