gensim.scripts.word2vec2tensor TypeError: write() argument must be str, not bytes

Created on 7 Mar 2018  路  7Comments  路  Source: RaRe-Technologies/gensim

Python environment

Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) 
[GCC 7.2.0] on linux

How I make article_body_w2v_300.txt

import gensim
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

sentences = LineSentence("./data/article_body_corpus.txt")

model = Word2Vec(sentences, size=300, window=5, min_count=5, workers=4)

model.wv.save_word2vec_format("article_body_w2v_300.txt", binary=False)

Command I use to run gensim.scripts.word2vec2tensor

python -m gensim.scripts.word2vec2tensor -i article_body_w2v_300.txt -o meow/

Console output

word_embedding  python -m gensim.scripts.word2vec2tensor -i article_body_w2v_300.txt -o meow/
2018-03-07 16:30:29,484 - word2vec2tensor - INFO - running /home/cpu11453local/anaconda3/envs/gensim/lib/python3.6/site-packages/gensim/scripts/word2vec2tensor.py -i article_body_w2v_300.txt -o meow/
2018-03-07 16:30:29,484 - utils_any2vec - INFO - loading projection weights from article_body_w2v_300.txt
2018-03-07 16:30:41,992 - utils_any2vec - INFO - loaded (56543, 300) matrix from article_body_w2v_300.txt
Traceback (most recent call last):
  File "/home/cpu11453local/anaconda3/envs/gensim/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/cpu11453local/anaconda3/envs/gensim/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/cpu11453local/anaconda3/envs/gensim/lib/python3.6/site-packages/gensim/scripts/word2vec2tensor.py", line 93, in <module>
    word2vec2tensor(args.input, args.output, args.binary)
  File "/home/cpu11453local/anaconda3/envs/gensim/lib/python3.6/site-packages/gensim/scripts/word2vec2tensor.py", line 73, in word2vec2tensor
    file_metadata.write(gensim.utils.to_utf8(word) + gensim.utils.to_utf8('\n'))
TypeError: write() argument must be str, not bytes

bug difficulty easy

All 7 comments

hello @ttpro1995, thanks for the report, can you try to run gensim.scripts.word2vec2tensor with python2 (I have some ideas, what happens here)?

On python 2.7, it worked without need any fix.

Python 2.7.14 |Anaconda, Inc.| (default, Dec  7 2017, 17:05:42) 
[GCC 7.2.0] on linux2

On python 3.6, it need some fix.

Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) 
[GCC 7.2.0] on linux

gensim.utils.to_utf8(word) type is byte but write() need string. So I add decode("utf-8")

    with open(outfiletsv, 'w+') as file_vector:
        with open(outfiletsvmeta, 'w+') as file_metadata:
            for word in model.index2word:
                file_metadata.write(gensim.utils.to_utf8(word).decode("utf-8") + gensim.utils.to_utf8('\n'))
                vector_row = '\t'.join(str(x) for x in model[word])
                file_vector.write(vector_row + '\n')

Then, it work on python 3.6. However, this fix does not work on python 2.7.
If run on python 2.7 with decode("utf-8")

 python gensim2tensor.py -i article_body_w2v_300.txt -o meow/                 
2018-03-08 16:30:51,521 - gensim2tensor - INFO - running gensim2tensor.py -i article_body_w2v_300.txt -o meow/
2018-03-08 16:30:51,521 - utils_any2vec - INFO - loading projection weights from article_body_w2v_300.txt
2018-03-08 16:31:08,798 - utils_any2vec - INFO - loaded (56543, 300) matrix from article_body_w2v_300.txt
Traceback (most recent call last):
  File "gensim2tensor.py", line 74, in <module>
    word2vec2tensor(args.input, args.output, args.binary)
  File "gensim2tensor.py", line 54, in word2vec2tensor
    file_metadata.write(gensim.utils.to_utf8(word).decode("utf-8") + gensim.utils.to_utf8('\n'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 1: ordinal not in range(128)

@ttpro1995 aha, as awaited, big thanks, that's really a bug.

Hi, sorry I deleted my comment as I saw that @AakaashRao had made a PR as well, but I will finish it up. Will submit PR when it's done!

@vsocrates right now nobody works on this fix, again, feel free to submit an PR

@menshikh-iv submitted. A quick note: I had to force the data type to be float64 in this line to pass the test with the test data we have now. Please review, thanks!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

johann-petrak picture johann-petrak  路  3Comments

Jianqiang picture Jianqiang  路  3Comments

sairampillai picture sairampillai  路  3Comments

volj1 picture volj1  路  4Comments

ahmedbhabbas picture ahmedbhabbas  路  4Comments