Python environment
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0] on linux
How I make article_body_w2v_300.txt
import gensim
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
sentences = LineSentence("./data/article_body_corpus.txt")
model = Word2Vec(sentences, size=300, window=5, min_count=5, workers=4)
model.wv.save_word2vec_format("article_body_w2v_300.txt", binary=False)
Command I use to run gensim.scripts.word2vec2tensor
python -m gensim.scripts.word2vec2tensor -i article_body_w2v_300.txt -o meow/
Console output
word_embedding python -m gensim.scripts.word2vec2tensor -i article_body_w2v_300.txt -o meow/
2018-03-07 16:30:29,484 - word2vec2tensor - INFO - running /home/cpu11453local/anaconda3/envs/gensim/lib/python3.6/site-packages/gensim/scripts/word2vec2tensor.py -i article_body_w2v_300.txt -o meow/
2018-03-07 16:30:29,484 - utils_any2vec - INFO - loading projection weights from article_body_w2v_300.txt
2018-03-07 16:30:41,992 - utils_any2vec - INFO - loaded (56543, 300) matrix from article_body_w2v_300.txt
Traceback (most recent call last):
File "/home/cpu11453local/anaconda3/envs/gensim/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/cpu11453local/anaconda3/envs/gensim/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/cpu11453local/anaconda3/envs/gensim/lib/python3.6/site-packages/gensim/scripts/word2vec2tensor.py", line 93, in <module>
word2vec2tensor(args.input, args.output, args.binary)
File "/home/cpu11453local/anaconda3/envs/gensim/lib/python3.6/site-packages/gensim/scripts/word2vec2tensor.py", line 73, in word2vec2tensor
file_metadata.write(gensim.utils.to_utf8(word) + gensim.utils.to_utf8('\n'))
TypeError: write() argument must be str, not bytes
hello @ttpro1995, thanks for the report, can you try to run gensim.scripts.word2vec2tensor with python2 (I have some ideas, what happens here)?
On python 2.7, it worked without need any fix.
Python 2.7.14 |Anaconda, Inc.| (default, Dec 7 2017, 17:05:42)
[GCC 7.2.0] on linux2
On python 3.6, it need some fix.
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0] on linux
gensim.utils.to_utf8(word) type is byte but write() need string. So I add decode("utf-8")
with open(outfiletsv, 'w+') as file_vector:
with open(outfiletsvmeta, 'w+') as file_metadata:
for word in model.index2word:
file_metadata.write(gensim.utils.to_utf8(word).decode("utf-8") + gensim.utils.to_utf8('\n'))
vector_row = '\t'.join(str(x) for x in model[word])
file_vector.write(vector_row + '\n')
Then, it work on python 3.6. However, this fix does not work on python 2.7.
If run on python 2.7 with decode("utf-8")
python gensim2tensor.py -i article_body_w2v_300.txt -o meow/
2018-03-08 16:30:51,521 - gensim2tensor - INFO - running gensim2tensor.py -i article_body_w2v_300.txt -o meow/
2018-03-08 16:30:51,521 - utils_any2vec - INFO - loading projection weights from article_body_w2v_300.txt
2018-03-08 16:31:08,798 - utils_any2vec - INFO - loaded (56543, 300) matrix from article_body_w2v_300.txt
Traceback (most recent call last):
File "gensim2tensor.py", line 74, in <module>
word2vec2tensor(args.input, args.output, args.binary)
File "gensim2tensor.py", line 54, in word2vec2tensor
file_metadata.write(gensim.utils.to_utf8(word).decode("utf-8") + gensim.utils.to_utf8('\n'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 1: ordinal not in range(128)
@ttpro1995 aha, as awaited, big thanks, that's really a bug.
@vsocrates yes, you need to add new test class to https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_scripts.py and write several tests for function https://github.com/RaRe-Technologies/gensim/blob/9c6db73919d032ab2f6ea35b3a9043e3b0d2aed5/gensim/scripts/word2vec2tensor.py#L51
feel free to post PR :+1:
Hi, sorry I deleted my comment as I saw that @AakaashRao had made a PR as well, but I will finish it up. Will submit PR when it's done!
@vsocrates right now nobody works on this fix, again, feel free to submit an PR
@menshikh-iv submitted. A quick note: I had to force the data type to be float64 in this line to pass the test with the test data we have now. Please review, thanks!