I'm using gensym Word2Vec module to build a linguistic model of text documents.
I've recently found the the Google's Embedding Projector described in here and that looks like the attachment depicted here showing as an example of the word embedding PCA visualization. The paper behind this visualization is here.

I'm trying to export to word embedding in the TSV file format needed to represent t-SNE and PCA, that is M lines of a 200 (or N) features vector columns of real numbers in this specific case, but I'm not sure hot to export my model data in this format.
My training python script that builds the model looks like the following, where I'm using a custom iterator, since I'm loading data from documents in streaming mode:
# gensim Logging code taken from http://rare-technologies.com/word2vec-tutorial/
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
logger.info( "Training corpora [%s] language:%s..." % (corpora_filepath, corpora_language) )
# My Corpus
# word deaccent, sentences tokenize, include stopwords, max 100 corpora files
corpus_iterator = TextCorpus( corpora_filepath, deaccent=True, tokenize=True, stopwords=False, maxFiles=100 )
# min_count = ignore all words with total frequency lower than this.
words_min_count = 1
# number of features
# window is the maximum distance between the current and predicted word within a sentence.
features_size = 300
# window size
# window is the maximum distance between the current and predicted word within a sentence.
words_distance_window = 4
# if 1 (default), sort the vocabulary by descending frequency before assigning word indexes.
sorted_vocab = 1
# sg defines the training algorithm. By default (sg=0), CBOW is used. Otherwise (sg=1), skip-gram is employed.
train_algorithm = 1
#negative = if > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20).
# Default is 5. If set to 0, no negative samping is used.
negative_sampling = 5
# limit RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones.
# Every 10 million word types need about 1GB of RAM. Set to None for no limit (default).
max_vocab_memory_size = None
# for other parameters
# @see https://radimrehurek.com/gensim/models/word2vec.html
model = Word2Vec(corpus_iterator,
sg = train_algorithm,
min_count=words_min_count,
size=features_size,
window=words_distance_window,
sorted_vocab=sorted_vocab,
negative=negative_sampling,
max_vocab_size=max_vocab_memory_size)
#igram = Phrases( tokens )
#print "\bigram samples %s is %d" % (bigram, len(bigram))
#print "Training 2gram..."
#model.train( bigram[ tokens ] )
# save the model
model_name_prefix='TextDoc'
# model saved in the gensym word2vec format
model_name = model_filepath+model_name_prefix+'-'+corpora_language
logger.info( "Saving trained model..." )
model.save( model_name )
logger.info( "Model saved to %s" % (model_name) )
# this model is a binary file in the original word2vec format
model_name = model_filepath+model_name_prefix+'-vectors-negative'+str(features_size)+'-'+corpora_language+'.bin'
# the vocabulary is a text file order by frequency desc
# wordN frequency
# wordN-1 frequency
vocab_name = model_filepath+model_name_prefix+'-vocab-vectors-negative'+str(features_size)+'-'+corpora_language+'.txt'
logger.info( "Saving trained model and vocabulary word2vec format..." )
model.save_word2vec_format(model_name, fvocab=vocab_name, binary=True)
logger.info( "Binary model saved to %s vocabulary %s" % (model_name, vocab_name) )
My suggestion would be to store the model in the word2vec text format (not binary).
Then write a script that takes a file in this word2vec text format as input, and transforms it into a file in whatever format your other tool needs (TSV) on output.
This will also allow you to use your script on models created by other word2vec tools which share the same text format as gensim, such as the original word2vec C implementation by Mikolov.
If you write such general-purpose conversion script, please consider contributing it back to gensim.
Since this is not a bug or issue with gensim (what is your actual question or problem?), I'll close this issue. Please continue the discussion on the gensim mailing list.
@loretoparisi Thanks. This is a very useful feature suggestion. Being able to view gensim embeddings in the visualiser would be great!
cc @parthoiiitm
@tmylk thanks the idea could be to have a saveapi that writes down this TVSfile like model.save_word2vec_tsv_format. I'm pretty sure that this could be done as @piskvorky described above and I will try something like that, but I'm not so deep inside gensym at this time to know how do to that, but I will try something and update here.
Oh, I didn't realize this was a feature request. Adding "wishlist" and "easy" labels.
I don't think this should be a model method though. Such format conversion tools are better served as stand-alone CLI scripts, without polluting the gensim core. Have a look at the existing "glove-to-word2vec" converter for inspiration.
@piskvorky @tmylk a possibile approach - if we are looking at the right tensors data - seems to be
model = gensim.models.Word2Vec.load_word2vec_format(model_path, binary=True)
with open( tensorsfp, 'w+') as tensors:
with open( metadatafp, 'w+') as metadata:
for word in model.index2word:
encoded=word.encode('utf-8')
metadata.write(encoded + '\n')
vector_row = '\t'.join(map(str, model[word]))
tensors.write(vector_row + '\n')
that will write down the requested metadata and the tensors tsv files.
If you confirm this, we could add a script like that.
@tmylk I am ready to take this.
I would like to implement this. @loretoparisi can you guide through? :)
@markroxor Well I'm waiting for a confirmation by @piskvorky then I will send the PR to put the script in gensim/scripts/ with instructions to build, but let's wait them before!
@loretoparisi Thanks, the approach is right. Looking forward to the script.
@tmylk super! I have sent the PR, https://github.com/RaRe-Technologies/gensim/pull/1051
Implemented in #1051
@piskvorky @tmylk a possibile approach - if we are looking at the right tensors data - seems to be
model = gensim.models.Word2Vec.load_word2vec_format(model_path, binary=True) with open( tensorsfp, 'w+') as tensors: with open( metadatafp, 'w+') as metadata: for word in model.index2word: encoded=word.encode('utf-8') metadata.write(encoded + '\n') vector_row = '\t'.join(map(str, model[word])) tensors.write(vector_row + '\n')that will write down the requested metadata and the tensors tsv files.
If you confirm this, we could add a script like that.
watch the 4th line should use this:
model.wv.index2word
@sonictl that's correct, it should be updated to model.wv. to reflect the new api.
Most helpful comment
watch the 4th line should use this:
model.wv.index2word