Gensim: Gensim Word2Vec model Export in TSV for Google's Embedding Projector Visualizer

Created on 3 Dec 2016 · 12Comments · Source: RaRe-Technologies/gensim

I'm using gensym Word2Vec module to build a linguistic model of text documents.
I've recently found the the Google's Embedding Projector described in here and that looks like the attachment depicted here showing as an example of the word embedding PCA visualization. The paper behind this visualization is here.

schermata 2016-12-03 alle 15 36 06

I'm trying to export to word embedding in the TSV file format needed to represent t-SNE and PCA, that is M lines of a 200 (or N) features vector columns of real numbers in this specific case, but I'm not sure hot to export my model data in this format.

My training python script that builds the model looks like the following, where I'm using a custom iterator, since I'm loading data from documents in streaming mode:

# gensim Logging code taken from http://rare-technologies.com/word2vec-tutorial/
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

    logger.info( "Training corpora [%s] language:%s..." % (corpora_filepath, corpora_language) )

    # My Corpus
    # word deaccent, sentences tokenize, include stopwords, max 100 corpora files
    corpus_iterator = TextCorpus( corpora_filepath, deaccent=True, tokenize=True, stopwords=False, maxFiles=100 )

    # min_count = ignore all words with total frequency lower than this.
    words_min_count = 1
    # number of features
    # window is the maximum distance between the current and predicted word within a sentence.
    features_size = 300
    # window size
    # window is the maximum distance between the current and predicted word within a sentence.
    words_distance_window = 4
    #  if 1 (default), sort the vocabulary by descending frequency before assigning word indexes.
    sorted_vocab = 1
    # sg defines the training algorithm. By default (sg=0), CBOW is used. Otherwise (sg=1), skip-gram is employed.
    train_algorithm = 1

    #negative = if > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). 
    # Default is 5. If set to 0, no negative samping is used.
    negative_sampling = 5

    #  limit RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. 
    # Every 10 million word types need about 1GB of RAM. Set to None for no limit (default).
    max_vocab_memory_size = None

# for other parameters
    # @see https://radimrehurek.com/gensim/models/word2vec.html
    model = Word2Vec(corpus_iterator, 
        sg = train_algorithm, 
        min_count=words_min_count, 
        size=features_size, 
        window=words_distance_window, 
        sorted_vocab=sorted_vocab,
        negative=negative_sampling,
        max_vocab_size=max_vocab_memory_size)

    #igram = Phrases( tokens )

    #print "\bigram samples %s is %d" % (bigram, len(bigram))

    #print "Training 2gram..."
    #model.train( bigram[ tokens ] )

    # save the model

    model_name_prefix='TextDoc'

    # model saved in the gensym word2vec format
    model_name = model_filepath+model_name_prefix+'-'+corpora_language

    logger.info( "Saving trained model..." )
    model.save( model_name )
    logger.info( "Model saved to %s" % (model_name) )

    # this model is a binary file in the original word2vec format
    model_name = model_filepath+model_name_prefix+'-vectors-negative'+str(features_size)+'-'+corpora_language+'.bin'

    # the vocabulary is a text file order by frequency desc
    # wordN frequency
    # wordN-1 frequency
    vocab_name = model_filepath+model_name_prefix+'-vocab-vectors-negative'+str(features_size)+'-'+corpora_language+'.txt'

    logger.info( "Saving trained model and vocabulary word2vec format..." )
    model.save_word2vec_format(model_name, fvocab=vocab_name, binary=True)
    logger.info( "Binary model saved to %s vocabulary %s" % (model_name, vocab_name) )

difficulty easy feature wishlist

Source

loretoparisi

👍1

Most helpful comment

@piskvorky @tmylk a possibile approach - if we are looking at the right tensors data - seems to be
model = gensim.models.Word2Vec.load_word2vec_format(model_path, binary=True)
with open( tensorsfp, 'w+') as tensors:
    with open( metadatafp, 'w+') as metadata:
         for word in model.index2word:
           encoded=word.encode('utf-8')
           metadata.write(encoded + '\n')
           vector_row = '\t'.join(map(str, model[word]))
           tensors.write(vector_row + '\n')
that will write down the requested metadata and the tensors tsv files.

If you confirm this, we could add a script like that.

watch the 4th line should use this:
model.wv.index2word

sonictl on 8 Sep 2019

👍4

All 12 comments

My suggestion would be to store the model in the word2vec text format (not binary).

Then write a script that takes a file in this word2vec text format as input, and transforms it into a file in whatever format your other tool needs (TSV) on output.

This will also allow you to use your script on models created by other word2vec tools which share the same text format as gensim, such as the original word2vec C implementation by Mikolov.

If you write such general-purpose conversion script, please consider contributing it back to gensim.

Since this is not a bug or issue with gensim (what is your actual question or problem?), I'll close this issue. Please continue the discussion on the gensim mailing list.

piskvorky on 4 Dec 2016

👎2

@loretoparisi Thanks. This is a very useful feature suggestion. Being able to view gensim embeddings in the visualiser would be great!
cc @parthoiiitm

tmylk on 4 Dec 2016

👍3

@tmylk thanks the idea could be to have a saveapi that writes down this TVSfile like model.save_word2vec_tsv_format. I'm pretty sure that this could be done as @piskvorky described above and I will try something like that, but I'm not so deep inside gensym at this time to know how do to that, but I will try something and update here.

loretoparisi on 5 Dec 2016

Oh, I didn't realize this was a feature request. Adding "wishlist" and "easy" labels.

I don't think this should be a model method though. Such format conversion tools are better served as stand-alone CLI scripts, without polluting the gensim core. Have a look at the existing "glove-to-word2vec" converter for inspiration.

piskvorky on 6 Dec 2016

👍1

@piskvorky @tmylk a possibile approach - if we are looking at the right tensors data - seems to be

model = gensim.models.Word2Vec.load_word2vec_format(model_path, binary=True)
with open( tensorsfp, 'w+') as tensors:
    with open( metadatafp, 'w+') as metadata:
         for word in model.index2word:
           encoded=word.encode('utf-8')
           metadata.write(encoded + '\n')
           vector_row = '\t'.join(map(str, model[word]))
           tensors.write(vector_row + '\n')

that will write down the requested metadata and the tensors tsv files.

If you confirm this, we could add a script like that.

loretoparisi on 14 Dec 2016

@tmylk I am ready to take this.
I would like to implement this. @loretoparisi can you guide through? :)

markroxor on 17 Dec 2016

@markroxor Well I'm waiting for a confirmation by @piskvorky then I will send the PR to put the script in gensim/scripts/ with instructions to build, but let's wait them before!

loretoparisi on 18 Dec 2016

👍1

@loretoparisi Thanks, the approach is right. Looking forward to the script.

tmylk on 19 Dec 2016

👍1

@tmylk super! I have sent the PR, https://github.com/RaRe-Technologies/gensim/pull/1051

loretoparisi on 20 Dec 2016

Implemented in #1051

tmylk on 25 Jan 2017

@piskvorky @tmylk a possibile approach - if we are looking at the right tensors data - seems to be
model = gensim.models.Word2Vec.load_word2vec_format(model_path, binary=True)
with open( tensorsfp, 'w+') as tensors:
    with open( metadatafp, 'w+') as metadata:
         for word in model.index2word:
           encoded=word.encode('utf-8')
           metadata.write(encoded + '\n')
           vector_row = '\t'.join(map(str, model[word]))
           tensors.write(vector_row + '\n')
that will write down the requested metadata and the tensors tsv files.

If you confirm this, we could add a script like that.

watch the 4th line should use this:
model.wv.index2word

sonictl on 8 Sep 2019

👍4

@sonictl that's correct, it should be updated to model.wv. to reflect the new api.

loretoparisi on 9 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings