Gensim: Doc2Vec - get trained document vectors and infer_vector

Created on 19 Aug 2017 · 10Comments · Source: RaRe-Technologies/gensim

Recently, I am trying to use the Doc2Vec module provided by Gensim. However, I am quite confused by some outputs. I am using Gensim-2.3.0 and Python-2.7.12.

I find that document vector after training is exactly the same as the one before training. I use model.docvecs[0] or model.docvecs.doctag_syn0[0] to get a document vector.

At the same time, the model.docvecs.most_similar outputs significantly changed after training, and the result is correct.

I also find that for model.infer_vector, the 2-norm of inferred vector would be larger, if the 'steps' parameter is bigger.

Now I am not sure if model.docvecs[0] or model.docvecs.doctag_syn0[0] are correct ways to get a document vector.

My python codes:

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

def practice()
    # prepare documents by TaggedDocument
    # docs = ...

    # an article tag
    id_str = '706cf480-7fe0-11e7-941e-959224aefd7b'

    # initialize a model
    model = Doc2Vec(size=300, window=20, min_count=2, workers=8, alpha=0.025, min_alpha=0.01, dm=0)

    # build vocabulary
    model.build_vocab(docs)

    # get the initial document vector
    docvec1 = model.docvecs[0]
    docvecsyn1 = model.docvecs.doctag_syn0[0]

    # calculate most similar documents
    # (the model is not trained, so the results should be wrong)
    docsim1 = model.docvecs.most_similar(id_str)

    # train this model
    model.train(docs, total_examples=len(docs), epochs=20)

    # get the trained document vector
    docvec2 = model.docvecs[0]
    docvecsyn2 = model.docvecs.doctag_syn0[0]

    # calculate most similar documents
    # (we expect the results to be correct)
    docsim2 = model.docvecs.most_similar(id_str)

    # choose one document
    doc = docs[0].words

    # infer vectors with different 'steps' parameters
    infervec1 = model.infer_vector(doc, alpha=0.025, min_alpha=0.01, steps=1)
    infervec2 = model.infer_vector(doc, alpha=0.025, min_alpha=0.01, steps=10)
    infervec3 = model.infer_vector(doc, alpha=0.025, min_alpha=0.01, steps=100)
    infervec4 = model.infer_vector(doc, alpha=0.025, min_alpha=0.01, steps=1000)

    # print results

    # document vector
    print('Document vector:')

    # we can see that the document vectors do not change after training.
    print(docvec1[:5])
    print(docvec2[:5])
    print(docvecsyn1[:5])
    print(docvecsyn2[:5])

    # most similar documents
    print('Most similar:')

    # before training, the result is wrong. after training, correct. good.
    print(docsim1[:2])
    print(docsim2[:2])

    # infered vectors with different 'steps' parameters
    print('Infered vector:')

    # we can see that, they are quite different.
    print(infervec1[:5])
    print(infervec2[:5])
    print(infervec3[:5])
    print(infervec4[:5])

    # norm of inferred vectors
    print('Norm of infered vector:')

    # it seems that, the norm of inferred vectors would be larger for bigger steps
    print(np.linalg.norm(infervec1))
    print(np.linalg.norm(infervec2))
    print(np.linalg.norm(infervec3))
    print(np.linalg.norm(infervec4))

practice()

And the outputs:

Document vector:
[ 0.45329446  0.04028329  0.74391299  0.23470438  0.28504953]
[ 0.45329446  0.04028329  0.74391299  0.23470438  0.28504953]
[ 0.45329446  0.04028329  0.74391299  0.23470438  0.28504953]
[ 0.45329446  0.04028329  0.74391299  0.23470438  0.28504953]
Most similar:
[('0029ed60-7903-11e7-bf23-1fad2a3e2eee', 0.1123562902212143), ('000a1c60-469a-11e7-b51a-11cd5459406e', 0.104494109749794)]
[('e20eae20-8063-11e7-941e-959224aefd7b', 0.9975029230117798), ('00482ab0-6ad4-11e7-8c8e-312511c861bd', 0.4974672198295593)]
Infered vector:
[ 0.18638353  0.01556012  0.50137675  0.14996666  0.24600163]
[ 0.58159155  0.01523605  0.77991062  0.21884988  0.303388  ]
[ 1.07764459  0.03167941  0.96692777  0.22491962  0.08230448]
[ 1.18006694  0.21795054  0.61040705  0.67637557 -0.62287807]
Norm of infered vector:
5.25663
8.78524
11.4886
14.5209

bug need info

Source

ColdL

Most helpful comment

Python assigns a pointer of model.docvecs[0] to docvec1 when docvec1 = model.docvecs[0].
Use docvec1 = copy.copy(model.docvecs[0]).

ishida-titech on 15 Sep 2017

👍3

All 10 comments

It's likely that your docs iterable object doesn't support multiple iterations, and thus no training passes are occurring after the first vocabulary-scan. This would usually be clear in logged output at the INFO (or if that doesn't help DEBUG) level - timings & counts of training will indicate unexpected values.

gojomo on 20 Aug 2017

My docs is just a python list, and each element is a TaggedDocument object.

It should be pointed out that, training (model.train) is very successful, because the most_similar results after training (docsim2) are correct. I have read these tagged articles.

Before training, the results (docsim1) are wrong, which is what we expect to see.

training (model.train) takes about 1000s for ~50000 documents (worker=8), and there is no error and no warning.

ColdL on 21 Aug 2017

This time I explicitly define the docs

This code could directly run. It is shown that, the most_similar results are changed after training, but the document vectors (model.docvecs[0]) are exactly the same as those before training.

I am using Python-2.7.12 and Gensim-2.3.0 on Kubuntu-16.04

from gensim.models.doc2vec import Doc2Vec, TaggedDocument


def practice():

    article1 = [u'My', u'name', u'is', u'David', u'I', u'like', u'Playing', u'Soccer']
    article2 = [u'My', u'name', u'is', u'Jenny', u'I', u'love', u'Basketball']
    article3 = [u'Today', u'is', u'Monday']

    id1 = '1'
    id2 = '2'
    id3 = '3'

    doc1 = TaggedDocument(article1, [id1])
    doc2 = TaggedDocument(article2, [id2])
    doc3 = TaggedDocument(article3, [id3])

    docs = [doc1, doc2, doc3]

    # initialize a model
    model = Doc2Vec(size=5, window=1, min_count=1, workers=8, alpha=0.025, min_alpha=0.01, dm=0)

    # build vocabulary
    model.build_vocab(docs)

    # get the initial document vector, and most similar articles
    # (before training, the results should be wrong)
    docvec1 = model.docvecs[0]
    docvecsyn1 = model.docvecs.doctag_syn0[0]
    docsim1 = model.docvecs.most_similar(id1)

    # train this model
    model.train(docs, total_examples=len(docs), epochs=100)

    # get the trained document vector, and most similar articles
    # (after training, the results should be correct)
    docvec2 = model.docvecs[0]
    docvecsyn2 = model.docvecs.doctag_syn0[0]
    docsim2 = model.docvecs.most_similar(id1)

    # print results

    # document vector
    print('Document vector:')

    # before training
    print('(Before training)')
    print(docvec1[:5])
    print(docvecsyn1[:5])

    # we can see that, the document vectors do not change after the training.
    print('(After training, exactly the same.)')
    print(docvec2[:5])
    print(docvecsyn2[:5])

    # most similar documents
    print('\nMost similar:')

    # before training, the result is wrong. after training, correct. good.
    print('(Before training)')
    print(docsim1[:2])

    print('(After training, significantly changed)')
    print(docsim2[:2])

practice()

and the outputs:

Document vector:
(Before training)
[ 0.08619376 -0.24537997 -0.23702955 -0.24228545  0.09785088]
[ 0.08619376 -0.24537997 -0.23702955 -0.24228545  0.09785088]
(After training, exactly the same.)
[ 0.08619376 -0.24537997 -0.23702955 -0.24228545  0.09785088]
[ 0.08619376 -0.24537997 -0.23702955 -0.24228545  0.09785088]

Most similar:
(Before training)
[('2', 0.574091911315918), ('3', 0.29283517599105835)]
(After training, significantly changed)
[('2', 0.9343106150627136), ('3', 0.7478965520858765)]

Process finished with exit code 0

ColdL on 21 Aug 2017

article* should preferably in list of lists type:

article1 = [['My'],['name'],['is'],['David'],['I'],['like'],['Playing'],['Soccer']]
article2 = [['My'],['name'],['is'],['Jenny'],['I'],['love'],['Basketball']]
article3 = [['Today'],['is'],['Monday']]

codehumanity on 11 Sep 2017

👎5

@codehumanity - No, that's not the expected form for Word2Vec or Doc2Vec text examples – and I wouldn't expect data of that format to work added to the above code example.

gojomo on 11 Sep 2017

Python assigns a pointer of model.docvecs[0] to docvec1 when docvec1 = model.docvecs[0].
Use docvec1 = copy.copy(model.docvecs[0]).

ishida-titech on 15 Sep 2017

👍3

Hi @ColdL, can you give me concrete data for reproducing your problem.

menshikh-iv on 2 Oct 2017

ping @ColdL

menshikh-iv on 31 Oct 2017

I think @ishida-titech is right, it happens because docvec1, docvec2 point to the same list. docsim1 and docsim2 are different because the most_similar() function return new list every call, that also means the model was trained. Gensim works correctly here.