Recently, I am trying to use the Doc2Vec module provided by Gensim. However, I am quite confused by some outputs. I am using Gensim-2.3.0 and Python-2.7.12.
I find that document vector after training is exactly the same as the one before training. I use model.docvecs[0] or model.docvecs.doctag_syn0[0] to get a document vector.
At the same time, the model.docvecs.most_similar outputs significantly changed after training, and the result is correct.
I also find that for model.infer_vector, the 2-norm of inferred vector would be larger, if the 'steps' parameter is bigger.
Now I am not sure if model.docvecs[0] or model.docvecs.doctag_syn0[0] are correct ways to get a document vector.
My python codes:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
def practice()
# prepare documents by TaggedDocument
# docs = ...
# an article tag
id_str = '706cf480-7fe0-11e7-941e-959224aefd7b'
# initialize a model
model = Doc2Vec(size=300, window=20, min_count=2, workers=8, alpha=0.025, min_alpha=0.01, dm=0)
# build vocabulary
model.build_vocab(docs)
# get the initial document vector
docvec1 = model.docvecs[0]
docvecsyn1 = model.docvecs.doctag_syn0[0]
# calculate most similar documents
# (the model is not trained, so the results should be wrong)
docsim1 = model.docvecs.most_similar(id_str)
# train this model
model.train(docs, total_examples=len(docs), epochs=20)
# get the trained document vector
docvec2 = model.docvecs[0]
docvecsyn2 = model.docvecs.doctag_syn0[0]
# calculate most similar documents
# (we expect the results to be correct)
docsim2 = model.docvecs.most_similar(id_str)
# choose one document
doc = docs[0].words
# infer vectors with different 'steps' parameters
infervec1 = model.infer_vector(doc, alpha=0.025, min_alpha=0.01, steps=1)
infervec2 = model.infer_vector(doc, alpha=0.025, min_alpha=0.01, steps=10)
infervec3 = model.infer_vector(doc, alpha=0.025, min_alpha=0.01, steps=100)
infervec4 = model.infer_vector(doc, alpha=0.025, min_alpha=0.01, steps=1000)
# print results
# document vector
print('Document vector:')
# we can see that the document vectors do not change after training.
print(docvec1[:5])
print(docvec2[:5])
print(docvecsyn1[:5])
print(docvecsyn2[:5])
# most similar documents
print('Most similar:')
# before training, the result is wrong. after training, correct. good.
print(docsim1[:2])
print(docsim2[:2])
# infered vectors with different 'steps' parameters
print('Infered vector:')
# we can see that, they are quite different.
print(infervec1[:5])
print(infervec2[:5])
print(infervec3[:5])
print(infervec4[:5])
# norm of inferred vectors
print('Norm of infered vector:')
# it seems that, the norm of inferred vectors would be larger for bigger steps
print(np.linalg.norm(infervec1))
print(np.linalg.norm(infervec2))
print(np.linalg.norm(infervec3))
print(np.linalg.norm(infervec4))
practice()
And the outputs:
Document vector:
[ 0.45329446 0.04028329 0.74391299 0.23470438 0.28504953]
[ 0.45329446 0.04028329 0.74391299 0.23470438 0.28504953]
[ 0.45329446 0.04028329 0.74391299 0.23470438 0.28504953]
[ 0.45329446 0.04028329 0.74391299 0.23470438 0.28504953]
Most similar:
[('0029ed60-7903-11e7-bf23-1fad2a3e2eee', 0.1123562902212143), ('000a1c60-469a-11e7-b51a-11cd5459406e', 0.104494109749794)]
[('e20eae20-8063-11e7-941e-959224aefd7b', 0.9975029230117798), ('00482ab0-6ad4-11e7-8c8e-312511c861bd', 0.4974672198295593)]
Infered vector:
[ 0.18638353 0.01556012 0.50137675 0.14996666 0.24600163]
[ 0.58159155 0.01523605 0.77991062 0.21884988 0.303388 ]
[ 1.07764459 0.03167941 0.96692777 0.22491962 0.08230448]
[ 1.18006694 0.21795054 0.61040705 0.67637557 -0.62287807]
Norm of infered vector:
5.25663
8.78524
11.4886
14.5209
It's likely that your docs iterable object doesn't support multiple iterations, and thus no training passes are occurring after the first vocabulary-scan. This would usually be clear in logged output at the INFO (or if that doesn't help DEBUG) level - timings & counts of training will indicate unexpected values.
My docs is just a python list, and each element is a TaggedDocument object.
It should be pointed out that, training (model.train) is very successful, because the most_similar results after training (docsim2) are correct. I have read these tagged articles.
Before training, the results (docsim1) are wrong, which is what we expect to see.
training (model.train) takes about 1000s for ~50000 documents (worker=8), and there is no error and no warning.
This time I explicitly define the docs
This code could directly run. It is shown that, the most_similar results are changed after training, but the document vectors (model.docvecs[0]) are exactly the same as those before training.
I am using Python-2.7.12 and Gensim-2.3.0 on Kubuntu-16.04
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
def practice():
article1 = [u'My', u'name', u'is', u'David', u'I', u'like', u'Playing', u'Soccer']
article2 = [u'My', u'name', u'is', u'Jenny', u'I', u'love', u'Basketball']
article3 = [u'Today', u'is', u'Monday']
id1 = '1'
id2 = '2'
id3 = '3'
doc1 = TaggedDocument(article1, [id1])
doc2 = TaggedDocument(article2, [id2])
doc3 = TaggedDocument(article3, [id3])
docs = [doc1, doc2, doc3]
# initialize a model
model = Doc2Vec(size=5, window=1, min_count=1, workers=8, alpha=0.025, min_alpha=0.01, dm=0)
# build vocabulary
model.build_vocab(docs)
# get the initial document vector, and most similar articles
# (before training, the results should be wrong)
docvec1 = model.docvecs[0]
docvecsyn1 = model.docvecs.doctag_syn0[0]
docsim1 = model.docvecs.most_similar(id1)
# train this model
model.train(docs, total_examples=len(docs), epochs=100)
# get the trained document vector, and most similar articles
# (after training, the results should be correct)
docvec2 = model.docvecs[0]
docvecsyn2 = model.docvecs.doctag_syn0[0]
docsim2 = model.docvecs.most_similar(id1)
# print results
# document vector
print('Document vector:')
# before training
print('(Before training)')
print(docvec1[:5])
print(docvecsyn1[:5])
# we can see that, the document vectors do not change after the training.
print('(After training, exactly the same.)')
print(docvec2[:5])
print(docvecsyn2[:5])
# most similar documents
print('\nMost similar:')
# before training, the result is wrong. after training, correct. good.
print('(Before training)')
print(docsim1[:2])
print('(After training, significantly changed)')
print(docsim2[:2])
practice()
and the outputs:
Document vector:
(Before training)
[ 0.08619376 -0.24537997 -0.23702955 -0.24228545 0.09785088]
[ 0.08619376 -0.24537997 -0.23702955 -0.24228545 0.09785088]
(After training, exactly the same.)
[ 0.08619376 -0.24537997 -0.23702955 -0.24228545 0.09785088]
[ 0.08619376 -0.24537997 -0.23702955 -0.24228545 0.09785088]
Most similar:
(Before training)
[('2', 0.574091911315918), ('3', 0.29283517599105835)]
(After training, significantly changed)
[('2', 0.9343106150627136), ('3', 0.7478965520858765)]
Process finished with exit code 0
article* should preferably in list of lists type:
article1 = [['My'],['name'],['is'],['David'],['I'],['like'],['Playing'],['Soccer']]
article2 = [['My'],['name'],['is'],['Jenny'],['I'],['love'],['Basketball']]
article3 = [['Today'],['is'],['Monday']]
@codehumanity - No, that's not the expected form for Word2Vec or Doc2Vec text examples – and I wouldn't expect data of that format to work added to the above code example.
Python assigns a pointer of model.docvecs[0] to docvec1 when docvec1 = model.docvecs[0].
Use docvec1 = copy.copy(model.docvecs[0]).
Hi @ColdL, can you give me concrete data for reproducing your problem.
ping @ColdL
I think @ishida-titech is right, it happens because docvec1, docvec2 point to the same list. docsim1 and docsim2 are different because the most_similar() function return new list every call, that also means the model was trained. Gensim works correctly here.
Thanks @strnam, I checked, @ishida-titech absolutely right here. The problem was in user code (not in Doc2Vec).
Most helpful comment
Python assigns a pointer of model.docvecs[0] to
docvec1whendocvec1 = model.docvecs[0].Use
docvec1 = copy.copy(model.docvecs[0]).