Gensim: Doc2Vec infer_vector() could (as option?) offer deterministic result

Created on 9 Sep 2015  Â·  46Comments  Â·  Source: RaRe-Technologies/gensim

Hello,

I recently updated to the current Gensim version and after this update the following situation arises when using infer_vector (doc2vec):

After loading a saved model I use the infer_vector method to infer the vector of a new sentence. However, infering the vector of the same sentence again leads to very different results. The problem is that after closing my application and restarting it, the infer_vector method delivers the same vectors again as in the previous session. (First vector is the same as the first vector in the first app run and so on). I guess that there is somewhere a bug where the new infered sentence influences the model which is not saved when stopping the application. Hence the same values again after restart.

The problem occurs after updating to 12.1 from an older github version of you. (I guess it was 11.x). In the older version, this problem did not occur. Unfortunately, I could not figure out what is going wrong.
Any ideas?

Greets Quhfus

difficulty hard feature

Most helpful comment

Just want to quickly confirm what @gojomo suggested: By resetting (or simply reseeding) the random state of the model before each inference step, I get the same vector each time (I use DM and negative sampling to train the model). I.e. simply do:

model.random.seed(0)
model.infer_vector(words)

All 46 comments

If either negative-sampling (the negative parameter) or frequent-word-downsampling (the sample parameter) are part of your configuration, then both training and inference are at times using a random-number generator.

In some ways the code tries to seed this source so that runs using the exact same set of inputs in the exact same order will be identical – but even the slightest difference will change things somewhat. For infer_vector, we aren't currently trying to reset the RNG for determinism before each inference. So a second run is in fact using a slightly difference set of random draws than a first one.

Are you using negative-sampling and/or frequent-word downsampling? If so, that's likely the cause of what you've observed. The vectors from subsequent runs should still be very similar, if the example text and model are sufficiently informative, and enough inference-gradient-descent has been done. (And since these techniques are dependent on some randomized choices and approximations, there's no single 'right' vector for particular inputs, just the "best found under the limits provided".)

If you need such determinism, you should be able to force it by explicitly resetting the model.random property to a freshly- and deterministically seeded RandomState instance, before each call to infer_vector(). (We could consider adding that as an option to the train_* methods, but it might not be the right default...)

Basically, I use Doc2Vec(..., min_count=3, size=1000, workers=8). Neither negative nor sample parameter is used. However, a look into the train_document_dm function shows that a random generator is always used here:

word_vocabs = [model.vocab[w] for w in doc_words if w in model.vocab and
                   model.vocab[w].sample_int > model.random.rand() * 2**32]

That is probably the solution. However, I have pretty short documents where I would like to infer the vector. Increasing the number of iterations in the inference process leads to more balanced vectors (the difference between the resulting vectors decreases).
What do you suggest? Increasing the number of iterations or performing the inference process multiple times with the same document and average the results vectors?

That code you've quoted is the pure-python version; it'll only be running if the C-extensions aren't working for you. (And since those make training about 100X faster, getting them to work should be top priority.) The logic is similar in the C code, but also if you're not using sampling, the vocab's sample_int will already be at the max, so the test always passes. So this won't be the cause of randomness between runs, if you're not choosing to use subsampling.

However, I just remembered another source of randomness: the effective window size. Rather than using the fixed max window size you've configured (or left at the default of 5), some random number _up to_ 5 is used, as per the original Word2Vec paper. This helps weight closer words more than distant ones. The workaround I mentioned, explicitly re-seeding the model's random before each inference, will likely still help.

I suspect 15 iterations on one inference (with the alpha learning rate automatically decreasing through the 15 steps) is likely to be more efficient than (for example) averaging together 3 runs of 5 iterations each. It's less work (same 15 steps, but no extra averaging step) and more consistent with the idea that each gradient-descent step edges closer to the best-achievable values. But you might want to test both: which gives you a 'better' vector for your purposes?

You may also want to try DBOW mode (dm=0). It's faster and creates better vectors for many purposes/datasets. (The window parameter also doesn't affect the DBOW vector training, so you might get reproducible inferences without the random-patching trick.)

Well, it is clear that the c-code is significantly faster but I used the python code to look for the main reason of the mentioned issue (the c code is not really understandable to me).

I tried to learn a small doc2vec model with DBOW (dm=0) and as you said the resulting vectors are identical. I have also found the code snippet of the window randomness.
However, in my further experiments I will regard this randomness issue and will probably use > 5 iterations in infer_vector to obtain similar vectors.

Since this comes up a bunch, I'm going to recast this issue as a feature-request, that infer_vector() should have at an option and/or doc-comment about how to get deterministic results if desired.

@tmylk @cscorley we could also promote this concept (repeatability) into a top-level API.

Whether this means resetting some RNG seeds or whatever, is up to the particular algorithm. But if set, we should explicitly fail requests to train / transform things that could result in non-deterministic output (threading etc).

I'm not sure whether we should aim at bit-for-bit determinism, or np.allclose kind of determinism.

Note that we definitely don't want this to be a part of unit tests / functional tests though. Determinism is just an engineering hack -- if an algorithm is inherently unstable or non-deterministic, we don't want to silently cover this up by seeding an RNG and pretending it's not!

Hi @gojomo ,
I followed your advice and came up with this code, in which the model's results are repeatable. I post it here for others to enjoy. Not a formal proof but at least a POC :)

Hi @piskvorky ,
From a purely theoretical standpoint, determinism is an engineering hack - but from a business logic viewpoint, a document embedding model should return the same embedding for the same document. Not supporting this would deter people from using gensim for Doc2Vec imho.

import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

sentences = [_.strip() for _ in "the dog is cute. i love the cats".split(".")]
tagged = [TaggedDocument(words=sent.split(), tags="SENT_%d" % (i))
    for i, sent in enumerate(sentences)]

# class gensim.models.doc2vec.Doc2Vec(documents=None, size=300, alpha=0.025, window=8, min_count=5, max_vocab_size=None, sample=0, seed=1, workers=1, min_alpha=0.0001, dm=1, hs=1, negative=0, dbow_words=0, dm_mean=0, dm_concat=0, dm_tag_count=1, docvecs=None, docvecs_mapfile=None, comment=None, trim_rule=None, **kwargs)

model = Doc2Vec(size=10, alpha=0.025, min_alpha=0.025, min_count=1, dm=0)  # use fixed learning rate
model.build_vocab(tagged)
for epoch in range(10):
    model.train(tagged)
    model.alpha -= 0.002  # decrease the learning rate
    model.min_alpha = model.alpha  # fix the learning rate, no decay

known_words = "love cute cats".split()
unknown_words = "astronaut kisses moon".split()
mixed_words = "the albatross is chicken".split()
for words in (known_words, unknown_words, mixed_words):
    v1 = model.infer_vector(words)
    for i in xrange(100):
        v2 = model.infer_vector(words)
        assert np.all(v1 == v2), "Failed on %s" % (" ".join(words))

Few users use the unit test directly, which is what I was talking about.

@guy4261

Thanks for posting the code. Can you please explain a little more for newbies in doc2vec about how to use the seed trick?

Currently I use the following ;

modelloaded=Doc2Vec.load("trained_model")

u = re.sub('[^a-zA-Z]', ' ', d1).lower().split() 
v = re.sub('[^a-zA-Z]', ' ', d2).lower().split() 


u1 = modelloaded.infer_vector(u) 
v1 = modelloaded.infer_vector(v)

Where exactly do I use the seed here?

@gojomo @piskvorky
Also, can the vectors u1 and v1 be used to calculate cosine similarity between two unseen documents u and v as follows

dist=scipy.spatial.distance.cosine(u1, v1)

@guy4261 – The snippet you posted just seems to be a test case that will fail currently. Did you mean to include a pointer to other code that does the forced-seeding?

@ajaanbaahu – Yes, that's the hope if the model/inference-parameters are all working well.

@gojomo
This snippet works. Doc2Vec gets its seed in the constructor (that's why I pasted the constructor's default values in the snippet as well). I did not need any additional seeding apart from that. Seeing your comment I thought that maybe vectors won't be stable across executions, but it seems to be working. After running the snippet I posted, I did the following:

>>> model.save("chicken.model")
>>> model = Doc2Vec.load("chicken.model") #would reloading break this?
>>> for words in (known_words, unknown_words, mixed_words):
...     v1 = model.infer_vector(words)
...     for i in xrange(100):
...         v2 = model.infer_vector(words)
...         assert np.all(v1 == v2), "Failed on %s" % (" ".join(words))
... 
>>> # no assertion failed!
>>> words
['the', 'albatross', 'is', 'chicken']
>>> model.infer_vector(words)
array([-0.01258622, -0.03592065,  0.0203425 ,  0.01526347, -0.0373825 ,
        0.03724672, -0.0266174 ,  0.0049285 , -0.02761576,  0.01557967], dtype=float32)
>>> model = Doc2Vec.load("chicken.model")
>>> model.infer_vector(words) # after reload, vector is the same!
array([-0.01258622, -0.03592065,  0.0203425 ,  0.01526347, -0.0373825 ,
        0.03724672, -0.0266174 ,  0.0049285 , -0.02761576,  0.01557967], dtype=float32)
>>> exit() # let's quit the interpreter and try it in a fresh interpreter
guyrap@foomachine:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from gensim.models.doc2vec import Doc2Vec
Using gpu device 0: GeForce GTX 980
>>> model = Doc2Vec.load("chicken.model")
>>> words = ['the', 'albatross', 'is', 'chicken']
>>> model.infer_vector(words)
array([-0.01258622, -0.03592065,  0.0203425 ,  0.01526347, -0.0373825 ,
        0.03724672, -0.0266174 ,  0.0049285 , -0.02761576,  0.01557967], dtype=float32)
>>> # same vector!!!

@ajaanbaahu
No seed trick - the default seed supplied to Doc2Vec is enough. The 'trick' here is using DBOW (dm=0) and Hierarchical Softmax (hs=1, as opposed to using negative sampling) which makes the inference avoid all sorts of random steps you normally can't control.

Regarding your cosine comment - I use this as a base of a document similarity system and use sklearn's KNN model, so I need stable results to keep my results logical... So that's my motivation and this is the solution I came up with.

@guy4261 – but where is the code that makes that snippet succeed?

@gojomo
?
Copying+pasting my snippet into the interpreter, it runs and no assertion fails.
The "success" I am aiming at is getting the same inferred vector for the same words list.
Or maybe I'm not getting your intention?

@guy4261 - Ah, I see, you're simply not using any of the features that introduce randomization. (Negative sampling, frequent-word downsampling, or a mode reliant on a window parameter.) I thought you meant you'd made changes to the code to use a deterministic seed for each infer_vector().

Note that using dm=1, or any negative value over 0, or any effective sample value will result in different results each time.

Got it.
A quick one while we're at it:
Is there a reason why the Doc2Vec constructor may accept both hs=1, negative=5 (or any negative>0 for that matter) and still work? Or is it something worth of an issue (to reject an illogical combination of these two modes)?

It's following the example of Word2Vec, which was following the example of the original researchers' word2vec.c code, which allowed both modes active at once. In a sense, activating both means there are _two_ NNs, which share the same input-words'-vectors ('projection weights'), but different output layers. They're trained in an interleaved fashion, each micro-example provided to each in turn. Given the extra time and memory it takes, I'm not sure that combination actually wins for any non-small test cases, where negative-alone usually seems best. In other surrounding APIs, I've seen negative=0 as a short-hand to imply hs=1, but haven't considered forcing them to be exclusive in the core classes.

Just want to quickly confirm what @gojomo suggested: By resetting (or simply reseeding) the random state of the model before each inference step, I get the same vector each time (I use DM and negative sampling to train the model). I.e. simply do:

model.random.seed(0)
model.infer_vector(words)

So was the issue solved. Can I get deterministic results from doc2vec infer_vector() now? ANd how? I am trying to do:

import gensim.models as g 
start_alpha=0.01
infer_epoch=1000
model="\\apnews_dbow\\doc2vec.bin"
m = g.Doc2Vec.load(model)
text='this is a sample text'
vec=m.infer_vector(text,alpha=start_alpha, steps=infer_epoch)

but still some the same sentence: "This is a sample sentence", I get diff vectors every run.

@ayush488 - infer_vector() needs a list-of-tokens, not a string. You'll still get slightly-different vectors on each invocation in many modes, though, for the reasons discussed above. As the comment directly above yours observes, a forced-reseeding should provide deterministic results. Did you try it and it not work?

Yes I tried forced seeding such as m.random.seed(42) just below where I
initialize m. But still I was getting different vectors.

On Jun 9, 2017 6:23 PM, "Gordon Mohr" notifications@github.com wrote:

@ayush488 https://github.com/ayush488 - infer_vector() needs a
list-of-tokens, not a string. You'll still get slightly-different vectors
on each invocation in many modes, though, for the reasons discussed above.
As the comment directly above yours observes, a forced-reseeding should
provide deterministic results. Did you try it and it not work?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/RaRe-Technologies/gensim/issues/447#issuecomment-307523082,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEb0EodtEW-5ceYg5NJYABMXkHRDoTuOks5sCdPagaJpZM4F6Uyf
.

@ayush488 You would need to re-seed it before every call to infer_vector(), as suggested above. Is that what you tried?

Can you send me the code snippet how to do. I just did the seeding once
when I initialized the model.
I am doing something like this:

import gensim.models as g
start_alpha=0.01
infer_epoch=1000
model="\apnews_dbow\doc2vec.bin"
m = g.Doc2Vec.load(model)

m.random.seed(42)

text='this is a sample text'
vec=m.infer_vector(text,alpha=start_alpha, steps=infer_epoch)

can you tell in which line I should do seeding?

Regards,
Ayush Singhal
Postdoctoral research fellow, NCBI, National Institute of Health, Maryland
PhD, M.Sc (University of Minnesota, Twin Cities, MN, USA)
B.Tech (IIT Roorkee, India)

On Sun, Jun 11, 2017 at 2:38 PM, Gordon Mohr notifications@github.com
wrote:

@ayush488 https://github.com/ayush488 You would need to re-seed it
before every call to infer_vector(), as suggested above. Is that what you
tried?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/RaRe-Technologies/gensim/issues/447#issuecomment-307651504,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEb0ErHiU27QzeNYOUdF3lb_gxDE1X_Zks5sDEI6gaJpZM4F6Uyf
.

@ayush488 - The re-seed() must precede every infer_vector(). Where are you performing a 2nd infer_vector() and seeing different results? (Put another re-seed() there.) You'd need to provide more detailed code of what code you've tried, and has still proven a difference in results, for us to make any more specific suggestion. (I have also answered at your SO question; perhaps the explanation there will also help.)

Hi @gojomo , i was following this thread's issue and tried your solution. Within the same runtime, the re-seed does work, but it doesn't if you restart the runtime. This is a code extract of how i implemented based on my understanding on above posts. I hope you might be able to advise.

self.__model = Doc2Vec(size=10,alpha=0.025,min_alpha=0.025,min_count=1,dm=0)
documents = load.get_doc(strFolderPathToDocs)
self.__model.build_vocab(documents)
self.__model.train(documents, total_words=self.__model.corpus_count, epochs=self.__model.iter)
self.__model.save(modelSaveFilenamePath)

After saving the model, i tried loading the model and performing inference on the same sentence. Within the same runtime, the vectors are the same, but it won't be the same in the next runtime.

self.__model=Doc2Vec.load(modelFilenamePath)
self.__model.random.seed(0)
docList=self.__model.docvecs.most_similar([self.__model.infer_vector(doc)], topn=topn) #Deliberate call twice
self.__model.random.seed(0)
docList=self.__model.docvecs.most_similar([self.__model.infer_vector(doc)], topn=topn)

Runtime 1:
: [-0.00795109 -0.07249657 0.04675696 0.02082575 0.09766824]
: [-0.00795109 -0.07249657 0.04675696 0.02082575 0.09766824]

Runtime 2:
: [-0.06148661 0.07009725 -0.00176225 0.0591853 0.02832361]
: [-0.06148661 0.07009725 -0.00176225 0.0591853 0.02832361]

Is this behaviour expected? If yes,how would one typically gauge similarity for unseen documents without getting very different results?

Thank you.

@jax79sg Are you in Python 3.x? If so it's likely related to the new PYTHONHASHSEED option, which (if not explicitly set) randomizes string-hashing between runtime launches, to prevent a certain kind of denial-of-service attack possible against services with predictable string-hashing. You can specify PYTHONHASHSEED env var if you want classic deterministic string-hashing globally in your runtime; if for some reason you still want randomized general string hashing, you could use the hashfxn parameter to Word2Vec/Doc2Vec to supply a stable-across-relaunches hash function just for that model – see https://github.com/RaRe-Technologies/gensim/blob/5f9503cae42002ec2f639fbe2cba8ec6a42b3466/gensim/models/word2vec.py#L367.

Hi @gojomo , thank you for your reply. I tried with a Python 2.x interpreter and its indeed working as expected. I tried both PYTHONHASHSEED and used a rudimentary ORD hash for hashfxn and it worked on Python 3 as well. From what i can tell, the hashfxn is used to perform hashing on both the text and labels. What kind of effect/impact would hashfxn have on the running of the doc2vec algorithm?

Something else on my mind, even with an indeterministic string-hashing between runtime launches, we probably shouldn't see such a drastic difference in results on every run. I was thinking it might be due to the unrealistically small dataset that i am having, but i noted that many had very different infer_vector results even with large datasets. In a practical scenario, should infer_vector be used?

Thank you.

@jax79sg The algorithm for training (and thus also inference, which is a constrained form of training) uses randomization – so there will typically be variance from run to run, or inference to inference. (From run to run, there's no essential relationship between where a word/doc winds up between runs – only the vectors within the same run have been co-trained, and are thus meaningfully comparable.)

In inference, if the model is good and sufficient effort is going into inference, each inference should wind up with similar (but not identical) vectors. If they're wildly different, the model may be poor or overfit (from too little data volume/variety or an oversized model), or insufficient inference effort may be occurring (which may be somewhat remediable with a larger steps parameter, especially for small texts).

Generally uses should be tolerant of the 'jitter' between runs – not trying to use forced-determinism to pretend the algorithm is more stable in its vector-assignments than it really is. But if the vectors are wildly different from run to run – not still usually one of each others' nearest neighbor – then something else is wrong with the application of the model/inference, because if it's doing anything meaningful with the word occurrences and thus underlying human-perceived meanings, such variance should be minimal, with adequate data and parameter choices. (The 'sanity check' I usually recommend is re-inferring vectors for texts in the training set, then checking the nearest neighbors – if the same document is the most-similar, or among the first few most-similar, things are generally working as they should and other evaluation of the inferred vectors is sensible – they might work really well! But if not, do further data-collection or parameter-tuning.)

Sorry if this is slightly off topic, but I noticed that while Doc2Vec passed the sanity check for me,

# 27224 documents were most similar with themselves, 150 were 2nd most similar
[27224, 150, 21, 12, 3, 3, 0, 0, 0, 0]

the similarity between a sentence and itself was always in the 0.6-0.7 range. Additionally, the similarity between phrases that were pretty close (their words were similar under word2vec) e.g. 'buy insurance' and 'purchase coverage', were only in the 0.2-0.3 range, and could fluctuate as much as 0.1 between runs. On the other hand, when I did model.random.seed(0) before infer_vector, the similarities shot up to as much as 0.8. Is my model poor? The documents in the corpus are much longer than the phrases that I call infer_vector with.

@wjgan7 Are you calling infer_vector() with lists-of-tokens? (Strings will be misinterpreted as lists-of-single-character-tokens.) Re-seeding might change the results of infer_vector() slightly, but if the text is meaningful (not just missing/single-char tokens) and sufficient inference is occurring (enough steps), the results should be be very similar from run-to-run with different seeds. (Wild differences mean not much inference is happening, for whatever reason, so something is wrong in model or choice of params.)

@gojomo Yeah it's a list of tokens, and you're right, it's probably an issue with params. Do you have any general advice for training on several hundred MB of Wikipedia articles (each used as a document in Doc2Vec) in order to determine the semantic relatedness of phrases/sentences?

@wjgan7 That's too generic of a query to offer any salient tips, and this forum/issue is best reserved for discussing the specific bug/feature-request. Open-ended queries for usage advice should go (with more context) to the project discussion list, https://groups.google.com/forum/#!forum/gensim.

I created a model from a corpus of documents (sentences) whit doc2vec. So, when I infer a vector of an identical phrase contained in the corpus, I would like to get the same embedding. But this does not happen. How can I do?
In practice, I would like to eliminate indeterminism factor both in the training and in the inference. Is it possible?

@Villuck - read the above comments in this issue for an explanation of the indeterminism & possible workarounds. For initial training to be deterministic, you'd need to limit training to a single thread - which will be much, much slower.

But in general, because these algorithms inherently use randomness, it's often better to have your evaluation processes be tolerant of small jitter/variance between runs. (And if there's large variance, there's likely a hint you have too little data or other applicability or metaparameter shortfalls.) Forcing perfect determinism creates a false sense of precision.

Ok, i've understand. But, is it possible in some way to eliminate indeterminism for obtaining the same embeddings every time?

@Villuck - It's already explained above, for infer_vector(). For training, it requires limiting to a single worker=1 thread. And it's usually a bad idea. Good luck!

It doesn't work. I create the model with this line:
model = gensim.models.doc2vec.Doc2Vec(size=dim_latent_factors, min_count=2, iter=260, dm=0, workers=1)
but the system returns a different embeddings of the same doc in every running.

@Villuck - Did you also read the part about PYTHONHASHSEED in Python 3.x, above? In any case, if you have continued problems related to training you should ask (with full code excerpts demonstrating the problem) on the project discussion list, https://groups.google.com/forum/#!forum/gensim, not this issue tracker for a specific unrelated feature request.

@gojomo ok, thank you.

@jrieke Thank you~!! with the random seed 0, I get same result vectors as I expected

I tried setting the seed between infer_vector calls. It gives consistent results within a single program execution, but still differs when you re-run the program from beginning. Any other means for locking in the random states between runs?

@bhomass - Did you also read the part about PYTHONHASHSEED in Python 3.x, above? And the parts about training non-determinism, if by "re-run the program from the beginning" means you're re-training a model?

no, no. miscommunication. I mean re-run the program to infer_vector using existing model.

@gojomo
infer_vector was generating 2 different vectors for the same sentence in a single run. After copying your snippet it generates the same vector in a single run but generates a new vector in a new run. How do i fix that.

data = ["I do not love machine learning.",
"I love coding in python",
"I love building chatbots",
"they chat amagingly well"]

tagged_data = []
for i, _d in enumerate(data):
    tagged_data.append(TaggedDocument(words=_d.split(), tags=[str(i)]))

max_epochs = 100
vec_size = 10
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha,
                min_alpha=0.025,
                min_count=1,
                dm=0)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")

@samisaga95 If you have a question, better to ask in the forum, https://groups.google.com/forum/#!forum/gensim. (I don't know what the relevance of your code snippet is to this feature-request.)

I was also suffering from the same problem of varying embeddings but this post on stack overflow solved it.
https://stackoverflow.com/questions/44443675/removing-randomization-of-vector-initialization-for-doc2vec

@ShrikanthSingh But as noted in my answer there, and here, and in the linked FAQ, it's generally a bad idea to rely on such deterministic-seeding for stable results.

If the "jitter" from run-to-run isn't something tolerated by your process, you likely have other problems in the model/parameters/data/process – and fixing that via deterministic seeding covers up the real problem. So I don't recommend such re-seeding.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

franciscojavierarceo picture franciscojavierarceo  Â·  3Comments

menshikh-iv picture menshikh-iv  Â·  4Comments

jeradf picture jeradf  Â·  4Comments

ahmedbhabbas picture ahmedbhabbas  Â·  4Comments

mikkokotila picture mikkokotila  Â·  3Comments