gensim 0.12.1 doc2vec example is not working

Created on 29 Aug 2015 · 6Comments · Source: RaRe-Technologies/gensim

When I run the following code, I get a TypeError.

import cPickle as pickle
import re

with open ('titles.pkl') as t:
    titles = pickle.load(t)

import gensim
from gensim.models.doc2vec import TaggedDocument, LabeledSentence, Doc2Vec

sentences = list(LabeledSentence(re.sub('[^a-zA-Z]', ' ', key).lower().split(), ("ID_" + str(value),)) for key, value in titles)


model = Doc2Vec(size=200, min_count=1, workers=16)

model.build_vocab(sentences)

model.alpha = 0.025

for epoch in range(20):
    model.train(sentences)
    model.alpha -= 0.001
    model.min_alpha = model.alpha

# this part of the code is in the documents for docvecs
docvec = model.docvecs[99]
sims = model.docvecs.most_similar(docvec)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/gensim_title_recommender/venv/local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 428, in most_similar
    for doc, weight in positive + negative:
TypeError: 'numpy.float32' object is not iterable

That sample should work, no? titles is an array of arrays where I have document titles and IDs as 'key, value`. My goal here is to be able to provide a new title and get back one of the original ids, ie, use what I understand are the similarity functions in doc2vec. Other calls to the model work:

sentence = u'This is my super cool title'
doc = re.sub('[^a-zA-Z]', ' ', sentence).lower().split()
model.most_similar(positive=doc)

So I believe I've built the model correctly, and that this is a bug.

In case it's relevant, this is on ubuntu 14.04.

Requirements.txt:

behave>=1.2.4
boto>=2.32.0
pep8
python-dateutil
gevent
grequests
psycopg2
cython
scipy==0.15.1
gensim

Provisioning script:

#!/bin/bash

add-apt-repository -y ppa:rwky/redis
sudo DEBIAN_FRONTEND=noninteractive apt-get -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" update
sudo DEBIAN_FRONTEND=noninteractive apt-get -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" upgrade
apt-get install -y redis-server ntp python-software-properties software-properties-common python-virtualenv
apt-get install -y python-dev python-pip git gfortran libopenblas-dev liblapack-dev postgresql postgresql-contrib libpq-dev
pip install virtualenv

Source

mmroden

All 6 comments

Try supplying the 'positive' example vector inside a list – [docvec]. (The code should but doesn't yet properly recognize a single ndarray like it does a single int index or string tag.)

gojomo on 29 Aug 2015

👍1

Ah, that got it!

>>> sims = model.docvecs.most_similar([docvec])
>>> sims
[('VID_29538736', 1.0), ('VID_29543021', 0.613202691078186), ('VID_29543020', 0.5868339538574219), ('VID_29538735', 0.5410645008087158), ('VID_29601774', 0.5222368240356445), ('VID_29464301', 0.5215977430343628), ('VID_29491783', 0.5171146392822266), ('VID_29463134', 0.5082449913024902), ('VID_29473381', 0.5064586400985718), ('VID_29616925', 0.5051187872886658)]

I guess those docs are a bit out of date is all.

So how would I query this model for a new document to try to get a similar doc by ID? Ideally, I'd do something like the query I had originally posted:

>>> sentence = u'Students get back in the classroom'
>>> doc = re.sub('[^a-zA-Z]', ' ', sentence).lower().split()
>>> model.docvecs.most_similar(positive=doc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/gensim_title_recommender/venv/local/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 435, in most_similar
    raise KeyError("doc '%s' not in trained set" % doc)
KeyError: u"doc 'everything' not in trained set"

I tried to follow the example for LSI, but that seems incorrect for doc2vec (ie, the whole point is not to use the single-word corpus, but to look at documents as a whole). Ideally, I'd think that I'd calculate a score for that sentence, and then query to get the most similar vector from the docvecs in the model (or an error, if it has words the model has never seen before). I could do something like infer_vector, or calculate the vectors of individual words and then dot product them all together, etc, etc, but that seems wrong. That sentence above does give me some pretty interesting results from my model, but ultimately, I want one of those VID strings I used in training.

>>> sent = u'Students get back in the classroom'
>>> doc = re.sub('[^a-zA-Z]', ' ', sent).lower().split()
>>> model.most_similar(positive=doc)
[(u'teachers', 0.44251883029937744), (u'class', 0.4367973506450653), (u'dorm', 0.40028077363967896), (u'kids', 0.3963126540184021), (u'apply', 0.39425891637802124), (u'freshmen', 0.3912411034107208), (u'to', 0.38409990072250366), (u'healthy', 0.38326704502105713), (u'helping', 0.3827580213546753), (u'start', 0.37929749488830566)]

mmroden on 29 Aug 2015

Yes, you would use infer_vector() to evolve a reasonable model-compatible vector for the new document's tokens, and then supply that raw vector as the positive example. So roughly like:

new_doc_vec = model.infer_vector(doc)
model.docvecs.most_similar([new_doc_vec])

A good sanity check is to re-infer a vector for a document already in the model. The nearest doc will (usually) be the same document, if the model is well-trained and the documents meaningfully understandable. ('Usually' because there's effectively some randomness in the process.)

Any words not part of the model's vocabulary – whether because they didn't make a min_count cutoff, of are all-new in a new inferred text – are simply ignored as if they weren't part of the provided example text.

As you note, the 'paragraph vectors' training methods don't primarily compose a doc vector from word vectors. (Sometimes, such a straightforward sum or average of word vectors is used as a baseline against which to compare doc vectors from other methods.) Rather, the doc vector is induced based on its ability to predict words – sometimes even without any contribution from other word vectors (in plain DBOW mode), or mixed with word vectors that are not a precursor input but co-trained at the same time (the DM modes or DBOW with interleaved dbow_words=1 training.) Definitely try DBOW (dm=0) mode, it can be faster/better for many uses. (Don't take the Doc2Vec class defaults as best-practice recommendations; they're somewhat arbitrary and in any case different datasets/end-purposes seem to benefit from different modes.)

gojomo on 29 Aug 2015

👍1

That's definitely working, thanks!

Would it make sense to be able to mess around in the infer_vectors function? I would expect that a training example that was fed into the model would return a high score (ie, > .9), but I'm not seeing that. For instance, I see:

>>> titles[100]
[u"Brad Boxberger on Kevin Kiermaier's play", 29484932]
>>> sent = u"Brad Boxberger on Kevin Kiermaier's play"
>>> doc = re.sub('[^a-zA-Z]', ' ', sent).lower().split()
>>> inf_vec = model.infer_vector(doc)
>>> model.docvecs.most_similar([inf_vec])
[('VID_29619465', 0.7008367776870728), ('VID_29619425', 0.673180878162384), ('VID_29484932', 0.6729332208633423), ('VID_29583883', 0.6665624380111694), ('VID_29583877', 0.6541317701339722), ('VID_29619468', 0.6540207862854004), ('VID_29583870', 0.6502735614776611), ('VID_29585875', 0.6245695352554321), ('VID_29588914', 0.6218901872634888), ('VID_29578095', 0.6111888885498047)]
>>> for title in titles:
...     if title[1] == 29619465:
...         print title
... 
[u'Play of the Week:  Play 2', 29619465]

So it's not an entirely unreasonable retrieval (it got that it's a play, and a sports play), but the score is pretty low, and the right doc is the third in the list.

Do you think that this is just messing around with model construction, or could something else be at play here? That's why I'm thinking of looking at the infer_vector, especially since I'm not limiting terms during model inclusion, ie, min_count=1.

mmroden on 29 Aug 2015

Some thoughts:

those are very short docs; there's not much to go on
also from the shortness: if you're using a DM mode with a large window, it seems every 'training context' might be a full sentence
words that only appear in one document are in some sense equivalent to being _another_ doc-specific tag/ID. I have a vague hunch that may make the full-doc vectors a bit less powerful, because half the learned-predictiveness of the example flows into that pseudo-ID, rather than the intended docvec.
more iterations on the initial training or the infer_vector (see its optional arguments) might make the inference more consistent with the bulk-trained vectors (more likely to find a vector closest to the bulk vector for the same input)
be certain to perform the exact same tokenization at inference time as initial build_vocab/training; also note that the original paragraph-vectors paper retained punctuation as tokens
be sure to try DBOW; it may work better on such small examples – but note that unless you also use the dbow_words=1 parameter, word vectors aren't trained at all (they'll be their randomized initial values at the end of training)