Gensim: potential Doc2Vec feature: reverse inference, to synthesize doc/summary words

Created on 21 Apr 2019 · 4Comments · Source: RaRe-Technologies/gensim

Motivated by the SO question: https://stackoverflow.com/questions/55768598/interpret-the-doc2vec-vectors-clusters-representation/55779049#55779049

Doc2Vec could plausibly have a function that's reverse-inference: take a doc-vector, return a (ranked) list of words most-predicted by that input vector. It'd work highly analogously to Word2Vec.predict_output_word(). Such a list of words might be useful as a sort-of summary or label for a doc-vector.

Hacktoberfest difficulty medium feature good first issue wishlist

Source

gojomo

Most helpful comment

@saraswatmks Yes, and further, you don't have to ask permission: a good PR submission will be reviewed/welcomed even without having declared your interest first. It's more important to "show working code" than "declare interest".

gojomo on 1 May 2019

👍4

All 4 comments

Hi @gojomo Is it yet to be done ? can I take it up?

saraswatmks on 1 May 2019

gojomo on 1 May 2019

👍4

@gojomo Just to be sure, I have multiple implementations for this in mind:
Option 1:

Takes a doc-vector (I will check for vector length as a sanity check)
Returns a list of tuples of words and scores which are most similar to the doc-vector
One thing to note, the words will be returned from the whole corpus.

Option 2:

Instead of input vector, I ask for doc-id
Get the vector for that doc-id
Filter all the words which occurred in that doc-id and get vector of all those words
Do the dot product only on the filtered word vectors. This way we ensure that we return only those words which actually occurred in that doc-idd.

What do you think ? Which one we should go with ? Or if there's any other way this can be done please let me know.

saraswatmks on 4 May 2019

Re: Option 1 – It's not "most similar" words which are needed here. Rather, it's "most predicted". The logic and behavior should be highly analogous to Word2Vec.predict_output_word(), except taking a single vector rather than a list-of-words.

(Option 2 is impossible, as the model includes no record of the words which occurred in that doc-id.)

gojomo on 6 May 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings