Gensim: potential Doc2Vec feature: reverse inference, to synthesize doc/summary words

Created on 21 Apr 2019  Â·  4Comments  Â·  Source: RaRe-Technologies/gensim

Motivated by the SO question: https://stackoverflow.com/questions/55768598/interpret-the-doc2vec-vectors-clusters-representation/55779049#55779049

Doc2Vec could plausibly have a function that's reverse-inference: take a doc-vector, return a (ranked) list of words most-predicted by that input vector. It'd work highly analogously to Word2Vec.predict_output_word(). Such a list of words might be useful as a sort-of summary or label for a doc-vector.

Hacktoberfest difficulty medium feature good first issue wishlist

Most helpful comment

@saraswatmks Yes, and further, you don't have to ask permission: a good PR submission will be reviewed/welcomed even without having declared your interest first. It's more important to "show working code" than "declare interest".

All 4 comments

Hi @gojomo Is it yet to be done ? can I take it up?

@saraswatmks Yes, and further, you don't have to ask permission: a good PR submission will be reviewed/welcomed even without having declared your interest first. It's more important to "show working code" than "declare interest".

@gojomo Just to be sure, I have multiple implementations for this in mind:
Option 1:

  1. Takes a doc-vector (I will check for vector length as a sanity check)
  2. Returns a list of tuples of words and scores which are most similar to the doc-vector
  3. One thing to note, the words will be returned from the whole corpus.

Option 2:

  1. Instead of input vector, I ask for doc-id
  2. Get the vector for that doc-id
  3. Filter all the words which occurred in that doc-id and get vector of all those words
  4. Do the dot product only on the filtered word vectors. This way we ensure that we return only those words which actually occurred in that doc-idd.

What do you think ? Which one we should go with ? Or if there's any other way this can be done please let me know.

Re: Option 1 – It's not "most similar" words which are needed here. Rather, it's "most predicted". The logic and behavior should be highly analogous to Word2Vec.predict_output_word(), except taking a single vector rather than a list-of-words.

(Option 2 is impossible, as the model includes no record of the words which occurred in that doc-id.)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

johann-petrak picture johann-petrak  Â·  3Comments

mmunozm picture mmunozm  Â·  3Comments

sairampillai picture sairampillai  Â·  3Comments

shubhvachher picture shubhvachher  Â·  4Comments

bgokden picture bgokden  Â·  3Comments