Fasttext: Describe how sentence vectors are generated

Created on 6 Sep 2017  路  4Comments  路  Source: facebookresearch/fastText

It would be great to have a better understanding of how the sentence vectors are generated. Superficially, there are similarities to Sent2Vec (https://github.com/epfml/sent2vec / paper) -- is that algorithm being used?

Most helpful comment

@spate141 No I have not done anything like that yet, but I probably will in the future (based on biomedical text).
In the meanwhile I have read up a little bit more and found that my original question was somewhat misguided. sent2vec does not (at least primarily) have any special functionality for deriving sentence vectors from existing word vector embeddings. Rather, it uses a different training objective already at the stage of training the word vectors.
In contrast, TF-IDF weighing of word vectors, or Smooth Inverse Frequency (SIF) weighing (https://github.com/PrincetonML/SIF) could be applied on normally trained word vectors post-hoc, but require word frequency information.
It seems like the superiority of these sentence embeddings compared to simple averaging or max-pooling of word vectors is a robust finding across several evaluation sets. Perhaps this could be a potential new feature for fastText?
I think I will start out working with sent2vec, since it is specialized on sentence embeddings and also works with word n-grams (while it might be extra work to apply SIF weighing to word n-grams).

All 4 comments

Upon looking at the codebase a bit more, it seems like vectors are currently generated via simple averaging of word vectors?

@matthias-samwald have you tried using sentence embeddings generated by sent2vec in any task? or any types of comparison with fastText embeddings ?
Thanks!

@spate141 No I have not done anything like that yet, but I probably will in the future (based on biomedical text).
In the meanwhile I have read up a little bit more and found that my original question was somewhat misguided. sent2vec does not (at least primarily) have any special functionality for deriving sentence vectors from existing word vector embeddings. Rather, it uses a different training objective already at the stage of training the word vectors.
In contrast, TF-IDF weighing of word vectors, or Smooth Inverse Frequency (SIF) weighing (https://github.com/PrincetonML/SIF) could be applied on normally trained word vectors post-hoc, but require word frequency information.
It seems like the superiority of these sentence embeddings compared to simple averaging or max-pooling of word vectors is a robust finding across several evaluation sets. Perhaps this could be a potential new feature for fastText?
I think I will start out working with sent2vec, since it is specialized on sentence embeddings and also works with word n-grams (while it might be extra work to apply SIF weighing to word n-grams).

Hello @matthias-samwald,

Please see the following issue https://github.com/facebookresearch/fastText/issues/323 which details a discussion on how we calculate sentence embeddings. It appears that you have already figured this out yourself, I just want to make sure this is referenced here. Indeed it could be on fastText's roadmap to also implement some of Sent2Vec's features, if deemed relevant, so stay tuned. I'm closing this issue now, but please feel encouraged to reopen it at any point if you don't consider this resolved.

Thanks,
Christian

Was this page helpful?
0 / 5 - 0 ratings