Gensim: CBOW model equivalent to the supervised learning model of fastText

Created on 19 Oct 2016  路  8Comments  路  Source: RaRe-Technologies/gensim

fastText is an "evolution" of word2vec, it contains new models for word embeddings and models for learning the association document -> label, i.e. classification.

Implementing the latter can be obtained by reusing the Word2vec class (only for CBOW), defining an input layer of words and an output layer of labels with their specific vocabulary. The concept of windows in training can be dropped. It is possible to implement, for output computation, the softmax function, together with its already present "approximations" (negative sampling and huffman tree).

A LabeledWord2Vec class is already implemented https://github.com/giacbrd/ShallowLearn/blob/master/shallowlearn/word2vec.py , it should be ported and improved (it misses negative sampling, some methods implementation)

difficulty medium feature

All 8 comments

For completeness linking to Fasttext Wrapper https://github.com/RaRe-Technologies/gensim/pull/847

I am still developing the code for this PR in https://github.com/giacbrd/ShallowLearn, I hope to start working on the fork ASAP

@giacbrd any progress on the PR? Cheers.

I am going to release a more stable model on my project, before Christmas, then I can port it in Gensim, it should be "easy"!
Cheers

I am finalizing the pull request. I am just thinking to design a better interface, but in general there is not much code.
Sorry but my spare time in the last two months has been minimal.
Cheers

From Gensim integration point of view, an API extending the existing FastText Wrapper API would be preferable. Though FastText wrapper API is not yet released so can be changed

Here I am doing something slightly different than re-implementing fastText. I have practically written a variant of the Word2Vec model, with the goal of learning the combinations sets_of_words->labels (i.e. text classification), where:

  • Output layer and its vocabulary is independent of the input layer
  • It is limited to CBOW
  • Given that the output layer is usually of a pre-defined size (e.g. labels in a text classification scenario), it is feasible to directly compute the softmax instead of its "approximations" (negative sampling and huffman tree). Then there are 3 loss methods

The wrapper, for now, is only for word embedding applications, but yes, it could be extended with the "supervised learning" component of fastText

See the pull request #1153

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dancinghui picture dancinghui  路  4Comments

johann-petrak picture johann-petrak  路  3Comments

franciscojavierarceo picture franciscojavierarceo  路  3Comments

shubhvachher picture shubhvachher  路  4Comments

vlad17 picture vlad17  路  4Comments