Gensim: CBOW model equivalent to the supervised learning model of fastText

Created on 19 Oct 2016 · 8Comments · Source: RaRe-Technologies/gensim

fastText is an "evolution" of word2vec, it contains new models for word embeddings and models for learning the association document -> label, i.e. classification.

Implementing the latter can be obtained by reusing the Word2vec class (only for CBOW), defining an input layer of words and an output layer of labels with their specific vocabulary. The concept of windows in training can be dropped. It is possible to implement, for output computation, the softmax function, together with its already present "approximations" (negative sampling and huffman tree).

A LabeledWord2Vec class is already implemented https://github.com/giacbrd/ShallowLearn/blob/master/shallowlearn/word2vec.py , it should be ported and improved (it misses negative sampling, some methods implementation)

difficulty medium feature

Source

giacbrd

👍1

All 8 comments

For completeness linking to Fasttext Wrapper https://github.com/RaRe-Technologies/gensim/pull/847

tmylk on 19 Oct 2016

I am still developing the code for this PR in https://github.com/giacbrd/ShallowLearn, I hope to start working on the fork ASAP

giacbrd on 14 Nov 2016

👍1

@giacbrd any progress on the PR? Cheers.

piskvorky on 12 Dec 2016

I am going to release a more stable model on my project, before Christmas, then I can port it in Gensim, it should be "easy"!
Cheers

giacbrd on 13 Dec 2016

👍1

I am finalizing the pull request. I am just thinking to design a better interface, but in general there is not much code.
Sorry but my spare time in the last two months has been minimal.
Cheers

giacbrd on 25 Jan 2017

👍1

From Gensim integration point of view, an API extending the existing FastText Wrapper API would be preferable. Though FastText wrapper API is not yet released so can be changed

tmylk on 25 Jan 2017

Here I am doing something slightly different than re-implementing fastText. I have practically written a variant of the Word2Vec model, with the goal of learning the combinations sets_of_words->labels (i.e. text classification), where:

Output layer and its vocabulary is independent of the input layer
It is limited to CBOW
Given that the output layer is usually of a pre-defined size (e.g. labels in a text classification scenario), it is feasible to directly compute the softmax instead of its "approximations" (negative sampling and huffman tree). Then there are 3 loss methods

The wrapper, for now, is only for word embedding applications, but yes, it could be extended with the "supervised learning" component of fastText

giacbrd on 25 Jan 2017

See the pull request #1153

giacbrd on 17 Feb 2017

Was this page helpful?

0 / 5 - 0 ratings