Gensim: native fastText (unsupervised) in gensim

Created on 6 Jul 2017 · 17Comments · Source: RaRe-Technologies/gensim

Currently, gensim has a wrapper for fastText. As discussed here, we need to implement training code (subword n-grams, hashing trick) for unsupervised fastText in gensim in python. As fastText is only a slight modification to word2vec, we will need to refactor the word2vec code to properly reuse the overlapping codes.

However, fastText outputs two files .vec and .bin which is C-standard. Should the python implementation in gensim provide pkl format output ?

This thread is intended to discuss and streamline all the requirements and deliverables regarding native fastText in gensim.

Source

prakhar2b

👍2

All 17 comments

Hi @prakhar2b Let's simply use the pickle-style format from utils.SaveLoad we already use for word2vec models to persist the fastText models to disk.
Writing out the models in .bin format is a useful feature, but it comes later.

Also, as discussed, please look at the word2vec code in detail, figure out what is needed for fastText, formulate a clear plan of action and post it here. It should contain details about -

Class structure
- Do we subclass Word2Vec/create a common base class for Word2Vec and FastText/use composition?
Refactoring/code reuse
- Analyze if it makes sense to reuse code from the word2vec training methods (I think it would, there should be a lot of overlap)
- Analyze what methods can be reused, and how (or whether) they would have to be refactored to make them re-usable
Integrating with existing FastText wrapper
- For actual use of the model (word similarity etc), we should be able to reuse a lot of code from the FastText wrapper
API of the new class (IMO, it's going to be pretty much the same as the FastText wrapper class)

IMO, this design process is just as challenging and important as writing the code itself, and it would be good if you spent a good amount of time to come up with a clear plan.

jayantj on 6 Jul 2017

👍1

Awesome feature! Let me add that having FastText in gensim will open up other unsupervised possibilities, such as sent2vec in #1376.

piskvorky on 7 Jul 2017

Even though the gensim mission is more 'unsupervised', the addition of known-labels in the FastText-for-classification mode is such a small delta I would suggest it be in-scope. It's really just adding in another kind of known-data, during training, as possible 'target' outputs of the internal NN, that may make the resulting vectors better. (Potentially, even if not then using the resulting word-vecs for the exact same classification problem, the inclusion of these extra targets during training may have made the word-vecs better for other tasks.)

Also, it will otherwise be a constant exception-to-be-mentioned, in docs/support: "Yes gensim implements FastText except not FastText mode X".

gojomo on 9 Jul 2017

👍1

Is this issue still open ?

dsouzadaniel on 10 Jul 2017

@dsouzadaniel yes, this is a part of ongoing Google summer of code project.

prakhar2b on 10 Jul 2017

I further looked into fasttext and word2vec code, and this is how I plan to approach -

Class structure/ code reuse

As fasttext is a slight modification of word2vec, we will be mostly using word2vec training code with very slight modification. So, I think we should create two modules - one for moving the common overlapping codes from word2vec (it's better to decouple word2vec and fasttext as much as possible, IMO), and second fasttext.py for exclusive fasttext codes like subword n-grams or hashing tricks.

The training codes from fasttext.cc/ model.cc is very similar to codes in word2vec.py like functions train_batch_cbow or train_batch_sg or functions for sampling etc. The codes for n-grams from dictionary.cc/ matrix.cc needs to be written in python in fasttext.py.

Integrating with existing fastText wrapper

IMO, it would be better to move the python codes (for loading and the hashing trick code etc) from wrapper into native fasttext, and then import these codes there in the wrapper, rather than the other way around.

API

I think API should be somewhat similar to word2vec.
I think something like this should be better -

model = FastText(sentences, model, size, window, ......)
model.wv['example'] # similar to model.wv in word2vec
model.save(fname)
model = FastText.load(fname)

cc @jayantj @piskvorky

prakhar2b on 11 Jul 2017

Sounds good -- it's a good idea to start with a PR that shows the new proposed package structure and refactoring. In clear (unoptimized) Python to start with, for concept clarity and to make discussions easier.

What is that model.ft in 3. API though? I'd prefer not to use obscure acronyms / variable names, unless it's really standard terminology. Isn't there something more descriptive?

piskvorky on 11 Jul 2017

@piskvorky ohh, model.ft was a mistake. I thought wv in model.wv stands for word2vec.

prakhar2b on 11 Jul 2017

@gojomo yes, regarding fasttext supervised classification, I think we should later incorporate labeledw2v #1153 into the fasttext implementation from this PR. Currently, just like facebook's implementation, gensim's fasttext will have two param skipgram and cbow , we will add param 'supervised` later with labeledw2v(different PR maybe). This is the plan as of now.

prakhar2b on 13 Jul 2017

Oh, I see. I'd say naming the variable wv was also unfortunate (word_vectors better).

@menshikh-iv how about we change the name to word_vectors, in all documentation, but keep wv as an alias to word_vectors too, for backward compatibility?

piskvorky on 13 Jul 2017

@prakhar2b People are reporting segfaults and limitations of the FB fastText implementation (how to continue training). A clean, flexible, supported implementation in Python is long overdue I'd say :)

piskvorky on 13 Jul 2017

👍1

@piskvorky yes, we can do this, you think that abbreviation wv confusing our users?

menshikh-iv on 13 Jul 2017

I think so, yes. At least it is to me, and I am a user too :)

piskvorky on 13 Jul 2017

Re: un-abbreviating wv

To fully communicate genericness across all uses, the property could also be called token_vectors. Depending on the general style preferences for/against any abbreviations, it could be wordvecs or tokenvecs. (For Doc2Vec the very KeyedVectors-like subcomponent that holds the doc-vectors is named docvecs.)

Aliases may need to be handled carefully given the SaveLoad/pickling approach, both across versions and to prevent duplicate writing of the same info. (Though perhaps, the syn0 -> wv.syn0 changes already paved the way for that.)

gojomo on 13 Jul 2017

Good point on being careful with pickling! (although I think (un)pickle handles such references correctly, but worth double checking)

Possible alternatives: token_vectors, word_vectors, vectors (more generic/universal?), embeddings...?

piskvorky on 14 Jul 2017

Current PR for this is #1525

menshikh-iv on 14 Aug 2017

Resolved in #1525

menshikh-iv on 2 Oct 2017

Was this page helpful?

0 / 5 - 0 ratings