Currently, gensim has a wrapper for fastText. As discussed here, we need to implement training code (subword n-grams, hashing trick) for unsupervised fastText in gensim in python. As fastText is only a slight modification to word2vec, we will need to refactor the word2vec code to properly reuse the overlapping codes.
However, fastText outputs two files .vec and .bin which is C-standard. Should the python implementation in gensim provide pkl format output ?
This thread is intended to discuss and streamline all the requirements and deliverables regarding native fastText in gensim.
Hi @prakhar2b Let's simply use the pickle-style format from utils.SaveLoad we already use for word2vec models to persist the fastText models to disk.
Writing out the models in .bin format is a useful feature, but it comes later.
Also, as discussed, please look at the word2vec code in detail, figure out what is needed for fastText, formulate a clear plan of action and post it here. It should contain details about -
Word2Vec/create a common base class for Word2Vec and FastText/use composition?IMO, this design process is just as challenging and important as writing the code itself, and it would be good if you spent a good amount of time to come up with a clear plan.
Awesome feature! Let me add that having FastText in gensim will open up other unsupervised possibilities, such as sent2vec in #1376.
Even though the gensim mission is more 'unsupervised', the addition of known-labels in the FastText-for-classification mode is such a small delta I would suggest it be in-scope. It's really just adding in another kind of known-data, during training, as possible 'target' outputs of the internal NN, that may make the resulting vectors better. (Potentially, even if not then using the resulting word-vecs for the exact same classification problem, the inclusion of these extra targets during training may have made the word-vecs better for other tasks.)
Also, it will otherwise be a constant exception-to-be-mentioned, in docs/support: "Yes gensim implements FastText except not FastText mode X".
Is this issue still open ?
@dsouzadaniel yes, this is a part of ongoing Google summer of code project.
I further looked into fasttext and word2vec code, and this is how I plan to approach -
Class structure/ code reuseAs fasttext is a slight modification of word2vec, we will be mostly using word2vec training code with very slight modification. So, I think we should create two modules - one for moving the common overlapping codes from word2vec (it's better to decouple word2vec and fasttext as much as possible, IMO), and second fasttext.py for exclusive fasttext codes like subword n-grams or hashing tricks.
The training codes from fasttext.cc/ model.cc is very similar to codes in word2vec.py like functions train_batch_cbow or train_batch_sg or functions for sampling etc. The codes for n-grams from dictionary.cc/ matrix.cc needs to be written in python in fasttext.py.
Integrating with existing fastText wrapperIMO, it would be better to move the python codes (for loading and the hashing trick code etc) from wrapper into native fasttext, and then import these codes there in the wrapper, rather than the other way around.
APII think API should be somewhat similar to word2vec.
I think something like this should be better -
model = FastText(sentences, model, size, window, ......)
model.wv['example'] # similar to model.wv in word2vec
model.save(fname)
model = FastText.load(fname)
cc @jayantj @piskvorky
Sounds good -- it's a good idea to start with a PR that shows the new proposed package structure and refactoring. In clear (unoptimized) Python to start with, for concept clarity and to make discussions easier.
What is that model.ft in 3. API though? I'd prefer not to use obscure acronyms / variable names, unless it's really standard terminology. Isn't there something more descriptive?
@piskvorky ohh, model.ft was a mistake. I thought wv in model.wv stands for word2vec.
@gojomo yes, regarding fasttext supervised classification, I think we should later incorporate labeledw2v #1153 into the fasttext implementation from this PR. Currently, just like facebook's implementation, gensim's fasttext will have two param skipgram and cbow , we will add param 'supervised` later with labeledw2v(different PR maybe). This is the plan as of now.
Oh, I see. I'd say naming the variable wv was also unfortunate (word_vectors better).
@menshikh-iv how about we change the name to word_vectors, in all documentation, but keep wv as an alias to word_vectors too, for backward compatibility?
@prakhar2b People are reporting segfaults and limitations of the FB fastText implementation (how to continue training). A clean, flexible, supported implementation in Python is long overdue I'd say :)
@piskvorky yes, we can do this, you think that abbreviation wv confusing our users?
I think so, yes. At least it is to me, and I am a user too :)
Re: un-abbreviating wv
To fully communicate genericness across all uses, the property could also be called token_vectors. Depending on the general style preferences for/against any abbreviations, it could be wordvecs or tokenvecs. (For Doc2Vec the very KeyedVectors-like subcomponent that holds the doc-vectors is named docvecs.)
Aliases may need to be handled carefully given the SaveLoad/pickling approach, both across versions and to prevent duplicate writing of the same info. (Though perhaps, the syn0 -> wv.syn0 changes already paved the way for that.)
Good point on being careful with pickling! (although I think (un)pickle handles such references correctly, but worth double checking)
Possible alternatives: token_vectors, word_vectors, vectors (more generic/universal?), embeddings...?
Current PR for this is #1525
Resolved in #1525