As the person mentioned in mailing list, he receives different result with a pre-trained model with gensim code & original facebook code.
bash
wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
unzip v0.1.0.zip
cd fastText-0.1.0
make
bin+text linkTry to retreive vectors with gensim
```python
from gensim.models import FastText
m = FastText.load_fasttext_format("wiki.en.vec")
print(m["hello"]) # existent word
print(m["someundefinedword"]) # non-existent word
bash
./fasttext print-word-vectors ../wiki.en.bin
hello
someundefinedword
Vectors for "hello" and "someundefinedword" exactly same (from gensim & Facebook)
Exactly same vectors for "hello", but different for "someundefinedword"
CC: @manneshiva
This behavior is expected as pointed out by this comment in the unit test file. The vector for an OOV word in Gensim is likely to be slightly different compared to the vector obtained from the Original Fasttext implementation for the same OOV word. This is because the Gensim code discards un-used ngram vectors (to save memory) while the Original implementation keeps all the buckets (and hence all ngrams). So it is possible that a new OOV words might contain a few ngrams whose vectors might be missing after the discarding. Such a case is highly unlikely (depending on the bucket size and vocab size) after this PR #1916 (merged after the creation of this issue).
P.S.: The current code does not explicitly store the ngrams anymore, but only the hashes. It also discards unused hashes.
Thanks for detailed explanation @manneshiva :+1:
That we might throw out never-encountered (and thus never-trained) n-grams might be an acceptable optimization, for the case where a model is fully trained inside gensim. (After all, if needing arbitrary random untrained vectors for those n-grams later, they can be created later.)
But, to discard them from an externally-trained, loaded, static model, and thus get different vectors than the original FastText in what should be a completely deterministic process, strikes me as a deviation from expected behavior, and thus a bug, despite any explanation. Your thoughts, @piskvorky?
In general we want to stick to the original, yes.
@manneshiva what does "unused" mean for externally loaded models? How much do we gain by modifying the static model's default behaviour there?
I don't have enough intuition about the trade-offs, the pros and cons.
I guess the expected behavior is that, if I load a model, all the vectors in that model are actually used in the end. I think deleting parts of a loaded binary model is not ideal, especially because it doesn't seem to be clearly pointed out anywhere. So the cons seem relatively clear - you are loading a model that may have been evaluated to have a certain quality, but under gensim you are no longer guaranteed that the quality will hold up. The pros are hard to judge for me, I'm not sure how hard it would be to adapt the code so this doesn't happen.
And like @piskvorky I'm also interested where the determination of "unused" subword vectors is actually made, since there's no training corpus in this setting.
Are there currently plans to change this?
In order to reduce the ngrams matrix size, the original Fasttext code (by Facebook) does not store a unique vector for each ngram but maps each ngram to a bucket using a hash function. The size of ngrams matrix is thus limited to bucket_size x vector_size. Due to this, there is no guarantee that all the vectors corresponding to each bucket is used/trained. The actual number of vectors (from the ngrams matrix) that are trained/updated will depend on the size of the bucket (default: 2 Million) and the ngrams encountered in the training corpus.
Since the only ngram vectors that are actually trained are the ngrams of the in-vocabulary words, "unused" buckets here refer to the hashes(bucket indices) that do not correspond to the hashes of any ngrams (of in-vocab words).
There has been some discussion around this trade-off between reducing memory usage of the model (without compromising the quality of word vectors) vs deviating from "deterministic/expected behavior". The decision to discard "unused" ngram buckets was a part of the FastText wrapper in Gensim based on the memory profiling result seen in this comment. Also look at the discussions/memory profiling results in this issue.
IMO, it is best to provide the user with an option/parameter to either discard unused vectors (without loss in quality of word vectors) or not.
Ah, if it's just untrained garbage, we should definitely discard that. If someone relies on an exact reproducibility from random/untrained vectors, their app is broken anyway. That's not the kind of compatibility we care about.
@manneshiva I assume the determination of used/unused is straightforward? Or is any guessing/heuristics involved, any room for error?
I would think that if some other FastText implementation (like the original) saves out 'untrained garbage' in its serialized model, and loads-and-uses such noise in its subsequent OOV calculations after re-loading, so that it affects (reproducible) evaluations of the frozen model, then we're not really format compatible if we make the independent decision to discard that noise. And, we'll get a continuing tail of "what's up with this?" questions/bug-reports from that decision to be arbitrarily different in how we load a (frozen, completed, original-tool) model.
Along with reproducibility issues, another point of discussion over this was that there is a memory/speed trade-off involved (during loading). Quoting from a previous comment -
"For relatively small vocab sizes (~200k), the steady-state memory usage is 1.1 GB lower than it would be if we chose to keep all ngram vectors as is. (for 300-d vectors). This is at the cost of significantly increased loading time.
Conversely, for large vocab sizes (like for Wikipedia models), we don't reduce memory usage, while also causing much higher load times. (as @gojomo rightly pointed out)
In case the common use case is indeed loading large models, it might make sense to store ngram vectors as is, without trying to discard any unused ones."
For this, Shiva proposed this solution here, which, IMO gives us the best of both worlds, while also providing exactly same behaviour by default as the original FastText, which should reduce the follow up questions/bug reports we get about this. The solution as @manneshiva described it -
"I feel we should give the user an option -- discard_unused_ngrams to save memory, which by default could be False. Since the memory saved for small vocab models is significant (owing to a fewer number of total used ngrmas), this should be helpful for a user trying to load a small vocab model with limited RAM."
@manneshiva @piskvorky @gojomo @menshikh-iv do you see any potential issues with this approach? if not, I think we should go ahead with it.
@gojomo has raised a valid concern that in theory, our heuristic for determining unused vectors could fail, if the serialized model has had the vocabulary trimmed after training and before serializing. However, FastText doesn't do anything of the sort (and neither do any other models in my experience), so I think it's a reasonable assumption to make. A note/info level log in the code could be useful
I agree with https://github.com/RaRe-Technologies/gensim/issues/1261#issuecomment-351020115 and @jayantj (looks like best variant).
We also definitely need to update documentation in this place (for avoid user confusion).
@manneshiva @gojomo Hi, thanks for your answers. I am encountering the similar problem with the author. I am confused about what this 'never-encountered (and thus never-trained) n-grams' comes from ? Is it initialized as a random vector for each bucket at the beginning? Thanks!
@Alice-Ke yes, all ngram vectors are initialized randomly before training. The gensim implementation throws the ones away that are not used by any of the known words. However, if you now look up the vector for a previously unknown word which contains some new ngram, gensim will return wrong results as it threw away the respective ngram vector. Further difference likely comes from possibly wrong handling of unicode characters: https://github.com/RaRe-Technologies/gensim/issues/2059
Most helpful comment
@Alice-Ke yes, all ngram vectors are initialized randomly before training. The gensim implementation throws the ones away that are not used by any of the known words. However, if you now look up the vector for a previously unknown word which contains some new ngram, gensim will return wrong results as it threw away the respective ngram vector. Further difference likely comes from possibly wrong handling of unicode characters: https://github.com/RaRe-Technologies/gensim/issues/2059