Fasttext: print-word-vectors returns zeros as embeddings for some words

Created on 11 Oct 2017  Â·  10Comments  Â·  Source: facebookresearch/fastText

I am using downloaded pretrained fasttest model for generating embeddings for my vocabulary. After latest commit
ebbd3bfee59a319813214a5c50d7fabe8fb1e344 some words, especially those, which are out of vocabulary for model are getting zeros as embeddings.
The following are the examples of those outputs.
$ echo 'saftware' | ./fasttext print-word-vectors wiki.en.bin
saftware 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

$ echo 'realli' | ./fasttext print-word-vectors wiki.en.bin
realli 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Most helpful comment

I am trying to find a word representation for a OOV word using fasttext embedding. FastText still gives 0 as embeddings for OOV words when trained from fresh text corpora.

I am using the following commands:
./fastText/fasttext skipgram -input vehicles_keyword.txt -output vehicles_ft_model**

Read 0M words
Number of words: 2016
Number of labels: 0
Progress: 100.0% words/sec/thread: 23458 lr: 0.000000 loss: 2.711548 eta: 0h0m

Then, when I try to find the embeddings for nisssan, I get the following:
echo "nissssan" | ./fastText/fasttext print-word-vectors vehicles_ft_model.bin
nissssan 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

could you please help me with the soloution?

All 10 comments

I think, this situation is mentioned in code here:

https://github.com/facebookresearch/fastText/blob/master/src/fasttext.cc#L160

But it's still interesting, why pre-trained Wikipedia models for all languages are not updated to version 12?

Hi @Stacy-D,

Thank you for reporting this bug! We are working on fixing this issue ASAP.

@filaPro: the bug is actually not here (fasttext.cc#L160 only applies to supervised models). The problem is that in the pre-trained word vectors in binary format, the pruneidx_size_ field is set to 0, while it should be -1(as no pruning is used).

Best,
Edouard Grave

Hey @Stacy-D,

We just updated the models, please give it a try and let us know if it works or not.

Thank you again for reporting this!

Christian

Hi @cpuhrsch,
Thanks for updated models. I tried and it seems like the problem was fixed.
Best,
Stacy

The issue still persists when vectors are created afresh from text corpora.

Hello @technologistkj,

Thank you for your post. I tried reproducing your problem the following way

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ ./word-vector-example.sh
[...]
$ $ echo "rL6AHV8PdI" | ./fasttext print-word-vectors result/fil9.bin
rL6AHV8PdI -0.012921 0.027566 0.02447 -0.028895 -0.076794 0.00024232 0.023369 -0.017054 0.034734 -0.050234 -0.011064 0.032803 0.015666 0.015034 0.071828 -0.034993 0.019008 0.017221 -0.024624 -0.00062403 -0.015722 -0.008082 -0.041755 0.037008 0.025829 0.0023138 -0.022022 -0.002246 0.015631 0.039966 0.004988 0.022385 0.0056453 -0.03398 -0.069502 0.021726 -0.017738 0.0020696 0.014315 0.0081895 -0.026444 -0.044003 0.03374 0.043305 -0.044605 -0.021455 -0.014389 0.0055447 0.0080337 -0.05985 0.016462 0.0030337 -0.0057783 -0.018836 -0.0083057 -0.056138 0.047519 0.018374 0.026927 0.066985 0.010327 0.001741 0.076185 -0.008449 0.016574 0.013783 0.038831 -0.010389 -0.038782 -0.053571 -0.019742 0.023117 0.010826 0.00070464 -0.021489 0.0010133 -0.02496 0.043632 0.01747 0.0034708 -0.011628 -0.0031117 0.0059641 0.025888 -0.026559 0.019224 0.017407 -0.0012481 -0.020946 0.012748 -0.028876 0.040704 -0.012078 -0.026164 -0.056953 0.010919 0.027313 -0.076724 0.0437 -0.044405

From what I can tell fastText does indeed produce vectors for OOV words.

Could you please post instructions on how to reproduce this issue on my end? Ideally within a docker image so that we can be sure to use the same environment. Please use either one of our test datasets or post the data you are using (if you are comfortable with this) so that I'm able to fully reproduce this issue on my end.

Thank you,
Christian

Christian,

I had built models from Indic languages - Eg, Hindi or Telugu. Can you
please check with those ?

On Wed, Dec 27, 2017 at 12:13 AM, cpuhrsch notifications@github.com wrote:

Hello @technologistkj https://github.com/technologistkj,

Thank you for your post. I tried reproducing your problem the following way

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ ./word-vector-example.sh
[...]
$ $ echo "rL6AHV8PdI" | ./fasttext print-word-vectors result/fil9.bin
rL6AHV8PdI -0.012921 0.027566 0.02447 -0.028895 -0.076794 0.00024232 0.023369 -0.017054 0.034734 -0.050234 -0.011064 0.032803 0.015666 0.015034 0.071828 -0.034993 0.019008 0.017221 -0.024624 -0.00062403 -0.015722 -0.008082 -0.041755 0.037008 0.025829 0.0023138 -0.022022 -0.002246 0.015631 0.039966 0.004988 0.022385 0.0056453 -0.03398 -0.069502 0.021726 -0.017738 0.0020696 0.014315 0.0081895 -0.026444 -0.044003 0.03374 0.043305 -0.044605 -0.021455 -0.014389 0.0055447 0.0080337 -0.05985 0.016462 0.0030337 -0.0057783 -0.018836 -0.0083057 -0.056138 0.047519 0.018374 0.026927 0.066985 0.010327 0.001741 0.076185 -0.008449 0.016574 0.013783 0.038831 -0.010389 -0.038782 -0.053571 -0.019742 0.023117 0.010826 0.00070464 -0.021489 0.0010133 -0.02496 0.043632 0.01747 0.0034708 -0.011628 -0.0031117 0.0059641 0.025888 -0.026559 0.019224 0.017407 -0.0012481 -0.020946 0.012748 -0.028876 0.040704 -0.012078 -0.026164 -0.056953 0.010919 0.027313 -0.076724 0.0437 -0.044405

From what I can tell fastText does indeed produce vectors for OOV words.

Could you please post instructions on how to reproduce this issue on my
end? Ideally within a docker image so that we can be sure to use the same
environment. Please use either one of our test datasets or post the data
you are using (if you are comfortable with this) so that I'm able to fully
reproduce this issue on my end.

Thank you,
Christian

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/facebookresearch/fastText/issues/332#issuecomment-353999847,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADKQQ6htkUTQCwaW6PHAZvPKL-bVsDioks5tET5TgaJpZM4P1aa9
.

~/PROGRAMS/fastText/fasttext supervised -input
~/Downloads/data/wiki_data_tokenised_for_vector_creation/englishWikiAndNews5MSentences.txt
-output semvec-eng -dim 108

Does the vector size matter ? Because I took latest, built and tried this -

git pull origin master
make

~/PROGRAMS/fastText/fasttext supervised -input
~/Downloads/data/wiki_data_tokenised_for_vector_creation/englishWikiAndNews5MSentences.txt
-output semvec-eng -dim 108
Read 101M words
Number of words: 2226855
Number of labels: 0
Progress: 100.0% words/sec/thread: 5738344 lr: 0.000000 loss: 0.000000
ETA: 0h 0m

echo "rL6AHV8PdI" | ~/PROGRAMS/fastText/fasttext print-word-vectors
semvec-eng.bin

Output is :

rL6AHV8PdI 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

On Thu, Dec 28, 2017 at 1:43 PM, Sree technologist.kj@gmail.com wrote:

Christian,

I had built models from Indic languages - Eg, Hindi or Telugu. Can you
please check with those ?

On Wed, Dec 27, 2017 at 12:13 AM, cpuhrsch notifications@github.com
wrote:

Hello @technologistkj https://github.com/technologistkj,

Thank you for your post. I tried reproducing your problem the following
way

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ ./word-vector-example.sh
[...]
$ $ echo "rL6AHV8PdI" | ./fasttext print-word-vectors result/fil9.bin
rL6AHV8PdI -0.012921 0.027566 0.02447 -0.028895 -0.076794 0.00024232 0.023369 -0.017054 0.034734 -0.050234 -0.011064 0.032803 0.015666 0.015034 0.071828 -0.034993 0.019008 0.017221 -0.024624 -0.00062403 -0.015722 -0.008082 -0.041755 0.037008 0.025829 0.0023138 -0.022022 -0.002246 0.015631 0.039966 0.004988 0.022385 0.0056453 -0.03398 -0.069502 0.021726 -0.017738 0.0020696 0.014315 0.0081895 -0.026444 -0.044003 0.03374 0.043305 -0.044605 -0.021455 -0.014389 0.0055447 0.0080337 -0.05985 0.016462 0.0030337 -0.0057783 -0.018836 -0.0083057 -0.056138 0.047519 0.018374 0.026927 0.066985 0.010327 0.001741 0.076185 -0.008449 0.016574 0.013783 0.038831 -0.010389 -0.038782 -0.053571 -0.019742 0.023117 0.010826 0.00070464 -0.021489 0.0010133 -0.02496 0.043632 0.01747 0.0034708 -0.011628 -0.0031117 0.0059641 0.025888 -0.026559 0.019224 0.017407 -0.0012481 -0.020946 0.012748 -0.028876 0.040704 -0.012078 -0.026164 -0.056953 0.010919 0.027313 -0.076724 0.0437 -0.044405

From what I can tell fastText does indeed produce vectors for OOV words.

Could you please post instructions on how to reproduce this issue on my
end? Ideally within a docker image so that we can be sure to use the same
environment. Please use either one of our test datasets or post the data
you are using (if you are comfortable with this) so that I'm able to fully
reproduce this issue on my end.

Thank you,
Christian

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/facebookresearch/fastText/issues/332#issuecomment-353999847,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADKQQ6htkUTQCwaW6PHAZvPKL-bVsDioks5tET5TgaJpZM4P1aa9
.

Hello @technologistkj,

Thank you for your comment. fastText uses subwords to retrieve vectors for words that are not part of the vocabulary. However, by default the supervised mode does not use subwords. You'll need to set the minn and maxn flags for this. I suggest you set -minn 3 and -maxn 6 and try this again.

Thank you,
Christian

I am trying to find a word representation for a OOV word using fasttext embedding. FastText still gives 0 as embeddings for OOV words when trained from fresh text corpora.

I am using the following commands:
./fastText/fasttext skipgram -input vehicles_keyword.txt -output vehicles_ft_model**

Read 0M words
Number of words: 2016
Number of labels: 0
Progress: 100.0% words/sec/thread: 23458 lr: 0.000000 loss: 2.711548 eta: 0h0m

Then, when I try to find the embeddings for nisssan, I get the following:
echo "nissssan" | ./fastText/fasttext print-word-vectors vehicles_ft_model.bin
nissssan 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

could you please help me with the soloution?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shriiitk picture shriiitk  Â·  3Comments

nomadlx picture nomadlx  Â·  3Comments

AhmedIdr picture AhmedIdr  Â·  3Comments

a11apurva picture a11apurva  Â·  3Comments

alanorth picture alanorth  Â·  3Comments