InferSent encoder demo with GloVe - Key Error

Created on 20 Aug 2018 · 7Comments · Source: facebookresearch/InferSent

I'm trying to run the demo.ipynb notebook in the encoder module, with 300 dimensional GloVe vectors. I've run all the commands as detailed in the Readme and the notebook, but at the model.encode command I get an error as follows:

KeyError                                  Traceback (most recent call last)
<ipython-input-36-3fb4b1a1a3f7> in <module>()
----> 1 embeddings = model.encode(sentences, bsize=128, tokenize=False, verbose=True)
      2 print('nb sentences encoded : {0}'.format(len(embeddings)))

/ais/hal9000/vkpriya/InferSent-master/encoder/models.py in encode(self, sentences, bsize, tokenize, verbose)
    220         for stidx in range(0, len(sentences), bsize):
    221             batch = Variable(self.get_batch(
--> 222                         sentences[stidx:stidx + bsize]), volatile=True)
    223             if self.is_cuda():
    224                 batch = batch.cuda()

/ais/hal9000/vkpriya/InferSent-master/encoder/models.py in get_batch(self, batch)
    172         for i in range(len(batch)):
    173             for j in range(len(batch[i])):
--> 174                 embed[j, i, :] = self.word_vec[batch[i][j]]
    175 
    176         return torch.FloatTensor(embed)

KeyError: </s>

Should I explicitly add the symbol to the word vector file?
Thanks!

Source

Priya22

Most helpful comment

The error doesn't appear if I use the exact GloVe vectors as specified in the Readme - glove.840B.300d.txt. Probably is missing in the others; maybe increasing the vocab size would help.

It happens when a (real) word isn't found in your embedding. Then you end up with which should never be in any embedding dictionary (it's supposed to be a non-word token).

What should probably happen is it should return an average or random vector. In my "fix" it returns a zero vector (which is kind of ok too).

nlothian on 11 Sep 2018

👍2

All 7 comments

You found some fix? maybe model.build_vocab(sentences, tokenize=True) helps?

EDIT: I have the same error, I tried to build vocab from the sentences and then model.update_vocab('') but didn't work, so I'm not sure what to do.

set92 on 27 Aug 2018

I had this error using FastText vectors too.

I did this, but it isn't a great fix:

        for i in range(len(batch)):
            for j in range(len(batch[i])):
                # this next line here is my change
                if batch[i][j] != self.eos:
                    embed[j, i, :] = self.word_vec[batch[i][j]]

nlothian on 5 Sep 2018

The error doesn't appear if I use the exact GloVe vectors as specified in the Readme - glove.840B.300d.txt. Probably </s> is missing in the others; maybe increasing the vocab size would help.

Priya22 on 6 Sep 2018

👍2

The error doesn't appear if I use the exact GloVe vectors as specified in the Readme - glove.840B.300d.txt. Probably is missing in the others; maybe increasing the vocab size would help.

It happens when a (real) word isn't found in your embedding. Then you end up with which should never be in any embedding dictionary (it's supposed to be a non-word token).

What should probably happen is it should return an average or random vector. In my "fix" it returns a zero vector (which is kind of ok too).

nlothian on 11 Sep 2018

👍2

@nlothian Ah I see. Thanks!

Priya22 on 11 Sep 2018

Hi,
In demo.ipynb, if you use infersent1.pkl please use the standard GloVe vectors (because the LSTM has been trained with these ones), and use the fastText common-crawl vectors for infersent2.pkl. It is also important to specify the version in "params_model". Version is "1" for infersent1.pkl and "2" for infersent2.pkl.
Thanks
Alexis

aconneau on 11 Sep 2018

Hi all,

This issue is due to the prepare_samples method (line 193-201 of models.py), as follows:

        # filters words without w2v vectors
        for i in range(len(sentences)):
            s_f = [word for word in sentences[i] if word in self.word_vec]
            if not s_f:
                import warnings
                warnings.warn('No words in "%s" (idx=%s) have w2v vectors. \
                               Replacing by "</s>"..' % (sentences[i], i))
                s_f = [self.eos]
            sentences[i] = s_f

the end of sentence token would be used to represent the whole sentence if all tokens in the sentence do not have corresponding vectors.

As @nlothian suggests, the simplest solution might be setting vector for EOS to mean or zero vector. Here, I add the following very short snippet to get_w2v method, before returning word_vec:

        if self.eos not in word_vec:
            word_vec[self.eos] = np.mean(np.stack(word_vec.values(), axis=0), axis=0)

It should also work for fasttext, i.e. version 2.

iamkissg on 22 Dec 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Model is no longer available in SW3 amazon bucket

nstfk · 9Comments

Download InferSent models via curl request failing

AmoghM · 7Comments

Has `infersent.set_glove_path()` function been removed?

antoinecomp · 3Comments

ValueError: some of the strides of a given numpy array are negative. This is currently not supported,

thammegowda · 15Comments

Access Denied

gkutiel · 12Comments