InferSent encoder demo with GloVe - Key Error

Created on 20 Aug 2018  路  7Comments  路  Source: facebookresearch/InferSent

I'm trying to run the demo.ipynb notebook in the encoder module, with 300 dimensional GloVe vectors. I've run all the commands as detailed in the Readme and the notebook, but at the model.encode command I get an error as follows:

KeyError                                  Traceback (most recent call last)
<ipython-input-36-3fb4b1a1a3f7> in <module>()
----> 1 embeddings = model.encode(sentences, bsize=128, tokenize=False, verbose=True)
      2 print('nb sentences encoded : {0}'.format(len(embeddings)))

/ais/hal9000/vkpriya/InferSent-master/encoder/models.py in encode(self, sentences, bsize, tokenize, verbose)
    220         for stidx in range(0, len(sentences), bsize):
    221             batch = Variable(self.get_batch(
--> 222                         sentences[stidx:stidx + bsize]), volatile=True)
    223             if self.is_cuda():
    224                 batch = batch.cuda()

/ais/hal9000/vkpriya/InferSent-master/encoder/models.py in get_batch(self, batch)
    172         for i in range(len(batch)):
    173             for j in range(len(batch[i])):
--> 174                 embed[j, i, :] = self.word_vec[batch[i][j]]
    175 
    176         return torch.FloatTensor(embed)

KeyError: </s>

Should I explicitly add the symbol to the word vector file?
Thanks!

Most helpful comment

The error doesn't appear if I use the exact GloVe vectors as specified in the Readme - glove.840B.300d.txt. Probably is missing in the others; maybe increasing the vocab size would help.

It happens when a (real) word isn't found in your embedding. Then you end up with which should never be in any embedding dictionary (it's supposed to be a non-word token).

What should probably happen is it should return an average or random vector. In my "fix" it returns a zero vector (which is kind of ok too).

All 7 comments

You found some fix? maybe model.build_vocab(sentences, tokenize=True) helps?

EDIT: I have the same error, I tried to build vocab from the sentences and then model.update_vocab('') but didn't work, so I'm not sure what to do.

I had this error using FastText vectors too.

I did this, but it isn't a great fix:

        for i in range(len(batch)):
            for j in range(len(batch[i])):
                # this next line here is my change
                if batch[i][j] != self.eos:
                    embed[j, i, :] = self.word_vec[batch[i][j]]

The error doesn't appear if I use the exact GloVe vectors as specified in the Readme - glove.840B.300d.txt. Probably </s> is missing in the others; maybe increasing the vocab size would help.

The error doesn't appear if I use the exact GloVe vectors as specified in the Readme - glove.840B.300d.txt. Probably is missing in the others; maybe increasing the vocab size would help.

It happens when a (real) word isn't found in your embedding. Then you end up with which should never be in any embedding dictionary (it's supposed to be a non-word token).

What should probably happen is it should return an average or random vector. In my "fix" it returns a zero vector (which is kind of ok too).

@nlothian Ah I see. Thanks!

Hi,
In demo.ipynb, if you use infersent1.pkl please use the standard GloVe vectors (because the LSTM has been trained with these ones), and use the fastText common-crawl vectors for infersent2.pkl. It is also important to specify the version in "params_model". Version is "1" for infersent1.pkl and "2" for infersent2.pkl.
Thanks
Alexis

Hi all,

This issue is due to the prepare_samples method (line 193-201 of models.py), as follows:

        # filters words without w2v vectors
        for i in range(len(sentences)):
            s_f = [word for word in sentences[i] if word in self.word_vec]
            if not s_f:
                import warnings
                warnings.warn('No words in "%s" (idx=%s) have w2v vectors. \
                               Replacing by "</s>"..' % (sentences[i], i))
                s_f = [self.eos]
            sentences[i] = s_f

the end of sentence token would be used to represent the whole sentence if all tokens in the sentence do not have corresponding vectors.

As @nlothian suggests, the simplest solution might be setting vector for EOS to mean or zero vector. Here, I add the following very short snippet to get_w2v method, before returning word_vec:

        if self.eos not in word_vec:
            word_vec[self.eos] = np.mean(np.stack(word_vec.values(), axis=0), axis=0)

It should also work for fasttext, i.e. version 2.

Was this page helpful?
0 / 5 - 0 ratings