Allennlp: Use complete pretrained-embedding file for vocab creation

Created on 6 Apr 2019 · 3Comments · Source: allenai/allennlp

Currently, the vocabulary created only contains tokens in the datasets given in the config, which can be pruned based on a pretrained-embeddings file. As a result, the vocabulary will only contain tokens in the intersection of datasets and pretrained-embeddings.

If a model trained using this is serving a demo, tokens that were in the pre-trained file, but not in the union of datasets will get mapped to UNK. These tokens would have got a (arguably) better representation if all the tokens in the pretrained-embeddings file were included in the vocab.

Can there an addition to include all tokens in the pre-trained embeddings to be in the vocabulary. This should be a simple addition in the _extend function in the Vocabulary class.w

They have seen some performance degradation due to this.

Source