Allennlp: Use complete pretrained-embedding file for vocab creation

Created on 6 Apr 2019  路  3Comments  路  Source: allenai/allennlp

Currently, the vocabulary created only contains tokens in the datasets given in the config, which can be pruned based on a pretrained-embeddings file. As a result, the vocabulary will only contain tokens in the intersection of datasets and pretrained-embeddings.

If a model trained using this is serving a demo, tokens that were in the pre-trained file, but not in the union of datasets will get mapped to UNK. These tokens would have got a (arguably) better representation if all the tokens in the pretrained-embeddings file were included in the vocab.

Can there an addition to include all tokens in the pre-trained embeddings to be in the vocabulary. This should be a simple addition in the _extend function in the Vocabulary class.w

They have seen some performance degradation due to this.

Most helpful comment

I think @OyvindTafjord already added support for this: #1822.

All 3 comments

I think @OyvindTafjord already added support for this: #1822.

... and there's also --extend-vocab option in evaluate command if you require. (#2501)

Thanks @HarshTrivedi. I missed the "min_pretrained_embeddings" param. It works.

Was this page helpful?
0 / 5 - 0 ratings