Currently, the vocabulary created only contains tokens in the datasets given in the config, which can be pruned based on a pretrained-embeddings file. As a result, the vocabulary will only contain tokens in the intersection of datasets and pretrained-embeddings.
If a model trained using this is serving a demo, tokens that were in the pre-trained file, but not in the union of datasets will get mapped to UNK. These tokens would have got a (arguably) better representation if all the tokens in the pretrained-embeddings file were included in the vocab.
Can there an addition to include all tokens in the pre-trained embeddings to be in the vocabulary. This should be a simple addition in the _extend function in the Vocabulary class.w
They have seen some performance degradation due to this.
I think @OyvindTafjord already added support for this: #1822.
... and there's also --extend-vocab option in evaluate command if you require. (#2501)
Thanks @HarshTrivedi. I missed the "min_pretrained_embeddings" param. It works.
Most helpful comment
I think @OyvindTafjord already added support for this: #1822.