Allennlp: How does ELMo handle padding?

Created on 21 Sep 2018  路  2Comments  路  Source: allenai/allennlp

When I am embedding batches of sentences (of different length), I introduce <PAD> tokens, which I include in the vocabulary file during preprocessing. I'm wondering how ELMo handles special tokens like a <PAD> token, and how its placement before and after the </S> token affects performance.

All 2 comments

The allennlp code will automatically handle padding when converting your text to ids, e.g. in allennlp.modules.elmo.batch_to_ids, so you don't have worry about adding padding tokens. Internally it accomplishes this by setting all of the character ids to the masking id 0. Adding special tokens like <PAD> to your input will cause ELMo to treat it just like any other token and have an unknown but likely undesired impact.

@matt-peters Hi, as far as I know that the <s> token in vocab.txt has its id: 0.
But this id causes collision with the masking id 0.

The reason I reckon the id of <s> token as 0 is the source code bilm-tf, and the real-world vocab.txt file:
1

Could you please explain my puzzle?
Wait for your prompt reply.

Was this page helpful?
0 / 5 - 0 ratings