When I am embedding batches of sentences (of different length), I introduce <PAD> tokens, which I include in the vocabulary file during preprocessing. I'm wondering how ELMo handles special tokens like a <PAD> token, and how its placement before and after the </S> token affects performance.
The allennlp code will automatically handle padding when converting your text to ids, e.g. in allennlp.modules.elmo.batch_to_ids, so you don't have worry about adding padding tokens. Internally it accomplishes this by setting all of the character ids to the masking id 0. Adding special tokens like <PAD> to your input will cause ELMo to treat it just like any other token and have an unknown but likely undesired impact.
@matt-peters Hi, as far as I know that the <s> token in vocab.txt has its id: 0.
But this id causes collision with the masking id 0.
The reason I reckon the id of <s> token as 0 is the source code bilm-tf, and the real-world vocab.txt file:

Could you please explain my puzzle?
Wait for your prompt reply.