Allennlp: Add start and end tokens to source sentence in seq2seq DatasetReader

Created on 14 Dec 2017  路  4Comments  路  Source: allenai/allennlp

Two points here:
1) Is it intentional that you do not append the @@END@@ token to the source sequence?
2) These @@START@@ and @@END@@ tokens are currently added inside the seq2seq dataset reader directly. However, there is also a way to add them via WordTokenizer params. Next, one could fetch them from the WordTokenizer instance inside the dataset reader for a future use in the decoder.

P2

Most helpful comment

Having EOS appended to the source gives decoder something it can always attend to while generating target EOS. Without source EOS, it mostly learns to attend to the last source token whatever it is (e.g. dot, question mark, or some word). This might require more effort from the model then just explicitly attending to the source EOS in all cases. E.g. my recent model (not exactly seq2seq, but anyway), at its first epochs, does not generate EOS when I remove the dot from the end of the source.

Regarding the 2nd point, yes, it makes sense.

All 4 comments

[Update] Regarding the 1st point, I now see there are many ways people use these tokens:

  • Microsoft uses (see 2d figure) them just like you do
  • Amazon uses (see 1st figure) them just like Microsoft, but also with @@END@@ at the end of a source sentence
  • Philipp Koehn uses (see Figure 13.22, page 53) them just like Amazon, but also with @@START@@ at the beginning of a source sentence

So I think it is not a point anymore.

Oh, I hadn't thought about the implications of including the end token as part of the encoder... We were only thinking about the fact that it never gets input to the decoder. I'm really not sure if it makes a difference. If you discover that it does make a difference, by all means, submit a PR to fix it, or make it an option.

For your second point, these tokens have special meaning that the _model_ needs to know about. The model doesn't currently have access to the DatasetReader object, and it isn't really desirable from an API standpoint to give it. So instead we define shared constants that both the DatasetReader and the Model use. It doesn't really make sense to configure this in the tokenizer. We had some discussion of this here: https://github.com/allenai/allennlp/pull/366#discussion_r143308089.

Having EOS appended to the source gives decoder something it can always attend to while generating target EOS. Without source EOS, it mostly learns to attend to the last source token whatever it is (e.g. dot, question mark, or some word). This might require more effort from the model then just explicitly attending to the source EOS in all cases. E.g. my recent model (not exactly seq2seq, but anyway), at its first epochs, does not generate EOS when I remove the dot from the end of the source.

Regarding the 2nd point, yes, it makes sense.

Yeah, that makes a lot of sense. I'm going to change the name of this issue to something like "Add an end token to the source tokenizer". If you want to submit a PR to fix it, that'd be great.

Was this page helpful?
0 / 5 - 0 ratings