After running the script prepare-iwslt14.sh the resulting files contain many @ symbols. The file _valid.en_ begins like this:
it had two@@ -@@ to-@@ three-@@ to-@@ 400 times the toxic lo@@ ads ever allowed by the ep@@ a .
often what j@@ ams us up is se@@ wa@@ ge .
what do you do when you have this sort of disrup@@ ted flow ?
stephen pal@@ um@@ b@@ i : following the mer@@ cur@@ y tra@@ il
Is this really the correct behavior?
This is called BPE-tokenization. It splits infrequent words into more frequent tokens. @@ is used to show when multiple tokens are a part of the same word.
For example, it may split a single long word into a bunch of "subwords":
antidisestablishmentarianism -> anti@@ dis@@ est@@ ablishment@@ arism
Note that "arism" does not have the @@-ending, because it is the end of the word
Most helpful comment
This is called BPE-tokenization. It splits infrequent words into more frequent tokens. @@ is used to show when multiple tokens are a part of the same word.
For example, it may split a single long word into a bunch of "subwords":
antidisestablishmentarianism -> anti@@ dis@@ est@@ ablishment@@ arism
Note that "arism" does not have the @@-ending, because it is the end of the word