Fairseq: prepare-iwslt14.sh: chars are replaced with @. Why?

Created on 20 Mar 2020  ยท  1Comment  ยท  Source: pytorch/fairseq

โ“ Questions and Help

After running the script prepare-iwslt14.sh the resulting files contain many @ symbols. The file _valid.en_ begins like this:

it had two@@ -@@ to-@@ three-@@ to-@@ 400 times the toxic lo@@ ads ever allowed by the ep@@ a .
often what j@@ ams us up is se@@ wa@@ ge .
what do you do when you have this sort of disrup@@ ted flow ?
stephen pal@@ um@@ b@@ i : following the mer@@ cur@@ y tra@@ il

Is this really the correct behavior?

What's your environment?

  • fairseq: 0.9.0
  • PyTorch: 1.4.0
  • OS: Ubuntu 18.04
  • How you installed fairseq: pip
needs triage question

Most helpful comment

This is called BPE-tokenization. It splits infrequent words into more frequent tokens. @@ is used to show when multiple tokens are a part of the same word.

For example, it may split a single long word into a bunch of "subwords":
antidisestablishmentarianism -> anti@@ dis@@ est@@ ablishment@@ arism

Note that "arism" does not have the @@-ending, because it is the end of the word

>All comments

This is called BPE-tokenization. It splits infrequent words into more frequent tokens. @@ is used to show when multiple tokens are a part of the same word.

For example, it may split a single long word into a bunch of "subwords":
antidisestablishmentarianism -> anti@@ dis@@ est@@ ablishment@@ arism

Note that "arism" does not have the @@-ending, because it is the end of the word

Was this page helpful?
0 / 5 - 0 ratings