Fairseq: prepare-iwslt14.sh: chars are replaced with @. Why?

Created on 20 Mar 2020 · 1Comment · Source: pytorch/fairseq

❓ Questions and Help

After running the script prepare-iwslt14.sh the resulting files contain many @ symbols. The file _valid.en_ begins like this:

it had two@@ -@@ to-@@ three-@@ to-@@ 400 times the toxic lo@@ ads ever allowed by the ep@@ a .
often what j@@ ams us up is se@@ wa@@ ge .
what do you do when you have this sort of disrup@@ ted flow ?
stephen pal@@ um@@ b@@ i : following the mer@@ cur@@ y tra@@ il

Is this really the correct behavior?

What's your environment?

fairseq: 0.9.0
PyTorch: 1.4.0
OS: Ubuntu 18.04
How you installed fairseq: pip

needs triage question

Source

leocaprio

Most helpful comment

This is called BPE-tokenization. It splits infrequent words into more frequent tokens. @@ is used to show when multiple tokens are a part of the same word.

For example, it may split a single long word into a bunch of "subwords":
antidisestablishmentarianism -> anti@@ dis@@ est@@ ablishment@@ arism

Note that "arism" does not have the @@-ending, because it is the end of the word

Guitaricet on 21 Mar 2020

👍2

>All comments

This is called BPE-tokenization. It splits infrequent words into more frequent tokens. @@ is used to show when multiple tokens are a part of the same word.

For example, it may split a single long word into a bunch of "subwords":
antidisestablishmentarianism -> anti@@ dis@@ est@@ ablishment@@ arism

Note that "arism" does not have the @@-ending, because it is the end of the word

Guitaricet on 21 Mar 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings