Fairseq: Add preprocessing scripts for zh-en

Created on 17 Oct 2019 · 4Comments · Source: pytorch/fairseq

For us to run the pretrained models here, would we need to use the same dictionary and BPE codes as was used for the pre-trained model? Or does it not matter? If it matters, can you provide the dictionaries?

The German one has a prepare script given, so maybe it generates the same dictionary, but other ones (such as the Chinese one) don't have a prepare script, so it's hard to reproduce the same dictionary.

enhancement

Source

moyid

Most helpful comment

@myleott We need the (missing) BPECode TOGETHER with the Zh-En scripts to MAKE USE OF THE provided pre-trained model

GeorgeS2019 on 17 Dec 2019

👍3

All 4 comments

That paper used the standard preprocessed datasets provided by fairseq. You can follow the instructions to generate them: https://github.com/pytorch/fairseq/blob/master/examples/translation/README.md

You're right that the Zh-En scripts are missing.

myleott on 16 Dec 2019

@myleott We need the (missing) BPECode TOGETHER with the Zh-En scripts to MAKE USE OF THE provided pre-trained model

GeorgeS2019 on 17 Dec 2019

👍3

Yep, I can add them later today, thanks for pointing this out.

myleott on 18 Dec 2019

👍1

The BPE codes are now available in a new set of archives with a .tar.gz extension. I've also updated the README with a bunch of additional usage instructions via torch.hub: https://github.com/pytorch/fairseq/tree/master/examples/pay_less_attention_paper#example-usage-torchhub

myleott on 19 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings