Fairseq: NAT distillation dataset

Created on 4 Oct 2019 · 22Comments · Source: pytorch/fairseq

Hi,

After unzipping the distillation dataset (https://github.com/pytorch/fairseq/blob/master/examples/nonautoregressive_translation/README.md#download), there are 15 files.

I am confused about the "valid-repeat.en-de.de", "valid-repeat.en-de.en", "valid.en-de.de", "valid.en-de.en", and "valid.en-de.ori" files.

I have the following questions:

Are these datasets translated from a transformer-big model with beam search technique, if yes, how about the BLEU score on the test set, 29.3 or 28.4 or others?
Which two validation files should be selected to build the binarized corpus to train a NAT model, I guess they should be "valid.en-de.de" and "valid.en-de.en"?
What are the roles of "valid-repeat.en-de.de" and "valid-repeat.en-de.en" files?
Intuitively, should I use "train.en-de.en(de)", "valid.en-de.en(de)", and "test.en-de.en(de)" to build the binarized corpus?

Thank you!

Source

SunbowLiu

Most helpful comment

Hi @MultiPath .
In the provided distillation datasets, I found that valid.en-de was also distilled. Is it necessary that valid data need to be distilled? Thank you!

yokusama on 19 Dec 2019

👍3

All 22 comments

Hi,

These datasets are based on transformer-base with BLEU score around 27.2.
valid-repeat.en-de.{en,de} are simply repeating 13 times from the valid.en-de.{en,de}. They are only used to enlarge the validation set when calculating the validation loss. Yes, you can binarize train/valid/test.en-de.{en, de}.

MultiPath on 9 Oct 2019

Hi,

These datasets are based on transformer-base with BLEU score around 27.2.

valid-repeat.en-de.{en,de} are simply repeating 13 times from the valid.en-de.{en,de}. They are only used to enlarge the validation set when calculating the validation loss. Yes, you can binarize train/valid/test.en-de.{en, de}.

Hi Jiatao,

Thank you very much for your reply! I still curious about valid-repeat.en-de.{en,de} and wanna know more details about them.

The question is:

Why and when we need to use the enlarged validation set? After using the enlarged validation set, the validation loss will be increased or decreased, and why we need the loss to be changed?

Thank you!

SunbowLiu on 10 Oct 2019

I think it might be due to when the data is huge the validation loss change might be clearly visible and we can't simply multiply it by 13 because same sentence might give different losses for different iterations.

gvskalyan on 10 Oct 2019

👍2

Hi Sunbow,

Why and when we need to use the enlarged validation set? After using the enlarged validation set, the validation loss will be increased or decreased, and why we need the loss to be changed?

Because the training process for Levenshtein Transformer involves "random delete" operations.
So repeating 13 times on the validation set is trying to reduce the variance on the validation loss.

MultiPath on 11 Oct 2019

👍1

Hi @kahne using scripts.md present in non-autorgessive-translation , this error occurs for

insertion transformer - unrecognised arguments: --pred-length-offset --length-loss-factor 0.1

Non-autoregressive Transformer with Iterative Refinement - unrecognized arguments: --train-step 4 --dae-ratio 0.5 --stochastic-approx

Also can you please verify the command used to train levenshtein Transformer, I was getting segmentation error / SIGENV and it gets killed

I was able to train mask and predict nat model but while using generate this occurs decoder_out['output_scores'], decoder_out['attn'])
KeyError: 'attn'

gvskalyan on 19 Oct 2019

Yes, Insertion Transformer does not have arguments about "--pred-length-offset --length-loss-factor 0.1" since it is used to predict lengths. We don't predict length in InsertionTransformer. I will fix this.
The --arch was mistaken. It should be --arch iterative_nonautoregressive_transformer. We will update the examples, too.
What error did you meet in detail? Can you do screenshot? Did you install the libnat library by python setup.py build_ext --inplace?
We did not return attn for these models previously. It returns None instead. Can you show me which line you entered this error?

MultiPath on 19 Oct 2019

👍1

1,2. Thanks for the fixes.

Will try it. reason : No NVCC found and gcc compiler is lower than 4.9, which I have altered to a
newer version and the traceback:
| loaded 3961179 examples from: data-bin/train.en-de.en | loaded 3961179 examples from: data-bin/train.en-de.de | data-bin/ train en-de 3961179 examples | epoch 001: 0%| | 0/1021 [00:00<?, ?it/s]Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/miniconda3/envs/nat/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main Traceback (most recent call last): File "/home/miniconda3/envs/nat/bin/fairseq-train", line 11, in <module> exitcode = _main(fd) File "/home/miniconda3/envs/nat/lib/python3.7/multiprocessing/spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) _pickle.UnpicklingError: pickle data was truncated load_entry_point('fairseq', 'console_scripts', 'fairseq-train')() File "/home/kalyan/nat/fairseq/fairseq_cli/train.py", line 334, in cli_main nprocs=args.distributed_world_size, File "/home/miniconda3/envs/nat/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/miniconda3/envs/nat/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 1 terminated with signal SIGSEGV
1. line 582, in _decode_one
  tokens, encoder_out=encoder_out, incremental_state=self.incremental_states[model],
  --> fairseq/fairseq/models/nonautoregressive_transformer.py", line 115, in forward_decoder
  step = decoder_out["step"]

Thank you.

gvskalyan on 22 Oct 2019

Hi,

These datasets are based on transformer-base with BLEU score around 27.2.

valid-repeat.en-de.{en,de} are simply repeating 13 times from the valid.en-de.{en,de}. They are only used to enlarge the validation set when calculating the validation loss. Yes, you can binarize train/valid/test.en-de.{en, de}.

hi @MultiPath so the transformer-base model is not the transformer baseline reported in Table 1 in the paper?

Tiiiger on 24 Oct 2019

Hi,

These datasets are based on transformer-base with BLEU score around 27.2.

valid-repeat.en-de.{en,de} are simply repeating 13 times from the valid.en-de.{en,de}. They are only used to enlarge the validation set when calculating the validation loss. Yes, you can binarize train/valid/test.en-de.{en, de}.

hi @MultiPath so the transformer-base model is not the transformer baseline reported in Table 1 in the paper?

Hi, I re-base all my previous experiments into fairseq and we can get a better baseline performance now. Previously all the experiments are done in my own framework. We will update the paper very soon.

MultiPath on 26 Oct 2019

These datasets are based on transformer-base with BLEU score around 27.2.

Hi @MultiPath, may I know whether the BLEU reported in paper and in the comment above are scores after using compound splitting (compound_split_bleu.sh)? Thank you.

raymondhs on 6 Nov 2019

@raymondhs No, we didn't use compound_split_bleu.sh.

kahne on 6 Nov 2019

Hi @kahne, thanks for your reply. May I know what is the expected (tokenized) BLEU score following the README? I trained a Levenshtein Transformer model on the released distilled data set and the tokenized BLEU I'm getting after calling fairseq-generate on the test subset is only 26.49.

Here is my training command:

fairseq-train \
    data-bin/wmt14_en_de_distill \
    --save-dir checkpoints/levenshtein_transformer \
    --ddp-backend=no_c10d \
    --task translation_lev \
    --criterion nat_loss \
    --arch levenshtein_transformer \
    --noise random_delete \
    --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9,0.98)' \
    --lr 0.0005 --lr-scheduler inverse_sqrt \
    --min-lr '1e-09' --warmup-updates 10000 \
    --warmup-init-lr '1e-07' --label-smoothing 0.1 \
    --dropout 0.3 --weight-decay 0.01 \
    --decoder-learned-pos \
    --encoder-learned-pos \
    --apply-bert-init \
    --log-format 'simple' --log-interval 100 \
    --fixed-validation-seed 7 \
    --max-tokens 8000 \
    --save-interval-updates 10000 \
    --no-epoch-checkpoints \
    --max-update 300000 \
    --fp16 --update-freq 2

I only added --fp16 --update-freq 2 to make the overall batch size about 64k similar to paper (I was training on 4 GPUs).

raymondhs on 6 Nov 2019

Hi, I think you can try training a bit longer (400K updates instead of 300K), and perform checkpoint average over the last five checkpoints. We usually found that average the last checkpoints get better performance. The best model I can get from this argument is around 26.9~27.1 on test set.

MultiPath on 6 Nov 2019

Hi @MultiPath, thanks for your tips. I did get around 26.9 BLEU with 400K updates.
I also tried replicating the experiments on Ro-En and En-Ja. I trained a Transformer teacher model and a student Levenshtein model on the teacher output on the training set. From my experiments on Ro-En and En-Ja, it seems the Transformer Base outperforms Levenshtein by about 1 BLEU point. On Ro-En I'm getting 34.1 BLEU for Transformer and 33.2 BLEU for Levenshtein. I am wondering if this behaviour is expected since the Levenshtein is a student model? Any settings that I should be careful with for Ro-En and En-Ja?

raymondhs on 10 Nov 2019

👀2

Hi @MultiPath .
Some questions about the distillation dataset. After unzipping the distillation dataset, the folder name is "wmt17_en_de_distill_base_chuntinz". But in the script, the dataset is "data-bin/wmt14_en_de_distil" .

I want to confirm if the distillation dataset is based on wmt14 or wmt17?

chynphh on 15 Dec 2019

It is WMT14 dataset. I forgot why the name was wrong. We will fix the name @kahne

MultiPath on 16 Dec 2019

@chynphh We have updated the name with wmt14 bpe codes inside.
Sorry about the confusion.

MultiPath on 17 Dec 2019

@MultiPath Thanks.

chynphh on 17 Dec 2019

Hi @MultiPath .
In the provided distillation datasets, I found that valid.en-de was also distilled. Is it necessary that valid data need to be distilled? Thank you!

yokusama on 19 Dec 2019

👍3

Hi @MultiPath. Sorry if this is a dumb question.
After training the NAT in your paper on the distilled training set, do you evaluate on the distilled test set or the original test set? Are all evaluations done on the distilled test set?

dhecloud on 12 Sep 2020

Hi @dhecloud , thanks for checking. Evaluations were done on the original test sets. Distillation is for train sets only.

kahne on 15 Sep 2020

👍1

Hi,
I am hitting some roadblocks whether I try to binarize the distilled dataset or try to train a NAR model on a freshly downloaded and binarized wmt14_en_de, could you please make the (distilled) binarized dataset available in https://github.com/pytorch/fairseq/tree/master/examples/nonautoregressive_translation#download ?