Hi,
After unzipping the distillation dataset (https://github.com/pytorch/fairseq/blob/master/examples/nonautoregressive_translation/README.md#download), there are 15 files.
I am confused about the "valid-repeat.en-de.de", "valid-repeat.en-de.en", "valid.en-de.de", "valid.en-de.en", and "valid.en-de.ori" files.
I have the following questions:
Thank you!
Hi,
valid-repeat.en-de.{en,de} are simply repeating 13 times from the valid.en-de.{en,de}. They are only used to enlarge the validation set when calculating the validation loss. Yes, you can binarize train/valid/test.en-de.{en, de}.Hi,
- These datasets are based on transformer-base with BLEU score around 27.2.
valid-repeat.en-de.{en,de}are simply repeating 13 times from thevalid.en-de.{en,de}. They are only used to enlarge the validation set when calculating the validation loss. Yes, you can binarize train/valid/test.en-de.{en, de}.
Hi Jiatao,
Thank you very much for your reply! I still curious about valid-repeat.en-de.{en,de} and wanna know more details about them.
The question is:
Why and when we need to use the enlarged validation set? After using the enlarged validation set, the validation loss will be increased or decreased, and why we need the loss to be changed?
Thank you!
I think it might be due to when the data is huge the validation loss change might be clearly visible and we can't simply multiply it by 13 because same sentence might give different losses for different iterations.
Hi Sunbow,
Why and when we need to use the enlarged validation set? After using the enlarged validation set, the validation loss will be increased or decreased, and why we need the loss to be changed?
Because the training process for Levenshtein Transformer involves "random delete" operations.
So repeating 13 times on the validation set is trying to reduce the variance on the validation loss.
Hi @kahne using scripts.md present in non-autorgessive-translation , this error occurs for
insertion transformer - unrecognised arguments: --pred-length-offset --length-loss-factor 0.1
Non-autoregressive Transformer with Iterative Refinement - unrecognized arguments: --train-step 4 --dae-ratio 0.5 --stochastic-approx
Also can you please verify the command used to train levenshtein Transformer, I was getting segmentation error / SIGENV and it gets killed
I was able to train mask and predict nat model but while using generate this occurs decoder_out['output_scores'], decoder_out['attn'])
KeyError: 'attn'
--arch was mistaken. It should be --arch iterative_nonautoregressive_transformer. We will update the examples, too.python setup.py build_ext --inplace?attn for these models previously. It returns None instead. Can you show me which line you entered this error?1,2. Thanks for the fixes.
Will try it. reason : No NVCC found and gcc compiler is lower than 4.9, which I have altered to a
newer version and the traceback:
| loaded 3961179 examples from: data-bin/train.en-de.en
| loaded 3961179 examples from: data-bin/train.en-de.de
| data-bin/ train en-de 3961179 examples
| epoch 001: 0%| | 0/1021 [00:00<?, ?it/s]Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/miniconda3/envs/nat/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
Traceback (most recent call last):
File "/home/miniconda3/envs/nat/bin/fairseq-train", line 11, in <module>
exitcode = _main(fd)
File "/home/miniconda3/envs/nat/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
File "/home/kalyan/nat/fairseq/fairseq_cli/train.py", line 334, in cli_main
nprocs=args.distributed_world_size,
File "/home/miniconda3/envs/nat/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/miniconda3/envs/nat/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 1 terminated with signal SIGSEGV
Thank you.
Hi,
- These datasets are based on transformer-base with BLEU score around 27.2.
valid-repeat.en-de.{en,de}are simply repeating 13 times from thevalid.en-de.{en,de}. They are only used to enlarge the validation set when calculating the validation loss. Yes, you can binarize train/valid/test.en-de.{en, de}.
hi @MultiPath so the transformer-base model is not the transformer baseline reported in Table 1 in the paper?
Hi,
- These datasets are based on transformer-base with BLEU score around 27.2.
valid-repeat.en-de.{en,de}are simply repeating 13 times from thevalid.en-de.{en,de}. They are only used to enlarge the validation set when calculating the validation loss. Yes, you can binarize train/valid/test.en-de.{en, de}.hi @MultiPath so the transformer-base model is not the transformer baseline reported in Table 1 in the paper?
Hi, I re-base all my previous experiments into fairseq and we can get a better baseline performance now. Previously all the experiments are done in my own framework. We will update the paper very soon.
- These datasets are based on transformer-base with BLEU score around 27.2.
Hi @MultiPath, may I know whether the BLEU reported in paper and in the comment above are scores after using compound splitting (compound_split_bleu.sh)? Thank you.
@raymondhs No, we didn't use compound_split_bleu.sh.
Hi @kahne, thanks for your reply. May I know what is the expected (tokenized) BLEU score following the README? I trained a Levenshtein Transformer model on the released distilled data set and the tokenized BLEU I'm getting after calling fairseq-generate on the test subset is only 26.49.
Here is my training command:
fairseq-train \
data-bin/wmt14_en_de_distill \
--save-dir checkpoints/levenshtein_transformer \
--ddp-backend=no_c10d \
--task translation_lev \
--criterion nat_loss \
--arch levenshtein_transformer \
--noise random_delete \
--share-all-embeddings \
--optimizer adam --adam-betas '(0.9,0.98)' \
--lr 0.0005 --lr-scheduler inverse_sqrt \
--min-lr '1e-09' --warmup-updates 10000 \
--warmup-init-lr '1e-07' --label-smoothing 0.1 \
--dropout 0.3 --weight-decay 0.01 \
--decoder-learned-pos \
--encoder-learned-pos \
--apply-bert-init \
--log-format 'simple' --log-interval 100 \
--fixed-validation-seed 7 \
--max-tokens 8000 \
--save-interval-updates 10000 \
--no-epoch-checkpoints \
--max-update 300000 \
--fp16 --update-freq 2
I only added --fp16 --update-freq 2 to make the overall batch size about 64k similar to paper (I was training on 4 GPUs).
Hi, I think you can try training a bit longer (400K updates instead of 300K), and perform checkpoint average over the last five checkpoints. We usually found that average the last checkpoints get better performance. The best model I can get from this argument is around 26.9~27.1 on test set.
Hi @MultiPath, thanks for your tips. I did get around 26.9 BLEU with 400K updates.
I also tried replicating the experiments on Ro-En and En-Ja. I trained a Transformer teacher model and a student Levenshtein model on the teacher output on the training set. From my experiments on Ro-En and En-Ja, it seems the Transformer Base outperforms Levenshtein by about 1 BLEU point. On Ro-En I'm getting 34.1 BLEU for Transformer and 33.2 BLEU for Levenshtein. I am wondering if this behaviour is expected since the Levenshtein is a student model? Any settings that I should be careful with for Ro-En and En-Ja?
Hi @MultiPath .
Some questions about the distillation dataset. After unzipping the distillation dataset, the folder name is "wmt17_en_de_distill_base_chuntinz". But in the script, the dataset is "data-bin/wmt14_en_de_distil" .
I want to confirm if the distillation dataset is based on wmt14 or wmt17?
It is WMT14 dataset. I forgot why the name was wrong. We will fix the name @kahne
@chynphh We have updated the name with wmt14 bpe codes inside.
Sorry about the confusion.
@MultiPath Thanks.
Hi @MultiPath .
In the provided distillation datasets, I found that valid.en-de was also distilled. Is it necessary that valid data need to be distilled? Thank you!
Hi @MultiPath. Sorry if this is a dumb question.
After training the NAT in your paper on the distilled training set, do you evaluate on the distilled test set or the original test set? Are all evaluations done on the distilled test set?
Hi @dhecloud , thanks for checking. Evaluations were done on the original test sets. Distillation is for train sets only.
Hi,
I am hitting some roadblocks whether I try to binarize the distilled dataset or try to train a NAR model on a freshly downloaded and binarized wmt14_en_de, could you please make the (distilled) binarized dataset available in https://github.com/pytorch/fairseq/tree/master/examples/nonautoregressive_translation#download ?
Most helpful comment
Hi @MultiPath .
In the provided distillation datasets, I found that valid.en-de was also distilled. Is it necessary that valid data need to be distilled? Thank you!