Fairseq: Reproducing WMT 14 En-Fr (Transformer)

Created on 28 Jul 2018  路  12Comments  路  Source: pytorch/fairseq

Hi,

I'm trying to reproduce the WMT 14 En-Fr results from the "Scaling NMT" paper.
It worked out for WMT 14 En-De with the provided preprocessing script and hyper-parameters.
However, for WMT 14 En-Fr, the PPL is going up and down.
My command:

python3.6 train.py data-bin/wmt14_en_fr_joined_dict --arch transformer_vaswani_wmt_en_fr_big --share-all-embeddings --optimizer Adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.001 --min-lr 1e-09 --dropout 0.1 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 3584 --update-freq 16

Any suggestions for a better set of parameters?

Cheers,
Stephan

Most helpful comment

Good news!
I could reproduce your results on WMT En-Fr after switching to pytorch version v0.4.1 (43.1 BLEU on newstest14).

However, OOM is still around 0.10.
To avoid that I tried a batch size of 4096 and got same results.

Here is log for batch size 5120:

| epoch 001 | valid on 'valid' subset | valid_loss 3.99279 | valid_nll_loss 2.19014 | valid_ppl 4.56 | num_updates 2147 | epoch 002 | valid on 'valid' subset | valid_loss 3.60887 | valid_nll_loss 1.81177 | valid_ppl 3.51 | num_updates 4295 | best 3.60887 | epoch 003 | valid on 'valid' subset | valid_loss 3.35868 | valid_nll_loss 1.5761 | valid_ppl 2.98 | num_updates 6446 | best 3.35868 | epoch 004 | valid on 'valid' subset | valid_loss 3.2832 | valid_nll_loss 1.49871 | valid_ppl 2.83 | num_updates 8597 | best 3.2832 | epoch 005 | valid on 'valid' subset | valid_loss 3.2473 | valid_nll_loss 1.45568 | valid_ppl 2.74 | num_updates 10747 | best 3.2473 | epoch 006 | valid on 'valid' subset | valid_loss 3.21693 | valid_nll_loss 1.43691 | valid_ppl 2.71 | num_updates 12897 | best 3.21693 | epoch 007 | valid on 'valid' subset | valid_loss 3.20082 | valid_nll_loss 1.40804 | valid_ppl 2.65 | num_updates 15048 | best 3.20082 | epoch 008 | valid on 'valid' subset | valid_loss 3.1874 | valid_nll_loss 1.39752 | valid_ppl 2.63 | num_updates 17198 | best 3.1874 | epoch 009 | valid on 'valid' subset | valid_loss 3.17369 | valid_nll_loss 1.38668 | valid_ppl 2.61 | num_updates 19348 | best 3.17369 | epoch 010 | valid on 'valid' subset | valid_loss 3.16504 | valid_nll_loss 1.37717 | valid_ppl 2.60 | num_updates 21498 | best 3.16504 | epoch 011 | valid on 'valid' subset | valid_loss 3.15988 | valid_nll_loss 1.36966 | valid_ppl 2.58 | num_updates 23648 | best 3.15988 | epoch 012 | valid on 'valid' subset | valid_loss 3.15066 | valid_nll_loss 1.36187 | valid_ppl 2.57 | num_updates 25798 | best 3.15066 | epoch 013 | valid on 'valid' subset | valid_loss 3.14477 | valid_nll_loss 1.35529 | valid_ppl 2.56 | num_updates 27949 | best 3.14477 | epoch 014 | valid on 'valid' subset | valid_loss 3.14406 | valid_nll_loss 1.35602 | valid_ppl 2.56 | num_updates 30099 | best 3.14406 | epoch 015 | valid on 'valid' subset | valid_loss 3.13829 | valid_nll_loss 1.3448 | valid_ppl 2.54 | num_updates 32249 | best 3.13829 | epoch 016 | valid on 'valid' subset | valid_loss 3.13028 | valid_nll_loss 1.34388 | valid_ppl 2.54 | num_updates 34399 | best 3.13028 | epoch 017 | valid on 'valid' subset | valid_loss 3.12723 | valid_nll_loss 1.33635 | valid_ppl 2.53 | num_updates 36549 | best 3.12723 | epoch 018 | valid on 'valid' subset | valid_loss 3.12526 | valid_nll_loss 1.3339 | valid_ppl 2.52 | num_updates 38699 | best 3.12526 | epoch 019 | valid on 'valid' subset | valid_loss 3.12183 | valid_nll_loss 1.33137 | valid_ppl 2.52 | num_updates 40848 | best 3.12183 | epoch 020 | valid on 'valid' subset | valid_loss 3.11855 | valid_nll_loss 1.33257 | valid_ppl 2.52 | num_updates 42998 | best 3.11855 | epoch 021 | valid on 'valid' subset | valid_loss 3.11717 | valid_nll_loss 1.32722 | valid_ppl 2.51 | num_updates 45149 | best 3.11717 | epoch 022 | valid on 'valid' subset | valid_loss 3.11449 | valid_nll_loss 1.3253 | valid_ppl 2.51 | num_updates 47299 | best 3.11449 | epoch 023 | valid on 'valid' subset | valid_loss 3.11412 | valid_nll_loss 1.32066 | valid_ppl 2.50 | num_updates 49449 | best 3.11412 | epoch 024 | valid on 'valid' subset | valid_loss 3.11065 | valid_nll_loss 1.32076 | valid_ppl 2.50 | num_updates 51599 | best 3.11065 | epoch 025 | valid on 'valid' subset | valid_loss 3.10901 | valid_nll_loss 1.32106 | valid_ppl 2.50 | num_updates 53749 | best 3.10901 | epoch 026 | valid on 'valid' subset | valid_loss 3.10663 | valid_nll_loss 1.3212 | valid_ppl 2.50 | num_updates 55899 | best 3.10663 | epoch 027 | valid on 'valid' subset | valid_loss 3.10602 | valid_nll_loss 1.31865 | valid_ppl 2.49 | num_updates 58049 | best 3.10602 | epoch 028 | valid on 'valid' subset | valid_loss 3.10591 | valid_nll_loss 1.31143 | valid_ppl 2.48 | num_updates 60199 | best 3.10591 | epoch 029 | valid on 'valid' subset | valid_loss 3.10283 | valid_nll_loss 1.3149 | valid_ppl 2.49 | num_updates 62350 | best 3.10283 | epoch 030 | valid on 'valid' subset | valid_loss 3.10323 | valid_nll_loss 1.31422 | valid_ppl 2.49 | num_updates 64499 | best 3.10283 | epoch 031 | valid on 'valid' subset | valid_loss 3.09931 | valid_nll_loss 1.31243 | valid_ppl 2.48 | num_updates 66649 | best 3.09931 | epoch 032 | valid on 'valid' subset | valid_loss 3.09965 | valid_nll_loss 1.31103 | valid_ppl 2.48 | num_updates 68799 | best 3.09931 | epoch 033 | valid on 'valid' subset | valid_loss 3.09943 | valid_nll_loss 1.30802 | valid_ppl 2.48 | num_updates 70949 | best 3.09931 | epoch 034 | valid on 'valid' subset | valid_loss 3.09991 | valid_nll_loss 1.30598 | valid_ppl 2.47 | num_updates 73100 | best 3.09931 | epoch 035 | valid on 'valid' subset | valid_loss 3.09615 | valid_nll_loss 1.30697 | valid_ppl 2.47 | num_updates 75250 | best 3.09615 | epoch 036 | valid on 'valid' subset | valid_loss 3.09759 | valid_nll_loss 1.30637 | valid_ppl 2.47 | num_updates 77400 | best 3.09615 | epoch 037 | valid on 'valid' subset | valid_loss 3.09649 | valid_nll_loss 1.30375 | valid_ppl 2.47 | num_updates 79550 | best 3.09615 | epoch 038 | valid on 'valid' subset | valid_loss 3.09297 | valid_nll_loss 1.30488 | valid_ppl 2.47 | num_updates 81700 | best 3.09297

All 12 comments

Hmm that command seems right. Can you share the training log?

Thanks for the reply. I just restarted the training and it seems things are back to normal:

| epoch 001 | valid on 'valid' subset | valid_loss 5.0954 | valid_nll_loss 3.45955 | valid_ppl 11.00 | num_updates 3142 
| epoch 002 | valid on 'valid' subset | valid_loss 4.69421 | valid_nll_loss 3.01449 | valid_ppl 8.08 | num_updates 6284 | best 4.69421 

Does the log look valid?
Note that I use newstest12+13 (as described in the paper) as validation set (in your original preprocessing script you sample a validation set from the training data).

What is the expected 'valid_nll_loss' for this task?

Cheers,
Stephan

Here's my training log. A few comments:

1) The validation loss is sampled from the training set, as per the preprocessing script, not using newstest12+13. We'll update the paper to reflect this (good catch!)
2) I actually used 5k tokens/gpu for En-Fr, so you'll need to update your training command for that (sorry!)

Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, arch='transformer_vaswani_wmt_en_de_big', attention_dropout=0.0, clip_norm=0.0, criterion='label_smoothed_cross_entropy', data='/private/home/myleott/data/wmt14_en_fr_oss_joined/', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_ffn_embed_dim=4096, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, device_id=0, distributed_backend='nccl', distributed_init_method='tcp://learnfair0253:58342', distributed_port=58342, distributed_rank=0, distributed_world_size=128, dropout=0.1, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fp16=True, label_smoothing=0.1, log_format='json', log_interval=10, lr=[0.0007], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=5120, max_update=80000, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, optimizer='adam', relu_dropout=0.0, restore_file='checkpoint_last.pt', sample_without_replacement=0, save_dir='/checkpoint02/myleott/2018-05-19/wmt14_en_fr.fp16_allreduce.fp16.maxupd80000.transformer_vaswani_wmt_en_de_big.shareemb.adam.beta0.9,0.98.initlr1e-07.warmup4000.lr0.0007.clip0.0.drop0.1.wd0.0.ls0.1.maxtok5120.seed2.ngpu128', save_interval=1, seed=2, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, train_subset='train', update_freq=[1.0], valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [en] dictionary: 44512 types
| [fr] dictionary: 44512 types
| /private/home/myleott/data/wmt14_en_fr_oss_joined/ train 35760411 examples
| /private/home/myleott/data/wmt14_en_fr_oss_joined/ valid 26853 examples
| model transformer_vaswani_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 221937664
| training on 128 GPUs
| max tokens per GPU = 5120 and max sentences per GPU = None
{"epoch": 1, "loss": "6.501", "nll_loss": "5.121", "ppl": "34.80", "wps": 1744340, "ups": "2.8", "wpb": 617158, "bsz": 16624, "num_updates": 2147, "lr": 0.00037577132500000006, "gnorm": "1.279", "clip": "100%", "oom": 0.025617140195621797, "loss_scale": "16.000", "wall": 940}
{"epoch": 1, "valid_loss": 4.0428634242555885, "valid_nll_loss": 2.244321019728282, "valid_ppl": "4.74"}
{"epoch": 2, "loss": "3.894", "nll_loss": "2.177", "ppl": "4.52", "wps": 1736584, "ups": "2.8", "wpb": 617152, "bsz": 16624, "num_updates": 4296, "lr": 0.0006754541888731568, "gnorm": "0.850", "clip": "100%", "oom": 0.022579143389199256, "loss_scale": "4.000", "wall": 1727}
{"epoch": 2, "valid_loss": 3.516267808286902, "valid_nll_loss": 1.7477440810976066, "valid_ppl": "3.36"}
{"epoch": 3, "loss": "3.452", "nll_loss": "1.705", "ppl": "3.26", "wps": 1614082, "ups": "2.6", "wpb": 617154, "bsz": 16625, "num_updates": 6447, "lr": 0.0005513777039573519, "gnorm": "0.626", "clip": "100%", "oom": 0.025593299208934387, "loss_scale": "8.000", "wall": 2572}
{"epoch": 3, "valid_loss": 3.3429785290945087, "valid_nll_loss": 1.5531231940250916, "valid_ppl": "2.93"}
{"epoch": 4, "loss": "3.341", "nll_loss": "1.588", "ppl": "3.01", "wps": 1720844, "ups": "2.8", "wpb": 617154, "bsz": 16625, "num_updates": 8598, "lr": 0.00047745155848055476, "gnorm": "0.511", "clip": "100%", "oom": 0.025936264247499417, "loss_scale": "16.000", "wall": 3368}
{"epoch": 4, "valid_loss": 3.273885347380664, "valid_nll_loss": 1.4926457897187224, "valid_ppl": "2.81"}
{"epoch": 5, "loss": "3.287", "nll_loss": "1.531", "ppl": "2.89", "wps": 1685457, "ups": "2.7", "wpb": 617154, "bsz": 16624, "num_updates": 10748, "lr": 0.00042703572303240997, "gnorm": "0.439", "clip": "100%", "oom": 0.027446966877558616, "loss_scale": "16.000", "wall": 4178}
{"epoch": 5, "valid_loss": 3.2428560363757453, "valid_nll_loss": 1.451394379894498, "valid_ppl": "2.73"}
{"epoch": 6, "loss": "3.252", "nll_loss": "1.494", "ppl": "2.82", "wps": 1636735, "ups": "2.7", "wpb": 617152, "bsz": 16625, "num_updates": 12897, "lr": 0.0003898375650579872, "gnorm": "0.391", "clip": "100%", "oom": 0.02721563154221912, "loss_scale": "16.000", "wall": 5011}
{"epoch": 6, "valid_loss": 3.215571361597506, "valid_nll_loss": 1.43414958858394, "valid_ppl": "2.70"}
{"epoch": 7, "loss": "3.227", "nll_loss": "1.467", "ppl": "2.76", "wps": 1706625, "ups": "2.8", "wpb": 617155, "bsz": 16626, "num_updates": 15047, "lr": 0.00036091345679217866, "gnorm": "0.356", "clip": "100%", "oom": 0.027314414833521632, "loss_scale": "8.000", "wall": 5812}
{"epoch": 7, "valid_loss": 3.1982939559913857, "valid_nll_loss": 1.410668059607591, "valid_ppl": "2.66"}
{"epoch": 8, "loss": "3.208", "nll_loss": "1.447", "ppl": "2.73", "wps": 1604210, "ups": "2.6", "wpb": 617155, "bsz": 16625, "num_updates": 17197, "lr": 0.00033759941861296304, "gnorm": "0.329", "clip": "100%", "oom": 0.026923300575681805, "loss_scale": "8.000", "wall": 6662}
{"epoch": 8, "valid_loss": 3.186600119078741, "valid_nll_loss": 1.3942007910383125, "valid_ppl": "2.63"}
{"epoch": 9, "loss": "3.192", "nll_loss": "1.430", "ppl": "2.69", "wps": 1695040, "ups": "2.7", "wpb": 617151, "bsz": 16625, "num_updates": 19346, "lr": 0.0003182969256936471, "gnorm": "0.308", "clip": "100%", "oom": 0.027137392742685826, "loss_scale": "4.000", "wall": 7467}
{"epoch": 9, "valid_loss": 3.170829971076651, "valid_nll_loss": 1.3833437350205129, "valid_ppl": "2.61"}
{"epoch": 10, "loss": "3.180", "nll_loss": "1.417", "ppl": "2.67", "wps": 1671182, "ups": "2.7", "wpb": 617154, "bsz": 16625, "num_updates": 21497, "lr": 0.00030195283218121974, "gnorm": "0.291", "clip": "100%", "oom": 0.02712006326464158, "loss_scale": "8.000", "wall": 8285}
{"epoch": 10, "valid_loss": 3.1633157649901, "valid_nll_loss": 1.3739047163839204, "valid_ppl": "2.59"}
{"epoch": 11, "loss": "3.169", "nll_loss": "1.405", "ppl": "2.65", "wps": 1622867, "ups": "2.6", "wpb": 617154, "bsz": 16625, "num_updates": 23647, "lr": 0.00028789890295524164, "gnorm": "0.276", "clip": "100%", "oom": 0.026768723305281853, "loss_scale": "8.000", "wall": 9125}
{"epoch": 11, "valid_loss": 3.1583966085524873, "valid_nll_loss": 1.368621294188633, "valid_ppl": "2.58"}
{"epoch": 12, "loss": "3.160", "nll_loss": "1.395", "ppl": "2.63", "wps": 1704381, "ups": "2.8", "wpb": 617156, "bsz": 16626, "num_updates": 25797, "lr": 0.0002756407569266462, "gnorm": "0.264", "clip": "100%", "oom": 0.026863588789394117, "loss_scale": "4.000", "wall": 9927}
{"epoch": 12, "valid_loss": 3.150058905959087, "valid_nll_loss": 1.3588230044471987, "valid_ppl": "2.56"}
{"epoch": 13, "loss": "3.152", "nll_loss": "1.386", "ppl": "2.61", "wps": 1623172, "ups": "2.6", "wpb": 617154, "bsz": 16625, "num_updates": 27947, "lr": 0.00026482588861213526, "gnorm": "0.254", "clip": "100%", "oom": 0.026943858016960677, "loss_scale": "4.000", "wall": 10767}
{"epoch": 13, "valid_loss": 3.1435484224229913, "valid_nll_loss": 1.3528672058293125, "valid_ppl": "2.55"}
{"epoch": 14, "loss": "3.144", "nll_loss": "1.379", "ppl": "2.60", "wps": 1663153, "ups": "2.7", "wpb": 617156, "bsz": 16625, "num_updates": 30097, "lr": 0.00025519163330968924, "gnorm": "0.246", "clip": "100%", "oom": 0.02661394823404326, "loss_scale": "4.000", "wall": 11588}
{"epoch": 14, "valid_loss": 3.143120926026271, "valid_nll_loss": 1.3581403909415002, "valid_ppl": "2.56"}
{"epoch": 15, "loss": "3.138", "nll_loss": "1.372", "ppl": "2.59", "wps": 1661423, "ups": "2.7", "wpb": 617154, "bsz": 16625, "num_updates": 32248, "lr": 0.00024653389835166157, "gnorm": "0.238", "clip": "100%", "oom": 0.02638923344083354, "loss_scale": "8.000", "wall": 12409}
{"epoch": 15, "valid_loss": 3.1385698975677156, "valid_nll_loss": 1.3452879640535018, "valid_ppl": "2.54"}
{"epoch": 16, "loss": "3.132", "nll_loss": "1.365", "ppl": "2.58", "wps": 1614021, "ups": "2.6", "wpb": 617155, "bsz": 16625, "num_updates": 34398, "lr": 0.0002387049580131443, "gnorm": "0.231", "clip": "100%", "oom": 0.026600383743240887, "loss_scale": "4.000", "wall": 13254}
{"epoch": 16, "valid_loss": 3.1301769816945133, "valid_nll_loss": 1.3420413326715648, "valid_ppl": "2.54"}
{"epoch": 17, "loss": "3.127", "nll_loss": "1.360", "ppl": "2.57", "wps": 1675050, "ups": "2.7", "wpb": 617154, "bsz": 16625, "num_updates": 36549, "lr": 0.0002315742606848088, "gnorm": "0.225", "clip": "100%", "oom": 0.026567074338559194, "loss_scale": "8.000", "wall": 14069}
{"epoch": 17, "valid_loss": 3.127218505337049, "valid_nll_loss": 1.336270351554289, "valid_ppl": "2.52"}
{"epoch": 18, "loss": "3.122", "nll_loss": "1.355", "ppl": "2.56", "wps": 1608994, "ups": "2.6", "wpb": 617153, "bsz": 16626, "num_updates": 38698, "lr": 0.00022505246573051944, "gnorm": "0.220", "clip": "100%", "oom": 0.026538839216496978, "loss_scale": "4.000", "wall": 14916}
{"epoch": 18, "valid_loss": 3.1267297271451215, "valid_nll_loss": 1.3352956947208277, "valid_ppl": "2.52"}
{"epoch": 19, "loss": "3.118", "nll_loss": "1.350", "ppl": "2.55", "wps": 1655361, "ups": "2.7", "wpb": 617152, "bsz": 16624, "num_updates": 40848, "lr": 0.00021904968699833772, "gnorm": "0.215", "clip": "100%", "oom": 0.02665981198589894, "loss_scale": "4.000", "wall": 15741}
{"epoch": 19, "valid_loss": 3.121336888706833, "valid_nll_loss": 1.3306140610829051, "valid_ppl": "2.52"}
{"epoch": 20, "loss": "3.114", "nll_loss": "1.345", "ppl": "2.54", "wps": 1654140, "ups": "2.7", "wpb": 617153, "bsz": 16625, "num_updates": 42998, "lr": 0.00021350296370858512, "gnorm": "0.211", "clip": "100%", "oom": 0.02674543002000093, "loss_scale": "4.000", "wall": 16566}
{"epoch": 20, "valid_loss": 3.118648823162552, "valid_nll_loss": 1.331059430090208, "valid_ppl": "2.52"}
{"epoch": 21, "loss": "3.110", "nll_loss": "1.341", "ppl": "2.53", "wps": 1619236, "ups": "2.6", "wpb": 617154, "bsz": 16625, "num_updates": 45149, "lr": 0.00020835501965432528, "gnorm": "0.207", "clip": "100%", "oom": 0.026755852842809364, "loss_scale": "8.000", "wall": 17408}
{"epoch": 21, "valid_loss": 3.1167511166677677, "valid_nll_loss": 1.3281390586463744, "valid_ppl": "2.51"}
{"epoch": 22, "loss": "3.107", "nll_loss": "1.338", "ppl": "2.53", "wps": 1666556, "ups": "2.7", "wpb": 617156, "bsz": 16625, "num_updates": 47299, "lr": 0.00020356450627185554, "gnorm": "0.203", "clip": "100%", "oom": 0.026427620034250195, "loss_scale": "4.000", "wall": 18227}
{"epoch": 22, "valid_loss": 3.114563441164484, "valid_nll_loss": 1.324312050783569, "valid_ppl": "2.50"}
{"epoch": 23, "loss": "3.103", "nll_loss": "1.334", "ppl": "2.52", "wps": 1622412, "ups": "2.6", "wpb": 617154, "bsz": 16625, "num_updates": 49449, "lr": 0.00019908992317177723, "gnorm": "0.200", "clip": "100%", "oom": 0.026491941191935126, "loss_scale": "4.000", "wall": 19068}
{"epoch": 23, "valid_loss": 3.1141280827908684, "valid_nll_loss": 1.32166640018651, "valid_ppl": "2.50"}
{"epoch": 24, "loss": "3.100", "nll_loss": "1.331", "ppl": "2.52", "wps": 1635212, "ups": "2.6", "wpb": 617151, "bsz": 16625, "num_updates": 51599, "lr": 0.0001948980047921047, "gnorm": "0.196", "clip": "100%", "oom": 0.02620205817942208, "loss_scale": "4.000", "wall": 19902}
{"epoch": 24, "valid_loss": 3.110669136914194, "valid_nll_loss": 1.3210852246007831, "valid_ppl": "2.50"}
{"epoch": 25, "loss": "3.097", "nll_loss": "1.328", "ppl": "2.51", "wps": 1667443, "ups": "2.7", "wpb": 617156, "bsz": 16626, "num_updates": 53749, "lr": 0.00019096019143386865, "gnorm": "0.194", "clip": "100%", "oom": 0.02612141621239465, "loss_scale": "4.000", "wall": 20721}
{"epoch": 25, "valid_loss": 3.1106161256777836, "valid_nll_loss": 1.319883019809227, "valid_ppl": "2.50"}
{"epoch": 26, "loss": "3.095", "nll_loss": "1.325", "ppl": "2.50", "wps": 1603069, "ups": "2.6", "wpb": 617154, "bsz": 16626, "num_updates": 55899, "lr": 0.0001872518065497763, "gnorm": "0.191", "clip": "100%", "oom": 0.026082756399935597, "loss_scale": "4.000", "wall": 21572}
{"epoch": 26, "valid_loss": 3.106729609782232, "valid_nll_loss": 1.318580078740263, "valid_ppl": "2.49"}
{"epoch": 27, "loss": "3.092", "nll_loss": "1.322", "ppl": "2.50", "wps": 1658188, "ups": "2.7", "wpb": 617153, "bsz": 16625, "num_updates": 58049, "lr": 0.0001837514032631448, "gnorm": "0.188", "clip": "100%", "oom": 0.026288135885200434, "loss_scale": "4.000", "wall": 22395}
{"epoch": 27, "valid_loss": 3.105788957031695, "valid_nll_loss": 1.3183252018612832, "valid_ppl": "2.49"}
{"epoch": 28, "loss": "3.090", "nll_loss": "1.319", "ppl": "2.50", "wps": 1629476, "ups": "2.6", "wpb": 617152, "bsz": 16625, "num_updates": 60199, "lr": 0.00018044024045858075, "gnorm": "0.186", "clip": "100%", "oom": 0.02627950630409143, "loss_scale": "4.000", "wall": 23232}
{"epoch": 28, "valid_loss": 3.104313131330914, "valid_nll_loss": 1.3142246088314344, "valid_ppl": "2.49"}
{"epoch": 29, "loss": "3.088", "nll_loss": "1.317", "ppl": "2.49", "wps": 1631170, "ups": "2.6", "wpb": 617152, "bsz": 16625, "num_updates": 62349, "lr": 0.0001773018591368861, "gnorm": "0.184", "clip": "100%", "oom": 0.026207316877576225, "loss_scale": "4.000", "wall": 24069}
{"epoch": 29, "valid_loss": 3.102629769670096, "valid_nll_loss": 1.316760606275866, "valid_ppl": "2.49"}
{"epoch": 30, "loss": "3.085", "nll_loss": "1.314", "ppl": "2.49", "wps": 1647668, "ups": "2.7", "wpb": 617153, "bsz": 16624, "num_updates": 64499, "lr": 0.00017432173711864654, "gnorm": "0.182", "clip": "100%", "oom": 0.026512038946340254, "loss_scale": "4.000", "wall": 24897}
{"epoch": 30, "valid_loss": 3.1036868049107493, "valid_nll_loss": 1.3137451967244027, "valid_ppl": "2.49"}
{"epoch": 31, "loss": "3.083", "nll_loss": "1.312", "ppl": "2.48", "wps": 1627880, "ups": "2.6", "wpb": 617156, "bsz": 16625, "num_updates": 66649, "lr": 0.00017148700552858888, "gnorm": "0.180", "clip": "100%", "oom": 0.026422001830485077, "loss_scale": "4.000", "wall": 25728}
{"epoch": 31, "valid_loss": 3.1011771528387655, "valid_nll_loss": 1.312515667050477, "valid_ppl": "2.48"}
{"epoch": 32, "loss": "3.081", "nll_loss": "1.310", "ppl": "2.48", "wps": 1652340, "ups": "2.7", "wpb": 617154, "bsz": 16625, "num_updates": 68798, "lr": 0.00016878744108330345, "gnorm": "0.178", "clip": "100%", "oom": 0.0262943690223553, "loss_scale": "2.000", "wall": 26554}
{"epoch": 32, "valid_loss": 3.1002546562239726, "valid_nll_loss": 1.3108558781345157, "valid_ppl": "2.48"}
{"epoch": 33, "loss": "3.079", "nll_loss": "1.308", "ppl": "2.48", "wps": 1628816, "ups": "2.6", "wpb": 617154, "bsz": 16625, "num_updates": 70949, "lr": 0.0001662091377019842, "gnorm": "0.177", "clip": "100%", "oom": 0.02642743379046921, "loss_scale": "8.000", "wall": 27392}
{"epoch": 33, "valid_loss": 3.0997653059796737, "valid_nll_loss": 1.3086442848088524, "valid_ppl": "2.48"}
{"epoch": 34, "loss": "3.078", "nll_loss": "1.306", "ppl": "2.47", "wps": 1637634, "ups": "2.7", "wpb": 617156, "bsz": 16625, "num_updates": 73098, "lr": 0.0001637477300784896, "gnorm": "0.175", "clip": "100%", "oom": 0.026553394073709265, "loss_scale": "4.000", "wall": 28224}
{"epoch": 34, "valid_loss": 3.100671220739572, "valid_nll_loss": 1.3063413562384645, "valid_ppl": "2.47"}
{"epoch": 35, "loss": "3.076", "nll_loss": "1.304", "ppl": "2.47", "wps": 1655171, "ups": "2.7", "wpb": 617151, "bsz": 16625, "num_updates": 75248, "lr": 0.0001613914617084694, "gnorm": "0.173", "clip": "100%", "oom": 0.026445885604933023, "loss_scale": "2.000", "wall": 29041}
{"epoch": 35, "valid_loss": 3.0965958770971826, "valid_nll_loss": 1.3074591464497258, "valid_ppl": "2.48"}
{"epoch": 36, "loss": "3.074", "nll_loss": "1.302", "ppl": "2.47", "wps": 1634650, "ups": "2.6", "wpb": 617154, "bsz": 16625, "num_updates": 77399, "lr": 0.00015913304053764884, "gnorm": "0.172", "clip": "100%", "oom": 0.026447370121060996, "loss_scale": "4.000", "wall": 29876}
{"epoch": 36, "valid_loss": 3.0979512390361466, "valid_nll_loss": 1.3069363734414163, "valid_ppl": "2.47"}
{"epoch": 37, "loss": "3.073", "nll_loss": "1.300", "ppl": "2.46", "wps": 1660677, "ups": "2.7", "wpb": 617152, "bsz": 16624, "num_updates": 79549, "lr": 0.00015696783686140273, "gnorm": "0.171", "clip": "100%", "oom": 0.026713095073476724, "loss_scale": "4.000", "wall": 30691}
{"epoch": 37, "valid_loss": 3.097330517648961, "valid_nll_loss": 1.3055174961764615, "valid_ppl": "2.47"}
{"epoch": 38, "loss": "3.065", "nll_loss": "1.292", "ppl": "2.45", "wps": 1618683, "ups": "2.6", "wpb": 617310, "bsz": 16599, "num_updates": 80000, "lr": 0.00015652475842498528, "gnorm": "0.170", "clip": "100%", "oom": 0.0267125, "loss_scale": "4.000", "wall": 30878}
{"epoch": 38, "valid_loss": 3.097408719719202, "valid_nll_loss": 1.3071560233976123, "valid_ppl": "2.47"}

Ok, thanks for the details!
I will kick off another training!

One more thing: Your learning rate is set to 0.0007.
Is that because you do 5k tkn/gpu?
And what is the best setup without distr. training?

Thanks a lot!

Ah, good call about the LR. I'll update the paper/README with more of these details for En-Fr. Generally bigger batch sizes support _larger_ learning rates, although it is dataset-dependent. In this case, 0.0007 is smaller than we used for En-De (0.001), even though I'm using a bigger batch size (617k tokens/batch for En-Fr vs. 400k tokens/batch for En-De).

For 8 GPUs, I'd recommend adding --update-freq 16 and training with a large learning rate (0.0007) 馃槃 It might be possible to use a smaller update-freq (10?) and still get get results, but I haven't tried it.

Ok, thanks!
I will play around with the hyper-parameters but I remember that I run into some OOM issues with En-Fr and batch sizes larger 4k.

Ok, I tried this command:

python3.6 train.py data-bin.en-fr/wmt14_en_fr_joined_dict --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0007 --min-lr 1e-09 --dropout 0.1 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 5120 --update-freq 16 --fp16 --seed 2

which could basically simulate your 128 GPUs setup.
And this is the output:

| distributed init (rank 4): tcp://localhost:10316 
| distributed init (rank 1): tcp://localhost:10316 
| distributed init (rank 7): tcp://localhost:10316 
| distributed init (rank 3): tcp://localhost:10316 
| distributed init (rank 5): tcp://localhost:10316 
| distributed init (rank 0): tcp://localhost:10316 
| distributed init (rank 2): tcp://localhost:10316 
| distributed init (rank 6): tcp://localhost:10316 
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_softmax_cutoff=None, arch='transformer_vaswani_wmt_en_de_big', attention_dropout=0.0, clip_norm=0.0, criterion='label_smoothed_cross_entropy', data='data-bin.en-fr/wmt14_en_fr_joined_dict', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, device_id=0, distributed_backend='nccl', distributed_init_method='tcp://localhost:10316', distributed_port=-1, distributed_rank=0, distributed_world_size=8, dropout=0.1, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fp16=True, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0007], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=5120, max_update=0, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, optimizer='adam', raw_text=False, relu_dropout=0.0, restore_file='checkpoint_last.pt', save_dir='/turi_bolt/user_output/artifacts', save_interval=1, save_interval_updates=0, seed=2, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation', train_subset='train', update_freq=[16], valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0) 
| [en] dictionary: 44512 types 
| [fr] dictionary: 44512 types 
| data-bin.en-fr/wmt14_en_fr_joined_dict train 35760411 examples 
| data-bin.en-fr/wmt14_en_fr_joined_dict valid 26853 examples 
| model transformer_vaswani_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion 
| num. model params: 221937664 
| training on 8 GPUs 
| max tokens per GPU = 5120 and max sentences per GPU = None 
| WARNING: overflow detected, setting loss scale to: 64.0 
| WARNING: overflow detected, setting loss scale to: 32.0 
| WARNING: overflow detected, setting loss scale to: 16.0 
| WARNING: overflow detected, setting loss scale to: 8.0 
| WARNING: overflow detected, setting loss scale to: 4.0 
| epoch 001: 1000 / 34402 loss=14.098, nll_loss=13.857, ppl=14839.35, wps=8015, ups=0.0, wpb=617649, bsz=16600, num_updates=57, lr=1.00736e-05, gnorm=4.153, clip=100%, oom=0, loss_scale=4.000, wall=4393 
...
| epoch 001: 6000 / 34402 loss=10.976, nll_loss=10.299, ppl=1260.18, wps=21220, ups=0.0, wpb=617547, bsz=16690, num_updates=370, lr=6.48408e-05, gnorm=2.139, clip=100%, oom=0, loss_scale=4.000, wall=10768 
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory 
...
| epoch 001: 8000 / 34402 loss=10.340, nll_loss=9.559, ppl=754.47, wps=22937, ups=0.0, wpb=617539, bsz=16697, num_updates=495, lr=8.67126e-05, gnorm=2.044, clip=100%, oom=0.0040404, loss_scale=4.000, wall=13327 
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
...

Why do I get theses OOM errors?
And what does "| WARNING: overflow detected, setting loss scale to: 64.0" mean?
I'm running my experiments on 8xP100s.

Cheers,
Stephan

Re: the overflow warning, those are because you are using FP16 training. Basically since we can鈥檛 detect numeric underflow (but can detect overflow), we dynamically scale the loss to keep it in the FP16 range of values (and scale down the gradients to compensate). We then adjust this scale every time we detect an overflow, which is the warning you鈥檙e seeing. It鈥檚 fine to ignore as long as the loss-scale value doesn鈥檛 get very small (>= 0.1 is acceptable).

Re: OOM, this batch size (5120 tokens per GPU) is right on the limit of what it can fit on a 16GB card. Fairseq is robust to occasional OOMs by catching them, skipping the batch, and continuing training. You may see the warning messages though.

Also, note that P100 doesn't support fast FP16 (--fp16), since only Volta (V100) has Tensor Cores to make FP16 fast. On Pascal (P100) it's generally faster to use FP32 than FP16. This is because FP16 requires additional bookkeeping that is only worth it if you can also benefit from the speedup from Tensor Cores. Unfortunately FP32 won鈥檛 allow this larger batch size though.

If you do switch to FP32, please make sure you're on the latest master, as there was a bug introduced a few days ago that affects FP32 and --update-freq. I just landed the fix: 202e0bb.

Unfortunately FP32 won鈥檛 allow this larger batch size though.

That's why I wanted to use FP16 on P100s. :)

If you do switch to FP32, please make sure you're on the latest master, as there was a bug introduced a few days ago that affects FP32 and --update-freq. I just landed the fix: 202e0bb.

Could that be related to the PPL issues I have seen recently?

It seems I'm still not able to reproduce your setup.
Here are more details about my current setup:

  • 32x4 V100s
  • CUDA 9.1
  • NCCL version 2.2.12+cuda9.1
  • current pytorch and fairseq master
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_softmax_cutoff=None, arch='transformer_vaswani_wmt_en_de_big', attention_dropout=0.0, clip_norm=0.0, criterion='label_smoothed_cross_entropy', data='data-bin', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, device_id=0, distributed_backend='nccl', distributed_init_method='tcp://myhost.com:31999', distributed_port=-1, distributed_rank=0, distributed_world_size=128, dropout=0.1, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fp16=True, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0007], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=5120, max_update=0, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, optimizer='adam', raw_text=False, relu_dropout=0.0, restore_file='checkpoint_last.pt', save_dir='output', save_interval=1, save_interval_updates=0, seed=2, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation', train_subset='train', update_freq=[1], valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0) 
| [en] dictionary: 44512 types 
| [fr] dictionary: 44512 types 
| data-bin train 35760411 examples 
| data-bin valid 26853 examples 
| model transformer_vaswani_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion 
| num. model params: 221937664 
| training on 128 GPUs 
| max tokens per GPU = 5120 and max sentences per GPU = None
| epoch 001 | loss 6.523 | nll_loss 5.146 | ppl 35.40 | wps 602670 | ups 0.9 | wpb 617158 | bsz 16624 | num_updates 2147 | lr 0.000375771 | gnorm 1.305 | clip 100% | oom 0.105263 | loss_scale 16.000 | wall 2390 
| epoch 001 | valid on 'valid' subset | valid_loss 4.14312 | valid_nll_loss 2.40969 | valid_ppl 5.31 | num_updates 2147 
| epoch 002 | loss 4.145 | nll_loss 2.451 | ppl 5.47 | wps 614640 | ups 1.0 | wpb 617150 | bsz 16624 | num_updates 4296 | lr 0.000675454 | gnorm 0.979 | clip 100% | oom 0.0977654 | loss_scale 4.000 | wall 4562 
| epoch 002 | valid on 'valid' subset | valid_loss 3.59712 | valid_nll_loss 1.8516 | valid_ppl 3.61 | num_updates 4296 | best 3.59712
| epoch 003 | loss 3.469 | nll_loss 1.723 | ppl 3.30 | wps 614948 | ups 1.0 | wpb 617154 | bsz 16625 | num_updates 6447 | lr 0.000551378 | gnorm 0.711 | clip 100% | oom 0.101443 | loss_scale 8.000 | wall 6762 
| epoch 003 | valid on 'valid' subset | valid_loss 3.39631 | valid_nll_loss 1.65069 | valid_ppl 3.14 | num_updates 6447 | best 3.39631
| epoch 004 | loss 3.346 | nll_loss 1.593 | ppl 3.02 | wps 598255 | ups 1.0 | wpb 617154 | bsz 16625 | num_updates 8598 | lr 0.000477452 | gnorm 0.573 | clip 100% | oom 0.0967667 | loss_scale 16.000 | wall 9015 
| epoch 004 | valid on 'valid' subset | valid_loss 3.32489 | valid_nll_loss 1.5773 | valid_ppl 2.98 | num_updates 8598 | best 3.32489 

It seems my OOM is higher than yours (0.10 vs. 0.02).
Which CUDA version are you using?
Any other ideas?

Good news!
I could reproduce your results on WMT En-Fr after switching to pytorch version v0.4.1 (43.1 BLEU on newstest14).

However, OOM is still around 0.10.
To avoid that I tried a batch size of 4096 and got same results.

Here is log for batch size 5120:

| epoch 001 | valid on 'valid' subset | valid_loss 3.99279 | valid_nll_loss 2.19014 | valid_ppl 4.56 | num_updates 2147 | epoch 002 | valid on 'valid' subset | valid_loss 3.60887 | valid_nll_loss 1.81177 | valid_ppl 3.51 | num_updates 4295 | best 3.60887 | epoch 003 | valid on 'valid' subset | valid_loss 3.35868 | valid_nll_loss 1.5761 | valid_ppl 2.98 | num_updates 6446 | best 3.35868 | epoch 004 | valid on 'valid' subset | valid_loss 3.2832 | valid_nll_loss 1.49871 | valid_ppl 2.83 | num_updates 8597 | best 3.2832 | epoch 005 | valid on 'valid' subset | valid_loss 3.2473 | valid_nll_loss 1.45568 | valid_ppl 2.74 | num_updates 10747 | best 3.2473 | epoch 006 | valid on 'valid' subset | valid_loss 3.21693 | valid_nll_loss 1.43691 | valid_ppl 2.71 | num_updates 12897 | best 3.21693 | epoch 007 | valid on 'valid' subset | valid_loss 3.20082 | valid_nll_loss 1.40804 | valid_ppl 2.65 | num_updates 15048 | best 3.20082 | epoch 008 | valid on 'valid' subset | valid_loss 3.1874 | valid_nll_loss 1.39752 | valid_ppl 2.63 | num_updates 17198 | best 3.1874 | epoch 009 | valid on 'valid' subset | valid_loss 3.17369 | valid_nll_loss 1.38668 | valid_ppl 2.61 | num_updates 19348 | best 3.17369 | epoch 010 | valid on 'valid' subset | valid_loss 3.16504 | valid_nll_loss 1.37717 | valid_ppl 2.60 | num_updates 21498 | best 3.16504 | epoch 011 | valid on 'valid' subset | valid_loss 3.15988 | valid_nll_loss 1.36966 | valid_ppl 2.58 | num_updates 23648 | best 3.15988 | epoch 012 | valid on 'valid' subset | valid_loss 3.15066 | valid_nll_loss 1.36187 | valid_ppl 2.57 | num_updates 25798 | best 3.15066 | epoch 013 | valid on 'valid' subset | valid_loss 3.14477 | valid_nll_loss 1.35529 | valid_ppl 2.56 | num_updates 27949 | best 3.14477 | epoch 014 | valid on 'valid' subset | valid_loss 3.14406 | valid_nll_loss 1.35602 | valid_ppl 2.56 | num_updates 30099 | best 3.14406 | epoch 015 | valid on 'valid' subset | valid_loss 3.13829 | valid_nll_loss 1.3448 | valid_ppl 2.54 | num_updates 32249 | best 3.13829 | epoch 016 | valid on 'valid' subset | valid_loss 3.13028 | valid_nll_loss 1.34388 | valid_ppl 2.54 | num_updates 34399 | best 3.13028 | epoch 017 | valid on 'valid' subset | valid_loss 3.12723 | valid_nll_loss 1.33635 | valid_ppl 2.53 | num_updates 36549 | best 3.12723 | epoch 018 | valid on 'valid' subset | valid_loss 3.12526 | valid_nll_loss 1.3339 | valid_ppl 2.52 | num_updates 38699 | best 3.12526 | epoch 019 | valid on 'valid' subset | valid_loss 3.12183 | valid_nll_loss 1.33137 | valid_ppl 2.52 | num_updates 40848 | best 3.12183 | epoch 020 | valid on 'valid' subset | valid_loss 3.11855 | valid_nll_loss 1.33257 | valid_ppl 2.52 | num_updates 42998 | best 3.11855 | epoch 021 | valid on 'valid' subset | valid_loss 3.11717 | valid_nll_loss 1.32722 | valid_ppl 2.51 | num_updates 45149 | best 3.11717 | epoch 022 | valid on 'valid' subset | valid_loss 3.11449 | valid_nll_loss 1.3253 | valid_ppl 2.51 | num_updates 47299 | best 3.11449 | epoch 023 | valid on 'valid' subset | valid_loss 3.11412 | valid_nll_loss 1.32066 | valid_ppl 2.50 | num_updates 49449 | best 3.11412 | epoch 024 | valid on 'valid' subset | valid_loss 3.11065 | valid_nll_loss 1.32076 | valid_ppl 2.50 | num_updates 51599 | best 3.11065 | epoch 025 | valid on 'valid' subset | valid_loss 3.10901 | valid_nll_loss 1.32106 | valid_ppl 2.50 | num_updates 53749 | best 3.10901 | epoch 026 | valid on 'valid' subset | valid_loss 3.10663 | valid_nll_loss 1.3212 | valid_ppl 2.50 | num_updates 55899 | best 3.10663 | epoch 027 | valid on 'valid' subset | valid_loss 3.10602 | valid_nll_loss 1.31865 | valid_ppl 2.49 | num_updates 58049 | best 3.10602 | epoch 028 | valid on 'valid' subset | valid_loss 3.10591 | valid_nll_loss 1.31143 | valid_ppl 2.48 | num_updates 60199 | best 3.10591 | epoch 029 | valid on 'valid' subset | valid_loss 3.10283 | valid_nll_loss 1.3149 | valid_ppl 2.49 | num_updates 62350 | best 3.10283 | epoch 030 | valid on 'valid' subset | valid_loss 3.10323 | valid_nll_loss 1.31422 | valid_ppl 2.49 | num_updates 64499 | best 3.10283 | epoch 031 | valid on 'valid' subset | valid_loss 3.09931 | valid_nll_loss 1.31243 | valid_ppl 2.48 | num_updates 66649 | best 3.09931 | epoch 032 | valid on 'valid' subset | valid_loss 3.09965 | valid_nll_loss 1.31103 | valid_ppl 2.48 | num_updates 68799 | best 3.09931 | epoch 033 | valid on 'valid' subset | valid_loss 3.09943 | valid_nll_loss 1.30802 | valid_ppl 2.48 | num_updates 70949 | best 3.09931 | epoch 034 | valid on 'valid' subset | valid_loss 3.09991 | valid_nll_loss 1.30598 | valid_ppl 2.47 | num_updates 73100 | best 3.09931 | epoch 035 | valid on 'valid' subset | valid_loss 3.09615 | valid_nll_loss 1.30697 | valid_ppl 2.47 | num_updates 75250 | best 3.09615 | epoch 036 | valid on 'valid' subset | valid_loss 3.09759 | valid_nll_loss 1.30637 | valid_ppl 2.47 | num_updates 77400 | best 3.09615 | epoch 037 | valid on 'valid' subset | valid_loss 3.09649 | valid_nll_loss 1.30375 | valid_ppl 2.47 | num_updates 79550 | best 3.09615 | epoch 038 | valid on 'valid' subset | valid_loss 3.09297 | valid_nll_loss 1.30488 | valid_ppl 2.47 | num_updates 81700 | best 3.09297

Was this page helpful?
0 / 5 - 0 ratings