https://github.com/jungokasai/deep-shallow#download-trained-deep-shallow-models
Desired Signature:
model = FSMT.from_pretrained('allen_nlp/en-de-12-1')
Weights can be downloaded with gdown https://pypi.org/project/gdown/
pip install gdown
gdown https://drive.google.com/uc?id=1x_G2cjvM1nW5hjAB8-vWxRqtQTlmIaQU
@stas00 if you are blocked in the late stages of #6940 and have extra cycles, you could give this a whirl. We could also wait for that to be finalized and then either of us can take this.
I can work on these
But first I have a question: who would be a user of these wmt16-based models when there are several wmt19 en-de models (marian+fairseq wmt) which are significantly better with scores at ~41 and 43 respectively, vs 28 in this one.
wmt19 is a vastly bigger dataset, so it makes sense that it'd beat wmt16-based pre-trained models.
Helinki-NLP/opus-mt-en-de trained on way more data than the fairseq model I think, not totally sure.Seems like embeddings will be shared in these.
Converted and did the initial eval with the same wmt19 val set, with beam=15
checkpoint_best.pt:
allen_nlp-wmt16-en-de-dist_12-1
{'bleu': 30.1995, 'n_obs': 1997, 'runtime': 233, 'seconds_per_sample': 0.1167}
allen_nlp-wmt16-en-de-dist_6-1
{'bleu': 29.3692, 'n_obs': 1997, 'runtime': 236, 'seconds_per_sample': 0.1182}
allen_nlp-wmt16-en-de-12-1
{'bleu': 24.3901, 'n_obs': 1997, 'runtime': 280, 'seconds_per_sample': 0.1402}
checkpoint_top5_average.pt:
allen_nlp-wmt16-en-de-dist_12-1
{'bleu': 30.1078, 'n_obs': 1997, 'runtime': 239, 'seconds_per_sample': 0.1197}
which is more or less about the same BLEU scores reported on the model's page.
I will re-check tomorrow that I haven't made any mistakes, but this is far from wmt19 model's scores, which is not surprising given the difference in the amount of training data.
Using wmt16 dataset was probably sufficient to support the ideas in the paper, but it doesn't appear to be very practical for the end user, unless it's re-trained with a much bigger dataset (and wmt20 is imminent and probably will be even bigger).
I guess I should also re-eval against wmt16 eval set, so that we also compare their bleu scores to the ported model - to ensure it works as good as the original. Will post results tomorrow.
That is shockingly low/suggests a bug somewhere. To big of a discrepancy to explain with training data. I can't find your en-de BLEU from the en-de but I remember 40.8 for Marian, which shouldn't be worse.
Can you check your work/ look at translations a bit, then upload the model to stas/ so that I can take a look from your PR branch?
Bug sources I've seen before:
eos_token_id, decoder_start_token_id)So as promised here is the bleu scores with the wmt16 test set (3K items), beam=50:
{'bleu': 26.1883, 'n_obs': 2999, 'runtime': 1151, 'seconds_per_sample': 0.3838}
dist_6-1
{'bleu': 26.685, 'n_obs': 2999, 'runtime': 1054, 'seconds_per_sample': 0.3515}
12-1
{'bleu': 24.7299, 'n_obs': 2999, 'runtime': 1134, 'seconds_per_sample': 0.3781}
which are about 2 points behind if we assume they have run their eval on the wmt16 test data set.
edit: but I also don't get the advertised scores with their instructions: https://github.com/jungokasai/deep-shallow/issues/3
I will explore the model more today and see if I find anything, then yes, will upload the model.
Thank you for the pointers, @sshleifer
I can't find your en-de BLEU from the en-de but I remember 40.8 for Marian, which shouldn't be worse.
github hides older comments, here is the one you're after:
https://github.com/huggingface/transformers/pull/6940#issuecomment-687709700
pair | fairseq | transformers
-------|----|----------
"en-ru"|36.4| 33.29
"ru-en"|41.3| 38.93
"de-en"|42.3| 41.18
"en-de"|43.1| 42.79
New model's config should be {'num_beams':5} according to https://github.com/jungokasai/deep-shallow#evaluation
New model's config should be
{'num_beams':5}according to https://github.com/jungokasai/deep-shallow#evaluation
I'm not 100% sure what to do with this.
fariseq uses num_beams=50 in their eval, but that would be an overkill as the default for normal use. So may I propose we set a reasonable beam size for FSMT (currently 8) and leave it at it.
But when comparing bleu scores we match what the researchers advertised.
Is this a reasonable approach?
FSMT defaulting to 8 is fine. 5 would also be fine.num_beams used by the author or a lower value that is close to as good (up to you).So do you suggest we use num_beams=50 for fairseq? that'd be imposing a significant slowdown on a user, when 5-10 scores about the same.
The competitors try to win and thus try to squeeze all they can, at a compute/time cost, so I'm not sure the number reported in the paper is always a good one to use for that model.
But if you think the model's default should match the paper as a rule, then a rule is a rule.
transformer.wmt16.en-de usesmy convo with the 1st author:
https://github.com/jungokasai/deep-shallow/issues/1#issuecomment-678549967
* Possible source of discrepancy for the allen-nlp models is the tokenizer. They are using whatever tokenizer `transformer.wmt16.en-de` uses [here](https://github.com/pytorch/fairseq/blob/master/examples/translation/README.md#pre-trained-models)
You're very likely correct:
# fairseq/models/transformer.py
'transformer.wmt14.en-fr': moses_subword('https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-fr.joined-dict.transformer.tar.bz2'),
'transformer.wmt16.en-de': 'https://dl.fbaipublicfiles.com/fairseq/models/wmt16.en-de.joined-dict.transformer.tar.bz2',
'transformer.wmt18.en-de': moses_subword('https://dl.fbaipublicfiles.com/fairseq/models/wmt18.en-de.ensemble.tar.gz'),
'transformer.wmt19.en-de': moses_fastbpe('https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-de.joined-dict.ensemble.tar.gz'),
'transformer.wmt19.en-ru': moses_fastbpe('https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-ru.ensemble.tar.gz'),
on it.
I verified your suggestion and they didn't do it like transformer.wmt16.en-de does, as they have in model args:
'bpe': 'fastbpe',
'tokenizer': 'moses',
but it's a good point that I need to ensure inside the convert script that this is so, otherwise there will be bugs in porting future models that may use a different tokenizer/bpe.
transformer.wmt16.en-de uses a basic split to words tokenizer and no sub-words:
def tokenize_line(line):
line = SPACE_NORMALIZER.sub(" ", line)
line = line.strip()
return line.split()
It was suggested checkpoint_top5_average.pt should have a better score than checkpoint_best.pt, but I get a slightly better result with the latter on dist-12-1. Here is the full table at the moment using the ported to FSMT weights.
num_beams=5 on wmt19 test set:
chkpt file| top5_average | best
----------|--------------|-----
dist-12-1 | 29.9134 | 30.2591
dist-6-1 | 29.9837 | 29.3349
12-1 | 26.4008 | 24.1803
I rerun eval after adding length_penalty = 0.6 and getting better scores:
chkpt file| top5_average
----------|--------------
dist-12-1 | 30.1637
dist-6-1 | 30.2214
12-1 | 26.9763
For en-de datasets, I think they used moses+joint fastbpe. The model just assumes input data are already preprocessed with these tools, so that's why they just split with space.
for wmt16 en/de, as you said fairseq transformer does only whitespace-splitting, and no moses/fastbpe.
But your model appears to do moses/fastbpe according to the args stored in the checkpoint, so our code copies your settings.
OK, I did search the hparam space and came up with:
# based on the results of a search on a range of `num_beams`, `length_penalty` and `early_stopping`
# values against wmt19 test data to obtain the best BLEU scores, we will use the following defaults:
#
# * `num_beams`: 5 (higher scores better, but requires more memory/is slower, can be adjusted by users)
# * `early_stopping`: `False` consistently scored better
# * `length_penalty` varied, so will assign the best one depending on the model
best_score_hparams = {
# fairseq:
"wmt19-ru-en": {"length_penalty": 1.1},
"wmt19-en-ru": {"length_penalty": 1.15},
"wmt19-en-de": {"length_penalty": 1.0},
"wmt19-de-en": {"length_penalty": 1.1},
# allen-nlp :
"wmt16-en-de-dist-12-1": {"length_penalty": 0.6},
"wmt16-en-de-dist-6-1": {"length_penalty": 0.6},
"wmt16-en-de-12-1": {"length_penalty": 0.8},
"wmt19-de-en-6-6-base": {"length_penalty": 0.6 },
"wmt19-de-en-6-6-big": {"length_penalty": 0.6 },
}
}
Here are the full results for allen-nlp:
bleu | num_beams | length_penalty
----- | --------- | --------------
30.36 | 15 | 0.6
30.35 | 15 | 0.7
30.29 | 10 | 0.6
30.27 | 15 | 0.8
30.23 | 10 | 0.7
30.21 | 15 | 0.9
30.16 | 5 | 0.6
30.16 | 10 | 0.8
30.11 | 10 | 0.9
30.11 | 15 | 1.0
30.10 | 5 | 0.7
30.03 | 5 | 0.8
30.03 | 5 | 0.9
30.02 | 10 | 1.0
29.99 | 15 | 1.1
29.94 | 10 | 1.1
29.91 | 5 | 1.0
29.88 | 5 | 1.1
bleu | num_beams | length_penalty
----- | --------- | --------------
30.22 | 5 | 0.6
30.17 | 10 | 0.7
30.17 | 15 | 0.7
30.16 | 5 | 0.7
30.11 | 15 | 0.8
30.10 | 10 | 0.6
30.07 | 10 | 0.8
30.05 | 5 | 0.8
30.05 | 15 | 0.9
30.04 | 5 | 0.9
30.03 | 15 | 0.6
30.00 | 10 | 0.9
29.98 | 5 | 1.0
29.95 | 15 | 1.0
29.92 | 5 | 1.1
29.91 | 10 | 1.0
29.82 | 15 | 1.1
29.80 | 10 | 1.1
bleu | num_beams | length_penalty
----- | --------- | --------------
27.71 | 15 | 0.8
27.60 | 15 | 0.9
27.35 | 15 | 0.7
27.33 | 10 | 0.7
27.19 | 10 | 0.8
27.17 | 10 | 0.6
27.13 | 5 | 0.8
27.07 | 5 | 0.7
27.07 | 15 | 0.6
27.02 | 15 | 1.0
26.98 | 5 | 0.6
26.97 | 10 | 0.9
26.69 | 5 | 0.9
26.48 | 10 | 1.0
26.40 | 5 | 1.0
26.18 | 15 | 1.1
26.04 | 10 | 1.1
25.65 | 5 | 1.1
bleu | num_beams | length_penalty
----- | --------- | --------------
38.37 | 5 | 0.6
38.31 | 5 | 0.7
38.29 | 15 | 0.7
38.25 | 10 | 0.7
38.25 | 15 | 0.6
38.24 | 10 | 0.6
38.23 | 15 | 0.8
38.17 | 5 | 0.8
38.11 | 10 | 0.8
38.11 | 15 | 0.9
38.03 | 5 | 0.9
38.02 | 5 | 1.0
38.02 | 10 | 0.9
38.02 | 15 | 1.0
38.00 | 10 | 1.0
37.86 | 5 | 1.1
37.77 | 10 | 1.1
37.74 | 15 | 1.1
bleu | num_beams | length_penalty
----- | --------- | --------------
40.12 | 15 | 0.6
40.01 | 10 | 0.6
39.96 | 15 | 0.7
39.90 | 5 | 0.6
39.90 | 10 | 0.7
39.76 | 5 | 0.7
39.74 | 10 | 0.8
39.74 | 15 | 0.8
39.65 | 5 | 0.8
39.56 | 10 | 0.9
39.48 | 5 | 0.9
39.46 | 15 | 0.9
39.42 | 10 | 1.0
39.32 | 5 | 1.0
39.29 | 15 | 1.0
39.21 | 10 | 1.1
39.16 | 5 | 1.1
38.83 | 15 | 1.1
https://github.com/huggingface/transformers/pull/7153 once merged should close this issue.