Transformers: Convert 12-1 and 6-1 en-de models from AllenNLP

Created on 10 Sep 2020 · 19Comments · Source: huggingface/transformers

https://github.com/jungokasai/deep-shallow#download-trained-deep-shallow-models

These should be FSMT models, so can be part of #6940 or done after.
They should be uploaded to the AllenNLP namespace. If stas takes this, they can start in stas/ and I will move them.
model card(s) should link to the original repo and paper.
I hope same en-de tokenizer already ported.
Would be interesting to compare BLEU to the initial models in that PR. There is no ensemble so we should be able to reported scores pretty well.
Ideally this requires 0 lines of checked in python code, besides maybe an integration test.

Desired Signature:

model = FSMT.from_pretrained('allen_nlp/en-de-12-1')

Weights can be downloaded with gdown https://pypi.org/project/gdown/

pip install gdown
gdown https://drive.google.com/uc?id=1x_G2cjvM1nW5hjAB8-vWxRqtQTlmIaQU

@stas00 if you are blocked in the late stages of #6940 and have extra cycles, you could give this a whirl. We could also wait for that to be finalized and then either of us can take this.

New model

Source

sshleifer

All 19 comments

I can work on these

But first I have a question: who would be a user of these wmt16-based models when there are several wmt19 en-de models (marian+fairseq wmt) which are significantly better with scores at ~41 and 43 respectively, vs 28 in this one.

wmt19 is a vastly bigger dataset, so it makes sense that it'd beat wmt16-based pre-trained models.

stas00 on 10 Sep 2020

❤1

since different dataset, different val set, which implies that 28 and 41 are not comparable BLEU scores.
These models should be significantly faster than the Marian models at similar performance levels.
We can finetune them on the new data if we think that will help.
FYI Helinki-NLP/opus-mt-en-de trained on way more data than the fairseq model I think, not totally sure.

sshleifer on 10 Sep 2020

❤1

Seems like embeddings will be shared in these.

sshleifer on 10 Sep 2020

Converted and did the initial eval with the same wmt19 val set, with beam=15

checkpoint_best.pt:

allen_nlp-wmt16-en-de-dist_12-1
{'bleu': 30.1995, 'n_obs': 1997, 'runtime': 233, 'seconds_per_sample': 0.1167}

allen_nlp-wmt16-en-de-dist_6-1
{'bleu': 29.3692, 'n_obs': 1997, 'runtime': 236, 'seconds_per_sample': 0.1182}

allen_nlp-wmt16-en-de-12-1
{'bleu': 24.3901, 'n_obs': 1997, 'runtime': 280, 'seconds_per_sample': 0.1402}

checkpoint_top5_average.pt:

allen_nlp-wmt16-en-de-dist_12-1
{'bleu': 30.1078, 'n_obs': 1997, 'runtime': 239, 'seconds_per_sample': 0.1197}

which is more or less about the same BLEU scores reported on the model's page.

I will re-check tomorrow that I haven't made any mistakes, but this is far from wmt19 model's scores, which is not surprising given the difference in the amount of training data.

Using wmt16 dataset was probably sufficient to support the ideas in the paper, but it doesn't appear to be very practical for the end user, unless it's re-trained with a much bigger dataset (and wmt20 is imminent and probably will be even bigger).

I guess I should also re-eval against wmt16 eval set, so that we also compare their bleu scores to the ported model - to ensure it works as good as the original. Will post results tomorrow.

stas00 on 11 Sep 2020

That is shockingly low/suggests a bug somewhere. To big of a discrepancy to explain with training data. I can't find your en-de BLEU from the en-de but I remember 40.8 for Marian, which shouldn't be worse.
Can you check your work/ look at translations a bit, then upload the model to stas/ so that I can take a look from your PR branch?

Bug sources I've seen before:

special tokens (eos_token_id, decoder_start_token_id)
assumption that tokenizer is identical
Redundant generations: suggests caching issue.

sshleifer on 11 Sep 2020

So as promised here is the bleu scores with the wmt16 test set (3K items), beam=50:

{'bleu': 26.1883, 'n_obs': 2999, 'runtime': 1151, 'seconds_per_sample': 0.3838}
dist_6-1
{'bleu': 26.685, 'n_obs': 2999, 'runtime': 1054, 'seconds_per_sample': 0.3515}
12-1
{'bleu': 24.7299, 'n_obs': 2999, 'runtime': 1134, 'seconds_per_sample': 0.3781}

which are about 2 points behind if we assume they have run their eval on the wmt16 test data set.

edit: but I also don't get the advertised scores with their instructions: https://github.com/jungokasai/deep-shallow/issues/3

I will explore the model more today and see if I find anything, then yes, will upload the model.

Thank you for the pointers, @sshleifer

I can't find your en-de BLEU from the en-de but I remember 40.8 for Marian, which shouldn't be worse.

github hides older comments, here is the one you're after:
https://github.com/huggingface/transformers/pull/6940#issuecomment-687709700

pair | fairseq | transformers
-------|----|----------
"en-ru"|36.4| 33.29
"ru-en"|41.3| 38.93
"de-en"|42.3| 41.18
"en-de"|43.1| 42.79

stas00 on 11 Sep 2020

New model's config should be {'num_beams':5} according to https://github.com/jungokasai/deep-shallow#evaluation

sshleifer on 11 Sep 2020

👍1

New model's config should be {'num_beams':5} according to https://github.com/jungokasai/deep-shallow#evaluation

I'm not 100% sure what to do with this.

fariseq uses num_beams=50 in their eval, but that would be an overkill as the default for normal use. So may I propose we set a reasonable beam size for FSMT (currently 8) and leave it at it.

But when comparing bleu scores we match what the researchers advertised.

Is this a reasonable approach?

stas00 on 11 Sep 2020

FSMT defaulting to 8 is fine. 5 would also be fine.
Each model's config should either have num_beams used by the author or a lower value that is close to as good (up to you).

sshleifer on 12 Sep 2020

So do you suggest we use num_beams=50 for fairseq? that'd be imposing a significant slowdown on a user, when 5-10 scores about the same.

The competitors try to win and thus try to squeeze all they can, at a compute/time cost, so I'm not sure the number reported in the paper is always a good one to use for that model.

But if you think the model's default should match the paper as a rule, then a rule is a rule.

stas00 on 12 Sep 2020

lets do 5 for fairseq
Possible source of discrepancy for the allen-nlp models is the tokenizer.
They are using whatever tokenizer transformer.wmt16.en-de uses
here

my convo with the 1st author:

https://github.com/jungokasai/deep-shallow/issues/1#issuecomment-678549967

sshleifer on 12 Sep 2020

* Possible source of discrepancy for the allen-nlp models is the tokenizer.
  They are using whatever tokenizer `transformer.wmt16.en-de` uses
  [here](https://github.com/pytorch/fairseq/blob/master/examples/translation/README.md#pre-trained-models)

You're very likely correct:

 # fairseq/models/transformer.py
            'transformer.wmt14.en-fr': moses_subword('https://dl.fbaipublicfiles.com/fairseq/models/wmt14.en-fr.joined-dict.transformer.tar.bz2'),
            'transformer.wmt16.en-de': 'https://dl.fbaipublicfiles.com/fairseq/models/wmt16.en-de.joined-dict.transformer.tar.bz2',
            'transformer.wmt18.en-de': moses_subword('https://dl.fbaipublicfiles.com/fairseq/models/wmt18.en-de.ensemble.tar.gz'),
            'transformer.wmt19.en-de': moses_fastbpe('https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-de.joined-dict.ensemble.tar.gz'),
            'transformer.wmt19.en-ru': moses_fastbpe('https://dl.fbaipublicfiles.com/fairseq/models/wmt19.en-ru.ensemble.tar.gz'),

on it.

stas00 on 12 Sep 2020

I verified your suggestion and they didn't do it like transformer.wmt16.en-de does, as they have in model args:

'bpe': 'fastbpe',
'tokenizer': 'moses',

but it's a good point that I need to ensure inside the convert script that this is so, otherwise there will be bugs in porting future models that may use a different tokenizer/bpe.

transformer.wmt16.en-de uses a basic split to words tokenizer and no sub-words:

def tokenize_line(line):
    line = SPACE_NORMALIZER.sub(" ", line)
    line = line.strip()
    return line.split()

stas00 on 12 Sep 2020

It was suggested checkpoint_top5_average.pt should have a better score than checkpoint_best.pt, but I get a slightly better result with the latter on dist-12-1. Here is the full table at the moment using the ported to FSMT weights.

num_beams=5 on wmt19 test set:

chkpt file| top5_average | best
----------|--------------|-----
dist-12-1 | 29.9134 | 30.2591
dist-6-1 | 29.9837 | 29.3349
12-1 | 26.4008 | 24.1803

stas00 on 12 Sep 2020

I rerun eval after adding length_penalty = 0.6 and getting better scores:

chkpt file| top5_average
----------|--------------
dist-12-1 | 30.1637
dist-6-1 | 30.2214
12-1 | 26.9763

stas00 on 13 Sep 2020

For en-de datasets, I think they used moses+joint fastbpe. The model just assumes input data are already preprocessed with these tools, so that's why they just split with space.

jungokasai on 13 Sep 2020

for wmt16 en/de, as you said fairseq transformer does only whitespace-splitting, and no moses/fastbpe.

But your model appears to do moses/fastbpe according to the args stored in the checkpoint, so our code copies your settings.

stas00 on 13 Sep 2020

OK, I did search the hparam space and came up with:

# based on the results of a search on a range of `num_beams`, `length_penalty` and `early_stopping`
# values against wmt19 test data to obtain the best BLEU scores, we will use the following defaults:
#
# * `num_beams`: 5 (higher scores better, but requires more memory/is slower, can be adjusted by users)
# * `early_stopping`: `False` consistently scored better
# * `length_penalty` varied, so will assign the best one depending on the model
best_score_hparams = {
    # fairseq:
    "wmt19-ru-en": {"length_penalty": 1.1},
    "wmt19-en-ru": {"length_penalty": 1.15},
    "wmt19-en-de": {"length_penalty": 1.0},
    "wmt19-de-en": {"length_penalty": 1.1},
    # allen-nlp : 
    "wmt16-en-de-dist-12-1": {"length_penalty": 0.6},
    "wmt16-en-de-dist-6-1": {"length_penalty": 0.6},
    "wmt16-en-de-12-1": {"length_penalty": 0.8},
    "wmt19-de-en-6-6-base": {"length_penalty": 0.6 },
    "wmt19-de-en-6-6-big": {"length_penalty": 0.6 },
    }
}

Here are the full results for allen-nlp:

wmt16-en-de-dist-12-1

bleu | num_beams | length_penalty
----- | --------- | --------------
30.36 | 15 | 0.6
30.35 | 15 | 0.7
30.29 | 10 | 0.6
30.27 | 15 | 0.8
30.23 | 10 | 0.7
30.21 | 15 | 0.9
30.16 | 5 | 0.6
30.16 | 10 | 0.8
30.11 | 10 | 0.9
30.11 | 15 | 1.0
30.10 | 5 | 0.7
30.03 | 5 | 0.8
30.03 | 5 | 0.9
30.02 | 10 | 1.0
29.99 | 15 | 1.1
29.94 | 10 | 1.1
29.91 | 5 | 1.0
29.88 | 5 | 1.1

wmt16-en-de-dist-6-1

bleu | num_beams | length_penalty
----- | --------- | --------------
30.22 | 5 | 0.6
30.17 | 10 | 0.7
30.17 | 15 | 0.7
30.16 | 5 | 0.7
30.11 | 15 | 0.8
30.10 | 10 | 0.6
30.07 | 10 | 0.8
30.05 | 5 | 0.8
30.05 | 15 | 0.9
30.04 | 5 | 0.9
30.03 | 15 | 0.6
30.00 | 10 | 0.9
29.98 | 5 | 1.0
29.95 | 15 | 1.0
29.92 | 5 | 1.1
29.91 | 10 | 1.0
29.82 | 15 | 1.1
29.80 | 10 | 1.1

wmt16-en-de-12-1

bleu | num_beams | length_penalty
----- | --------- | --------------
27.71 | 15 | 0.8
27.60 | 15 | 0.9
27.35 | 15 | 0.7
27.33 | 10 | 0.7
27.19 | 10 | 0.8
27.17 | 10 | 0.6
27.13 | 5 | 0.8
27.07 | 5 | 0.7
27.07 | 15 | 0.6
27.02 | 15 | 1.0
26.98 | 5 | 0.6
26.97 | 10 | 0.9
26.69 | 5 | 0.9
26.48 | 10 | 1.0
26.40 | 5 | 1.0
26.18 | 15 | 1.1
26.04 | 10 | 1.1
25.65 | 5 | 1.1

wmt19-de-en-6-6-base

bleu | num_beams | length_penalty
----- | --------- | --------------
38.37 | 5 | 0.6
38.31 | 5 | 0.7
38.29 | 15 | 0.7
38.25 | 10 | 0.7
38.25 | 15 | 0.6
38.24 | 10 | 0.6
38.23 | 15 | 0.8
38.17 | 5 | 0.8
38.11 | 10 | 0.8
38.11 | 15 | 0.9
38.03 | 5 | 0.9
38.02 | 5 | 1.0
38.02 | 10 | 0.9
38.02 | 15 | 1.0
38.00 | 10 | 1.0
37.86 | 5 | 1.1
37.77 | 10 | 1.1
37.74 | 15 | 1.1

wmt19-de-en-6-6-big

bleu | num_beams | length_penalty
----- | --------- | --------------
40.12 | 15 | 0.6
40.01 | 10 | 0.6
39.96 | 15 | 0.7
39.90 | 5 | 0.6
39.90 | 10 | 0.7
39.76 | 5 | 0.7
39.74 | 10 | 0.8
39.74 | 15 | 0.8
39.65 | 5 | 0.8
39.56 | 10 | 0.9
39.48 | 5 | 0.9
39.46 | 15 | 0.9
39.42 | 10 | 1.0
39.32 | 5 | 1.0
39.29 | 15 | 1.0
39.21 | 10 | 1.1
39.16 | 5 | 1.1
38.83 | 15 | 1.1