Tensor2tensor: Does the transformer model support ensemble model?

Created on 7 Aug 2017  Β·  15Comments  Β·  Source: tensorflow/tensor2tensor

Hi, we have trained several models with the same configure on NMT, and we want to ensemble these models when decode (just combine the probability of each word in each decode step). But it seems current model uses the estimator to predict the whole sentences. Does the decode part of the transformer model support ensemble multi models? or if I want to use ensemble model when decode, where should I work from?

Most helpful comment

Oh, I don't think you can average weights of independently initialized models, I wouldn't believe that this can work. You can average their predictions, but if you're up to averaging weights, you need to start from the same random initialization and at least a bunch of thousands of first steps, so they're reasonably close.

All 15 comments

Proper ensembling (combine the probability of each word in each decode step) would be nice to have, it should give the best results.
Meanwhile, you can use avg_checkpoints.py to combine the models you have. According to my experiment the best results are achieved with averaging 20 checkpoint from each of the N independently* trained models (that is averaging 20*N checkpoints into one model).

*) It seems that the models do not need to be trained independently from the very beginning, but only for the last few hundred steps (thus saving the total GPU training time). I am running experiments to analyze this in more detail.

@martinpopel thanks for your answer, we will try it by average checkpoint. And it seems to be a little difficult to ensemble model in this transformer code.

@martinpopel
hi, we have try to average the model over 4 independent model. But when decoding with the average model, we find the decode result is terrible as followed:

INFO:tensorflow:Inference results OUTPUT: 主要 主要 主要 主要 主要 主要 主要 ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€
ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€
124 INFO:tensorflow:Inference results INPUT: Liu Tong , a director at a local TV channel in the northeast province of Liaoning , recalled that in his parent 's generation , the bride would be picked up by bicycle instead of the swanky sportscars favoured by today & apos;s generation . Looking at how the concept of tying the knot has evolved , Liu wrote on the social media site Weibo : ' In the 1950s it was about having a bed , in the 1960s it was just about a bag of sweets , in the 1970s it was the Little Red Book , in th e 1980s it was about having a radio , in the 1990s there was the extravagance of top-class hotels , and in the 2000s the wedding reception is a display of individuality . '
125 INFO:tensorflow:Inference results OUTPUT: 主要 主要 主要 ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€
ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ > ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€
126 INFO:tensorflow:Inference results INPUT: Over the years , I have filled many spiral notebooks with quotations . On the subject of love , clichΓ©s are hard to avoid . I have heard , " They / we are like two pieces of a puzzle , " or " It was fate / bash ert / kismet / meant to be " thousands of times , or that 's how it feels anyway . Occasionally , though , people say something original and sparkly about love , and as a reporter , I want to hug them . It 's like finding a diamond in the sand . As a sort of thank-you to all of those whom I have written about , and those whom I have written for , here are 10 of those jewels
127 INFO:tensorflow:Inference results OUTPUT: 主要 主要 主要 主要 主要 主要 主要 ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€
ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€ ζœ€

have you meet the problems? look forward to your answer and thanks.

I have seen such terrible output when I tried to resume training of an averaged model (strangely, the approx_bleu was quite OK, but the real BLEU was almost 0). So once you create an averaged model you should use it only for decoding, not for further training. You should also make sure that all the averaged models are of the same type (they have exactly the same graph, with identical variable names, dimensions etc). I think there is no sanity check in avg_checkpoints.py to prevent e.g. averaging transformer_base with transformer_big.

BTW: based on my recent experiments:

  • It is better to take 40 rather than 20 checkpoints (stored in 20-minutes intervals, which is the default). BLEU grows for up to 200 checkpoints, but with smaller and smaller improvements.
  • Even better is to take 10 checkpoints stored in 200-minutes intervals.
  • Unfortunately, it seems the longer the model is trained (and the better BLEU it has without averaging) the smaller improvement can be achieved by the averaging (both checkpoint-averaging only and +averaging independently trained models). This is in contrast with proper ensembling in Nematus, where both checkpoint-ensemble and checkpoint&indpendent-ensemble helps significantly (e.g. over 2.3 BLEU points, while averaging in T2T gives me just +0.6 BLEU for the same language pair) even for the long-trained models. Not to speak about right-to-left reranking (yes, it would be nice to have forced decoding and n-best list pruning in T2T).

@martinpopel thanks for your answer. I had to list the details in our experiment:

  1. our model: 4 independent tramsformer_big_dr2
  2. checkpoint step: we keep model every 5000 steps instead of the fixed time interval e.g. 20 minutes
  3. we just use the average model for decode, not for further train
    In addition, we observed:
  4. it works to use checkpoints average model, which has a little improvement compared with single checkpoint model.
  5. However, when we try to use independent average model based on 4 independent model of 100k step, the terrible output appears.
  6. we also try to use checkpoint&independent average model based on 4*5 checkpoints (4 independent models and 5 checkpoints in each model), the bad output also appear.
    So why the independent models average works for you, but fails for me? Do you have any idea about this?
    BTW, I'm a little confused that if the independent models converge with different local minimum point, does the average model still works?

@njuzrs: In my experiments the 4 models were not actually fully independent. They were cloned from the original experiment at 120k, 320k and 653k steps. I evaluated the averaging at 256k, 900k and 1900k steps and the multiple models always improved the BLEU a bit. See my notes for details.
I expected fully independent models should be better or same, but I have not tried it yet. It is possible that the random initialization is principally different.

Oh, I don't think you can average weights of independently initialized models, I wouldn't believe that this can work. You can average their predictions, but if you're up to averaging weights, you need to start from the same random initialization and at least a bunch of thousands of first steps, so they're reasonably close.

@lukaszkaiser: Thanks for clarification. All my "pseudo-independent" models had at least 120k first steps the same and then over 1M independent steps. Averaging those models worked, but the improvement was rather small: +0.2 BLEU point.

@martinpopel @lukaszkaiser thanks for your advice, they are in accord with my experiments.

@lukaszkaiser now we want to ensemble model in transformer problem, has the code support ensemble model or could you share us any idea about the work? (we have read the code and found it written in high level of integration)

In the new release 1.2.0, it's easy to call T2T models from a raw session, as in this test:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/trainer_utils_test.py#L96

You could use this to call a few models step-by-step, in each step feeding new targets and ensembling in python. It will be slow, but should be an easy way to try out the performance of an ensemble.

@njuzrs The models you used to ensemble are not in a same local optimal solution, I think .

Hi @lukaszkaiser , I've done ensemble decoding with your proposal. Unfortunately, run a new session for each decoding step is slow indeed. I'm wondering whether there's a more appropriate way to implement ensemble decoding. Keep trying ...

@cshanbo Could you share your codes for ensemble decoding?I also want to have a try. Thanks for your help!

I've already forgotten where I put those codes. Even though I did, the
decoding speed is too slow to use (followed Lukas' idea). Hope someone
could propose an efficient solution.

tobyoup notifications@github.com 于2019εΉ΄3月7ζ—₯周四 δΈ‹εˆ10:43ε†™ι“οΌš

@cshanbo https://github.com/cshanbo Could you share your codes for
ensemble decoding?I also want to have a try. Thanks for your help!

β€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/tensor2tensor/issues/211#issuecomment-470552148,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJM_FUaeYYTBiBnAu0FYrNpWOqF5w3gOks5vUSWVgaJpZM4OvF1z
.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mehmedes picture mehmedes  Β·  3Comments

ndvbd picture ndvbd  Β·  3Comments

KayShenClarivate picture KayShenClarivate  Β·  3Comments

bezigon picture bezigon  Β·  4Comments

goodmansasha picture goodmansasha  Β·  4Comments