Hi, we have trained several models with the same configure on NMT, and we want to ensemble these models when decode (just combine the probability of each word in each decode step). But it seems current model uses the estimator to predict the whole sentences. Does the decode part of the transformer model support ensemble multi models? or if I want to use ensemble model when decode, where should I work from?
Proper ensembling (combine the probability of each word in each decode step) would be nice to have, it should give the best results.
Meanwhile, you can use avg_checkpoints.py to combine the models you have. According to my experiment the best results are achieved with averaging 20 checkpoint from each of the N independently* trained models (that is averaging 20*N checkpoints into one model).
*) It seems that the models do not need to be trained independently from the very beginning, but only for the last few hundred steps (thus saving the total GPU training time). I am running experiments to analyze this in more detail.
@martinpopel thanks for your answer, we will try it by average checkpoint. And it seems to be a little difficult to ensemble model in this transformer code.
@martinpopel
hi, we have try to average the model over 4 independent model. But when decoding with the average model, we find the decode result is terrible as followed:
INFO:tensorflow:Inference results OUTPUT: δΈ»θ¦ δΈ»θ¦ δΈ»θ¦ δΈ»θ¦ δΈ»θ¦ δΈ»θ¦ δΈ»θ¦ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ
ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ
124 INFO:tensorflow:Inference results INPUT: Liu Tong , a director at a local TV channel in the northeast province of Liaoning , recalled that in his parent 's generation , the bride would be picked up by bicycle instead of the swanky sportscars favoured by today & apos;s generation . Looking at how the concept of tying the knot has evolved , Liu wrote on the social media site Weibo : ' In the 1950s it was about having a bed , in the 1960s it was just about a bag of sweets , in the 1970s it was the Little Red Book , in th e 1980s it was about having a radio , in the 1990s there was the extravagance of top-class hotels , and in the 2000s the wedding reception is a display of individuality . '
125 INFO:tensorflow:Inference results OUTPUT: δΈ»θ¦ δΈ»θ¦ δΈ»θ¦ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ
ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ > ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ
126 INFO:tensorflow:Inference results INPUT: Over the years , I have filled many spiral notebooks with quotations . On the subject of love , clichΓ©s are hard to avoid . I have heard , " They / we are like two pieces of a puzzle , " or " It was fate / bash ert / kismet / meant to be " thousands of times , or that 's how it feels anyway . Occasionally , though , people say something original and sparkly about love , and as a reporter , I want to hug them . It 's like finding a diamond in the sand . As a sort of thank-you to all of those whom I have written about , and those whom I have written for , here are 10 of those jewels
127 INFO:tensorflow:Inference results OUTPUT: δΈ»θ¦ δΈ»θ¦ δΈ»θ¦ δΈ»θ¦ δΈ»θ¦ δΈ»θ¦ δΈ»θ¦ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ
ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ ζ
have you meet the problems? look forward to your answer and thanks.
I have seen such terrible output when I tried to resume training of an averaged model (strangely, the approx_bleu was quite OK, but the real BLEU was almost 0). So once you create an averaged model you should use it only for decoding, not for further training. You should also make sure that all the averaged models are of the same type (they have exactly the same graph, with identical variable names, dimensions etc). I think there is no sanity check in avg_checkpoints.py to prevent e.g. averaging transformer_base with transformer_big.
BTW: based on my recent experiments:
@martinpopel thanks for your answer. I had to list the details in our experiment:
@njuzrs: In my experiments the 4 models were not actually fully independent. They were cloned from the original experiment at 120k, 320k and 653k steps. I evaluated the averaging at 256k, 900k and 1900k steps and the multiple models always improved the BLEU a bit. See my notes for details.
I expected fully independent models should be better or same, but I have not tried it yet. It is possible that the random initialization is principally different.
Oh, I don't think you can average weights of independently initialized models, I wouldn't believe that this can work. You can average their predictions, but if you're up to averaging weights, you need to start from the same random initialization and at least a bunch of thousands of first steps, so they're reasonably close.
@lukaszkaiser: Thanks for clarification. All my "pseudo-independent" models had at least 120k first steps the same and then over 1M independent steps. Averaging those models worked, but the improvement was rather small: +0.2 BLEU point.
@martinpopel @lukaszkaiser thanks for your advice, they are in accord with my experiments.
@lukaszkaiser now we want to ensemble model in transformer problem, has the code support ensemble model or could you share us any idea about the work? (we have read the code and found it written in high level of integration)
In the new release 1.2.0, it's easy to call T2T models from a raw session, as in this test:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/trainer_utils_test.py#L96
You could use this to call a few models step-by-step, in each step feeding new targets and ensembling in python. It will be slow, but should be an easy way to try out the performance of an ensemble.
@njuzrs The models you used to ensemble are not in a same local optimal solution, I think .
Hi @lukaszkaiser , I've done ensemble decoding with your proposal. Unfortunately, run a new session for each decoding step is slow indeed. I'm wondering whether there's a more appropriate way to implement ensemble decoding. Keep trying ...
@cshanbo Could you share your codes for ensemble decodingοΌI also want to have a try. Thanks for your help!
I've already forgotten where I put those codes. Even though I did, the
decoding speed is too slow to use (followed Lukas' idea). Hope someone
could propose an efficient solution.
tobyoup notifications@github.com δΊ2019εΉ΄3ζ7ζ₯ε¨ε δΈε10:43ειοΌ
@cshanbo https://github.com/cshanbo Could you share your codes for
ensemble decodingοΌI also want to have a try. Thanks for your help!β
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/tensor2tensor/issues/211#issuecomment-470552148,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJM_FUaeYYTBiBnAu0FYrNpWOqF5w3gOks5vUSWVgaJpZM4OvF1z
.
Most helpful comment
Oh, I don't think you can average weights of independently initialized models, I wouldn't believe that this can work. You can average their predictions, but if you're up to averaging weights, you need to start from the same random initialization and at least a bunch of thousands of first steps, so they're reasonably close.