I see that only the approx_bleu_score is sent to the tensorboard. How can evaluate the real bleu?
What's the difference between approx_bleu_score to real bleu?
How can evaluate the real bleu?
use t2t-bleu
What's the difference between approx_bleu_score to real bleu?
The main difference is that approx_bleu is computed on the internal subwords instead of words, thus it is not replicable (not comparable with other models) and not suitable for reporting in publications.
Another problem is the autoregressive evaluation using gold previous tokens, which is a kind of cheating.
See #407, #522 and #436 for more details.
@nadavb Are there still questions left? Otherwise I think we could close that issue :)
@martinpopel thanks,
Most helpful comment
use t2t-bleu
The main difference is that approx_bleu is computed on the internal subwords instead of words, thus it is not replicable (not comparable with other models) and not suitable for reporting in publications.
Another problem is the autoregressive evaluation using gold previous tokens, which is a kind of cheating.
See #407, #522 and #436 for more details.