As the name implies, can you provide any performance comparison between pre-norm and post-norm performance comparison using a transformer on Machine Translation Dataset?
What is the pre-norm and post-norm?
As the name implies, can you provide any performance comparison between pre-norm and post-norm performance comparison using a transformer on Machine Translation Dataset?
You can refer to the ACL2019 paper https://arxiv.org/abs/1906.01787, which is based on fairseq.
Have a look at https://github.com/wangqiangneu/dlcl to reproduce the results.
Most helpful comment
You can refer to the ACL2019 paper https://arxiv.org/abs/1906.01787, which is based on fairseq.