Could you share instructions to (roughly) reproduce the ensembled pretraining runs on GLUE?
[Mainly for Yinhan Liu, @myleott et al.]
Sure, so for the all the GLUE tasks, we ran 15 different seeds for the selected hyperparam, and ensemble top 7 model based on dev set metrics. For WNLI we ensembled 5 models.
For all the classification tasks, ensemble was based on average of probabilities. For STS-B it was average of scores.
Let me know if you are looking for any more specific things?
Thanks! Is there code handy to do the ensembling? No problem if not—that
sounds easy enough to reproduce.
On Tue, Aug 6, 2019 at 5:31 PM ngoyal2707 notifications@github.com wrote:
Sure, so for the all the GLUE tasks, we ran 15 different seeds for the
selected hyperparam, and ensemble top 7 model based on dev set metrics.
For WNLI we ensembled 5 models.For all the classification tasks, ensemble was based on average of
probabilities. For STS-B it was average of scores.Let me know if you are looking for any more specific things?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pytorch_fairseq_issues_984-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAJZSWOENKZKVVNNFUH7N5LQDHUR3A5CNFSM4IJ2KX4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3WQZKA-23issuecomment-2D518851752&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=sCzLyHdE8zgQwk2-sKwA1w&m=Y7JHhJeBoLtEEplbNJoWCDGhaiXJTZuwjI_bK4JSWqE&s=NBIhCL1ta0ypqfn3NbtES9L0jJg_dlrAfYEjXB_hUFU&e=,
or mute the thread
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAJZSWIZRCEXVAGEJSBLDTLQDHUR3ANCNFSM4IJ2KX4A&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=sCzLyHdE8zgQwk2-sKwA1w&m=Y7JHhJeBoLtEEplbNJoWCDGhaiXJTZuwjI_bK4JSWqE&s=g3SOx6oi3QgPX60j5AqwfxNpQwrR6vT3HeEZYxPR6nc&e=
.
The ensembling code is pretty intertwined with our scheduling system and we don't have plans to release them. But we are definitely happy to help debug any issues.
I will close this issue. Feel free to reopen if you have more questions
Most helpful comment
Sure, so for the all the GLUE tasks, we ran
15different seeds for the selected hyperparam, and ensemble top7model based on dev set metrics. ForWNLIwe ensembled5models.For all the classification tasks, ensemble was based on average of probabilities. For
STS-Bit was average of scores.Let me know if you are looking for any more specific things?