You guys have done a great job, can you provide detailed hyperparameters for 10h finetune in wav2vec 2.0. I don鈥檛 know how to adjust the hyperparameters for 10min, 1h and 10h datasets. Thanks a lot.
theres a table in the appendix B in the paper that shows the differences between various splits. in general you would just adjust --max-update, and then adjust --warmup-steps, --hold-steps, and --decay steps so that they use 0.1/0.4/0.5 of max-update respectively. you then need to update --mask-prob and --mask-channel-prob. this prob would be mask-length * x where x is the number in the table and mask-length is what you use for --mask-length (10 in the example) or --mask-channel-length.
so for example, for 10h we see that timestep mask prob should be 0.065, so we set --mask-prob to 0.65. channel mask prob is 0.004, so we set it to 64 * 0.004 = 0.256. then we set --max-updates to 20000 and change --warmup-steps to 20000 * 0.1 = 2000, --hold-steps to 8000 and --decay-steps to 10000.
you can adjust the example for other splits following the same procedure.
do you think it would be valuable to add examples for every split even though it will make readme much longer?
Thank you for the explanation. I was able to figure out the masking parameters by reading the code and appendix B, but not the training schedule. In the readme, I would suggest providing this explanation and just the relevant command line arguments for the 10h example as you have here, with a reference to appendix B as a guide for other dataset sizes.
Thank you @alexeib
Most helpful comment
theres a table in the appendix B in the paper that shows the differences between various splits. in general you would just adjust --max-update, and then adjust --warmup-steps, --hold-steps, and --decay steps so that they use 0.1/0.4/0.5 of max-update respectively. you then need to update --mask-prob and --mask-channel-prob. this prob would be mask-length * x where x is the number in the table and mask-length is what you use for --mask-length (10 in the example) or --mask-channel-length.
so for example, for 10h we see that timestep mask prob should be 0.065, so we set --mask-prob to 0.65. channel mask prob is 0.004, so we set it to 64 * 0.004 = 0.256. then we set --max-updates to 20000 and change --warmup-steps to 20000 * 0.1 = 2000, --hold-steps to 8000 and --decay-steps to 10000.
you can adjust the example for other splits following the same procedure.
do you think it would be valuable to add examples for every split even though it will make readme much longer?