Our training job sometimes became slower and we suspect it is due to some issue in one of the workers. But horovod only report missing rank when one worker is slower for more than 60s. Is there a way to find out the slowest rank during runtime?
We have a plan to implement such slow worker auto-detection, but it's not implemented yet.
Most helpful comment
We have a plan to implement such slow worker auto-detection, but it's not implemented yet.