Horovod: is there a way to identify the slow rank without dumping the timeline?

Created on 28 Feb 2019  路  1Comment  路  Source: horovod/horovod

Our training job sometimes became slower and we suspect it is due to some issue in one of the workers. But horovod only report missing rank when one worker is slower for more than 60s. Is there a way to find out the slowest rank during runtime?

contribution welcome enhancement

Most helpful comment

We have a plan to implement such slow worker auto-detection, but it's not implemented yet.

>All comments

We have a plan to implement such slow worker auto-detection, but it's not implemented yet.

Was this page helpful?
0 / 5 - 0 ratings