Horovod: Duplicated Synchronization with Gradient Clipping?

Created on 28 May 2019 · 3Comments · Source: horovod/horovod

Environment:

Framework: pytorch
Framework version: 1.0
Horovod version: latest
MPI version: -
CUDA version: -
NCCL version: -
Python version: 3.6
OS and version: Ubuntu 16.04

https://github.com/horovod/horovod/blob/master/horovod/torch/__init__.py#L174

If you see this documentation, when using gradient clipping, there is duplicated(twice) synchronization. Once is when 'synchronize' is called and second one is when 'step' is called.

After I followed this documentation with gradient clipping, training speed is slower.

Any solusions?

bug

Source

ildoonet

All 3 comments

@ildoonet, thanks for raising this! This regression has been introduced by #597.

While we're thinking about a proper solution, can you specify optimizer._requires_update = set() after DistributedOptimizer wraps original optimizer, like this?

alsrgv on 29 May 2019

@alsrgv Thanks for the tip. I will try with it and wait for the proper solutions.

ildoonet on 29 May 2019

👍1

@ildoonet, the fix was merged into master. You can reinstall Horovod from master (or wait a bit for 0.16.3), and use .step(synchronize=False), as new documentation prescribes.

alsrgv on 30 May 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings