Horovod: Duplicated Synchronization with Gradient Clipping?

Created on 28 May 2019  路  3Comments  路  Source: horovod/horovod

Environment:

  1. Framework: pytorch
  2. Framework version: 1.0
  3. Horovod version: latest
  4. MPI version: -
  5. CUDA version: -
  6. NCCL version: -
  7. Python version: 3.6
  8. OS and version: Ubuntu 16.04

https://github.com/horovod/horovod/blob/master/horovod/torch/__init__.py#L174

If you see this documentation, when using gradient clipping, there is duplicated(twice) synchronization. Once is when 'synchronize' is called and second one is when 'step' is called.

After I followed this documentation with gradient clipping, training speed is slower.

Any solusions?

bug

All 3 comments

@ildoonet, thanks for raising this! This regression has been introduced by #597.

While we're thinking about a proper solution, can you specify optimizer._requires_update = set() after DistributedOptimizer wraps original optimizer, like this?

@alsrgv Thanks for the tip. I will try with it and wait for the proper solutions.

@ildoonet, the fix was merged into master. You can reinstall Horovod from master (or wait a bit for 0.16.3), and use .step(synchronize=False), as new documentation prescribes.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

goswamig picture goswamig  路  3Comments

zanonShao picture zanonShao  路  3Comments

waynezhang2018 picture waynezhang2018  路  3Comments

dhaners picture dhaners  路  3Comments

YoungDav picture YoungDav  路  3Comments