Horovod: PID KeyError in broadcast_optimizer_state

Created on 17 Feb 2020  路  3Comments  路  Source: horovod/horovod

Environment:

  1. Framework: PyTorch
  2. Framework version: torch==1.4.0, torchvision==0.5.0
  3. Horovod version: horovod==0.19.0
  4. MPI version: Open MPI 4.0.0
  5. CUDA version: 10.0
  6. NCCL version: 2.4.2-1+cuda10.0
  7. Python version: 3.7.6
  8. OS and version: Ubuntu 18.04
  9. GCC version: 7.4.0

Checklist:

  1. Did you search issues to find if somebody asked this question before?
  2. If your question is about hang, did you read this doc?
  3. If your question is about docker, did you read this doc?
  4. Did you check if you question is answered in the troubleshooting guide?

Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.

When calling broadcast_optimizer_state we're seeing KeyErrors when trying to access the state_dict for apparently nonexistent pids on all processes, example stack trace attached:

Traceback (most recent call last):
  ...
  File ************
    HVD.broadcast_optimizer_state(optimizer, root_rank=0)
  File "/usr/local/lib/python3.7/dist-packages/horovod/torch/__init__.py", line 572, in broadcast_optimizer_state
    param_state = state_dict['state'][pid]
KeyError: 140137585983888

This is running on a single multi-GPU instance, and all processes are failing the same way (although with different pids). Any pointers appreciated, thanks!

bug

Most helpful comment

Thanks for raising this issue @kangp3 and thanks for the repro @amnda-d. Looks like this bug was introduced by #1609, where we fixed this method to skip setting state on params that do not require grads, but didn't fix the attempt to broadcast those params.

1726 should fix the issue, feel free to try it out and let me know how it goes.

Thanks!

All 3 comments

This script seems to consistently reproduce the error in our environment:

import torch
import torchvision

import horovod.torch as HVD

HVD.init()
torch.cuda.set_device(HVD.local_rank())
torch.cuda.manual_seed(20)

MODEL = torchvision.models.detection.keypointrcnn_resnet50_fpn()

optimizer = torch.optim.SGD(
    MODEL.parameters(), lr=0.001, weight_decay=1e-6, momentum=0.9, nesterov=True
)

HVD.broadcast_optimizer_state(optimizer, root_rank=0)
optimizer = HVD.DistributedOptimizer(
    optimizer, named_parameters=MODEL.named_parameters(), backward_passes_per_step=1
)

Thanks for raising this issue @kangp3 and thanks for the repro @amnda-d. Looks like this bug was introduced by #1609, where we fixed this method to skip setting state on params that do not require grads, but didn't fix the attempt to broadcast those params.

1726 should fix the issue, feel free to try it out and let me know how it goes.

Thanks!

what model are you training? I encountered with this problem when I was training faster-rcnn.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

chentingpc picture chentingpc  路  3Comments

dhaners picture dhaners  路  3Comments

kit1980 picture kit1980  路  3Comments

Sampson1107 picture Sampson1107  路  3Comments

goswamig picture goswamig  路  3Comments