Environment:
torch==1.4.0, torchvision==0.5.0horovod==0.19.0Checklist:
Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.
When calling broadcast_optimizer_state we're seeing KeyErrors when trying to access the state_dict for apparently nonexistent pids on all processes, example stack trace attached:
Traceback (most recent call last):
...
File ************
HVD.broadcast_optimizer_state(optimizer, root_rank=0)
File "/usr/local/lib/python3.7/dist-packages/horovod/torch/__init__.py", line 572, in broadcast_optimizer_state
param_state = state_dict['state'][pid]
KeyError: 140137585983888
This is running on a single multi-GPU instance, and all processes are failing the same way (although with different pids). Any pointers appreciated, thanks!
This script seems to consistently reproduce the error in our environment:
import torch
import torchvision
import horovod.torch as HVD
HVD.init()
torch.cuda.set_device(HVD.local_rank())
torch.cuda.manual_seed(20)
MODEL = torchvision.models.detection.keypointrcnn_resnet50_fpn()
optimizer = torch.optim.SGD(
MODEL.parameters(), lr=0.001, weight_decay=1e-6, momentum=0.9, nesterov=True
)
HVD.broadcast_optimizer_state(optimizer, root_rank=0)
optimizer = HVD.DistributedOptimizer(
optimizer, named_parameters=MODEL.named_parameters(), backward_passes_per_step=1
)
Thanks for raising this issue @kangp3 and thanks for the repro @amnda-d. Looks like this bug was introduced by #1609, where we fixed this method to skip setting state on params that do not require grads, but didn't fix the attempt to broadcast those params.
Thanks!
what model are you training? I encountered with this problem when I was training faster-rcnn.
Most helpful comment
Thanks for raising this issue @kangp3 and thanks for the repro @amnda-d. Looks like this bug was introduced by #1609, where we fixed this method to skip setting state on params that do not require grads, but didn't fix the attempt to broadcast those params.
1726 should fix the issue, feel free to try it out and let me know how it goes.
Thanks!