Maskrcnn-benchmark: Why still use "from torch.distributed import deprecated as dist" in pytorch v1.0?

Created on 29 Oct 2018  ·  9Comments  ·  Source: facebookresearch/maskrcnn-benchmark

Is there any bugs in the new version torch.distributed package?

when I try to change the interface to new version torch.distributed.DistributedDataParallel, I get an error:

site-packages/torch/nn/parallel/distributed.py", line 298, in distributed_data_parallel_hook
bucket[bucket_offset] = param.grad.data
IndexError: list assignment index out of range

Need help, thanks!

dependency bug

Most helpful comment

My training program is still running(without this error), but I think the reason of this bug is that in original code, if we set a parameter's requires_grad to false, it will assign a wrong offset to the bucket_map, because when loop at the statement if not param_tuple[0].requires_grad: continue, it will assign the offset of the next parameter's bucket_idx to idx, but it would be idx-1(self.bucket_sizes[bucket_idx]), and then cause IndexError: list assignment index out of rang of bucket[bucket_offset] = param.grad.data.

All 9 comments

Hi,

This is a great question!

Indeed, there is currently a bug in the new torch.distributed package that makes it so that we can't use it in our models. I've faced exactly the same errors as you did.

I've already notified @teng-li and @pietern (the creators of the new distributed library for PyTorch) about the problem, and for now we need to use the deprecated distributed package.

Thank you very much!

This indeed cause some inconvenience, and after reading the source code of torch/nn/parallel/distributed.py, I tentatively solve it by change line 195:

            for idx, param_tuple in enumerate(zip(*param_buckets_tuple)):
                if not param_tuple[0].requires_grad:
                    continue
                for p in param_tuple:
                    # line 195 / original
                    # self.bucket_map[p] = (bucket_idx, idx)
                   **self.bucket_map[p] = (bucket_idx, self.bucket_sizes[bucket_idx] )**
                self.bucket_sizes[bucket_idx] += 1

So you changed

self.bucket_map[p] = (bucket_idx, idx)

with

self.bucket_map[p] = (bucket_idx, self.bucket_sizes[bucket_idx] )

?

This is interesting. I didn't have a closer look to the new distributed backend, but if this solves the problem for you, it might be worth sending a PR to PyTorch, and see if the tests fail?

My training program is still running(without this error), but I think the reason of this bug is that in original code, if we set a parameter's requires_grad to false, it will assign a wrong offset to the bucket_map, because when loop at the statement if not param_tuple[0].requires_grad: continue, it will assign the offset of the next parameter's bucket_idx to idx, but it would be idx-1(self.bucket_sizes[bucket_idx]), and then cause IndexError: list assignment index out of rang of bucket[bucket_offset] = param.grad.data.

I think this is a reasonable assumption.

It would be a great contribution to PyTorch if, once you validate that this is indeed the cause, you pushed your fixes to PyTorch, which a test case.

@unlimblue Thanks for the analysis and the explanation; it sounds plausible to me.

For who have this issue with an old pytorch version:
I have pytorch(1.0.0.dev20181022) and still encounter this issue at #364.
I update (resinstall) pytorch to the latest version and it works.

Yes, you need to have a recent version of PyTorch in order for the fix to be present.

Was this page helpful?
0 / 5 - 0 ratings