Maskrcnn-benchmark: Why still use "from torch.distributed import deprecated as dist" in pytorch v1.0?

Created on 29 Oct 2018 · 9Comments · Source: facebookresearch/maskrcnn-benchmark

❓

Is there any bugs in the new version torch.distributed package?

when I try to change the interface to new version torch.distributed.DistributedDataParallel, I get an error:

site-packages/torch/nn/parallel/distributed.py", line 298, in distributed_data_parallel_hook
bucket[bucket_offset] = param.grad.data
IndexError: list assignment index out of range

Need help, thanks!

dependency bug

Source

unlimblue

Most helpful comment

My training program is still running(without this error), but I think the reason of this bug is that in original code, if we set a parameter's requires_grad to false, it will assign a wrong offset to the bucket_map, because when loop at the statement if not param_tuple[0].requires_grad: continue, it will assign the offset of the next parameter's bucket_idx to idx, but it would be idx-1(self.bucket_sizes[bucket_idx]), and then cause IndexError: list assignment index out of rang of bucket[bucket_offset] = param.grad.data.

unlimblue on 29 Oct 2018

👍5

All 9 comments

Hi,

This is a great question!

Indeed, there is currently a bug in the new torch.distributed package that makes it so that we can't use it in our models. I've faced exactly the same errors as you did.

I've already notified @teng-li and @pietern (the creators of the new distributed library for PyTorch) about the problem, and for now we need to use the deprecated distributed package.

fmassa on 29 Oct 2018

Thank you very much!

This indeed cause some inconvenience, and after reading the source code of torch/nn/parallel/distributed.py, I tentatively solve it by change line 195:

            for idx, param_tuple in enumerate(zip(*param_buckets_tuple)):
                if not param_tuple[0].requires_grad:
                    continue
                for p in param_tuple:
                    # line 195 / original
                    # self.bucket_map[p] = (bucket_idx, idx)
                   **self.bucket_map[p] = (bucket_idx, self.bucket_sizes[bucket_idx] )**
                self.bucket_sizes[bucket_idx] += 1

unlimblue on 29 Oct 2018

So you changed

self.bucket_map[p] = (bucket_idx, idx)

with

self.bucket_map[p] = (bucket_idx, self.bucket_sizes[bucket_idx] )

This is interesting. I didn't have a closer look to the new distributed backend, but if this solves the problem for you, it might be worth sending a PR to PyTorch, and see if the tests fail?

fmassa on 29 Oct 2018

unlimblue on 29 Oct 2018

👍5

I think this is a reasonable assumption.

It would be a great contribution to PyTorch if, once you validate that this is indeed the cause, you pushed your fixes to PyTorch, which a test case.

fmassa on 29 Oct 2018

@unlimblue Thanks for the analysis and the explanation; it sounds plausible to me.

pietern on 29 Oct 2018

Fixed via https://github.com/facebookresearch/maskrcnn-benchmark/pull/248

fmassa on 10 Dec 2018

For who have this issue with an old pytorch version:
I have pytorch(1.0.0.dev20181022) and still encounter this issue at #364.
I update (resinstall) pytorch to the latest version and it works.

suica on 21 Jan 2019

Yes, you need to have a recent version of PyTorch in order for the fix to be present.

fmassa on 21 Jan 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Raise ValueError: Type mismatch (<type 'str'> vs. <type 'tuple'>) with values (coco_2017_train vs. ('coco_2017_train',)) for config key: DATASETS.TRAIN

SkeletonOne · 3Comments

Error when trying to train: RuntimeError: cuda runtime error (59) : device-side assert triggered

Nacho114 · 4Comments

Loss is nan when trying to fine-tune all layers (FREEZE_CONV_BODY_AT: 0)

mrteera · 3Comments

size mismatch

CF2220160244 · 3Comments

Can't build a model a second time

hadim · 4Comments