It looks like the _amp_state.py module determines whether pytorch is running in distributed mode at the import level. The distributed only seems to be used in maybe_print. See code snippet:
This causes a couple issues:
env:// initialization of torch distributedNeither of these is an issue for most, since most people launch via torch.distributed.launch. However, it can be an issue if you define your own distributed launch function or use torch.multiprocessing.spawn. I can't see a good reason to do it this way anyway, as it appears this variable is only used in the maybe_print function. I'll submit a pull request to fix this. Let me know if I'm missing something though.
Hi, @rmrao , @mcarilli. Thanks for the contribution and this amazing package!
But I ran into a problem when running amp.initialize(...), and here is the message of traceback:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-5-04044ff3ed13> in <module>
8 criterion = nn.NLLLoss()
9
---> 10 model, optimizer = amp.initialize(model, optimizer, opt_level='O1')
11
12 lr_finder = LRFinder(model, optimizer, criterion, device='cuda')
c:\users\nale\anaconda3\envs\py37\lib\site-packages\apex\amp\frontend.py in initialize(models, optimizers, enabled, opt_level, cast_model_type, patch_torch_functions, keep_batchnorm_fp32, master_weights, loss_scale, cast_model_outputs, num_losses, verbosity, min_loss_scale, max_loss_scale)
326 else:
327 _amp_state.opt_properties = opt_levels[opt_level](_amp_state.opt_properties)
--> 328 maybe_print("Selected optimization level {}".format(opt_levels[opt_level].brief), True)
329 maybe_print("Defaults for this optimization level are:", True)
330 for k, v in _amp_state.opt_properties.options.items():
c:\users\nale\anaconda3\envs\py37\lib\site-packages\apex\amp\_amp_state.py in maybe_print(msg, rank0)
37
38 def maybe_print(msg, rank0=False):
---> 39 distributed = torch.distributed.is_initialized() and \
40 torch.distributed.get_world_size() > 1
41 if _amp_state.verbosity > 0:
AttributeError: module 'torch.distributed' has no attribute 'is_initialized'
(on Windows10, Python 3.7.3, torch 1.3.0, cuda 10.1, cudnn 7.6.5)
I found the possible cause is that there is no attribute _c10d_init in torch._C, so that it makes the file torch/distributed/distributed_c10d.py not able to be imported.
And that's why torch.distributed.is_initialized() is not available. (see also torch/distributed/__init__.py#L12-L17)
To solve this issue, I would suggest to add a guard to check whether the module torch.distributed is available before calling torch.distributed.is_initialized().
Possible patch might be like this:
# apex/apex/amp/_amp_state.py
def maybe_print(msg, rank0=False):
distributed = torch.distributed.is_available() and \
torch.distributed.is_initialized() and \
torch.distributed.get_world_size() > 1
I've tried this patch and it worked fine on my machine. And I'd appreciate it if you can review this issue, thanks!
thanks @NaleRaphael worked for me too.
Ok, should be fixed by https://github.com/NVIDIA/apex/commit/f37fdf07367a71521bd14fec66153e0996ad128c.
@mcarilli , thanks a lot!
Most helpful comment
Ok, should be fixed by https://github.com/NVIDIA/apex/commit/f37fdf07367a71521bd14fec66153e0996ad128c.