Apex: _amp_state determines whether running in distributed at import

Created on 22 Nov 2019 · 4Comments · Source: NVIDIA/apex

It looks like the _amp_state.py module determines whether pytorch is running in distributed mode at the import level. The distributed only seems to be used in maybe_print. See code snippet:

https://github.com/NVIDIA/apex/blob/37cdaf4ad57ab4e7dd9ef13dbed7b29aa939d061/apex/amp/_amp_state.py#L38-L52

This causes a couple issues:

It will only support the env:// initialization of torch distributed
It will fail if amp is imported before launching the distributed training

Neither of these is an issue for most, since most people launch via torch.distributed.launch. However, it can be an issue if you define your own distributed launch function or use torch.multiprocessing.spawn. I can't see a good reason to do it this way anyway, as it appears this variable is only used in the maybe_print function. I'll submit a pull request to fix this. Let me know if I'm missing something though.

Source

rmrao

Most helpful comment

Ok, should be fixed by https://github.com/NVIDIA/apex/commit/f37fdf07367a71521bd14fec66153e0996ad128c.

mcarilli on 3 Dec 2019

👍2

All 4 comments

Hi, @rmrao , @mcarilli. Thanks for the contribution and this amazing package!

But I ran into a problem when running amp.initialize(...), and here is the message of traceback:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-04044ff3ed13> in <module>
      8 criterion = nn.NLLLoss()
      9 
---> 10 model, optimizer = amp.initialize(model, optimizer, opt_level='O1')
     11 
     12 lr_finder = LRFinder(model, optimizer, criterion, device='cuda')

c:\users\nale\anaconda3\envs\py37\lib\site-packages\apex\amp\frontend.py in initialize(models, optimizers, enabled, opt_level, cast_model_type, patch_torch_functions, keep_batchnorm_fp32, master_weights, loss_scale, cast_model_outputs, num_losses, verbosity, min_loss_scale, max_loss_scale)
    326     else:
    327         _amp_state.opt_properties = opt_levels[opt_level](_amp_state.opt_properties)
--> 328         maybe_print("Selected optimization level {}".format(opt_levels[opt_level].brief), True)
    329         maybe_print("Defaults for this optimization level are:", True)
    330         for k, v in _amp_state.opt_properties.options.items():

c:\users\nale\anaconda3\envs\py37\lib\site-packages\apex\amp\_amp_state.py in maybe_print(msg, rank0)
     37 
     38 def maybe_print(msg, rank0=False):
---> 39     distributed = torch.distributed.is_initialized() and \
     40         torch.distributed.get_world_size() > 1
     41     if _amp_state.verbosity > 0:

AttributeError: module 'torch.distributed' has no attribute 'is_initialized'

(on Windows10, Python 3.7.3, torch 1.3.0, cuda 10.1, cudnn 7.6.5)

I found the possible cause is that there is no attribute _c10d_init in torch._C, so that it makes the file torch/distributed/distributed_c10d.py not able to be imported.
And that's why torch.distributed.is_initialized() is not available. (see also torch/distributed/__init__.py#L12-L17)

To solve this issue, I would suggest to add a guard to check whether the module torch.distributed is available before calling torch.distributed.is_initialized().

Possible patch might be like this:

# apex/apex/amp/_amp_state.py
def maybe_print(msg, rank0=False): 
    distributed = torch.distributed.is_available() and \
        torch.distributed.is_initialized() and \
        torch.distributed.get_world_size() > 1

I've tried this patch and it worked fine on my machine. And I'd appreciate it if you can review this issue, thanks!