Apex: AttributeError: module 'torch.distributed' has no attribute 'deprecated'

Created on 12 Aug 2019 · 42Comments · Source: NVIDIA/apex

Hi!

I get this error on Windows 10 and torch=1.2.0 when just import apex

I think my system just not supports this, but it is not good behavior.

Source

metya

Most helpful comment

If you've already installed apex, remove it:

pip uninstall apex
rm -rf apex

Reinstall from @ptrblck 's fork of apex apex_no_distributed branch

git clone https://github.com/ptrblck/apex.git
cd apex
git checkout apex_no_distributed
pip install -v --no-cache-dir ./

DannyDannyDanny on 27 Sep 2019

👍2

All 42 comments

Hi @metya,

could you post the complete stack trace, so that we can have a look, where this deprecated attribute is coming from?

ptrblck on 12 Aug 2019

sure!
excuse me that i didn't think about it.

>>> import apex
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\metya\Anaconda3\lib\site-packages\apex\__init__.py", line 4, in <module>
    from . import parallel
  File "D:\metya\Anaconda3\lib\site-packages\apex\parallel\__init__.py", line 8, in <module>
    ReduceOp = torch.distributed.deprecated.reduce_op
AttributeError: module 'torch.distributed' has no attribute 'deprecated'

metya on 12 Aug 2019

Thanks for letting us know!
We'll look into it.

ptrblck on 12 Aug 2019

I found this thread when getting the same error upgrading to torch=1.2.0. Seems that for some (weird) reason torch.distributed does not have attribute ReduceOp (even though documentation states it). Error goes away with downgrading to torch=1.1.0

asbe on 15 Aug 2019

@asbe Do you get this error on a Linux or Windows machine?

ptrblck on 15 Aug 2019

Win10, Cuda 10. Torch installed using the pip wheel suggested by the selector on pytorch.org

asbe on 15 Aug 2019

Same here.

win10, cuda10.0, vs2017, pytorch1.2.0

elepherai on 16 Aug 2019

Same here.
pytorch 1.3.0 built from source, windows 10 pro, cuda 10.1, vs2019.

helson73 on 19 Aug 2019

@metya @asbe @helson73
Could you try to build apex from this branch and see, if this error disappears?
https://github.com/ptrblck/apex/tree/apex_no_distributed

ptrblck on 19 Aug 2019

same here,cuda10.0.1 vs2017 pytorch1.2.0 win10

tuboxin on 20 Aug 2019

D:\MachineLearning\Maskrcnn_benchMark\lib\site-packages\apex__init__.py in
3 import warnings
4
----> 5 from . import parallel
6 from . import amp
7 from . import fp16_utils

D:\MachineLearning\Maskrcnn_benchMark\lib\site-packages\apex\parallel__init__.py in
6 ReduceOp = torch.distributed.reduce_op
7 else:
----> 8 ReduceOp = torch.distributed.deprecated.reduce_op
9
10 from .distributed import DistributedDataParallel, Reducer

AttributeError: module 'torch.distributed' has no attribute 'deprecated'
problem still

tuboxin on 20 Aug 2019

@tuboxin did you try to build from my branch or the current master branch?

ptrblck on 20 Aug 2019

Same error here while trying the dcgan example.

python 3.7.3, pyTorch master (torch-1.3.0a0+6ce6939), centos7, cuda10.1, cudnn 7.6.1

mfuntowicz on 20 Aug 2019

@mfuntowicz Could you try to build apex using this branch: https://github.com/NVIDIA/apex/issues/429#issuecomment-522604591 ?

ptrblck on 20 Aug 2019

@ptrblck yes,I followed your apex branch!

tuboxin on 21 Aug 2019

@ptrblck
Hi, I have tried the branch but got following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\user\Anaconda3\envs\tc1src\lib\site-packages\apex\__init__.py", line 7, in <module>
    from . import amp
  File "C:\Users\user\Anaconda3\envs\tc1src\lib\site-packages\apex\amp\__init__.py", line 1, in <module>
    from .amp import init, half_function, float_function, promote_function,\
  File "C:\Users\user\Anaconda3\envs\tc1src\lib\site-packages\apex\amp\amp.py", line 2, in <module>
    from .handle import AmpHandle, NoOpHandle
  File "C:\Users\user\Anaconda3\envs\tc1src\lib\site-packages\apex\amp\handle.py", line 11, in <module>
    from ..parallel.LARC import LARC
  File "C:\Users\user\Anaconda3\envs\tc1src\lib\site-packages\apex\parallel\__init__.py", line 8, in <module>
    ReduceOp = torch.distributed.deprecated.reduce_op
AttributeError: module 'torch.distributed' has no attribute 'deprecated'

helson73 on 21 Aug 2019

@ptrblck Something went wrong with pytorch I compiled leading to distributed not being included.

It's fixed and it works as expected. Sorry for the false positive.

mfuntowicz on 21 Aug 2019

@ptrblck I think the error I shown above is not caused by apex.
For some reason, pytorch I installed on my windows machine does not contain reduce_op at all. (both 1.3 complied from source or 1.2 installed from conda)
I guess for parallel usage of multi-gpus, pytorch needs NCCL, but NCCL does not support windows, so how pytorch in windows could run at multi-gpu mode at the first place? It's so weird.

helson73 on 21 Aug 2019

@helson73, @tuboxin, @asbe, @metya
I've updated my branch with some more fixes. Could you try it again please?
https://github.com/ptrblck/apex/tree/apex_no_distributed

ptrblck on 27 Aug 2019

❤1

Hey @ptrblck, same setup as the others, except for python3.6
I think

ReduceOp = torch.distributed.deprecated.reduce_op

will always fail in PyTorch 1.2.0, since they've removed torch.distributed.deprecated. That being said, it's unclear why torch.distributed does not have a ReduceOp method when imported.

Here's a pretty minimal check on the command line for the things you try to grab in __init__.py for apex.

Looks like @helson73 may be right about it being a PyTorch problem.

jacob-mink on 28 Aug 2019

This OP is missing in the PyTorch binaries for Windows, since (if I'm not mistaken) Windows does not support (some) distributed setups.
Therefore I've built PyTorch from source, manually disabling the distributed option, so that I can run into the same errors.
My current branch should guard these imports with if torch.distributed.is_available(), so that this operator shouldn't be visible at all.

ptrblck on 28 Aug 2019

@jacob-mink This would of course also mean that you won't be able to use the distributed package on your machine, but should be able to use other utilities like mixed precision training etc.
Let me know, if my current branch still tries to import this operator and I'll have another look.

ptrblck on 28 Aug 2019

Right, I understand. I think your current branch is still trying to import that operator. Specifically, in apex\parallel\__init__.py I'm seeing an attempt to create a ReduceOp variable that throws the same error.

jacob-mink on 28 Aug 2019

Thanks for the information!
Could you post the whole stack trace, so that I can fix it?

ptrblck on 28 Aug 2019

@ptrblck here you go!

Traceback (most recent call last):
  File "train.py", line 23, in <module>
    from apex.parallel import DistributedDataParallel
  File "C:\Users\user\Anaconda3\envs\ml-pt-apex-test\lib\site-packages\apex\parallel\__init__.py", line 8, in <module>
    ReduceOp = torch.distributed.deprecated.reduce_op
AttributeError: module 'torch.distributed' has no attribute 'deprecated'

jacob-mink on 28 Aug 2019

👍1

OK, I see.
Unfortunately, if your system does not support distributed training, you won't be able to use DDP.
Could you try to remove this import and all occurrences of DDP in your script?
Is the apex and amp import working so far?

ptrblck on 28 Aug 2019

👍1

Ah, got it. So, at that level, it's a different issue.
Now when I use your branch w/ PyTorch 1.2.0, removing all DDP, it works!

jacob-mink on 28 Aug 2019

👍2

@ptrblck I installed apex from your repository. It works correctly now. Thanks.
Windows 10, Python 3.6, PyTorch 1.2, cuda10.0

youweiliang on 1 Sep 2019

👍1

@ptrblck ~~Tried your branch, not work:~~ works!

Python 3.5.2 (default, Nov 12 2018, 13:43:14) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import apex
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/apex/__init__.py", line 5, in <module>
    from . import parallel
  File "/usr/local/lib/python3.5/dist-packages/apex/parallel/__init__.py", line 8, in <module>
    ReduceOp = torch.distributed.deprecated.reduce_op
AttributeError: module 'torch.distributed' has no attribute 'deprecated'
>>>

I am using pytorch1.3 build from latest source.

jinfagang on 10 Sep 2019

@jinfagang Line5 in my repository points to a check using torch.distributed.is_avaibale(), not the import which raised the error from your stack trace.
Could you make sure to use my branch and run the code again?

ptrblck on 10 Sep 2019

Sorry, it works!

jinfagang on 10 Sep 2019

👍1

If you've already installed apex, remove it:

pip uninstall apex
rm -rf apex

Reinstall from @ptrblck 's fork of apex apex_no_distributed branch

git clone https://github.com/ptrblck/apex.git
cd apex
git checkout apex_no_distributed
pip install -v --no-cache-dir ./

DannyDannyDanny on 27 Sep 2019

👍2

I installed apex from the apex_no_distributed branch, and it worked for pytorch 1.2. Now that I have updated to pytorch 1.3, I get the following error: AttributeError: module 'torch.nn' has no attribute 'backends'
traceback:

AttributeError                            Traceback (most recent call last)
<ipython-input-11-70eeb9444499> in <module>
      3             validate=True, # Evaluate the model after each epoch
      4             schedule_type="warmup_cosine",
----> 5             optimizer_type="adamw")

~\Anaconda3\envs\d-learn\lib\site-packages\fast_bert\learner_cls.py in fit(self, epochs, lr, validate, schedule_type, optimizer_type)
    182             except ImportError:
    183                 raise ImportError('Please install apex to use fp16 training')
--> 184             self.model, optimizer = amp.initialize(self.model, optimizer, opt_level=self.fp16_opt_level)
    185 
    186         # Get scheduler

~\Anaconda3\envs\d-learn\lib\site-packages\apex\amp\frontend.py in initialize(models, optimizers, enabled, opt_level, cast_model_type, patch_torch_functions, keep_batchnorm_fp32, master_weights, loss_scale, cast_model_outputs, num_losses, verbosity, min_loss_scale, max_loss_scale)
    355         maybe_print("{:22} : {}".format(k, v), True)
    356 
--> 357     return _initialize(models, optimizers, _amp_state.opt_properties, num_losses, cast_model_outputs)
    358 
    359 

~\Anaconda3\envs\d-learn\lib\site-packages\apex\amp\_initialize.py in _initialize(models, optimizers, properties, num_losses, cast_model_outputs)
    239     if properties.patch_torch_functions:
    240         # handle is unused here. It's accessible later through a global value anyway.
--> 241         handle = amp_init(loss_scale=properties.loss_scale, verbose=(_amp_state.verbosity == 2))
    242         for optimizer in optimizers:
    243             # Disable Amp casting for the optimizer step, because it should only be

~\Anaconda3\envs\d-learn\lib\site-packages\apex\amp\amp.py in init(enabled, loss_scale, enable_caching, verbose, allow_banned)
     99             try_caching = (cast_fn == utils.maybe_half)
    100             wrap.cached_cast(module.MODULE, fn, cast_fn, handle,
--> 101                              try_caching, verbose)
    102 
    103     # 1.5) Pre-0.4, put the blacklist methods on HalfTensor and whitelist

~\Anaconda3\envs\d-learn\lib\site-packages\apex\amp\wrap.py in cached_cast(mod, fn, cast_fn, handle, try_caching, verbose)
     31 def cached_cast(mod, fn, cast_fn, handle,
     32                 try_caching=False, verbose=False):
---> 33     if not utils.has_func(mod, fn):
     34         return
     35 

~\Anaconda3\envs\d-learn\lib\site-packages\apex\amp\utils.py in has_func(mod, fn)
    130 
    131 def has_func(mod, fn):
--> 132     if isinstance(mod, torch.nn.backends.backend.FunctionBackend):
    133         return fn in mod.function_classes
    134     elif isinstance(mod, dict):

AttributeError: module 'torch.nn' has no attribute 'backends'

DanyalAndriano on 16 Oct 2019

@DanyalAndriano try to reinstall apex from master branch. I had same issue and solved it by reinstalling apex

YuryBolkonsky on 17 Oct 2019

@YuryBolkonsky thanks. I did eventually do this and it works with pytorch 1.2, but with 1.3 I get an error when using fp16_opt_level "01". I get the same error as reported here https://github.com/kaushaltrivedi/fast-bert/issues/90.

DanyalAndriano on 17 Oct 2019

@ptrblck I installed apex from your repository. But it is still not working.
Windows 10, Python 3.7.3, PyTorch 1.3.1, cuda10.0

va26 on 19 Nov 2019

@DanyalAndriano Are you using a high-level wrapper on top of PyTorch+apex?
If so, could you post a code snippet showing your use case?

@va26 @BramVanroy was kind enough to create a PR based on my branch, which was merged in https://github.com/NVIDIA/apex/pull/531. Could you reinstall apex from master and retry, please?

ptrblck on 13 Jan 2020

❤1

@ptrblck I am....I'm using the fast-bert library. A fast-AI inspired library for Hugging Face's transformers. Here is the code that runs on 1.2, but not 1.3...

args = Box({
    "run_text": "multiclass sentiment v2",
    "task_name": 'sentiment_classification_corrected_data',
    "model_type": 'bert',
    "model_name": 'bert-base-uncased',
    "do_lower_case": True,
    "max_grad_norm": 1.0,
    "train_batch_size": 8,
    "eval_batch_size": 16,
    "max_seq_length": 256,
    "learning_rate": 4e-5,
    "warmup_proportion": 0.002,
    "gradient_accumulation_steps": 8,
    "fp16": True,
    "fp16_opt_level": "O1",
    "eval_all_checkpoints": True,
    "overwrite_output_dir": True,
    "warmup_steps": 500,
    "logging_steps": 50,
    "overwrite_cache": True, # make sure to over write previous datasets
    "seed": 45
})

databunch = BertDataBunch(DATA_PATH, 
                          LABEL_PATH, 
                          tokenizer=args.model_name, 
                          train_file='train.csv', 
                          val_file='val.csv',
                          text_col="text", 
                          label_file='labels.csv',
                          label_col='label',
                          multi_label=False,
                          batch_size_per_gpu=args['train_batch_size'], 
                          max_seq_length=args['max_seq_length'], 
                          multi_gpu=True)

logger = logging.getLogger()
device_cuda = torch.device("cuda")

metrics_list = []
metrics_list.append({'name': 'accuracy', 'function': accuracy})

# Create learner object
learner = BertLearner.from_pretrained_model(databunch, 
                                            pretrained_path=args.model_name,
                                            metrics=metrics_list, 
                                            device=device_cuda, 
                                            logger=logger, 
                                            fp16_opt_level=args.fp16_opt_level,
                                            output_dir=OUTPUT_DIR, 
                                            finetuned_wgts_path=None, 
                                            warmup_steps=args.warmup_steps,
                                            grad_accumulation_steps=args.gradient_accumulation_steps,
                                            multi_gpu=True, 
                                            is_fp16=True, 
                                            multi_label=False,  
                                            logging_steps=200)   

learner.fit(epochs=1,
            lr=4e-5,
            validate=True, # Evaluate the model after each epoch
            schedule_type="warmup_cosine",
            optimizer_type="adamw")

# save model
learner.save_model()

DanyalAndriano on 13 Jan 2020

The error reported over at https://github.com/kaushaltrivedi/fast-bert/issues/90 seems to be completely different than OP, though? The linked issue is about the scheduler/optimizer step. The current issue that we are in now is about torch.distributed specifically.

BramVanroy on 13 Jan 2020

@BramVanroy It may be different... I've linked it because the error on this report (above) only came up with setting fp16_opt_level to O1, so I thought it may be related.

DanyalAndriano on 13 Jan 2020

Can you post the full error trace again?

BramVanroy on 13 Jan 2020

I have the same problem. Can I reduce the version of torch?