Apex: AttributeError: module 'torch.distributed' has no attribute 'deprecated'

Created on 12 Aug 2019  路  42Comments  路  Source: NVIDIA/apex

Hi!

I get this error on Windows 10 and torch=1.2.0 when just import apex

I think my system just not supports this, but it is not good behavior.

Most helpful comment

If you've already installed apex, remove it:

pip uninstall apex
rm -rf apex

Reinstall from @ptrblck 's fork of apex apex_no_distributed branch

git clone https://github.com/ptrblck/apex.git
cd apex
git checkout apex_no_distributed
pip install -v --no-cache-dir ./

All 42 comments

Hi @metya,

could you post the complete stack trace, so that we can have a look, where this deprecated attribute is coming from?

sure!
excuse me that i didn't think about it.

>>> import apex
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\metya\Anaconda3\lib\site-packages\apex\__init__.py", line 4, in <module>
    from . import parallel
  File "D:\metya\Anaconda3\lib\site-packages\apex\parallel\__init__.py", line 8, in <module>
    ReduceOp = torch.distributed.deprecated.reduce_op
AttributeError: module 'torch.distributed' has no attribute 'deprecated'

Thanks for letting us know!
We'll look into it.

I found this thread when getting the same error upgrading to torch=1.2.0. Seems that for some (weird) reason torch.distributed does not have attribute ReduceOp (even though documentation states it). Error goes away with downgrading to torch=1.1.0

@asbe Do you get this error on a Linux or Windows machine?

Win10, Cuda 10. Torch installed using the pip wheel suggested by the selector on pytorch.org

Same here.

win10, cuda10.0, vs2017, pytorch1.2.0

Same here.
pytorch 1.3.0 built from source, windows 10 pro, cuda 10.1, vs2019.

@metya @asbe @helson73
Could you try to build apex from this branch and see, if this error disappears?
https://github.com/ptrblck/apex/tree/apex_no_distributed

same here,cuda10.0.1 vs2017 pytorch1.2.0 win10

D:\MachineLearning\Maskrcnn_benchMark\lib\site-packages\apex__init__.py in
3 import warnings
4
----> 5 from . import parallel
6 from . import amp
7 from . import fp16_utils

D:\MachineLearning\Maskrcnn_benchMark\lib\site-packages\apex\parallel__init__.py in
6 ReduceOp = torch.distributed.reduce_op
7 else:
----> 8 ReduceOp = torch.distributed.deprecated.reduce_op
9
10 from .distributed import DistributedDataParallel, Reducer

AttributeError: module 'torch.distributed' has no attribute 'deprecated'
problem still

@tuboxin did you try to build from my branch or the current master branch?

Same error here while trying the dcgan example.

python 3.7.3, pyTorch master (torch-1.3.0a0+6ce6939), centos7, cuda10.1, cudnn 7.6.1

@mfuntowicz Could you try to build apex using this branch: https://github.com/NVIDIA/apex/issues/429#issuecomment-522604591 ?

@ptrblck yes,I followed your apex branch!

@ptrblck
Hi, I have tried the branch but got following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\user\Anaconda3\envs\tc1src\lib\site-packages\apex\__init__.py", line 7, in <module>
    from . import amp
  File "C:\Users\user\Anaconda3\envs\tc1src\lib\site-packages\apex\amp\__init__.py", line 1, in <module>
    from .amp import init, half_function, float_function, promote_function,\
  File "C:\Users\user\Anaconda3\envs\tc1src\lib\site-packages\apex\amp\amp.py", line 2, in <module>
    from .handle import AmpHandle, NoOpHandle
  File "C:\Users\user\Anaconda3\envs\tc1src\lib\site-packages\apex\amp\handle.py", line 11, in <module>
    from ..parallel.LARC import LARC
  File "C:\Users\user\Anaconda3\envs\tc1src\lib\site-packages\apex\parallel\__init__.py", line 8, in <module>
    ReduceOp = torch.distributed.deprecated.reduce_op
AttributeError: module 'torch.distributed' has no attribute 'deprecated'

@ptrblck Something went wrong with pytorch I compiled leading to distributed not being included.

It's fixed and it works as expected. Sorry for the false positive.

@ptrblck I think the error I shown above is not caused by apex.
For some reason, pytorch I installed on my windows machine does not contain reduce_op at all. (both 1.3 complied from source or 1.2 installed from conda)
I guess for parallel usage of multi-gpus, pytorch needs NCCL, but NCCL does not support windows, so how pytorch in windows could run at multi-gpu mode at the first place? It's so weird.

@helson73, @tuboxin, @asbe, @metya
I've updated my branch with some more fixes. Could you try it again please?
https://github.com/ptrblck/apex/tree/apex_no_distributed

Hey @ptrblck, same setup as the others, except for python3.6
I think

ReduceOp = torch.distributed.deprecated.reduce_op

will always fail in PyTorch 1.2.0, since they've removed torch.distributed.deprecated. That being said, it's unclear why torch.distributed does not have a ReduceOp method when imported.

Here's a pretty minimal check on the command line for the things you try to grab in __init__.py for apex.
image

Looks like @helson73 may be right about it being a PyTorch problem.

This OP is missing in the PyTorch binaries for Windows, since (if I'm not mistaken) Windows does not support (some) distributed setups.
Therefore I've built PyTorch from source, manually disabling the distributed option, so that I can run into the same errors.
My current branch should guard these imports with if torch.distributed.is_available(), so that this operator shouldn't be visible at all.

@jacob-mink This would of course also mean that you won't be able to use the distributed package on your machine, but should be able to use other utilities like mixed precision training etc.
Let me know, if my current branch still tries to import this operator and I'll have another look.

Right, I understand. I think your current branch is still trying to import that operator. Specifically, in apex\parallel\__init__.py I'm seeing an attempt to create a ReduceOp variable that throws the same error.

Thanks for the information!
Could you post the whole stack trace, so that I can fix it?

@ptrblck here you go!

Traceback (most recent call last):
  File "train.py", line 23, in <module>
    from apex.parallel import DistributedDataParallel
  File "C:\Users\user\Anaconda3\envs\ml-pt-apex-test\lib\site-packages\apex\parallel\__init__.py", line 8, in <module>
    ReduceOp = torch.distributed.deprecated.reduce_op
AttributeError: module 'torch.distributed' has no attribute 'deprecated'

OK, I see.
Unfortunately, if your system does not support distributed training, you won't be able to use DDP.
Could you try to remove this import and all occurrences of DDP in your script?
Is the apex and amp import working so far?

Ah, got it. So, at that level, it's a different issue.
Now when I use your branch w/ PyTorch 1.2.0, removing all DDP, it works!

@ptrblck I installed apex from your repository. It works correctly now. Thanks.
Windows 10, Python 3.6, PyTorch 1.2, cuda10.0

@ptrblck Tried your branch, not work: works!

Python 3.5.2 (default, Nov 12 2018, 13:43:14) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import apex
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/apex/__init__.py", line 5, in <module>
    from . import parallel
  File "/usr/local/lib/python3.5/dist-packages/apex/parallel/__init__.py", line 8, in <module>
    ReduceOp = torch.distributed.deprecated.reduce_op
AttributeError: module 'torch.distributed' has no attribute 'deprecated'
>>> 

I am using pytorch1.3 build from latest source.

@jinfagang Line5 in my repository points to a check using torch.distributed.is_avaibale(), not the import which raised the error from your stack trace.
Could you make sure to use my branch and run the code again?

Sorry, it works!

If you've already installed apex, remove it:

pip uninstall apex
rm -rf apex

Reinstall from @ptrblck 's fork of apex apex_no_distributed branch

git clone https://github.com/ptrblck/apex.git
cd apex
git checkout apex_no_distributed
pip install -v --no-cache-dir ./

I installed apex from the apex_no_distributed branch, and it worked for pytorch 1.2. Now that I have updated to pytorch 1.3, I get the following error: AttributeError: module 'torch.nn' has no attribute 'backends'
traceback:

AttributeError                            Traceback (most recent call last)
<ipython-input-11-70eeb9444499> in <module>
      3             validate=True, # Evaluate the model after each epoch
      4             schedule_type="warmup_cosine",
----> 5             optimizer_type="adamw")

~\Anaconda3\envs\d-learn\lib\site-packages\fast_bert\learner_cls.py in fit(self, epochs, lr, validate, schedule_type, optimizer_type)
    182             except ImportError:
    183                 raise ImportError('Please install apex to use fp16 training')
--> 184             self.model, optimizer = amp.initialize(self.model, optimizer, opt_level=self.fp16_opt_level)
    185 
    186         # Get scheduler

~\Anaconda3\envs\d-learn\lib\site-packages\apex\amp\frontend.py in initialize(models, optimizers, enabled, opt_level, cast_model_type, patch_torch_functions, keep_batchnorm_fp32, master_weights, loss_scale, cast_model_outputs, num_losses, verbosity, min_loss_scale, max_loss_scale)
    355         maybe_print("{:22} : {}".format(k, v), True)
    356 
--> 357     return _initialize(models, optimizers, _amp_state.opt_properties, num_losses, cast_model_outputs)
    358 
    359 

~\Anaconda3\envs\d-learn\lib\site-packages\apex\amp\_initialize.py in _initialize(models, optimizers, properties, num_losses, cast_model_outputs)
    239     if properties.patch_torch_functions:
    240         # handle is unused here. It's accessible later through a global value anyway.
--> 241         handle = amp_init(loss_scale=properties.loss_scale, verbose=(_amp_state.verbosity == 2))
    242         for optimizer in optimizers:
    243             # Disable Amp casting for the optimizer step, because it should only be

~\Anaconda3\envs\d-learn\lib\site-packages\apex\amp\amp.py in init(enabled, loss_scale, enable_caching, verbose, allow_banned)
     99             try_caching = (cast_fn == utils.maybe_half)
    100             wrap.cached_cast(module.MODULE, fn, cast_fn, handle,
--> 101                              try_caching, verbose)
    102 
    103     # 1.5) Pre-0.4, put the blacklist methods on HalfTensor and whitelist

~\Anaconda3\envs\d-learn\lib\site-packages\apex\amp\wrap.py in cached_cast(mod, fn, cast_fn, handle, try_caching, verbose)
     31 def cached_cast(mod, fn, cast_fn, handle,
     32                 try_caching=False, verbose=False):
---> 33     if not utils.has_func(mod, fn):
     34         return
     35 

~\Anaconda3\envs\d-learn\lib\site-packages\apex\amp\utils.py in has_func(mod, fn)
    130 
    131 def has_func(mod, fn):
--> 132     if isinstance(mod, torch.nn.backends.backend.FunctionBackend):
    133         return fn in mod.function_classes
    134     elif isinstance(mod, dict):

AttributeError: module 'torch.nn' has no attribute 'backends'

@DanyalAndriano try to reinstall apex from master branch. I had same issue and solved it by reinstalling apex

@YuryBolkonsky thanks. I did eventually do this and it works with pytorch 1.2, but with 1.3 I get an error when using fp16_opt_level "01". I get the same error as reported here https://github.com/kaushaltrivedi/fast-bert/issues/90.

@ptrblck I installed apex from your repository. But it is still not working.
Windows 10, Python 3.7.3, PyTorch 1.3.1, cuda10.0

@DanyalAndriano Are you using a high-level wrapper on top of PyTorch+apex?
If so, could you post a code snippet showing your use case?

@va26 @BramVanroy was kind enough to create a PR based on my branch, which was merged in https://github.com/NVIDIA/apex/pull/531. Could you reinstall apex from master and retry, please?

@ptrblck I am....I'm using the fast-bert library. A fast-AI inspired library for Hugging Face's transformers. Here is the code that runs on 1.2, but not 1.3...

args = Box({
    "run_text": "multiclass sentiment v2",
    "task_name": 'sentiment_classification_corrected_data',
    "model_type": 'bert',
    "model_name": 'bert-base-uncased',
    "do_lower_case": True,
    "max_grad_norm": 1.0,
    "train_batch_size": 8,
    "eval_batch_size": 16,
    "max_seq_length": 256,
    "learning_rate": 4e-5,
    "warmup_proportion": 0.002,
    "gradient_accumulation_steps": 8,
    "fp16": True,
    "fp16_opt_level": "O1",
    "eval_all_checkpoints": True,
    "overwrite_output_dir": True,
    "warmup_steps": 500,
    "logging_steps": 50,
    "overwrite_cache": True, # make sure to over write previous datasets
    "seed": 45
})

databunch = BertDataBunch(DATA_PATH, 
                          LABEL_PATH, 
                          tokenizer=args.model_name, 
                          train_file='train.csv', 
                          val_file='val.csv',
                          text_col="text", 
                          label_file='labels.csv',
                          label_col='label',
                          multi_label=False,
                          batch_size_per_gpu=args['train_batch_size'], 
                          max_seq_length=args['max_seq_length'], 
                          multi_gpu=True)

logger = logging.getLogger()
device_cuda = torch.device("cuda")

metrics_list = []
metrics_list.append({'name': 'accuracy', 'function': accuracy})

# Create learner object
learner = BertLearner.from_pretrained_model(databunch, 
                                            pretrained_path=args.model_name,
                                            metrics=metrics_list, 
                                            device=device_cuda, 
                                            logger=logger, 
                                            fp16_opt_level=args.fp16_opt_level,
                                            output_dir=OUTPUT_DIR, 
                                            finetuned_wgts_path=None, 
                                            warmup_steps=args.warmup_steps,
                                            grad_accumulation_steps=args.gradient_accumulation_steps,
                                            multi_gpu=True, 
                                            is_fp16=True, 
                                            multi_label=False,  
                                            logging_steps=200)   

learner.fit(epochs=1,
            lr=4e-5,
            validate=True, # Evaluate the model after each epoch
            schedule_type="warmup_cosine",
            optimizer_type="adamw")

# save model
learner.save_model()

The error reported over at https://github.com/kaushaltrivedi/fast-bert/issues/90 seems to be completely different than OP, though? The linked issue is about the scheduler/optimizer step. The current issue that we are in now is about torch.distributed specifically.

@BramVanroy It may be different... I've linked it because the error on this report (above) only came up with setting fp16_opt_level to O1, so I thought it may be related.

Can you post the full error trace again?

I have the same problem. Can I reduce the version of torch?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lemonhu picture lemonhu  路  3Comments

rmrao picture rmrao  路  3Comments

flymark2010 picture flymark2010  路  3Comments

Hecmay picture Hecmay  路  4Comments

jiangnanyida picture jiangnanyida  路  3Comments