Hi!
I get this error on Windows 10 and torch=1.2.0 when just import apex
I think my system just not supports this, but it is not good behavior.
Hi @metya,
could you post the complete stack trace, so that we can have a look, where this deprecated attribute is coming from?
sure!
excuse me that i didn't think about it.
>>> import apex
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\metya\Anaconda3\lib\site-packages\apex\__init__.py", line 4, in <module>
from . import parallel
File "D:\metya\Anaconda3\lib\site-packages\apex\parallel\__init__.py", line 8, in <module>
ReduceOp = torch.distributed.deprecated.reduce_op
AttributeError: module 'torch.distributed' has no attribute 'deprecated'
Thanks for letting us know!
We'll look into it.
I found this thread when getting the same error upgrading to torch=1.2.0. Seems that for some (weird) reason torch.distributed does not have attribute ReduceOp (even though documentation states it). Error goes away with downgrading to torch=1.1.0
@asbe Do you get this error on a Linux or Windows machine?
Win10, Cuda 10. Torch installed using the pip wheel suggested by the selector on pytorch.org
Same here.
win10, cuda10.0, vs2017, pytorch1.2.0
Same here.
pytorch 1.3.0 built from source, windows 10 pro, cuda 10.1, vs2019.
@metya @asbe @helson73
Could you try to build apex from this branch and see, if this error disappears?
https://github.com/ptrblck/apex/tree/apex_no_distributed
same here,cuda10.0.1 vs2017 pytorch1.2.0 win10
D:\MachineLearning\Maskrcnn_benchMark\lib\site-packages\apex__init__.py in
3 import warnings
4
----> 5 from . import parallel
6 from . import amp
7 from . import fp16_utils
D:\MachineLearning\Maskrcnn_benchMark\lib\site-packages\apex\parallel__init__.py in
6 ReduceOp = torch.distributed.reduce_op
7 else:
----> 8 ReduceOp = torch.distributed.deprecated.reduce_op
9
10 from .distributed import DistributedDataParallel, Reducer
AttributeError: module 'torch.distributed' has no attribute 'deprecated'
problem still
@tuboxin did you try to build from my branch or the current master branch?
Same error here while trying the dcgan example.
python 3.7.3, pyTorch master (torch-1.3.0a0+6ce6939), centos7, cuda10.1, cudnn 7.6.1
@mfuntowicz Could you try to build apex using this branch: https://github.com/NVIDIA/apex/issues/429#issuecomment-522604591 ?
@ptrblck yes,I followed your apex branch!
@ptrblck
Hi, I have tried the branch but got following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\user\Anaconda3\envs\tc1src\lib\site-packages\apex\__init__.py", line 7, in <module>
from . import amp
File "C:\Users\user\Anaconda3\envs\tc1src\lib\site-packages\apex\amp\__init__.py", line 1, in <module>
from .amp import init, half_function, float_function, promote_function,\
File "C:\Users\user\Anaconda3\envs\tc1src\lib\site-packages\apex\amp\amp.py", line 2, in <module>
from .handle import AmpHandle, NoOpHandle
File "C:\Users\user\Anaconda3\envs\tc1src\lib\site-packages\apex\amp\handle.py", line 11, in <module>
from ..parallel.LARC import LARC
File "C:\Users\user\Anaconda3\envs\tc1src\lib\site-packages\apex\parallel\__init__.py", line 8, in <module>
ReduceOp = torch.distributed.deprecated.reduce_op
AttributeError: module 'torch.distributed' has no attribute 'deprecated'
@ptrblck Something went wrong with pytorch I compiled leading to distributed not being included.
It's fixed and it works as expected. Sorry for the false positive.
@ptrblck I think the error I shown above is not caused by apex.
For some reason, pytorch I installed on my windows machine does not contain reduce_op at all. (both 1.3 complied from source or 1.2 installed from conda)
I guess for parallel usage of multi-gpus, pytorch needs NCCL, but NCCL does not support windows, so how pytorch in windows could run at multi-gpu mode at the first place? It's so weird.
@helson73, @tuboxin, @asbe, @metya
I've updated my branch with some more fixes. Could you try it again please?
https://github.com/ptrblck/apex/tree/apex_no_distributed
Hey @ptrblck, same setup as the others, except for python3.6
I think
ReduceOp = torch.distributed.deprecated.reduce_op
will always fail in PyTorch 1.2.0, since they've removed torch.distributed.deprecated. That being said, it's unclear why torch.distributed does not have a ReduceOp method when imported.
Here's a pretty minimal check on the command line for the things you try to grab in __init__.py for apex.

Looks like @helson73 may be right about it being a PyTorch problem.
This OP is missing in the PyTorch binaries for Windows, since (if I'm not mistaken) Windows does not support (some) distributed setups.
Therefore I've built PyTorch from source, manually disabling the distributed option, so that I can run into the same errors.
My current branch should guard these imports with if torch.distributed.is_available(), so that this operator shouldn't be visible at all.
@jacob-mink This would of course also mean that you won't be able to use the distributed package on your machine, but should be able to use other utilities like mixed precision training etc.
Let me know, if my current branch still tries to import this operator and I'll have another look.
Right, I understand. I think your current branch is still trying to import that operator. Specifically, in apex\parallel\__init__.py I'm seeing an attempt to create a ReduceOp variable that throws the same error.
Thanks for the information!
Could you post the whole stack trace, so that I can fix it?
@ptrblck here you go!
Traceback (most recent call last):
File "train.py", line 23, in <module>
from apex.parallel import DistributedDataParallel
File "C:\Users\user\Anaconda3\envs\ml-pt-apex-test\lib\site-packages\apex\parallel\__init__.py", line 8, in <module>
ReduceOp = torch.distributed.deprecated.reduce_op
AttributeError: module 'torch.distributed' has no attribute 'deprecated'
OK, I see.
Unfortunately, if your system does not support distributed training, you won't be able to use DDP.
Could you try to remove this import and all occurrences of DDP in your script?
Is the apex and amp import working so far?
Ah, got it. So, at that level, it's a different issue.
Now when I use your branch w/ PyTorch 1.2.0, removing all DDP, it works!
@ptrblck I installed apex from your repository. It works correctly now. Thanks.
Windows 10, Python 3.6, PyTorch 1.2, cuda10.0
@ptrblck Tried your branch, not work: works!
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import apex
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/apex/__init__.py", line 5, in <module>
from . import parallel
File "/usr/local/lib/python3.5/dist-packages/apex/parallel/__init__.py", line 8, in <module>
ReduceOp = torch.distributed.deprecated.reduce_op
AttributeError: module 'torch.distributed' has no attribute 'deprecated'
>>>
I am using pytorch1.3 build from latest source.
@jinfagang Line5 in my repository points to a check using torch.distributed.is_avaibale(), not the import which raised the error from your stack trace.
Could you make sure to use my branch and run the code again?
Sorry, it works!
If you've already installed apex, remove it:
pip uninstall apex
rm -rf apex
Reinstall from @ptrblck 's fork of apex apex_no_distributed branch
git clone https://github.com/ptrblck/apex.git
cd apex
git checkout apex_no_distributed
pip install -v --no-cache-dir ./
I installed apex from the apex_no_distributed branch, and it worked for pytorch 1.2. Now that I have updated to pytorch 1.3, I get the following error: AttributeError: module 'torch.nn' has no attribute 'backends'
traceback:
AttributeError Traceback (most recent call last)
<ipython-input-11-70eeb9444499> in <module>
3 validate=True, # Evaluate the model after each epoch
4 schedule_type="warmup_cosine",
----> 5 optimizer_type="adamw")
~\Anaconda3\envs\d-learn\lib\site-packages\fast_bert\learner_cls.py in fit(self, epochs, lr, validate, schedule_type, optimizer_type)
182 except ImportError:
183 raise ImportError('Please install apex to use fp16 training')
--> 184 self.model, optimizer = amp.initialize(self.model, optimizer, opt_level=self.fp16_opt_level)
185
186 # Get scheduler
~\Anaconda3\envs\d-learn\lib\site-packages\apex\amp\frontend.py in initialize(models, optimizers, enabled, opt_level, cast_model_type, patch_torch_functions, keep_batchnorm_fp32, master_weights, loss_scale, cast_model_outputs, num_losses, verbosity, min_loss_scale, max_loss_scale)
355 maybe_print("{:22} : {}".format(k, v), True)
356
--> 357 return _initialize(models, optimizers, _amp_state.opt_properties, num_losses, cast_model_outputs)
358
359
~\Anaconda3\envs\d-learn\lib\site-packages\apex\amp\_initialize.py in _initialize(models, optimizers, properties, num_losses, cast_model_outputs)
239 if properties.patch_torch_functions:
240 # handle is unused here. It's accessible later through a global value anyway.
--> 241 handle = amp_init(loss_scale=properties.loss_scale, verbose=(_amp_state.verbosity == 2))
242 for optimizer in optimizers:
243 # Disable Amp casting for the optimizer step, because it should only be
~\Anaconda3\envs\d-learn\lib\site-packages\apex\amp\amp.py in init(enabled, loss_scale, enable_caching, verbose, allow_banned)
99 try_caching = (cast_fn == utils.maybe_half)
100 wrap.cached_cast(module.MODULE, fn, cast_fn, handle,
--> 101 try_caching, verbose)
102
103 # 1.5) Pre-0.4, put the blacklist methods on HalfTensor and whitelist
~\Anaconda3\envs\d-learn\lib\site-packages\apex\amp\wrap.py in cached_cast(mod, fn, cast_fn, handle, try_caching, verbose)
31 def cached_cast(mod, fn, cast_fn, handle,
32 try_caching=False, verbose=False):
---> 33 if not utils.has_func(mod, fn):
34 return
35
~\Anaconda3\envs\d-learn\lib\site-packages\apex\amp\utils.py in has_func(mod, fn)
130
131 def has_func(mod, fn):
--> 132 if isinstance(mod, torch.nn.backends.backend.FunctionBackend):
133 return fn in mod.function_classes
134 elif isinstance(mod, dict):
AttributeError: module 'torch.nn' has no attribute 'backends'
@DanyalAndriano try to reinstall apex from master branch. I had same issue and solved it by reinstalling apex
@YuryBolkonsky thanks. I did eventually do this and it works with pytorch 1.2, but with 1.3 I get an error when using fp16_opt_level "01". I get the same error as reported here https://github.com/kaushaltrivedi/fast-bert/issues/90.
@ptrblck I installed apex from your repository. But it is still not working.
Windows 10, Python 3.7.3, PyTorch 1.3.1, cuda10.0
@DanyalAndriano Are you using a high-level wrapper on top of PyTorch+apex?
If so, could you post a code snippet showing your use case?
@va26 @BramVanroy was kind enough to create a PR based on my branch, which was merged in https://github.com/NVIDIA/apex/pull/531. Could you reinstall apex from master and retry, please?
@ptrblck I am....I'm using the fast-bert library. A fast-AI inspired library for Hugging Face's transformers. Here is the code that runs on 1.2, but not 1.3...
args = Box({
"run_text": "multiclass sentiment v2",
"task_name": 'sentiment_classification_corrected_data',
"model_type": 'bert',
"model_name": 'bert-base-uncased',
"do_lower_case": True,
"max_grad_norm": 1.0,
"train_batch_size": 8,
"eval_batch_size": 16,
"max_seq_length": 256,
"learning_rate": 4e-5,
"warmup_proportion": 0.002,
"gradient_accumulation_steps": 8,
"fp16": True,
"fp16_opt_level": "O1",
"eval_all_checkpoints": True,
"overwrite_output_dir": True,
"warmup_steps": 500,
"logging_steps": 50,
"overwrite_cache": True, # make sure to over write previous datasets
"seed": 45
})
databunch = BertDataBunch(DATA_PATH,
LABEL_PATH,
tokenizer=args.model_name,
train_file='train.csv',
val_file='val.csv',
text_col="text",
label_file='labels.csv',
label_col='label',
multi_label=False,
batch_size_per_gpu=args['train_batch_size'],
max_seq_length=args['max_seq_length'],
multi_gpu=True)
logger = logging.getLogger()
device_cuda = torch.device("cuda")
metrics_list = []
metrics_list.append({'name': 'accuracy', 'function': accuracy})
# Create learner object
learner = BertLearner.from_pretrained_model(databunch,
pretrained_path=args.model_name,
metrics=metrics_list,
device=device_cuda,
logger=logger,
fp16_opt_level=args.fp16_opt_level,
output_dir=OUTPUT_DIR,
finetuned_wgts_path=None,
warmup_steps=args.warmup_steps,
grad_accumulation_steps=args.gradient_accumulation_steps,
multi_gpu=True,
is_fp16=True,
multi_label=False,
logging_steps=200)
learner.fit(epochs=1,
lr=4e-5,
validate=True, # Evaluate the model after each epoch
schedule_type="warmup_cosine",
optimizer_type="adamw")
# save model
learner.save_model()
The error reported over at https://github.com/kaushaltrivedi/fast-bert/issues/90 seems to be completely different than OP, though? The linked issue is about the scheduler/optimizer step. The current issue that we are in now is about torch.distributed specifically.
@BramVanroy It may be different... I've linked it because the error on this report (above) only came up with setting fp16_opt_level to O1, so I thought it may be related.
Can you post the full error trace again?
I have the same problem. Can I reduce the version of torch?
Most helpful comment
If you've already installed apex, remove it:
Reinstall from @ptrblck 's fork of apex
apex_no_distributedbranch