Pytorch-lightning: 0.8.2 calls backward on '_GeneratorContextManager'

Created on 29 Jun 2020  路  15Comments  路  Source: PyTorchLightning/pytorch-lightning

馃悰 Bug

0.8.2 calls backward on '_GeneratorContextManager' and crashes training.
0.8.1 works correctly. my training_step returns {'loss':loss, 'log':{'learn_rate':self.lr}}

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 538, in ddp_train
    self.run_pretrain_routine(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1100, in run_pretrain_routine
    self.train()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 370, in train
    self.run_training_epoch()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 452, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 630, in run_training_batch
    self.hiddens
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 804, in optimizer_closure
    model_ref.backward(self, closure_loss, optimizer, opt_idx)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/core/hooks.py", line 189, in backward
    loss.backward()
AttributeError: '_GeneratorContextManager' object has no attribute 'backward'

Expected behavior

backward is called on the loss and training runs correctly

bug / fix help wanted

Most helpful comment

@williamFalcon yes, the master version works for me now. Thanks!

All 15 comments

did you override optimizer step?
could you try master? we just pushed a fix to a typo we had

Can confirm this happens on 0.8.3

ok. Can you post a colab example that replicates this?

@Anjum48 @s-rog
colab please

@williamFalcon my optimizer step was untouched, I can't run more testing atm but I'll get to it as soon as I can

@williamFalcon Hi I also encountered this, with normal Adam optimizer. I don't have a colab to replicate this atm but from what I saw earlier, this can be replicated with any setting as long as the Trainer is set to precision=16 when using Apex. Under this condition, the following lines from training_loop.py and hooks.py will run:

if self.precision == 16 and not self.on_tpu closure_loss = model_ref.amp_scale_loss(closure_loss, optimizer, opt_idx)

scaled_loss = amp.scale_loss(unscaled_loss, optimizer)

will cause the closure_loss be a _GeneratorContextManager object. Which then cannot have a backward() method.

It seems under the current design, pytorch lighting's scale_loss function can only be used as a context?

@williamFalcon Here's a colab example (my first time using colab so let me know if you have issues seeing it) https://colab.research.google.com/drive/1G08jVDpx-T-5HE2c89RLJdq4u67mM2-o?usp=sharing

I suspect the issue lies with Apex AMP as suggested above by @aeryen

ummm. I think this is an apex issue. I can't replicate it with 16-bit native.

image

@aeryen min share a minimal example to reproduce?

hi sorry for the delay: https://colab.research.google.com/drive/1rjaRRwgBTm4CKPfe9po_WSxnKqY4jDRv?usp=sharing
I agree this is an apex issue, i.e. only occur when NATIVE_AMP_AVALAIBLE is false in the hooks.py

@aeryen , @Anjum48 ,@s-rog this is fixed on master. Give it a try?

@williamFalcon yes, the master version works for me now. Thanks!

@williamFalcon can confirm as well! and sorry couldn't be more helpful earlier

Hi @williamFalcon thanks for the quick fix. I just upgraded but am now seeing a different error:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0,1]
Using APEX 16bit precision.
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Loaded pretrained weights for efficientnet-b0
/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py:140: DtypeWarning: Columns (5) have mixed types.Specify dtype option on import or set low_memory=False.
  train_single_fold(args)
Using APEX 16bit precision.
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=ddp
All DDP processes registered. Starting ddp with 2 processes
----------------------------------------------------------------------------------------------------
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic

  | Name      | Type             | Params
-----------------------------------------------
0 | critereon | CrossEntropyLoss | 0     
1 | net       | EfficientNet     | 4 M   
Validation sanity check:  50%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枌                       | 1/2 [00:00<00:00,  1.01it/s]Traceback (most recent call last):
  File "/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py", line 140, in <module>
    train_single_fold(args)
  File "/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py", line 64, in train_single_fold
    trainer.fit(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 957, in fit
    self.ddp_train(task, model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 538, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1141, in run_pretrain_routine
    eval_results = self._evaluate(model,
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 346, in _evaluate
    self.reduce_eval_ddp(eval_results)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 363, in reduce_eval_ddp
    self.reduce_eval_ddp(v)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 365, in reduce_eval_ddp
    dist.all_reduce(v, op=dist.reduce_op.SUM)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 898, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense
Traceback (most recent call last):
  File "train.py", line 140, in <module>
    train_single_fold(args)
  File "train.py", line 64, in train_single_fold
    trainer.fit(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 973, in fit
    self.spawn_ddp_children(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 449, in spawn_ddp_children
    self.ddp_train(local_rank, model, is_master=True)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 538, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1141, in run_pretrain_routine
    eval_results = self._evaluate(model,
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 346, in _evaluate
    self.reduce_eval_ddp(eval_results)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 363, in reduce_eval_ddp
    self.reduce_eval_ddp(v)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 365, in reduce_eval_ddp
    dist.all_reduce(v, op=dist.reduce_op.SUM)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 898, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

I'm not manually assigning tensors to a device (i.e. PL should be assigning all tensors as CUDA tensors) and I am not using sparse tensors (at least not that I am aware of).

EDIT: I found the issue. I guess metrics need to be CUDA tensors now. Thanks again :)

@Anjum48 mind send a new issue?

Was this page helpful?
0 / 5 - 0 ratings