Pytorch-lightning: 0.8.2 calls backward on '_GeneratorContextManager'

Created on 29 Jun 2020 · 15Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

0.8.2 calls backward on '_GeneratorContextManager' and crashes training.
0.8.1 works correctly. my training_step returns {'loss':loss, 'log':{'learn_rate':self.lr}}

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 538, in ddp_train
    self.run_pretrain_routine(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1100, in run_pretrain_routine
    self.train()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 370, in train
    self.run_training_epoch()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 452, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 630, in run_training_batch
    self.hiddens
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 804, in optimizer_closure
    model_ref.backward(self, closure_loss, optimizer, opt_idx)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/core/hooks.py", line 189, in backward
    loss.backward()
AttributeError: '_GeneratorContextManager' object has no attribute 'backward'

Expected behavior

backward is called on the loss and training runs correctly

bug / fix help wanted

Source

s-rog

Most helpful comment

@williamFalcon yes, the master version works for me now. Thanks!

aeryen on 30 Jun 2020

👍2

All 15 comments

did you override optimizer step?
could you try master? we just pushed a fix to a typo we had

williamFalcon on 29 Jun 2020

Can confirm this happens on 0.8.3

Anjum48 on 29 Jun 2020

👍2

ok. Can you post a colab example that replicates this?

williamFalcon on 29 Jun 2020

@Anjum48 @s-rog
colab please

williamFalcon on 30 Jun 2020

@williamFalcon my optimizer step was untouched, I can't run more testing atm but I'll get to it as soon as I can

s-rog on 30 Jun 2020

@williamFalcon Hi I also encountered this, with normal Adam optimizer. I don't have a colab to replicate this atm but from what I saw earlier, this can be replicated with any setting as long as the Trainer is set to precision=16 when using Apex. Under this condition, the following lines from training_loop.py and hooks.py will run:

if self.precision == 16 and not self.on_tpu closure_loss = model_ref.amp_scale_loss(closure_loss, optimizer, opt_idx)

scaled_loss = amp.scale_loss(unscaled_loss, optimizer)

will cause the closure_loss be a _GeneratorContextManager object. Which then cannot have a backward() method.

It seems under the current design, pytorch lighting's scale_loss function can only be used as a context?

aeryen on 30 Jun 2020

@williamFalcon Here's a colab example (my first time using colab so let me know if you have issues seeing it) https://colab.research.google.com/drive/1G08jVDpx-T-5HE2c89RLJdq4u67mM2-o?usp=sharing

I suspect the issue lies with Apex AMP as suggested above by @aeryen

Anjum48 on 30 Jun 2020

ummm. I think this is an apex issue. I can't replicate it with 16-bit native.

williamFalcon on 30 Jun 2020

@aeryen min share a minimal example to reproduce?

Borda on 30 Jun 2020

hi sorry for the delay: https://colab.research.google.com/drive/1rjaRRwgBTm4CKPfe9po_WSxnKqY4jDRv?usp=sharing
I agree this is an apex issue, i.e. only occur when NATIVE_AMP_AVALAIBLE is false in the hooks.py

aeryen on 30 Jun 2020

@aeryen , @Anjum48 ,@s-rog this is fixed on master. Give it a try?

williamFalcon on 30 Jun 2020

@williamFalcon yes, the master version works for me now. Thanks!

aeryen on 30 Jun 2020

👍2

@williamFalcon can confirm as well! and sorry couldn't be more helpful earlier

s-rog on 1 Jul 2020

❤1

Hi @williamFalcon thanks for the quick fix. I just upgraded but am now seeing a different error:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0,1]
Using APEX 16bit precision.
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Loaded pretrained weights for efficientnet-b0
/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py:140: DtypeWarning: Columns (5) have mixed types.Specify dtype option on import or set low_memory=False.
  train_single_fold(args)
Using APEX 16bit precision.
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=ddp
All DDP processes registered. Starting ddp with 2 processes
----------------------------------------------------------------------------------------------------
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic

  | Name      | Type             | Params
-----------------------------------------------
0 | critereon | CrossEntropyLoss | 0     
1 | net       | EfficientNet     | 4 M   
Validation sanity check:  50%|███████████████████████▌                       | 1/2 [00:00<00:00,  1.01it/s]Traceback (most recent call last):
  File "/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py", line 140, in <module>
    train_single_fold(args)
  File "/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py", line 64, in train_single_fold
    trainer.fit(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 957, in fit
    self.ddp_train(task, model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 538, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1141, in run_pretrain_routine
    eval_results = self._evaluate(model,
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 346, in _evaluate
    self.reduce_eval_ddp(eval_results)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 363, in reduce_eval_ddp
    self.reduce_eval_ddp(v)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 365, in reduce_eval_ddp
    dist.all_reduce(v, op=dist.reduce_op.SUM)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 898, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense
Traceback (most recent call last):
  File "train.py", line 140, in <module>
    train_single_fold(args)
  File "train.py", line 64, in train_single_fold
    trainer.fit(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 973, in fit
    self.spawn_ddp_children(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 449, in spawn_ddp_children
    self.ddp_train(local_rank, model, is_master=True)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 538, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1141, in run_pretrain_routine
    eval_results = self._evaluate(model,
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 346, in _evaluate
    self.reduce_eval_ddp(eval_results)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 363, in reduce_eval_ddp
    self.reduce_eval_ddp(v)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 365, in reduce_eval_ddp
    dist.all_reduce(v, op=dist.reduce_op.SUM)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 898, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

I'm not manually assigning tensors to a device (i.e. PL should be assigning all tensors as CUDA tensors) and I am not using sparse tensors (at least not that I am aware of).

EDIT: I found the issue. I guess metrics need to be CUDA tensors now. Thanks again :)

Anjum48 on 1 Jul 2020

@Anjum48 mind send a new issue?

Borda on 1 Jul 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings