Pytorch-lightning: auto_lr_finder crashes when using 16-bit precision with pytorch-nightly and torchvision-nightly

Created on 21 Jul 2020 · 5Comments · Source: PyTorchLightning/pytorch-lightning

I heard that the nightly version of pytorch has native support for 16-bit training and wanted to give it a try since I'm trying to train some recent models on a GTX 1080. FYI, I'm using pytorch-lightning=0.85.0.

I've installed the following version of the two libraries:

I've also setup the Trainer as follows:

    trainer = Trainer(
        gpus=1,
        max_epochs=hparams.epochs,
        auto_lr_find=True,
        progress_bar_refresh_rate=0,
        accumulate_grad_batches=10,
        # overfit_batches=5,
        amp_level="O2",
        precision=16,
        logger=logger,
        checkpoint_callback=checkpoint_callback,
    )

I'm training a resnext101_32x8d_wsl model using the weights provided by Facebook in pytorch-hub.

Running command:
        python pipe/train_cnn.py
/home/gianluca/git/kaggle/siim-isic-melanoma-classification/.venv/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:25: UserWarning: Checkpoint directory /home/gianluca/git/kaggle/siim-isic-melanoma-classification/models exists and is not empty with save_top_k != 0.All files in this directory will be deleted when
a checkpoint is saved!
  warnings.warn(*args, **kwargs)
Using cache found in /home/gianluca/.cache/torch/hub/facebookresearch_WSL-Images_master
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
Using native 16bit precision.
Traceback (most recent call last):
  File "pipe/train_cnn.py", line 237, in <module>
    main(create_submission=True)
  File "pipe/train_cnn.py", line 48, in main
    preds, weight_fpath = train(fold_number=fold_number, folds=folds)
  File "pipe/train_cnn.py", line 120, in train
    trainer.fit(model)
  File "/home/gianluca/git/kaggle/siim-isic-melanoma-classification/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 956, in fit
    self._run_lr_finder_internally(model)
  File "/home/gianluca/git/kaggle/siim-isic-melanoma-classification/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/lr_finder.py", line 58, in _run_lr_finder_internally
    lr_finder = self.lr_find(model)
  File "/home/gianluca/git/kaggle/siim-isic-melanoma-classification/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/lr_finder.py", line 180, in lr_find
    self.save_checkpoint(str(save_path))
  File "/home/gianluca/git/kaggle/siim-isic-melanoma-classification/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 268, in save_checkpoint
    checkpoint = self.dump_checkpoint(weights_only)
  File "/home/gianluca/git/kaggle/siim-isic-melanoma-classification/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 362, in dump_checkpoint
    checkpoint['native_amp_scaling_state'] = self.scaler.state_dict()
AttributeError: 'NoneType' object has no attribute 'state_dict'
ERROR: failed to reproduce 'train_cnn.dvc': stage: 'train_cnn.dvc' cmd 'python pipe/train_cnn.py' failed

PyTorch Version (e.g., 1.0): torch-1.7.0.dev20200720
OS (e.g., Linux): Ubuntu 18.04
How you installed PyTorch (conda, pip, source): poetry
Build command you used (if compiling from source):
Python version: 3.7.0
CUDA/cuDNN version: 10.2
GPU models and configuration: 1 x GTX 1080
Any other relevant information:

Additional context

Since torch^1.6.0 has native support to 16-bit training, I did not install NVidia APEX. The whole reason of using a nightly version of pytorch was to avoid to install APEX since I wasn't able to figure out how to install it with poetry.

Priority P0 bug / fix help wanted

Source

IamGianluca

Most helpful comment

Ran into same issue, the error is clearer when you call lr_find directly:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-003731b0ec57> in <module>
     55 # trainer.scaler = torch.cuda.amp.GradScaler()
     56 
---> 57 lrf = trainer.lr_find(model=net, train_dataloader=trn_dl, early_stop_threshold=10.)
     58 

~/anaconda3/envs/dl/lib/python3.7/site-packages/pytorch_lightning/trainer/lr_finder.py in lr_find(self, model, train_dataloader, val_dataloaders, min_lr, max_lr, num_training, mode, early_stop_threshold, num_accumulation_steps)
    178 
    179         # Dump model checkpoint
--> 180         self.save_checkpoint(str(save_path))
    181 
    182         # Configure optimizer and scheduler

~/anaconda3/envs/dl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py in save_checkpoint(self, filepath, weights_only)
    266 
    267     def save_checkpoint(self, filepath, weights_only: bool = False):
--> 268         checkpoint = self.dump_checkpoint(weights_only)
    269 
    270         if self.is_global_zero:

~/anaconda3/envs/dl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py in dump_checkpoint(self, weights_only)
    360             # save native amp scaling
    361             if self.use_amp and NATIVE_AMP_AVALAIBLE and not self.use_tpu:
--> 362                 checkpoint['native_amp_scaling_state'] = self.scaler.state_dict()
    363 
    364         # add the module_arguments and state_dict from the model

AttributeError: 'NoneType' object has no attribute 'state_dict'

trainer.scaler is initialized to None, and then set to torch.cuda.amp.GradScaler() here. Meanwhile lr_find wants to checkpoint the state of the scaler at some point before this happens.

Quick fix: just set the value of trainer.scaler after trainer init and before lr_find. This doesn't work if you want to use auto_lr_find option.

trainer = pl.Trainer(gpus=1,
                     max_epochs=20,
                     precision=16)
trainer.scaler = torch.cuda.amp.GradScaler()
lrf = trainer.lr_find(model=net, train_dataloader=trn_dl)

Real fix: ensure that given trainer args, the scaler is initialized to non-nil before it's needed elsewhere, needs contributors to weigh in on how.

sooheon on 31 Jul 2020

🚀1 🎉1

All 5 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 21 Jul 2020

After a few rapid experiments, the issue seems to be related to using the auto_lr_finder. In fact, disabling it fixes the issue.

IamGianluca on 21 Jul 2020

👍1