I heard that the nightly version of pytorch has native support for 16-bit training and wanted to give it a try since I'm trying to train some recent models on a GTX 1080. FYI, I'm using pytorch-lightning=0.85.0.
I've installed the following version of the two libraries:
I've also setup the Trainer as follows:
trainer = Trainer(
gpus=1,
max_epochs=hparams.epochs,
auto_lr_find=True,
progress_bar_refresh_rate=0,
accumulate_grad_batches=10,
# overfit_batches=5,
amp_level="O2",
precision=16,
logger=logger,
checkpoint_callback=checkpoint_callback,
)
I'm training a resnext101_32x8d_wsl model using the weights provided by Facebook in pytorch-hub.
Running command:
python pipe/train_cnn.py
/home/gianluca/git/kaggle/siim-isic-melanoma-classification/.venv/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:25: UserWarning: Checkpoint directory /home/gianluca/git/kaggle/siim-isic-melanoma-classification/models exists and is not empty with save_top_k != 0.All files in this directory will be deleted when
a checkpoint is saved!
warnings.warn(*args, **kwargs)
Using cache found in /home/gianluca/.cache/torch/hub/facebookresearch_WSL-Images_master
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
Using native 16bit precision.
Traceback (most recent call last):
File "pipe/train_cnn.py", line 237, in <module>
main(create_submission=True)
File "pipe/train_cnn.py", line 48, in main
preds, weight_fpath = train(fold_number=fold_number, folds=folds)
File "pipe/train_cnn.py", line 120, in train
trainer.fit(model)
File "/home/gianluca/git/kaggle/siim-isic-melanoma-classification/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 956, in fit
self._run_lr_finder_internally(model)
File "/home/gianluca/git/kaggle/siim-isic-melanoma-classification/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/lr_finder.py", line 58, in _run_lr_finder_internally
lr_finder = self.lr_find(model)
File "/home/gianluca/git/kaggle/siim-isic-melanoma-classification/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/lr_finder.py", line 180, in lr_find
self.save_checkpoint(str(save_path))
File "/home/gianluca/git/kaggle/siim-isic-melanoma-classification/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 268, in save_checkpoint
checkpoint = self.dump_checkpoint(weights_only)
File "/home/gianluca/git/kaggle/siim-isic-melanoma-classification/.venv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 362, in dump_checkpoint
checkpoint['native_amp_scaling_state'] = self.scaler.state_dict()
AttributeError: 'NoneType' object has no attribute 'state_dict'
ERROR: failed to reproduce 'train_cnn.dvc': stage: 'train_cnn.dvc' cmd 'python pipe/train_cnn.py' failed
conda, pip, source): poetrySince torch^1.6.0 has native support to 16-bit training, I did not install NVidia APEX. The whole reason of using a nightly version of pytorch was to avoid to install APEX since I wasn't able to figure out how to install it with poetry.
Hi! thanks for your contribution!, great first issue!
After a few rapid experiments, the issue seems to be related to using the auto_lr_finder. In fact, disabling it fixes the issue.
Ran into same issue, the error is clearer when you call lr_find directly:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-9-003731b0ec57> in <module>
55 # trainer.scaler = torch.cuda.amp.GradScaler()
56
---> 57 lrf = trainer.lr_find(model=net, train_dataloader=trn_dl, early_stop_threshold=10.)
58
~/anaconda3/envs/dl/lib/python3.7/site-packages/pytorch_lightning/trainer/lr_finder.py in lr_find(self, model, train_dataloader, val_dataloaders, min_lr, max_lr, num_training, mode, early_stop_threshold, num_accumulation_steps)
178
179 # Dump model checkpoint
--> 180 self.save_checkpoint(str(save_path))
181
182 # Configure optimizer and scheduler
~/anaconda3/envs/dl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py in save_checkpoint(self, filepath, weights_only)
266
267 def save_checkpoint(self, filepath, weights_only: bool = False):
--> 268 checkpoint = self.dump_checkpoint(weights_only)
269
270 if self.is_global_zero:
~/anaconda3/envs/dl/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py in dump_checkpoint(self, weights_only)
360 # save native amp scaling
361 if self.use_amp and NATIVE_AMP_AVALAIBLE and not self.use_tpu:
--> 362 checkpoint['native_amp_scaling_state'] = self.scaler.state_dict()
363
364 # add the module_arguments and state_dict from the model
AttributeError: 'NoneType' object has no attribute 'state_dict'
trainer.scaler is initialized to None, and then set to torch.cuda.amp.GradScaler() here. Meanwhile lr_find wants to checkpoint the state of the scaler at some point before this happens.
Quick fix: just set the value of trainer.scaler after trainer init and before lr_find. This doesn't work if you want to use auto_lr_find option.
trainer = pl.Trainer(gpus=1,
max_epochs=20,
precision=16)
trainer.scaler = torch.cuda.amp.GradScaler()
lrf = trainer.lr_find(model=net, train_dataloader=trn_dl)
Real fix: ensure that given trainer args, the scaler is initialized to non-nil before it's needed elsewhere, needs contributors to weigh in on how.
seems to be duplicate to #1827
Most helpful comment
Ran into same issue, the error is clearer when you call lr_find directly:
trainer.scaler is initialized to None, and then set to torch.cuda.amp.GradScaler() here. Meanwhile lr_find wants to checkpoint the state of the scaler at some point before this happens.
Quick fix: just set the value of trainer.scaler after trainer init and before lr_find. This doesn't work if you want to use auto_lr_find option.
Real fix: ensure that given trainer args, the scaler is initialized to non-nil before it's needed elsewhere, needs contributors to weigh in on how.