If model.hparams.batch_size is too large and the call to model.fit() fails on the first iteration of binary search scaling, it fails.
Steps to reproduce the behavior:
model.hparams.batch_size to be a large number, such that it does not fit on your GPU. File "/home/nick/anaconda3/envs/ml/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 951, in fit
self.scale_batch_size(model, mode=self.auto_scale_batch_size)
File "/home/nick/anaconda3/envs/ml/lib/python3.7/site-packages/pytorch_lightning/trainer/training_tricks.py", line 164, in scale_batch_size
new_size = _run_binsearch_scaling(self, model, new_size, batch_arg_name, max_trials)
File "/home/nick/anaconda3/envs/ml/lib/python3.7/site-packages/pytorch_lightning/trainer/training_tricks.py", line 312, in _run_binsearch_scaling
midval = (high + low) // 2
UnboundLocalError: local variable 'low' referenced before assignment
It should handle this initial case properly and perform binary search from with low=0 and high=initial batch size.
If no batch size is possible on with the given memory, it should warn the user.
I'm trying to train a large NLP model. So even very small batch sizes can fail initially.
Hi! thanks for your contribution!, great first issue!
@nick-fung good founding, mind sending a PR with an adjustment? 馃惏
cc: @SkafteNicki @justusschock
@awaelchli
this is strange, since the initial batch size should always be set to init_val which have a default value of 2:
https://github.com/PyTorchLightning/pytorch-lightning/blob/3910ad033074367f6abfe0001562db725a75cb73/pytorch_lightning/trainer/training_tricks.py#L191
@nick-fung do you have a colab notebook where I can reproduce the error?
on master it works (gpu_template.py, it uses hparams as in your case), I tested first that I get OOM without running the auto scale. Setting auto_scale_batch_size=True (and on master we need to call trainer.tune() now) it searches through all sizes starting from 2.
from your PR description I see you use 0.8.5, we fixed many things since then, so if you upgrade to 0.9.1rc1 I'm sure it will work.
Closing this for now.