Pytorch-lightning: Auto scaling batch fails if initial batch size is too large

Created on 22 Aug 2020 · 5Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

If model.hparams.batch_size is too large and the call to model.fit() fails on the first iteration of binary search scaling, it fails.

To Reproduce

Steps to reproduce the behavior:

Create a large model.
Set model.hparams.batch_size to be a large number, such that it does not fit on your GPU.
See the following error:

 File "/home/nick/anaconda3/envs/ml/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 951, in fit
    self.scale_batch_size(model, mode=self.auto_scale_batch_size)
  File "/home/nick/anaconda3/envs/ml/lib/python3.7/site-packages/pytorch_lightning/trainer/training_tricks.py", line 164, in scale_batch_size
    new_size = _run_binsearch_scaling(self, model, new_size, batch_arg_name, max_trials)
  File "/home/nick/anaconda3/envs/ml/lib/python3.7/site-packages/pytorch_lightning/trainer/training_tricks.py", line 312, in _run_binsearch_scaling
    midval = (high + low) // 2
UnboundLocalError: local variable 'low' referenced before assignment

Expected behavior

It should handle this initial case properly and perform binary search from with low=0 and high=initial batch size.
If no batch size is possible on with the given memory, it should warn the user.

Environment

CUDA:
- GPU:
- GeForce GTX 1080 Ti
- available: True
- version: 10.0.130
Packages:
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.3.1
- pytorch-lightning: 0.8.5
- tensorboard: 2.3.0
- tqdm: 4.47.0
System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.7.6
- version: #58-Ubuntu SMP Fri Jul 10 19:33:51 UTC 2020

Additional context

I'm trying to train a large NLP model. So even very small batch sizes can fail initially.

bug / fix help wanted

Source

nick-fung

All 5 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 22 Aug 2020

@nick-fung good founding, mind sending a PR with an adjustment? 🐰
cc: @SkafteNicki @justusschock

Borda on 25 Aug 2020

@awaelchli

edenlightning on 1 Sep 2020

this is strange, since the initial batch size should always be set to init_val which have a default value of 2:
https://github.com/PyTorchLightning/pytorch-lightning/blob/3910ad033074367f6abfe0001562db725a75cb73/pytorch_lightning/trainer/training_tricks.py#L191
@nick-fung do you have a colab notebook where I can reproduce the error?

SkafteNicki on 1 Sep 2020

👍1

on master it works (gpu_template.py, it uses hparams as in your case), I tested first that I get OOM without running the auto scale. Setting auto_scale_batch_size=True (and on master we need to call trainer.tune() now) it searches through all sizes starting from 2.

from your PR description I see you use 0.8.5, we fixed many things since then, so if you upgrade to 0.9.1rc1 I'm sure it will work.
Closing this for now.