Pytorch-lightning: Auto scaling batch fails if initial batch size is too large

Created on 22 Aug 2020  路  5Comments  路  Source: PyTorchLightning/pytorch-lightning

馃悰 Bug

If model.hparams.batch_size is too large and the call to model.fit() fails on the first iteration of binary search scaling, it fails.

To Reproduce

Steps to reproduce the behavior:

  1. Create a large model.
  2. Set model.hparams.batch_size to be a large number, such that it does not fit on your GPU.
  3. See the following error:
 File "/home/nick/anaconda3/envs/ml/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 951, in fit
    self.scale_batch_size(model, mode=self.auto_scale_batch_size)
  File "/home/nick/anaconda3/envs/ml/lib/python3.7/site-packages/pytorch_lightning/trainer/training_tricks.py", line 164, in scale_batch_size
    new_size = _run_binsearch_scaling(self, model, new_size, batch_arg_name, max_trials)
  File "/home/nick/anaconda3/envs/ml/lib/python3.7/site-packages/pytorch_lightning/trainer/training_tricks.py", line 312, in _run_binsearch_scaling
    midval = (high + low) // 2
UnboundLocalError: local variable 'low' referenced before assignment

Expected behavior

It should handle this initial case properly and perform binary search from with low=0 and high=initial batch size.
If no batch size is possible on with the given memory, it should warn the user.

Environment

  • CUDA:
    - GPU:
    - GeForce GTX 1080 Ti
    - available: True
    - version: 10.0.130
  • Packages:
    - numpy: 1.18.1
    - pyTorch_debug: False
    - pyTorch_version: 1.3.1
    - pytorch-lightning: 0.8.5
    - tensorboard: 2.3.0
    - tqdm: 4.47.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.7.6
    - version: #58-Ubuntu SMP Fri Jul 10 19:33:51 UTC 2020

Additional context

I'm trying to train a large NLP model. So even very small batch sizes can fail initially.

bug / fix help wanted

All 5 comments

Hi! thanks for your contribution!, great first issue!

@nick-fung good founding, mind sending a PR with an adjustment? 馃惏
cc: @SkafteNicki @justusschock

@awaelchli

this is strange, since the initial batch size should always be set to init_val which have a default value of 2:
https://github.com/PyTorchLightning/pytorch-lightning/blob/3910ad033074367f6abfe0001562db725a75cb73/pytorch_lightning/trainer/training_tricks.py#L191
@nick-fung do you have a colab notebook where I can reproduce the error?

on master it works (gpu_template.py, it uses hparams as in your case), I tested first that I get OOM without running the auto scale. Setting auto_scale_batch_size=True (and on master we need to call trainer.tune() now) it searches through all sizes starting from 2.

from your PR description I see you use 0.8.5, we fixed many things since then, so if you upgrade to 0.9.1rc1 I'm sure it will work.
Closing this for now.

Was this page helpful?
0 / 5 - 0 ratings