Pytorch-lightning: Trainer.on_gpu incorrectly set to False when specifying `gpus=0`

Created on 5 Aug 2020  路  15Comments  路  Source: PyTorchLightning/pytorch-lightning

馃悰 Bug

When creating a trainer with the arg gpus=0, the field on_gpu is always set False, even on machines with CUDA available.

The existing logic for on_gpu is:

self.on_gpu = True if (gpus and torch.cuda.is_available()) else False

is buggy because 0 is "falsy". It should probably be:

self.on_gpu = gpus is not None and torch.cuda.is_available()

To Reproduce

trainer = trainer.Trainer(gpus=0, ...)
documentation help wanted

All 15 comments

Hi! thanks for your contribution!, great first issue!

int value specifies the number of GPUs to train on. You can set gpus='0'

@rohitgr7 Sure! While I don't mean to turn this into an overly-academic conversation, but the type signature for the gpus arg is Optional[Union[List[int], str, int]], which is somewhat misleading given that gpus=0 doesn't work as expected.

yeah, I guess docs and docstrings can be improved here.
https://pytorch-lightning.readthedocs.io/en/latest/trainer.html#gpus

We also have a complete table of all possible configurations and their meaning here:
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#select-gpu-devices

@rohitgr7 @pgeez how about this clarification in the docs in the section that @rohitgr7 linked:

  • Number of GPUs to train on (int)
  • or Which GPUs to train on (list)
  • can handle strings
# default used by the Trainer (ie: train on CPU)
trainer = Trainer(gpus=None)
# equivalent
trainer = Trainer(gpus=0)

Would that avoid the misunderstanding you had?

I don't think this is a documentation error, I think this is a bug. In other words, specifying gpus=0 should be perfectly valid and supported because device 0 is a valid device.

No it's not a bug, this is expected behavior and documented that way. gpus=int should mean the number of devices, not the index. It is like this by design.
Here #563 also an old discussions on these choices and why they are the way they are (see in particular @williamFalcon's comments)

We also have a complete table of all possible configurations and their meaning here:
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#select-gpu-devices

In that table, would it be more user readable if we sort by Meaning?
Now they are sorted by Type

Ahh, ok! Thanks for the clarification all.

I'm not questioning the implementation, but was confused because in trainer.py, the doc for the gpus arg is

gpus: Which GPUs to train on.

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py#L231

@awaelchli To your point about above documenting the equivalence of gpus=None and gpus=0, I like that idea.

@ydcjeff I think keeping them sorted by Type is reasonable.

When running trainer script without any flag this log message wrongly appears -

"GPU available: True, used: True"

even though the training does executed on the CPU

this is because self.on_gpu is True while self.single_gpu is False

(Windows 2004, python 3.6.9 Anaconda, pytorch 1.3.1)

@eladar Just checked, looks like it's fixed on master. I get
GPU available: True, used: False
with
Trainer(gpus=0)

I can confirm that when initializing Trainer with gpus=0 explicitly I get the message above. However, when using the default value for gpus it states -
"GPU available: True, used: True"
while still processing inputs on cpu.

basically, i've played with the template code and changed it to fit my net. I've executed trainer.py without passing any arguments. the relevant parts from trainer.py is as follow:

if __name__ == '__main__':
    parser = ArgumentParser(description='dual patch classification', add_help=False)

    # add args from trainer
    parser = Trainer.add_argparse_args(parser)

    # give the module a chance to add own params
    # good practice to define LightningModule speficic params in the module
    parser = LitModel.add_model_specific_args(parser)

    # parse params
    args = parser.parse_args()

    main(args)

Than initialized the trainer in main(args)

trainer = Trainer.from_argparse_args(args)
trainer.fit(model)

doing so i get -
"GPU available: True, used: True"

but when debugging the code it seems that data is never passed to the GPU. The Trainer property single_gpu is False.

pytorch_lightning.__version__
Out[2]: '0.8.5'

  1. use 0.9.0rc13.
  2. look at the GPU usage with nvidia-smi

On 0.9.0rc13 it is fixed

GPU available: True, used: False

Was this page helpful?
0 / 5 - 0 ratings

Related issues

williamFalcon picture williamFalcon  路  3Comments

polars05 picture polars05  路  3Comments

baeseongsu picture baeseongsu  路  3Comments

DavidRuhe picture DavidRuhe  路  3Comments

edenlightning picture edenlightning  路  3Comments