Pytorch-lightning: Trainer.on_gpu incorrectly set to False when specifying `gpus=0`

Created on 5 Aug 2020 · 15Comments · Source: PyTorchLightning/pytorch-lightning

🐛 Bug

When creating a trainer with the arg gpus=0, the field on_gpu is always set False, even on machines with CUDA available.

The existing logic for on_gpu is:

self.on_gpu = True if (gpus and torch.cuda.is_available()) else False

is buggy because 0 is "falsy". It should probably be:

self.on_gpu = gpus is not None and torch.cuda.is_available()

To Reproduce

trainer = trainer.Trainer(gpus=0, ...)

documentation help wanted

Source

pgeez

All 15 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 5 Aug 2020

int value specifies the number of GPUs to train on. You can set gpus='0'

rohitgr7 on 5 Aug 2020

@rohitgr7 Sure! While I don't mean to turn this into an overly-academic conversation, but the type signature for the gpus arg is Optional[Union[List[int], str, int]], which is somewhat misleading given that gpus=0 doesn't work as expected.

pgeez on 5 Aug 2020

yeah, I guess docs and docstrings can be improved here.
https://pytorch-lightning.readthedocs.io/en/latest/trainer.html#gpus

rohitgr7 on 5 Aug 2020

We also have a complete table of all possible configurations and their meaning here:
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#select-gpu-devices

awaelchli on 6 Aug 2020

@rohitgr7 @pgeez how about this clarification in the docs in the section that @rohitgr7 linked:

Number of GPUs to train on (int)
or Which GPUs to train on (list)
can handle strings

# default used by the Trainer (ie: train on CPU)
trainer = Trainer(gpus=None)
# equivalent
trainer = Trainer(gpus=0)

Would that avoid the misunderstanding you had?

awaelchli on 6 Aug 2020

👍1

I don't think this is a documentation error, I think this is a bug. In other words, specifying gpus=0 should be perfectly valid and supported because device 0 is a valid device.

pgeez on 6 Aug 2020

No it's not a bug, this is expected behavior and documented that way. gpus=int should mean the number of devices, not the index. It is like this by design.
Here #563 also an old discussions on these choices and why they are the way they are (see in particular @williamFalcon's comments)

awaelchli on 6 Aug 2020

We also have a complete table of all possible configurations and their meaning here:
https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#select-gpu-devices

In that table, would it be more user readable if we sort by Meaning?
Now they are sorted by Type

ydcjeff on 6 Aug 2020

Ahh, ok! Thanks for the clarification all.

I'm not questioning the implementation, but was confused because in trainer.py, the doc for the gpus arg is

gpus: Which GPUs to train on.

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py#L231

@awaelchli To your point about above documenting the equivalence of gpus=None and gpus=0, I like that idea.

@ydcjeff I think keeping them sorted by Type is reasonable.

pgeez on 6 Aug 2020

When running trainer script without any flag this log message wrongly appears -

"GPU available: True, used: True"

even though the training does executed on the CPU

this is because self.on_gpu is True while self.single_gpu is False

(Windows 2004, python 3.6.9 Anaconda, pytorch 1.3.1)

eladar on 10 Aug 2020

👍1

@eladar Just checked, looks like it's fixed on master. I get
GPU available: True, used: False
with
Trainer(gpus=0)

awaelchli on 10 Aug 2020

👍1

I can confirm that when initializing Trainer with gpus=0 explicitly I get the message above. However, when using the default value for gpus it states -
"GPU available: True, used: True"
while still processing inputs on cpu.

basically, i've played with the template code and changed it to fit my net. I've executed trainer.py without passing any arguments. the relevant parts from trainer.py is as follow:

if __name__ == '__main__':
    parser = ArgumentParser(description='dual patch classification', add_help=False)

    # add args from trainer
    parser = Trainer.add_argparse_args(parser)

    # give the module a chance to add own params
    # good practice to define LightningModule speficic params in the module
    parser = LitModel.add_model_specific_args(parser)

    # parse params
    args = parser.parse_args()

    main(args)

Than initialized the trainer in main(args)

trainer = Trainer.from_argparse_args(args)
trainer.fit(model)

doing so i get -
"GPU available: True, used: True"

but when debugging the code it seems that data is never passed to the GPU. The Trainer property single_gpu is False.

pytorch_lightning.__version__
Out[2]: '0.8.5'

eladar on 16 Aug 2020

use 0.9.0rc13.
look at the GPU usage with nvidia-smi

williamFalcon on 16 Aug 2020

On 0.9.0rc13 it is fixed

GPU available: True, used: False

eladar on 18 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

How to save checkpoints within lightning_logs?

polars05 · 3Comments

How set number of epochs

Vichoko · 3Comments

Fix .test() on ddp

williamFalcon · 3Comments

Wandb Flatten Dict

anthonytec2 · 3Comments

Simplification: Merge load_from_metrics and load_from_checkpoint

awaelchli · 3Comments