Pytorch-lightning: Update docs to be clear on --gpus behaviour.

Created on 29 Nov 2019 · 17Comments · Source: PyTorchLightning/pytorch-lightning

Final resolution:

The resolution then should be alternative 1, since we agree that don't want to get rid of the 'number of gpus' functionality (which was the original proposed aggressive solution).

If we detect --gpus 0 with int, a warning should suffice alongside updated docs.

Is your feature request related to a problem? Please describe.
Trainer.gpus can currently be used to specify a number of GPUs or specific GPUs to run on. This makes values like

0 (run on CPU), "0" (Run on GPU 0), [0] (run on GPU 0)

confusing for newcomers.

Describe the solution you'd like
As an aggressive solution to this issue, we move to have gpus always specify specific GPUs as that is the more encompassing case. Going forward, we can put a deprecation notice up when a single int is passed in:

"In the future, gpus to specify specific GPUs the model will run on. If you would like to run on CPU, pass in None or an empty list."

Then, in the next breaking version, we can simplify the behaviour.

Describe alternatives you've considered
1) Keep as is: This is a viable solution. We could just document more carefully. However, anecdotally, this is confusing for our team and most likely other users.
2) Have gpus mean number of GPUs: There are many cases where researchers need to run multiple experiments on the same time on a multi-gpu machine. Being able to specify which GPU easily would be useful. As an argument for this, one could use 'CUDA_VISIBLE_DEVICES' to do this.
3) Create a new num_gpus argument: This could make it self-documenting and allow for both workflows. However, it will be an additional argument to maintain.

Additional context

enhancement help wanted

Source

jeffling

Most helpful comment

@Borda yep, today or tomorrow :)

awaelchli on 4 Dec 2019

👍2

All 17 comments

My two cents : I would be against 2 and 3.
Why not 2? If the GPU ids are passed with CUDA_VISIBLE_DEVICES, then what's the point of specifying the number of GPUs
CUDA_VISIBLE_DEVICES=0,1 python train.py --gpus 2 asks for double work of the user and would CUDA_VISIBLE_DEVICES=0,1 python train.py --gpus 1 make sense?
Why not 3? The arguments are not orthogonal and there are already too many arguments in Trainer
Solution 1 would be ok IMO, it's nice to have different possibilities.
I'm also not against the solution you suggest

It will be clearer for sure.
It should additionally support something like gpus=-1 or gpus=[-1] to use all available GPUs
The bad point I see is for this solution is the interface with argparse from bash, it's not straightforward to parse a list of integer from the command line.

BTW, this line makes that when gpus is an integer, only the gpus first GPUs will be used. This makes the usage of integer a little less useful.

mpariente on 30 Nov 2019

@jeffling Let's keep the first option. As mentioned by @mpariente here are the reasons:

Case 1:
User uses an integer to indicate how many GPUs to use.
This is likely the most common case. User also doesn't want to deal with choosing the GPUs.

gpus=2

Case 2:
User cares about choosing the GPUs. In this case this already implies 2 gpus, so it'd be redundant to have another argument.

gpus=[0, 3]

Case 3:
These arguments likely come from command line.
In this case, we need to support strings for these users.

python main.py --gpus "0,1,2,3"

I do agree that we need to support -1 for all (i thought we already did).

So, the resolution seems to be updated docs, maybe a table with examples?

williamFalcon on 30 Nov 2019

I do agree that we need to support -1 for all (i thought we already did).

I told that in the case the suggestion was adopted. -1 is supported as an integer and a string but not as a list. But it seems normal with the current behavior.

So, the resolution seems to be updated docs, maybe a table with examples?

Can the docs be updated on master directly?

mpariente on 30 Nov 2019

Case 3:
These arguments likely come from command line.
In this case, we need to support strings for these users.
python main.py --gpus "0,1,2,3"

I would not use a string for multiple options, argparse supports multiple values directly:

parser.add_argument('--gpus', type=int, nargs='*', help='list of GPUs')

to be used as

python main.py --gpus 0 1 2 3

Borda on 30 Nov 2019

@Borda yeah, but this assumes users know argparse super well.... again, we need to keep in mind non-expert users.

I have however thought about providing a default parser with args for each trainer flag so users don't have to remember them (maybe a new issue)?

williamFalcon on 30 Nov 2019

@Borda yeah, but this assumes users know argparse super well.... again, we need to keep in mind non-expert users.

well, you should know it well as developer, but for the users you have the help message to guide them... then this default parameters can be written in docs or readme :]

Borda on 30 Nov 2019

i'm talking about scientists, physicists etc... deep learning is not just being done by engineers or developers.

That level of usability is critical and at the core of lightning

williamFalcon on 30 Nov 2019

I meant that the people writing argparser should do it easier like with this listed parameters, I personally would be very confused passing it as a string with a separator... but I am open discussion :]

Borda on 30 Nov 2019

From commandline, an issue happens with using nargs syntax --gpus 0 1 2 vs --gpus 0,1,2.

Let's say I'm doing two runs on 3 GPUs, one that uses 2 GPUs and one that uses 1.

First run: python train.py --gpus 0 1. cool.
Second run: python train.py --gpus 2. Whoops, how do we parse this?

If our ideal for the framework user is to be able to do straight pass-throughs to lightning, the string is a better choice. Otherwise everyone will need to deal with the 'list vs int' logic their own way.

The original issue: if we're using string, a user can easily screw up python train.py --gpus "0" vs python train.py --gpus 0. But we can mitigate through other means

Final resolution:

The resolution then should be alternative 1, since we agree that don't want to get rid of the 'number of gpus' functionality (which was the original proposed aggressive solution).

If we detect --gpus 0 with int, a warning should suffice alongside updated docs.

I still have a few PRs I want to do before this, so if anybody would like an easy one feel free to take it :)

jeffling on 1 Dec 2019

From commandline, an issue happens with using nargs syntax --gpus 0 1 2 vs --gpus 0,1,2.
Let's say I'm doing two runs on 3 GPUs, one that uses 2 GPUs and one that uses 1.
First run: python train.py --gpus 0 1. cool.
Second run: python train.py --gpus 2. Whoops, how do we parse this?

This is s simple fix, and we shall correct it in lightning size
after parsing arg parser check if it is a list, else make it as a list... I did the same couple times in past

Borda on 1 Dec 2019

yeah, either way we need to keep support for passing in a string because this is exactly the friction we need to avoid with users
:)

williamFalcon on 1 Dec 2019

👍1

How about we make a list in the docs with all accepted inputs and how they are parsed/interpreted, with the edge cases discussed here? Then we write a test for each case to make sure that the parsing works properly. The parsing is already quite complicated and it takes a bit of time to see the edge cases.

awaelchli on 3 Dec 2019

@awaelchli yes, mind submitting a PR?

williamFalcon on 4 Dec 2019

sure. I will try to deliver it before next release.

awaelchli on 4 Dec 2019

next release on 6th of Dec? :)

Borda on 4 Dec 2019

@Borda yep, today or tomorrow :)

awaelchli on 4 Dec 2019

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.