Final resolution:
The resolution then should be alternative 1, since we agree that don't want to get rid of the 'number of gpus' functionality (which was the original proposed aggressive solution).
If we detect --gpus 0 with int, a warning should suffice alongside updated docs.
Is your feature request related to a problem? Please describe.
Trainer.gpus can currently be used to specify a number of GPUs or specific GPUs to run on. This makes values like
0 (run on CPU), "0" (Run on GPU 0), [0] (run on GPU 0)
confusing for newcomers.
Describe the solution you'd like
As an aggressive solution to this issue, we move to have gpus always specify specific GPUs as that is the more encompassing case. Going forward, we can put a deprecation notice up when a single int is passed in:
"In the future,
gpusto specify specific GPUs the model will run on. If you would like to run on CPU, pass in None or an empty list."
Then, in the next breaking version, we can simplify the behaviour.
Describe alternatives you've considered
1) Keep as is: This is a viable solution. We could just document more carefully. However, anecdotally, this is confusing for our team and most likely other users.
2) Have gpus mean number of GPUs: There are many cases where researchers need to run multiple experiments on the same time on a multi-gpu machine. Being able to specify which GPU easily would be useful. As an argument for this, one could use 'CUDA_VISIBLE_DEVICES' to do this.
3) Create a new num_gpus argument: This could make it self-documenting and allow for both workflows. However, it will be an additional argument to maintain.
Additional context
My two cents : I would be against 2 and 3.
Why not 2? If the GPU ids are passed with CUDA_VISIBLE_DEVICES, then what's the point of specifying the number of GPUs
CUDA_VISIBLE_DEVICES=0,1 python train.py --gpus 2 asks for double work of the user and would CUDA_VISIBLE_DEVICES=0,1 python train.py --gpus 1 make sense?
Why not 3? The arguments are not orthogonal and there are already too many arguments in Trainer
Solution 1 would be ok IMO, it's nice to have different possibilities.
I'm also not against the solution you suggest
gpus=-1 or gpus=[-1] to use all available GPUsargparse from bash, it's not straightforward to parse a list of integer from the command line. BTW, this line makes that when gpus is an integer, only the gpus first GPUs will be used. This makes the usage of integer a little less useful.
@jeffling Let's keep the first option. As mentioned by @mpariente here are the reasons:
Case 1:
User uses an integer to indicate how many GPUs to use.
This is likely the most common case. User also doesn't want to deal with choosing the GPUs.
gpus=2
Case 2:
User cares about choosing the GPUs. In this case this already implies 2 gpus, so it'd be redundant to have another argument.
gpus=[0, 3]
Case 3:
These arguments likely come from command line.
In this case, we need to support strings for these users.
python main.py --gpus "0,1,2,3"
I do agree that we need to support -1 for all (i thought we already did).
So, the resolution seems to be updated docs, maybe a table with examples?
I do agree that we need to support -1 for all (i thought we already did).
I told that in the case the suggestion was adopted. -1 is supported as an integer and a string but not as a list. But it seems normal with the current behavior.
So, the resolution seems to be updated docs, maybe a table with examples?
Can the docs be updated on master directly?
Case 3:
These arguments likely come from command line.
In this case, we need to support strings for these users.python main.py --gpus "0,1,2,3"
I would not use a string for multiple options, argparse supports multiple values directly:
parser.add_argument('--gpus', type=int, nargs='*', help='list of GPUs')
to be used as
python main.py --gpus 0 1 2 3
@Borda yeah, but this assumes users know argparse super well.... again, we need to keep in mind non-expert users.
I have however thought about providing a default parser with args for each trainer flag so users don't have to remember them (maybe a new issue)?
@Borda yeah, but this assumes users know argparse super well.... again, we need to keep in mind non-expert users.
well, you should know it well as developer, but for the users you have the help message to guide them... then this default parameters can be written in docs or readme :]
i'm talking about scientists, physicists etc... deep learning is not just being done by engineers or developers.
That level of usability is critical and at the core of lightning
I meant that the people writing argparser should do it easier like with this listed parameters, I personally would be very confused passing it as a string with a separator... but I am open discussion :]
From commandline, an issue happens with using nargs syntax --gpus 0 1 2 vs --gpus 0,1,2.
Let's say I'm doing two runs on 3 GPUs, one that uses 2 GPUs and one that uses 1.
First run: python train.py --gpus 0 1. cool.
Second run: python train.py --gpus 2. Whoops, how do we parse this?
If our ideal for the framework user is to be able to do straight pass-throughs to lightning, the string is a better choice. Otherwise everyone will need to deal with the 'list vs int' logic their own way.
The original issue: if we're using string, a user can easily screw up python train.py --gpus "0" vs python train.py --gpus 0. But we can mitigate through other means
Final resolution:
The resolution then should be alternative 1, since we agree that don't want to get rid of the 'number of gpus' functionality (which was the original proposed aggressive solution).
If we detect --gpus 0 with int, a warning should suffice alongside updated docs.
I still have a few PRs I want to do before this, so if anybody would like an easy one feel free to take it :)
From commandline, an issue happens with using
nargssyntax--gpus 0 1 2vs--gpus 0,1,2.
Let's say I'm doing two runs on 3 GPUs, one that uses 2 GPUs and one that uses 1.
First run:python train.py --gpus 0 1. cool.
Second run:python train.py --gpus 2. Whoops, how do we parse this?
This is s simple fix, and we shall correct it in lightning size
after parsing arg parser check if it is a list, else make it as a list... I did the same couple times in past
yeah, either way we need to keep support for passing in a string because this is exactly the friction we need to avoid with users
:)
How about we make a list in the docs with all accepted inputs and how they are parsed/interpreted, with the edge cases discussed here? Then we write a test for each case to make sure that the parsing works properly. The parsing is already quite complicated and it takes a bit of time to see the edge cases.
@awaelchli yes, mind submitting a PR?
sure. I will try to deliver it before next release.
next release on 6th of Dec? :)
@Borda yep, today or tomorrow :)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
@Borda yep, today or tomorrow :)