I have 4 gpus eg,0,1,2,3 but I want to use 2,3 to train my models. What should I do to train my model successfully?
but the way,I use CUDA_VISIBLE_DIVICES=2,3 xxxxx --gpus 2, and CUDA_VISIBLE_DIVICES=0,1,2,3 xxxx --gpus 2 both are not useless.
It works for me in dist_train.sh...
#!/usr/bin/env bash
PYTHON=${PYTHON:-"python"}
CUDA_VISIBLE_DEVICES=4,5,6,7 $PYTHON -m torch.distributed.launch --nproc_per_node=$2 $(dirname "$0")/train.py $1 --launcher pytorch ${@:3}
thank you very much, do you put these 3 commands in a shell scrtip?
Yes
ok, I will try, thank you !!!
No thank you!
HI, @AresGao in your commands I do not know how to assign the config file, in the README FILE I find author say distribute train use the command like:./tools/dist_train.sh
Oh, I reference https://github.com/facebookresearch/maskrcnn-benchmark, and use following command to train successfully.
export NGPUS=2
CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.launch --nproc_per_node=$NGPUS tools/train.py configs/faster_rcnn_r101_fpn_1x.py --gpus 2
Most helpful comment
It works for me in dist_train.sh...