using the latest installations from the pytorch recommended conda line, along with the following required libraries
cython
pycocotools
matplotlib
I was able to hit an error in the line given under https://github.com/pytorch/vision/blob/master/references/detection/README.md
for performing Faster R CNN
I would also wonder if I can improve the docs by mentioning the fact that, in order to run that example you must pip install cython, pycocotools, and matplotlib ?
Steps to reproduce the behavior:
references/detection/ folder somewhereREADME.md provided command(clone_reference_torchvision) emcp@2600k:~/Dev/git/clone_reference_torchvision$ python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --dataset coco --model fasterrcnn_resnet50_fpn --epochs 26 --lr-steps 16 22 --aspect-ratio-group-factor 3
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
| distributed init (rank 0): env://
Traceback (most recent call last):
File "train.py", line 201, in <module>
main(args)
File "train.py", line 60, in main
Traceback (most recent call last):
File "train.py", line 201, in <module>
main(args)
File "train.py", line 60, in main
utils.init_distributed_mode(args)
File "/home/emcp/Dev/git/clone_reference_torchvision/utils.py", line 317, in init_distributed_mode
utils.init_distributed_mode(args)
torch.cuda.set_device(args.gpu)
File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/cuda/__init__.py", line 245, in set_device
File "/home/emcp/Dev/git/clone_reference_torchvision/utils.py", line 317, in init_distributed_mode
torch._C._cuda_setDevice(device)
torch.cuda.set_device(args.gpu)
File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/cuda/__init__.py", line 245, in set_device
RuntimeError torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp:59
: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp:59
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
Traceback (most recent call last):
File "train.py", line 201, in <module>
main(args)
File "train.py", line 60, in main
utils.init_distributed_mode(args)
File "/home/emcp/Dev/git/clone_reference_torchvision/utils.py", line 317, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/cuda/__init__.py", line 245, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp:59
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
Traceback (most recent call last):
File "train.py", line 201, in <module>
main(args)
File "train.py", line 60, in main
utils.init_distributed_mode(args)
File "/home/emcp/Dev/git/clone_reference_torchvision/utils.py", line 317, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/cuda/__init__.py", line 245, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp:59
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
Traceback (most recent call last):
File "train.py", line 201, in <module>
main(args)
File "train.py", line 60, in main
utils.init_distributed_mode(args)
File "/home/emcp/Dev/git/clone_reference_torchvision/utils.py", line 317, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/cuda/__init__.py", line 245, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp:59
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
Traceback (most recent call last):
File "train.py", line 201, in <module>
main(args)
File "train.py", line 60, in main
utils.init_distributed_mode(args)
File "/home/emcp/Dev/git/clone_reference_torchvision/utils.py", line 317, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/cuda/__init__.py", line 245, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp:59
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
Traceback (most recent call last):
File "train.py", line 201, in <module>
main(args)
File "train.py", line 60, in main
utils.init_distributed_mode(args)
File "/home/emcp/Dev/git/clone_reference_torchvision/utils.py", line 317, in init_distributed_mode
torch.cuda.set_device(args.gpu)
File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/cuda/__init__.py", line 245, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp:59
it should execute training
/home/emcp/anaconda3/envs/clone_reference_torchvision/bin/python /home/emcp/Dev/git/clone_reference_torchvision/collect_env.py
Collecting environment information...
PyTorch version: 1.5.0
Is debug build: No
CUDA used to build PyTorch: 10.2
OS: Ubuntu 20.04 LTS
GCC version: (Ubuntu 9.3.0-10ubuntu2) 9.3.0
CMake version: Could not collect
Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce GTX 1060 6GB
Nvidia driver version: 440.64
cuDNN version: Could not collect
Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] torch==1.5.0
[pip] torchvision==0.6.0a0+82fd1c8
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] mkl 2020.1 217
[conda] mkl-service 2.3.0 py38he904b0f_0
[conda] mkl_fft 1.0.15 py38ha843d7b_0
[conda] mkl_random 1.1.0 py38h962f231_0
[conda] numpy 1.18.1 py38h4f9e942_0
[conda] numpy-base 1.18.1 py38hde5b4d6_1
[conda] pytorch 1.5.0 py3.8_cuda10.2.89_cudnn7.6.5_0 pytorch
[conda] torchvision 0.6.0 py38_cu102 pytorch
Process finished with exit code 0
Hi,
The error you are getting is because you don't have 8 GPUs on your machine most probably, but you run the script asking for 8 GPUs (in nproc_per_node=8).
I would also wonder if I can improve the docs by mentioning the fact that, in order to run that example you must pip install cython, pycocotools, and matplotlib ?
It would be great if you could improve the docs of references/detection/README.md with the extra required libraries. Can you send a PR?
perfect, I had never used these distributed commands before.. shall I make a note of that as well in the .md ? I think so...
Secondly, it is implied that we setup the coco dataset as well? When I turn down the # of GPUS to 1 it gives me this
python -m torch.distributed.launch --nproc_per_node=1 --use_env train.py --dataset coco --model maskrcnn_resnet50_fpn --epochs 26 --lr-steps 16 22 --aspect-ratio-group-factor 3
| distributed init (rank 0): env://
Namespace(aspect_ratio_group_factor=3, batch_size=2, data_path='/datasets01/COCO/022719/', dataset='coco', device='cuda', dist_backend='nccl', dist_url='env://', distributed=True, epochs=26, gpu=0, lr=0.02, lr_gamma=0.1, lr_step_size=8, lr_steps=[16, 22], model='maskrcnn_resnet50_fpn', momentum=0.9, output_dir='.', pretrained=False, print_freq=20, rank=0, resume='', start_epoch=0, test_only=False, weight_decay=0.0001, workers=4, world_size=1)
Loading data
loading annotations into memory...
Traceback (most recent call last):
File "train.py", line 201, in <module>
main(args)
File "train.py", line 68, in main
dataset, num_classes = get_dataset(args.dataset, "train", get_transform(train=True), args.data_path)
File "train.py", line 47, in get_dataset
ds = ds_fn(p, image_set=image_set, transforms=transform)
File "/home/emcp/Dev/git/clone_reference_torchvision/coco_utils.py", line 241, in get_coco
dataset = CocoDetection(img_folder, ann_file, transforms=transforms)
File "/home/emcp/Dev/git/clone_reference_torchvision/coco_utils.py", line 211, in __init__
super(CocoDetection, self).__init__(img_folder, ann_file)
File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torchvision/datasets/coco.py", line 98, in __init__
self.coco = COCO(annFile)
File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/pycocotools/coco.py", line 84, in __init__
dataset = json.load(open(annotation_file, 'r'))
FileNotFoundError: [Errno 2] No such file or directory: '/datasets01/COCO/022719/annotations/instances_train2017.json'
Traceback (most recent call last):
File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/runpy.py", line 193, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/distributed/launch.py", line 263, in <module>
main()
File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/distributed/launch.py", line 258, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/emcp/anaconda3/envs/clone_reference_torchvision/bin/python', '-u', 'train.py', '--dataset', 'coco', '--model', 'maskrcnn_resnet50_fpn', '--epochs', '26', '--lr-steps', '16', '22', '--aspect-ratio-group-factor', '3']' returned non-zero exit status 1.
You need to pass the --data-path flag to the script with the location of your COCO dataset, see https://github.com/pytorch/vision/blob/e6b4078ec73c2cf4fd4432e19c782db58719fb99/references/detection/train.py#L152
shall I make a note of that as well in the .md ? I think so...
Yes, you could mention that in the README as well.
https://github.com/pytorch/vision/pull/2241
I will continue to run through the code in references/detection until I can replicate the expected result. Once I have it working with COCO2017 or whichever dataset coco needs.. then I will try to graft in my custom dataset tagged in coco style, my dataloader... and try to debug from there
https://github.com/JRGEMCP/bootstrap-pytorch-torchvision-fasterrcnn/tree/master
It seems implied in the tutorial that you will run this against the default coco dataset and I think many want to see how to do it on their own data.
or am i wrong and you should be able to load any COCO spec dataset in that folder path from the --data-path ?
Thanks for the PR, I merged it.
You can load any COCO-style dataset with the reference code without modifications. If you want to create your custom dataset, I would recommend checking https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html (and the Colab notebook that is linked on the top of the tutorial).
FYI, if you change the number of GPUs that you are training, you need to modify the learning rates as well in order to match performance. So if you decrease the number of GPUs by 2x (from 8 to 4), you'll also need to decrease the learning rate by 2x. This follows standard scaling rules for multi-GPU training, see https://github.com/facebookresearch/maskrcnn-benchmark#single-gpu-training and the Detectron schedules (although we use epochs here instead of iterations, so don't need to change the number of epochs).
I believe I've answered your questions, and I such I'm closing the issue, but let us know if you hit problems.