Vision: fresh installation of pytorch 1.5 and torchvision .6 yields error with docs

Created on 19 May 2020 · 5Comments · Source: pytorch/vision

🐛 Bug

using the latest installations from the pytorch recommended conda line, along with the following required libraries

cython
pycocotools
matplotlib

I was able to hit an error in the line given under https://github.com/pytorch/vision/blob/master/references/detection/README.md
for performing Faster R CNN

I would also wonder if I can improve the docs by mentioning the fact that, in order to run that example you must pip install cython, pycocotools, and matplotlib ?

To Reproduce

Steps to reproduce the behavior:

copy the references/detection/ folder somewhere
create a conda environment and install latest stable pytorch and torchvision
attempt to run the README.md provided command

(clone_reference_torchvision) emcp@2600k:~/Dev/git/clone_reference_torchvision$ python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --dataset coco --model fasterrcnn_resnet50_fpn --epochs 26 --lr-steps 16 22 --aspect-ratio-group-factor 3
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
| distributed init (rank 0): env://
Traceback (most recent call last):
  File "train.py", line 201, in <module>
    main(args)
  File "train.py", line 60, in main
Traceback (most recent call last):
  File "train.py", line 201, in <module>
        main(args)
  File "train.py", line 60, in main
    utils.init_distributed_mode(args)
  File "/home/emcp/Dev/git/clone_reference_torchvision/utils.py", line 317, in init_distributed_mode
utils.init_distributed_mode(args)
    torch.cuda.set_device(args.gpu)
  File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/cuda/__init__.py", line 245, in set_device
  File "/home/emcp/Dev/git/clone_reference_torchvision/utils.py", line 317, in init_distributed_mode
    torch._C._cuda_setDevice(device)
    torch.cuda.set_device(args.gpu)
  File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/cuda/__init__.py", line 245, in set_device
RuntimeError    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp:59
: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp:59
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 201, in <module>
    main(args)
  File "train.py", line 60, in main
    utils.init_distributed_mode(args)
  File "/home/emcp/Dev/git/clone_reference_torchvision/utils.py", line 317, in init_distributed_mode
    torch.cuda.set_device(args.gpu)
  File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/cuda/__init__.py", line 245, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp:59
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 201, in <module>
    main(args)
  File "train.py", line 60, in main
    utils.init_distributed_mode(args)
  File "/home/emcp/Dev/git/clone_reference_torchvision/utils.py", line 317, in init_distributed_mode
    torch.cuda.set_device(args.gpu)
  File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/cuda/__init__.py", line 245, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp:59
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 201, in <module>
    main(args)
  File "train.py", line 60, in main
    utils.init_distributed_mode(args)
  File "/home/emcp/Dev/git/clone_reference_torchvision/utils.py", line 317, in init_distributed_mode
    torch.cuda.set_device(args.gpu)
  File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/cuda/__init__.py", line 245, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp:59
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 201, in <module>
    main(args)
  File "train.py", line 60, in main
    utils.init_distributed_mode(args)
  File "/home/emcp/Dev/git/clone_reference_torchvision/utils.py", line 317, in init_distributed_mode
    torch.cuda.set_device(args.gpu)
  File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/cuda/__init__.py", line 245, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp:59
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 201, in <module>
    main(args)
  File "train.py", line 60, in main
    utils.init_distributed_mode(args)
  File "/home/emcp/Dev/git/clone_reference_torchvision/utils.py", line 317, in init_distributed_mode
    torch.cuda.set_device(args.gpu)
  File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/cuda/__init__.py", line 245, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1587428207430/work/torch/csrc/cuda/Module.cpp:59

Expected behavior

it should execute training

Environment

/home/emcp/anaconda3/envs/clone_reference_torchvision/bin/python /home/emcp/Dev/git/clone_reference_torchvision/collect_env.py
Collecting environment information...
PyTorch version: 1.5.0
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Ubuntu 20.04 LTS
GCC version: (Ubuntu 9.3.0-10ubuntu2) 9.3.0
CMake version: Could not collect

Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce GTX 1060 6GB
Nvidia driver version: 440.64
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] torch==1.5.0
[pip] torchvision==0.6.0a0+82fd1c8
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.2.89              hfd86e86_1  
[conda] mkl                       2020.1                      217  
[conda] mkl-service               2.3.0            py38he904b0f_0  
[conda] mkl_fft                   1.0.15           py38ha843d7b_0  
[conda] mkl_random                1.1.0            py38h962f231_0  
[conda] numpy                     1.18.1           py38h4f9e942_0  
[conda] numpy-base                1.18.1           py38hde5b4d6_1  
[conda] pytorch                   1.5.0           py3.8_cuda10.2.89_cudnn7.6.5_0    pytorch
[conda] torchvision               0.6.0                py38_cu102    pytorch

Process finished with exit code 0

Additional context

reference scripts question object detection

Source

EMCP

All 5 comments

Hi,

The error you are getting is because you don't have 8 GPUs on your machine most probably, but you run the script asking for 8 GPUs (in nproc_per_node=8).

I would also wonder if I can improve the docs by mentioning the fact that, in order to run that example you must pip install cython, pycocotools, and matplotlib ?

It would be great if you could improve the docs of references/detection/README.md with the extra required libraries. Can you send a PR?

fmassa on 19 May 2020

👍1

perfect, I had never used these distributed commands before.. shall I make a note of that as well in the .md ? I think so...

Secondly, it is implied that we setup the coco dataset as well? When I turn down the # of GPUS to 1 it gives me this

 python -m torch.distributed.launch --nproc_per_node=1 --use_env train.py --dataset coco --model maskrcnn_resnet50_fpn --epochs 26 --lr-steps 16 22 --aspect-ratio-group-factor 3
| distributed init (rank 0): env://
Namespace(aspect_ratio_group_factor=3, batch_size=2, data_path='/datasets01/COCO/022719/', dataset='coco', device='cuda', dist_backend='nccl', dist_url='env://', distributed=True, epochs=26, gpu=0, lr=0.02, lr_gamma=0.1, lr_step_size=8, lr_steps=[16, 22], model='maskrcnn_resnet50_fpn', momentum=0.9, output_dir='.', pretrained=False, print_freq=20, rank=0, resume='', start_epoch=0, test_only=False, weight_decay=0.0001, workers=4, world_size=1)
Loading data
loading annotations into memory...
Traceback (most recent call last):
  File "train.py", line 201, in <module>
    main(args)
  File "train.py", line 68, in main
    dataset, num_classes = get_dataset(args.dataset, "train", get_transform(train=True), args.data_path)
  File "train.py", line 47, in get_dataset
    ds = ds_fn(p, image_set=image_set, transforms=transform)
  File "/home/emcp/Dev/git/clone_reference_torchvision/coco_utils.py", line 241, in get_coco
    dataset = CocoDetection(img_folder, ann_file, transforms=transforms)
  File "/home/emcp/Dev/git/clone_reference_torchvision/coco_utils.py", line 211, in __init__
    super(CocoDetection, self).__init__(img_folder, ann_file)
  File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torchvision/datasets/coco.py", line 98, in __init__
    self.coco = COCO(annFile)
  File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/pycocotools/coco.py", line 84, in __init__
    dataset = json.load(open(annotation_file, 'r'))
FileNotFoundError: [Errno 2] No such file or directory: '/datasets01/COCO/022719/annotations/instances_train2017.json'
Traceback (most recent call last):
  File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/emcp/anaconda3/envs/clone_reference_torchvision/lib/python3.8/site-packages/torch/distributed/launch.py", line 258, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/emcp/anaconda3/envs/clone_reference_torchvision/bin/python', '-u', 'train.py', '--dataset', 'coco', '--model', 'maskrcnn_resnet50_fpn', '--epochs', '26', '--lr-steps', '16', '22', '--aspect-ratio-group-factor', '3']' returned non-zero exit status 1.

EMCP on 19 May 2020

You need to pass the --data-path flag to the script with the location of your COCO dataset, see https://github.com/pytorch/vision/blob/e6b4078ec73c2cf4fd4432e19c782db58719fb99/references/detection/train.py#L152

shall I make a note of that as well in the .md ? I think so...

Yes, you could mention that in the README as well.

fmassa on 19 May 2020

👍1

https://github.com/pytorch/vision/pull/2241

I will continue to run through the code in references/detection until I can replicate the expected result. Once I have it working with COCO2017 or whichever dataset coco needs.. then I will try to graft in my custom dataset tagged in coco style, my dataloader... and try to debug from there

https://github.com/JRGEMCP/bootstrap-pytorch-torchvision-fasterrcnn/tree/master

It seems implied in the tutorial that you will run this against the default coco dataset and I think many want to see how to do it on their own data.

or am i wrong and you should be able to load any COCO spec dataset in that folder path from the --data-path ?

EMCP on 19 May 2020

Thanks for the PR, I merged it.

You can load any COCO-style dataset with the reference code without modifications. If you want to create your custom dataset, I would recommend checking https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html (and the Colab notebook that is linked on the top of the tutorial).

FYI, if you change the number of GPUs that you are training, you need to modify the learning rates as well in order to match performance. So if you decrease the number of GPUs by 2x (from 8 to 4), you'll also need to decrease the learning rate by 2x. This follows standard scaling rules for multi-GPU training, see https://github.com/facebookresearch/maskrcnn-benchmark#single-gpu-training and the Detectron schedules (although we use epochs here instead of iterations, so don't need to change the number of epochs).

I believe I've answered your questions, and I such I'm closing the issue, but let us know if you hit problems.

fmassa on 20 May 2020

🎉1

Was this page helpful?

0 / 5 - 0 ratings