Apex: amp torchvision detection

Created on 18 Aug 2019 · 4Comments · Source: NVIDIA/apex

I am trying to use apex and fp16 with torchvision detection models.

the model is initialised asL

import torchvision.models.detection as models
model = models.__dict__['fasterrcnn_resnet50_fpn'](pretrained=False)
model.to(device)

I followed the guide using the lines:

if args.fp16:
    model, optimizer = amp.initialize(model, optimizer)

and

if fp16:
    with amp.scale_loss(losses, optimizer) as scaled_loss:
        scaled_loss.backward()

in order to use fp16
but I get:

Traceback (most recent call last):
  File "train_detection_basic.py", line 221, in <module>
    main(args)
  File "train_detection_basic.py", line 187, in main
    batch_loop(model, optimizer, train_loader, device, epoch, args.fp16)
  File "train_detection_basic.py", line 63, in batch_loop
    loss_dict = model(images_l, target_l)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torchvision-0.4.0a0+d31eafa-py3.7-linux-x86_64.egg/torchvision/models/detection/generalized_rcnn.py", line 52, in forward
    detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torchvision-0.4.0a0+d31eafa-py3.7-linux-x86_64.egg/torchvision/models/detection/roi_heads.py", line 540, in forward
    box_features = self.box_roi_pool(features, proposals, image_shapes)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torchvision-0.4.0a0+d31eafa-py3.7-linux-x86_64.egg/torchvision/ops/poolers.py", line 163, in forward
    spatial_scale=scale, sampling_ratio=self.sampling_ratio
  File "/opt/conda/lib/python3.7/site-packages/torchvision-0.4.0a0+d31eafa-py3.7-linux-x86_64.egg/torchvision/ops/roi_align.py", line 69, in roi_align
    return _RoIAlignFunction.apply(input, rois, output_size, spatial_scale, sampling_ratio)
  File "/opt/conda/lib/python3.7/site-packages/torchvision-0.4.0a0+d31eafa-py3.7-linux-x86_64.egg/torchvision/ops/roi_align.py", line 24, in forward
    output_size[0], output_size[1], sampling_ratio)
RuntimeError: Expected tensor for argument #1 'input' to have the same type as tensor for argument #2 'rois'; but type Variable[CUDAHalfType] does not equal Variable[CUDAFloatType] (while checking arguments for ROIAlign_forward_cuda) (checkSameType at /opt/pytorch/aten/src/ATen/TensorUtils.cpp:140)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6c (0x7f3517aababc in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: at::checkSameType(char const*, at::TensorArg const&, at::TensorArg const&) + 0x458 (0x7f34d9580608 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #2: at::checkAllSameType(char const*, c10::ArrayRef<at::TensorArg>) + 0x3e (0x7f34d95807de in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: ROIAlign_forward_cuda(at::Tensor const&, at::Tensor const&, float, int, int, int) + 0x44d (0x7f350d85eae6 in /home/jovyan/.cache/Python-Eggs/torchvision-0.4.0a0+d31eafa-py3.7-linux-x86_64.egg-tmp/torchvision/_C.cpython-37m-x86_64-linux-gnu.so)
frame #4: ROIAlign_forward(at::Tensor const&, at::Tensor const&, float, int, int, int) + 0x75 (0x7f350d8127a2 in /home/jovyan/.cache/Python-Eggs/torchvision-0.4.0a0+d31eafa-py3.7-linux-x86_64.egg-tmp/torchvision/_C.cpython-37m-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x63641 (0x7f350d837641 in /home/jovyan/.cache/Python-Eggs/torchvision-0.4.0a0+d31eafa-py3.7-linux-x86_64.egg-tmp/torchvision/_C.cpython-37m-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x603c7 (0x7f350d8343c7 in /home/jovyan/.cache/Python-Eggs/torchvision-0.4.0a0+d31eafa-py3.7-linux-x86_64.egg-tmp/torchvision/_C.cpython-37m-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x5b832 (0x7f350d82f832 in /home/jovyan/.cache/Python-Eggs/torchvision-0.4.0a0+d31eafa-py3.7-linux-x86_64.egg-tmp/torchvision/_C.cpython-37m-x86_64-linux-gnu.so)
frame #8: <unknown function> + 0x5baea (0x7f350d82faea in /home/jovyan/.cache/Python-Eggs/torchvision-0.4.0a0+d31eafa-py3.7-linux-x86_64.egg-tmp/torchvision/_C.cpython-37m-x86_64-linux-gnu.so)
frame #9: <unknown function> + 0x4b093 (0x7f350d81f093 in /home/jovyan/.cache/Python-Eggs/torchvision-0.4.0a0+d31eafa-py3.7-linux-x86_64.egg-tmp/torchvision/_C.cpython-37m-x86_64-linux-gnu.so)
<omitting python frames>
frame #14: THPFunction_apply(_object*, _object*) + 0xa2a (0x7f351a344ada in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

any ideas?

Source

Data-drone

Most helpful comment

I got the same problem. A quick fix could be:

model.roi_heads.box_roi_pool.forward = \
    amp.half_function(model.roi_heads.box_roi_pool.forward)

DeanChan on 29 Aug 2019

👍4

All 4 comments

Hi @Data-drone,

thanks for raising this issue!
We are currently working on implementing amp for the detection models.
However, we are seeing this issue:

RuntimeError: expected device cuda:0 and dtype Float but got device cuda:0 and dtype Half
The above operation failed in interpreter, with the following stack trace:
at /home/ptrblck/anaconda3/envs/apex/lib/python3.7/site-packages/torchvision-0.5.0a0+19315e3-py3.7-linux-x86_64.egg/torchvision/ops/misc.py:158:15
    def forward(self, x):
        # move reshapes to the beginning
        # to make it fuser-friendly
        w = self.weight.reshape(1, -1, 1, 1)
        b = self.bias.reshape(1, -1, 1, 1)
        rv = self.running_var.reshape(1, -1, 1, 1)
        rm = self.running_mean.reshape(1, -1, 1, 1)
        scale = w * rv.rsqrt()
        bias = b - rm * scale
        return x * scale + bias
               ~~~~~~~~~ <--- HERE

So your might be unrelated to this one.
Could you post a reproducible code snippet?
I'm currently running this code:

data = data.to(device, non_blocking=True)
        batch_size = data.size(0)
        optimizer.zero_grad()        
        target = {'boxes': torch.randn(1, 4),
                  'labels': torch.randint(0, 91, (1,)),
                  'image_id': torch.tensor(0),
                  'area': torch.randn(1),
                  'iscrowd': torch.tensor(0, dtype=torch.uint8)
        }
        targets = [target] * batch_size
        loss = model(data, targets)

ptrblck on 19 Aug 2019

my full code repo is here: https://github.com/Data-drone/cv_experiments I am running ./run_detect_train.sh I am running all the code on a docker container: https://cloud.docker.com/u/datadrone/repository/docker/datadrone/deeplearn_pytorch though I did have to rebuild the torchvision library from source over the installed one in the container as it didn't compile with CUDA properly.

I think I encountered the runtime error that you got as well, if you look at lines 46 onwards in train_detection_basic.py you can see the workaround I found

Data-drone on 19 Aug 2019

I got the same problem. A quick fix could be:

model.roi_heads.box_roi_pool.forward = \
    amp.half_function(model.roi_heads.box_roi_pool.forward)

DeanChan on 29 Aug 2019

👍4

thanks that worked for me for now

Data-drone on 7 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

undefined symbol: __ZN2at19UndefinedTensorImpl10_singletonE

rmrao · 3Comments

_amp_state determines whether running in distributed at import

rmrao · 4Comments

relation between apex.parallel.DistributedDataParallel and torch.distributed

xmyqsh · 3Comments

Learning Scheduler

TheRevanchist · 3Comments

installation failed: Given no hashes to check 123 links for project 'pip': discarding no candidates

DeeDive · 4Comments