Vision: GeneralizedRCNN returns NaNs with torch.uint8 inputs

Created on 7 Jan 2021 · 3Comments · Source: pytorch/vision

🐛 Bug

The FasterRCNN model (and, more generally, the GeneralizedRCNN class) expects as input images a list of float PyTorch tensors, but if you try to pass it a list of tensors with dtype torch.uint8, the model returns NaN values in the normalization step and, as a consequence, in the losses computation.

To Reproduce

Steps to reproduce the behavior:

Load an image as a PyTorch tensor with dtype torch.uint8, along with its corresponding target dictionary
Create an instance of FasterRCNN and pass that image to the model
Observe the output of the model, which should be the dictionary of losses with all NaN values

Expected behavior

I would have expected the model to throw an exception or at least a warning. In particular, since the GeneralizedRCNN class takes care of transformations such as normalization and resizing, in my opinion it should also check the type of the input images, in order to avoid such errors.

Environment

PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 10.15.7 (x86_64)
GCC version: Could not collect
Clang version: 12.0.0 (clang-1200.0.32.28)
CMake version: version 3.18.4

Python version: 3.8 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] torch==1.7.1
[pip3] torchvision==0.8.2
[conda] Could not collect

Additional context

I realized that the error I was facing is caused by the normalize function of the GeneralizedRCNNTransform class, which relies on the image dtype to convert the mean and standard deviation lists to tensors, so that in the default case (ImageNet mean/std) they contain all zeros.

def normalize(self, image):
        dtype, device = image.dtype, image.device
        mean = torch.as_tensor(self.image_mean, dtype=dtype, device=device)
        std = torch.as_tensor(self.image_std, dtype=dtype, device=device)
        return (image - mean[:, None, None]) / std[:, None, None]

To avoid this problem, a simple image.float() would suffice.

bug object detection

Source

Wadaboa

All 3 comments

@Wadaboa Thanks for reporting.

I think the current implementation expects that you have already coverted the image to 0-1 scale, using one of the other transforms:
https://github.com/pytorch/vision/blob/3d60f498e71ba63b428edb184c9ac38fa3737fa6/references/detection/train.py#L51-L53

Concerning the code on normalize, I think the idea of casting the mean/std to the dtype of the image is not correct:
https://github.com/pytorch/vision/blob/3d60f498e71ba63b428edb184c9ac38fa3737fa6/torchvision/models/detection/transform.py#L120-L124

The above works only if the image is of floating type but if it's not, the whole thing fails as you pointed out. I think this needs to be fixed one way or another:

Option 1: Do not support uint8; add a check on the top of the method and if the image is not a float throw an exception.
Option 2: Support integer types; on the top of the method cast the image to float32 if it's not already of float type.

Given that the GeneralizedRCNNTransform receives as arguments mean/std which can be in 0-255 scale, I think option 2 is best. Do you mind sending us a PR that fixes the issue? Please add a unit-test that shows that the problem is resolved.

cc @fmassa