Pytorch3d: nan in Rasterizer gradients

Created on 12 Mar 2020 · 8Comments · Source: facebookresearch/pytorch3d

Description

I encountered a case where two vertices of a triangle had same screen-space XY coordinates, and this led to a 'nan' value in the gradient.

Some interesting things I noted:

Using CPU instead of GPU (Quadro GP100 in this case) leads to no issues
Using CPU vs GPU also gives different pix2face in forward pass of the rasterizer. The CPU run ignores the offending triangle's contribution, the GPU run does not. I am not sure why this happens, but I suspect this is what needs to be fixed.

Instructions To Reproduce the Issue:

import torch

import pytorch3d
import pytorch3d.renderer
import numpy as np
import matplotlib.pyplot as plt

device = torch.device("cuda:0")

# using cpu instead of gpu leads no nan values!
# device = torch.device("cpu")

# Note that v0 and v2 of the triangle have same x,y but different Z, so it's not a degenerate case (just a triangle that is parallel to camera)
# These are actual vertex values that occured during a run
# However, my attempts to manually create a simpler vertex location that reproduced this were unsuccessful
vs = torch.Tensor([[0.7922, -0.1992,  6.8850],[0.8408, -0.1622,  6.8568],[0.7922, -0.1992,  6.89]]).to(device)
fs = torch.Tensor([[0,1,2]]).to(device)

vs.requires_grad = True

meshes = pytorch3d.structures.Meshes([vs],[fs])
cameras = pytorch3d.renderer.OpenGLOrthographicCameras(znear=0, zfar=1, device=device)

blend_params = pytorch3d.renderer.BlendParams(sigma=1e-4, gamma=1e-4)
mask_raster_settings = pytorch3d.renderer.RasterizationSettings(
    image_size=256, 
    blur_radius=np.log(1. / 1e-4 - 1.) * blend_params.sigma, 
    faces_per_pixel=20,
    bin_size=0
)
mask_rasterizer = pytorch3d.renderer.MeshRasterizer(
    cameras=cameras, 
    raster_settings=mask_raster_settings
)
mask_shader = pytorch3d.renderer.SoftSilhouetteShader(blend_params=blend_params)
mask_renderer = pytorch3d.renderer.MeshRenderer(mask_rasterizer, mask_shader)

img_mask = mask_renderer(meshes)

img_mask[0,:,:,3].mean().backward()
print(vs.grad)

plt.imshow(img_mask[0,:,:,3].detach().cpu().numpy())
plt.show()

tensor([[nan, nan, nan],
        [nan, nan, nan],
        [nan, nan, nan]], device='cuda:0')

output_3_0

## debugging via checking barycentric coords
pix2face, _, barycentric_coords, _ = mask_rasterizer(meshes)
(pix2face == 0).nonzero()[0]

tensor([  0, 147,  17,   0], device='cuda:0')

print(barycentric_coords[0, 147, 17, 0])

tensor([-3.5279e+26,  0.0000e+00,  3.5279e+26], device='cuda:0',
       grad_fn=<SelectBackward>)

bug

Source

shubhtuls

👍2

Most helpful comment

I think this is related to some precision issues when computing the face areas. Adding a statement to print the 'face_area' in the cuda rasterizer implementation here shows that it is of the order of 1e-9, which is greater than the kEpsilon=1e-30 used to check for zero area.

I unblocked on my end by additionally defining a 'kEpsilonFace=1e-7' in these lines and using that for the zero area check, but I'm not sure if this is the ideal solution.

shubhtuls on 12 Mar 2020

👍4

All 8 comments

Update: actually, even using a simpler vertex location can reproduce the error e.g.

vs = torch.Tensor([[0., 0.,  1.0],[0.2, 0.2,  2.0],[0., 0.,  3.0]]).to(device)

shubhtuls on 12 Mar 2020

@shubhtuls thanks for the detailed explanation of the issue. I will try to reproduce the error as described and get back to you!

nikhilaravi on 12 Mar 2020

I unblocked on my end by additionally defining a 'kEpsilonFace=1e-7' in these lines and using that for the zero area check, but I'm not sure if this is the ideal solution.

shubhtuls on 12 Mar 2020

👍4

Small face areas are such a headache! I guess in both the example faces above you have almost 0 face areas. Did nans disappear with 1e-7?

gkioxari on 13 Mar 2020

👍1

@gkioxari - so far, yes!

shubhtuls on 13 Mar 2020

👍1

@gkioxari - so far, yes!

Thanks for your solution, can I realize your solution aims to avoid gradient vanish problem through a more rigid settings according to each face area?

tomguluson92 on 8 Apr 2020

@tomguluson92 In a follow up diff we are setting the kEpsilon value to 1e-8. Yes the issue is the small face areas that are determined based on that value.

gkioxari on 8 Apr 2020

This has been fixed by https://github.com/facebookresearch/pytorch3d/commit/487d4d6607a60a8be7135b334137985f40953a92.

nikhilaravi on 24 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings