I encountered a case where two vertices of a triangle had same screen-space XY coordinates, and this led to a 'nan' value in the gradient.
Some interesting things I noted:
import torch
import pytorch3d
import pytorch3d.renderer
import numpy as np
import matplotlib.pyplot as plt
device = torch.device("cuda:0")
# using cpu instead of gpu leads no nan values!
# device = torch.device("cpu")
# Note that v0 and v2 of the triangle have same x,y but different Z, so it's not a degenerate case (just a triangle that is parallel to camera)
# These are actual vertex values that occured during a run
# However, my attempts to manually create a simpler vertex location that reproduced this were unsuccessful
vs = torch.Tensor([[0.7922, -0.1992, 6.8850],[0.8408, -0.1622, 6.8568],[0.7922, -0.1992, 6.89]]).to(device)
fs = torch.Tensor([[0,1,2]]).to(device)
vs.requires_grad = True
meshes = pytorch3d.structures.Meshes([vs],[fs])
cameras = pytorch3d.renderer.OpenGLOrthographicCameras(znear=0, zfar=1, device=device)
blend_params = pytorch3d.renderer.BlendParams(sigma=1e-4, gamma=1e-4)
mask_raster_settings = pytorch3d.renderer.RasterizationSettings(
image_size=256,
blur_radius=np.log(1. / 1e-4 - 1.) * blend_params.sigma,
faces_per_pixel=20,
bin_size=0
)
mask_rasterizer = pytorch3d.renderer.MeshRasterizer(
cameras=cameras,
raster_settings=mask_raster_settings
)
mask_shader = pytorch3d.renderer.SoftSilhouetteShader(blend_params=blend_params)
mask_renderer = pytorch3d.renderer.MeshRenderer(mask_rasterizer, mask_shader)
img_mask = mask_renderer(meshes)
img_mask[0,:,:,3].mean().backward()
print(vs.grad)
plt.imshow(img_mask[0,:,:,3].detach().cpu().numpy())
plt.show()
tensor([[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan]], device='cuda:0')

## debugging via checking barycentric coords
pix2face, _, barycentric_coords, _ = mask_rasterizer(meshes)
(pix2face == 0).nonzero()[0]
tensor([ 0, 147, 17, 0], device='cuda:0')
print(barycentric_coords[0, 147, 17, 0])
tensor([-3.5279e+26, 0.0000e+00, 3.5279e+26], device='cuda:0',
grad_fn=<SelectBackward>)
Update: actually, even using a simpler vertex location can reproduce the error e.g.
vs = torch.Tensor([[0., 0., 1.0],[0.2, 0.2, 2.0],[0., 0., 3.0]]).to(device)
@shubhtuls thanks for the detailed explanation of the issue. I will try to reproduce the error as described and get back to you!
I think this is related to some precision issues when computing the face areas. Adding a statement to print the 'face_area' in the cuda rasterizer implementation here shows that it is of the order of 1e-9, which is greater than the kEpsilon=1e-30 used to check for zero area.
I unblocked on my end by additionally defining a 'kEpsilonFace=1e-7' in these lines and using that for the zero area check, but I'm not sure if this is the ideal solution.
Small face areas are such a headache! I guess in both the example faces above you have almost 0 face areas. Did nans disappear with 1e-7?
@gkioxari - so far, yes!
@gkioxari - so far, yes!
Thanks for your solution, can I realize your solution aims to avoid gradient vanish problem through a more rigid settings according to each face area?
@tomguluson92 In a follow up diff we are setting the kEpsilon value to 1e-8. Yes the issue is the small face areas that are determined based on that value.
Most helpful comment
I think this is related to some precision issues when computing the face areas. Adding a statement to print the 'face_area' in the cuda rasterizer implementation here shows that it is of the order of 1e-9, which is greater than the kEpsilon=1e-30 used to check for zero area.
I unblocked on my end by additionally defining a 'kEpsilonFace=1e-7' in these lines and using that for the zero area check, but I'm not sure if this is the ideal solution.