Hi,
Thanks for your effort for answering questions. My model is a simple fully connected Point Cloud Generator followed by PointsRenderer. Generated point clouds are rendered with fixed parameters then loss is calculated with reference image using MSE or L1 loss. After some iterations PointsRenderer returns NaN values during backward pass even the loss is not NaN.
Iteration 20
Loss : 0.06443136185407639
Mean grad of the last layer : 2.3488732949772384e-06
Iteration 21
Loss : 0.06735570728778839
Mean grad of the last layer : 1.2177332564533572e-06
Iteration 22
Loss : 0.06658419221639633
Mean grad of the last layer : nan
Iteration 23
Loss : 0.16023781895637512
Mean grad of the last layer : nan
When I use anomaly detection, it produces the following output:
RuntimeError: Function '_CompositeAlphaPointsBackward' returned nan values in its 1th output
NormWeightedCompositor returns 0 instead of NaN. What are the possible reasons that PointsRenderer returns NaN values during backward pass?
Can you check if any of the output points from the pointcloud generator are NaN? What value are you using for the radius in the raster_settings for the PointsRasterizer?
Can you provide a minimal script showing the set up and settings you are using?
I checked, no other NaN values before backward pass. This is the related code:
class PCGenerator(nn.Module):
def __init__(self, latent_size, device):
super(PCGenerator, self).__init__()
self.latent_size = latent_size
self.device = device
self.dec1 = nn.Linear(self.latent_size,256)
self.dec2 = nn.Linear(256,256)
self.dec3 = nn.Linear(256,1024*3)
self.raster_settings = PointsRasterizationSettings( image_size=64, radius=0.06, points_per_pixel=8)
self.cameras = OpenGLPerspectiveCameras(device=self.device)
self.compositor = AlphaCompositor()
self.raster = PointsRasterizer(self.cameras, self.raster_settings)
self.renderer = PointsRenderer(self.raster, self.compositor)
# distance = 1.75 elevation = 30.0 azimuth = 45.0
self.at = torch.from_numpy(np.array([0.5,0.45,0.5], dtype=np.float32)).unsqueeze(0).to(self.device)
self.R, self.T = look_at_view_transform( 1.75, 30.0, 45.0, at = self.at, device=self.device)
def render(self,PC):
self.PCs = Pointclouds(points = PC, features = torch.ones_like(PC, requires_grad=True, device=self.device))
self.rendered = self.renderer(self.PCs, R=self.R, T=self.T, device=self.device)
return self.rendered.squeeze()
def generatePC(self, x):
x = F.relu(self.dec1(x))
x = F.relu(self.dec2(x))
x = self.dec3(x)
x = torch.sigmoid(x)
return x.view(-1,1024,3)
def forward(self,x):
return self.render(self.generatePC(x))
def train_epoch():
for i, (ims, pcs) in enumerate(train_loader):
PCfeatures = pcs.to(device) # Point cloud features to generate point clouds
ref_ims = ims.to(device) # Reference images
optimizer.zero_grad()
renderedPC = PCGenerator(PCfeatures)
loss = loss_function(ref_ims, renderedPC) # both shapes [N,64,64,3]
print("Iteration " , i)
print("Loss : ", loss.item())
loss.backward()
print("Mean grad of the last layer : " , torch.mean(PCGenerator.dec3.weight.grad).item())
optimizer.step()
We have found the culprit
You are seeing nans when alpha_tvalue is 1.0 and thus division with 0.0. A diff will be submitted to fix this issue!
I see. Thanks for the response. I was looking for a similar reason for NormWeightedCompositor to return 0 instead of NaN. Is it because of the following line?
https://github.com/facebookresearch/pytorch3d/blob/b4fd9d1d34b96eebbf8954abc1d5a4e3b8a91f11/pytorch3d/csrc/compositing/norm_weighted_sum.cu#L143
The implementation in normweighted is different as it is doing a different compositing. A value of 1 in their alpha will not lead to NaNs. The issue with alpha compositing was that an alpha value of 1 will lead to a division with 0, even though the numerator is already 0. But in normweighted they take care of small alpha values in a different way, as you point out.
Here is the fix https://github.com/facebookresearch/pytorch3d/commit/d689baac5ede7be237645518d1b0575f93ac1ceb.
I am closing this issue, but feel free to re-open it if you encounter more issues.