Pytorch3d: Cuda runtime error during mesh optimization

Created on 15 Feb 2020  ·  10Comments  ·  Source: facebookresearch/pytorch3d

Hi,

First of all, fantastic work I love it :)
I'm modifying the mesh deformation code to learn a mesh from 2D views. The optimization process starts but after about 100 optimization steps, the code crashes with the following (the log is output with CUDA_LAUNCH_BLOCKING=1).

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-17-d220a60da988> in <module>
     29     # We sample 5k points from the surface of each mesh
     30     sample_trg = sample_points_from_meshes(trg_mesh, 5000)
---> 31     sample_src = sample_points_from_meshes(new_src_mesh, 5000)
     32 

~/coding/python/pytorch3d/env/lib/python3.7/site-packages/pytorch3d/ops/sample_points_from_meshes.py in sample_points_from_meshes(meshes, num_samples, return_normals)
     64             num_samples, replacement=True
     65         )  # (N, num_samples)
---> 66         sample_face_idxs += mesh_to_face[meshes.valid].view(num_valid_meshes, 1)
     67 
     68     # Get the vertex coordinates of the sampled faces.

RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THC/THCGeneral.cpp:313

The error could be related to the other issue here but I'm not sure if it's the same root cause. The error seems to happen at different points in the optimization process depending on parameters like loss weights, optimizer used.

Python : 3.7.5
Pytorch : 1.4.0
CUDA : 10.2
Ubuntu : 19.10

Thanks!

bug question

All 10 comments

This is unrelated to the other issue. The device-side assert is triggered for example if there are invalid indices being used to index into a tensor.

Before sample_src = sample_points_from_meshes(new_src_mesh, 5000) can you add a print statement to check new_src_mesh.valid and new_src_mesh.mesh_to_faces_packed_first_idx()?

Thanks for your quick answer.
The output remains the same throughout optimization, even just before the sampling that fails :

tensor([True]) # new_src_mesh.valid
tensor([0]) # new_src_mesh.mesh_to_faces_packed_first_idx()

Something strange, if I print new_src_mesh.valid just before the sampling it works fine but if I add the prints in the source code just before the line that triggers the device-side assert it fails :

~/coding/python/pytorch3d/env/lib/python3.7/site-packages/pytorch3d/ops/sample_points_from_meshes.py in sample_points_from_meshes(meshes, num_samples, return_normals)
     64             num_samples, replacement=True
     65         )  # (N, num_samples)
---> 66         print(mesh_to_face.cpu())
     67         print(meshes.valid.cpu())
     68         sample_face_idxs += mesh_to_face[meshes.valid].view(num_valid_meshes, 1)

RuntimeError: CUDA error: device-side assert triggered

ok it might be something else in the sample_points_from_meshes function which is causing the error.
Can you try adding the same print statements at the beginning of the function? Can you share the jupyter notebook of your modified code or provide a complete loop that we can run to debug further?

@vilhub were you able to resolve this issue? Alternatively as suggested in #82 can you save meshes.verts_padded() and meshes.faces_padded() to a file before the call to sample_points_from_meshes and share them here? (you can save these values at each iteration but we only need the values from the iteration when the error occurs)

We can then try to load and run the inputs that to reproduce the error and debug from there.

Hi, sorry for the late answer, I haven't resolved the issue since I haven't had time yet. Tomorrow I will investigate and meanwhile I uploaded the notebook here https://github.com/vilhub/pytorch3d/blob/master/docs/tutorials/build_mesh_from_angles.ipynb .
Thanks a lot for your help!

Can you check whether your predictions are nans? This might be a cause of error. Note that if you set the weights of some losses to high this can happen.

Good point, the deform_verts tensor containing the deformation of the vertices indeed contains NaNs before crashing. So the issue does not come from the point sampling or the mesh functions. The maximum of the deformation vector norms grows slowly, stays below 0.3 and suddenly 2 deformation vectors turn to NaNs.
The points don't go to infinity but I'm guessing they get pushed to a weird configuration where perhaps an edge or a face is singular. I tried playing around with the weights of the laplacian loss, edge loss and normal loss but setting the weights to zero also doesn't impact this.

One possible explanation could be the blur parameter of the renderer. If I leave it at the default it gives this
image
The loss is low for vertices in the yellow regions, and since there are some small yellow cavities like in the eye of the dolphin, it could be that singularities arise if points get pushed to the same small region.
If I set the blur parameter of the renderer to lower, the cavities disappear but the loss function gets discontinuous so optimization gets extremely slow. I'll try to see if I can find a decent balance for that parameter.

@vilhub can I close this issue?

Closing this issue. If you are still experiencing issues feel free to reopen it.

Was this page helpful?
0 / 5 - 0 ratings