Pytorch3d: CUDA error: device-side assert triggered in sample_points_from_meshes function

Created on 20 Mar 2020  Â·  7Comments  Â·  Source: facebookresearch/pytorch3d

Hi. First of all, thanks for developing this long-desired tool. Now, coming to the bug.

I just started working with PyTorch3D and was trying the tutorial from here: https://github.com/facebookresearch/pytorch3d/blob/master/docs/tutorials/deform_source_mesh_to_target_mesh.ipynb

I started with my own jupyter notebook to reproduce the code. However, when I tried to visualize the meshes, by calling the plot_pointcloud() function in the tutorial, I came across the following error:
plot_pointcloud(trg_mesh, "Target mesh")

RuntimeError                              Traceback (most recent call last)
<ipython-input-39-1e1d27f1793b> in <module>
      3 # print(trg_mesh._N)
      4 # trg_mesh.valid
----> 5 plot_pointcloud(trg_mesh, "Target mesh")
      6 # plot_pointcloud(src_mesh, "Source mesh")

<ipython-input-30-fa31b9ded440> in plot_pointcloud(mesh, title)
      2     # Sample points uniformly from the surface of the mesh
      3     print(mesh)
----> 4     points = sample_points_from_meshes(mesh, 5000)
      5     x, y, z = points.clone().detach().cpu().squeeze().unbind(1)
      6     fig = plt.figure(figsize=(5, 5))

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/pytorch3d/ops/sample_points_from_meshes.py in sample_points_from_meshes(meshes, num_samples, return_normals)
     39           be filled with 0.
     40     """
---> 41     if meshes.isempty():
     42         raise ValueError("Meshes are empty.")
     43 

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/pytorch3d/structures/meshes.py in isempty(self)
    430             bool indicating whether there is any data.
    431         """
--> 432         return self._N == 0 or self.valid.eq(False).all()
    433 
    434     def verts_list(self):

RuntimeError: CUDA error: device-side assert triggered

I noticed the error was coming by the member mesh.valid. When I called that member directly from the script, I got similar error.
trg_mesh.valid

RuntimeError                              Traceback (most recent call last)
~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    392                         if cls is not object \
    393                                 and callable(cls.__dict__.get('__repr__')):
--> 394                             return _repr_pprint(obj, self, cycle)
    395 
    396             return _default_pprint(obj, self, cycle)

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    682     """A pprint that just redirects to the normal repr function."""
    683     # Find newlines and replace them with p.break_()
--> 684     output = repr(obj)
    685     lines = output.splitlines()
    686     with p.group():

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/torch/tensor.py in __repr__(self)
    157         # characters to replace unicode characters with.
    158         if sys.version_info > (3,):
--> 159             return torch._tensor_str._str(self)
    160         else:
    161             if hasattr(sys.stdout, 'encoding'):

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/torch/_tensor_str.py in _str(self)
    309                 tensor_str = _tensor_str(self.to_dense(), indent)
    310             else:
--> 311                 tensor_str = _tensor_str(self, indent)
    312 
    313     if self.layout != torch.strided:

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/torch/_tensor_str.py in _tensor_str(self, indent)
    207     if self.dtype is torch.float16 or self.dtype is torch.bfloat16:
    208         self = self.float()
--> 209     formatter = _Formatter(get_summarized_data(self) if summarize else self)
    210     return _tensor_str_with_formatter(self, indent, formatter, summarize)
    211 

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/torch/_tensor_str.py in __init__(self, tensor)
     81         if not self.floating_dtype:
     82             for value in tensor_view:
---> 83                 value_str = '{}'.format(value)
     84                 self.max_width = max(self.max_width, len(value_str))
     85 

~/miniconda3/envs/pytorch3d/lib/python3.6/site-packages/torch/tensor.py in __format__(self, format_spec)
    407     def __format__(self, format_spec):
    408         if self.dim() == 0:
--> 409             return self.item().__format__(format_spec)
    410         return object.__format__(self, format_spec)
    411 

RuntimeError: CUDA error: device-side assert triggered

My configuration is:
Ubuntu: 18.04
Python: 3.6.10
Pytorch: 1.4.0
Pytorch3D: 0.1.1
CUDA: 10.1

Thanks!

bug

Most helpful comment

Oh I see. That resolves the issue. You can go ahead and close it. Thanks.

All 7 comments

Hi @rahuldey91! Thank you for your kind words.

This is issue has been reported before (see https://github.com/facebookresearch/pytorch3d/issues/82 and https://github.com/facebookresearch/pytorch3d/issues/63) and is likely due to nans in your meshes. Could you print out or check for nans before you execute sampling?
In the meantime, I will add a check at the beginning of mesh sampling which will raise a better error message!

I added a check that raises an error if non finite values are passed (see https://github.com/facebookresearch/pytorch3d/commit/6c48ff6ad9005cfc03704c77531a4a25d1c8d843).

Hi @gkioxari! Thanks for your quick response and pointing out related issues. I was trying to check for the presence of nans in the mesh, but I was getting the same error even while calling trg_mesh.verts_list(). Then I noticed that my mesh was in device "cuda:7". I reran the code after changing the device to "cuda:0" and I got the desired output without any errors. Could you help me understand why the data being on a device other than cuda:0 would produce an error?

This shouldn't create a problem. Note that we use these ops to train on multiple GPUs, e.g. when training Mesh R-CNN models with distributed training on 8 gpus. Is it possible that your data was living on different devices, or that your GPU is corrupt in any way? I can't think of other reasons why it would fail.

Here is my ipynb file to reproduce the error. If you change the device to device = torch.device("cuda:0"), it will run without errors. For any other gpu, it shoots this error.
sphere_to_dolphin.zip

@rahuldey91 are you using one gpu or multiple gpus? If you are using a GPU other than the default (cuda:0) you may need set it explicitly as :

device = torch.device("cuda:7")
torch.cuda.set_device(device)

Oh I see. That resolves the issue. You can go ahead and close it. Thanks.

Was this page helpful?
0 / 5 - 0 ratings