Pytorch3d: CUDA error: an illegal memory access was encountered

Created on 13 Jul 2020  路  16Comments  路  Source: facebookresearch/pytorch3d

馃悰 Bugs / Unexpected behaviors

When I use nn.DataParallel, it reports an error (Strangely, the error is reported after running several batches. And if I do not use nn.DataParallel, everything is ok.)

Caught RuntimeError in replica 0 on device 0. RuntimeError: CUDA error: an illegal memory access was encountered

More details:

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(input, kwargs)
File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(
input, *kwargs)
File "", line 81, in forward
img_render = renderer(mesh)
File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(
input, *kwargs)
File "/home/anaconda3/lib/python3.7/site-packages/pytorch3d/renderer/mesh/renderer.py", line 67, in forward
images = self.shader(fragments, meshes_world, *
kwargs)
File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(input, *kwargs)
File "/home/anaconda3/lib/python3.7/site-packages/pytorch3d/renderer/mesh/shader.py", line 148, in forward
materials=materials,
File "/home/anaconda3/lib/python3.7/site-packages/pytorch3d/renderer/mesh/shading.py", line 123, in gouraud_shading
verts, vertex_normals, lights, cameras, materials
File "/home/anaconda3/lib/python3.7/site-packages/pytorch3d/renderer/mesh/shading.py", line 31, in _apply_lighting
camera_position=cameras.get_camera_center(),
File "/home/anaconda3/lib/python3.7/site-packages/pytorch3d/renderer/cameras.py", line 122, in get_camera_center
w2v_trans = self.get_world_to_view_transform(
kwargs)
File "/home/anaconda3/lib/python3.7/site-packages/pytorch3d/renderer/cameras.py", line 149, in get_world_to_view_transform
world_to_view_transform = get_world_to_view_transform(R=self.R, T=self.T)
File "/home/anaconda3/lib/python3.7/site-packages/pytorch3d/renderer/cameras.py", line 839, in get_world_to_view_transform
R = Rotate(R, device=R.device)
File "/home/anaconda3/lib/python3.7/site-packages/pytorch3d/transforms/transform3d.py", line 510, in __init__
_check_valid_rotation_matrix(R, tol=orthogonal_tol)
File "/home/anaconda3/lib/python3.7/site-packages/pytorch3d/transforms/transform3d.py", line 702, in _check_valid_rotation_matrix
det_R = torch.det(R)
RuntimeError: CUDA error: an illegal memory access was encountered

Instructions To Reproduce the Issue:

Here is my minimum code to repoduce it (Sorry, not a fully running example):

class mytest(nn.Module):

def __init__(self):
    super(mytest, self).__init__()

    self.register_buffer('face_buf', torch.tensor(Just_for_test['tri']).contiguous())


def forward(self, face_shape, face_texture):

    batch_size = face_shape.size(0)


    R, T = look_at_view_transform(eye = ((0.0, 0.0, 1.0),), at = ((0.0, 0.0, 1.0),), up = ((0.0, 1.0, 0.0),))

    raster_settings = RasterizationSettings(
        image_size = 224,
        blur_radius = 0.0,
        faces_per_pixel = 1,
    )


    tri = (self.face_buf - 1).int()

    face_color_testures = Textures(verts_rgb = face_texture)
    mesh = Meshes(face_shape, tri.unsqueeze(0).expand((batch_size, -1, -1)), face_color_testures)

    cameras = OpenGLPerspectiveCameras(
        device = face_shape.device,
        znear = 0.01,
        zfar = 50.,
        aspect_ratio = 1.,
        fov = 12.5936,
        R = R, T = T
    )

    lights = PointLights(
        device = face_shape.device,
        ambient_color = ((1.0, 1.0, 1.0),),
        location = ((0.0, 0.0, 1e5),)
    )

    renderer = MeshRenderer(
        rasterizer = MeshRasterizer(
            cameras = cameras,
            raster_settings = raster_settings
        ),
        shader = HardGouraudShader(
            device = face_shape.device,
            cameras = cameras,
            lights = lights
        )
    )

    img_render = renderer(mesh)

    return img_render

face_model = mytest()
face_model = nn.DataParallel(face_model, device_ids =[0, 1, 2, 3])
face_model = face_model.to(device)
how to

Most helpful comment

It is easier to use nn.parallel.DistributedDataParallel, which works perfectly for me.

Yes, It's a solution. However, can you give a nn.DataParallel tutorials in multi-gpu settings? I found the moudle MeshRenderer is not friendly in multi-gpu settings. The above code sometimes will cause the error. @bottler @nikhilaravi

All 16 comments

Can you post a complete working program please? I've hit other errors trying to make this work.

@bottler Here is the code, and When I iterated to 1366, it reported the above error :

!mkdir -p data/cow_mesh
!wget -P data/cow_mesh https://dl.fbaipublicfiles.com/pytorch3d/data/cow_mesh/cow.obj
!wget -P data/cow_mesh https://dl.fbaipublicfiles.com/pytorch3d/data/cow_mesh/cow.mtl
!wget -P data/cow_mesh https://dl.fbaipublicfiles.com/pytorch3d/data/cow_mesh/cow_texture.png

import torch
import matplotlib.pyplot as plt
from skimage.io import imread

import torch
from torchvision import transforms
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import torch.backends.cudnn as cudnn


from pytorch3d.io import load_objs_as_meshes


from pytorch3d.structures import Meshes, Textures
from pytorch3d.renderer import (
    look_at_view_transform,
    OpenGLPerspectiveCameras, 
    PointLights, 
    DirectionalLights, 
    Materials, 
    RasterizationSettings, 
    MeshRenderer, 
    MeshRasterizer,  
    TexturedSoftPhongShader,
    HardGouraudShader
)


import sys
import os
sys.path.append(os.path.abspath(''))

gpu_count = 4
gpu_list = [i for i in range(gpu_count)]
gpu_str = ','.join(list(map(lambda x: str(x), gpu_list)))

os.environ["CUDA_VISIBLE_DEVICES"] = gpu_str
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

DATA_DIR = "./data"
obj_filename = os.path.join(DATA_DIR, "cow_mesh/cow.obj")


mesh_temp = load_objs_as_meshes([obj_filename])
verts = mesh_temp.verts_list()[0].unsqueeze(0).repeat((64, 1, 1)).cuda()
texs = torch.ones((64, 2930, 3)).cuda()

class Face3D(nn.Module):
    def __init__(self):
        super(Face3D, self).__init__()
        DATA_DIR = "./data"
        obj_filename = os.path.join(DATA_DIR, "cow_mesh/cow.obj")

        # Load obj file
        mesh_temp = load_objs_as_meshes([obj_filename])

        self.register_buffer('faces', mesh_temp.faces_list()[0])

    def forward(self, verts, texs):

        batch_size = verts.size(0)
#         print('b', batch_size, texs.size(0))
        R, T = look_at_view_transform(eye = ((0.0, 0.0, 10.0),), at = ((0.0, 0.0, 0.0),), up = ((0.0, 1.0, 0.0),))
        cameras = OpenGLPerspectiveCameras(device = verts.device, R = R, T = T)


        raster_settings = RasterizationSettings(
            image_size = 224, 
            blur_radius = 0.0, 
            faces_per_pixel = 1, 
        )

        lights = PointLights(
            device = verts.device,
            ambient_color = ((1.0, 1.0, 1.0),),
            diffuse_color = ((0, 0.0, 0),),
            specular_color = ((0.0, 0, 0),),
            location = ((0.0, 0.0, 1e5),)
        )


        renderer = MeshRenderer(
            rasterizer = MeshRasterizer(
                cameras = cameras, 
                raster_settings = raster_settings
            ),
            shader = HardGouraudShader(
                device = verts.device, 
                cameras = cameras,
                lights = lights
            )
        )
        face_color_testures = Textures(verts_rgb = texs)

        tri = self.faces
        mesh =  Meshes(verts, tri.unsqueeze(0).expand((batch_size, -1, -1)), face_color_testures)

        img_render = renderer(mesh)

        return img_render[:, :, :, :3]

model = Face3D()
model = nn.DataParallel(model, device_ids = gpu_list)
model = model.to(device)
for i in range(10000):

    print(i)
    _ = model(verts, texs)

My environment is

python 3.7.6
pytorch 1.5
cuda 10.1
gpu: GeForce RTX 2080 Ti x 4 or Tesla T4 x 8 (Both have this problem)

I've just tried this with a recent nightly build on a 2GPU machine and it runs fine.

Thanks! It resolves my problem.

Sorry~ When I run the code in nightly version, it had the same problem again. And I found that this problem is not a deterministic event. @bottler

I have two machines, the same CONDA environment. The above code can run on one machine (RTX 2080 Ti x 4), the other (Tesla T4 x 8) can not run. The another code (The code is almost the same as the code I provided) I want to run has problems on both machines, that is, after running several batches, the above error is reported.

(Updated code)Here is the code, and When I iterated to 1870, it reported the above error :

!mkdir -p data/cow_mesh
!wget -P data/cow_mesh https://dl.fbaipublicfiles.com/pytorch3d/data/cow_mesh/cow.obj
!wget -P data/cow_mesh https://dl.fbaipublicfiles.com/pytorch3d/data/cow_mesh/cow.mtl
!wget -P data/cow_mesh https://dl.fbaipublicfiles.com/pytorch3d/data/cow_mesh/cow_texture.png


import os
import torch
import matplotlib.pyplot as plt
from skimage.io import imread

import torch
from torchvision import transforms
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import torch.backends.cudnn as cudnn


# Util function for loading meshes
from pytorch3d.io import load_objs_as_meshes

# Data structures and functions for rendering
from pytorch3d.structures import Meshes, Textures
from pytorch3d.renderer import (
    look_at_view_transform,
    OpenGLPerspectiveCameras, 
    PointLights, 
    DirectionalLights, 
    Materials, 
    RasterizationSettings, 
    MeshRenderer, 
    MeshRasterizer,  
    TexturedSoftPhongShader,
    HardGouraudShader
)

# add path for demo utils functions 
import sys
import os
sys.path.append(os.path.abspath(''))

gpu_count = 4
gpu_list = [i for i in range(gpu_count)]
gpu_str = ','.join(list(map(lambda x: str(x), gpu_list)))


os.environ["CUDA_VISIBLE_DEVICES"] = gpu_str
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

torch.manual_seed(1)
cudnn.benchmark = False
cudnn.deterministic = True

# Set paths
DATA_DIR = "./data"
obj_filename = os.path.join(DATA_DIR, "cow_mesh/cow.obj")

# Load obj file
mesh_temp = load_objs_as_meshes([obj_filename])
# texture_image = mesh.textures.maps_padded()
verts = mesh_temp.verts_list()[0].unsqueeze(0).repeat((128, 1, 1)).cuda()
texs = torch.ones((128, 2930, 3)).cuda()

class Face3D(nn.Module):
def __init__(self):
    super(Face3D, self).__init__()
    DATA_DIR = "./data"
    obj_filename = os.path.join(DATA_DIR, "cow_mesh/cow.obj")

    # Load obj file
    mesh_temp = load_objs_as_meshes([obj_filename])

    self.register_buffer('faces', mesh_temp.faces_list()[0].long().contiguous())

def forward(self, verts, texs):

    batch_size = verts.size(0)

    R, T = look_at_view_transform(eye = ((0.0, 0.0, 10.0),), at = ((0.0, 0.0, 0.0),), up = ((0.0, 1.0, 0.0),))
    cameras = OpenGLPerspectiveCameras(device = verts.device, R = R, T = T)


    raster_settings = RasterizationSettings(
        image_size = 224, 
        blur_radius = 0.0, 
        faces_per_pixel = 1, 
    )

    lights = PointLights(
        device = verts.device,
        ambient_color = ((1.0, 1.0, 1.0),),
        diffuse_color = ((0, 0.0, 0),),
        specular_color = ((0.0, 0, 0),),
        location = ((0.0, 0.0, 1e5),)
    )


    renderer = MeshRenderer(
        rasterizer = MeshRasterizer(
            cameras = cameras, 
            raster_settings = raster_settings
        ),
        shader = HardGouraudShader(
            device = verts.device, 
            cameras = cameras,
            lights = lights
        )
    )
    face_color_testures = Textures(verts_rgb = texs)

    tri = self.faces.int()
    mesh =  Meshes(verts, tri.unsqueeze(0).expand((batch_size, -1, -1)), face_color_testures)

    img_render = renderer(mesh)

    return img_render[:, :, :, :3]

model = Face3D()
model = nn.DataParallel(model, device_ids = gpu_list)
model = model.to(device)
for i in range(1000000):

     print(i)
     _ = model(verts, texs)

It is easier to use nn.parallel.DistributedDataParallel, which works perfectly for me.

It is easier to use nn.parallel.DistributedDataParallel, which works perfectly for me.

Yes, It's a solution. However, can you give a nn.DataParallel tutorials in multi-gpu settings? I found the moudle MeshRenderer is not friendly in multi-gpu settings. The above code sometimes will cause the error. @bottler @nikhilaravi

Finally, I resolved the problem by comment the func _check_valid_rotation_matrix in transform3d.py. However, I do not find any bug in the function _check_valid_rotation_matrix. Fortunately, because R is fixed, I don't need to check them every time, so I delete the func call. Maybe your team can analyze the problem. In short, thank you for your open source tools! @bottler @nikhilaravi

@nikhilaravi Request to reopen this issue: looks like I am facing the same issue with pytorch 0.2.5:
Running with nn.DataParallel randomly throws an error. Any ideas?

Likewise. I am seeing exactly the same error detailed in this issue. The error occurs stochastically to me too. i.e., running the same code crash with error pretty randomly. I'm using pytorch 1.7, pytorch3d v0.3.0. Please consider reopening this issue.

I meet the same problems, do your have any ideas?

@jiaxiangshang @dnnspark Please open new issues with the exact code and details of what you are experiencing. Lots of different things could conceivably give this error.

FYI, using nn.parallel.DistributedDataParallel, rather than nn.DataParallel, can fix this bug!!! Thanks @Gaozhongpai

(My code works well on one GPU when using nn.DataParallel, but suffers from CUDA runtime error: an illegal memory access was encountered (700) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda111_1605822518874/work/interface_cuda/interface.cpp:944 when using multi-GPU.)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

AndreiBarsan picture AndreiBarsan  路  3Comments

elcronos picture elcronos  路  3Comments

eliemichel picture eliemichel  路  3Comments

aluo-x picture aluo-x  路  3Comments

zhjscut picture zhjscut  路  3Comments