I am trying to project 3D points to Screen space using PerspectiveCameras. For comparison, I use one of the random 3D points and a camera intrinsic and extrinsic matrix from KITTI dataset. The image is not a square; it is a rectangular image.
The output of using simple matrix multiplication of camera matrix to a 3D point using KITTI/NumPy is different from the output of cameras.transform_points_screen even after accounting for mirroring of axes (b/w KITTI and PyTorch3D). The KITTI convention is X- right, Y- down and Z-inside while PyTorch3D convention is X-left, Y-up and Z-inside. Their convention in screen space is identical
"""
Sample Run:
python test/test_projection.py
"""
import os, sys
sys.path.append(os.getcwd())
import numpy as np
import torch
from pytorch3d.renderer import (
PerspectiveCameras
)
import copy
np.set_printoptions(precision= 4)
torch.set_printoptions(precision= 4, sci_mode= False)
h = 374
w = 1238
z_eps = 1e-3
# Set the cuda device
if torch.cuda.is_available():
device = torch.device("cuda:0")
torch.cuda.set_device(device)
else:
device = torch.device("cpu")
# Camera Matrix Parameters
# This KITTI matrix is in X- right, Y- down and Z-inside
# The screen space is X-right (in pixels), Y- down (in pixels)
# # K = [
# [fx, 0, px, 0],
# [0, fy, py, 0],
# [0, 0, 1, 0],
# [0, 0, 0, 1],
# ]
intrinsics = np.array([[718.3351, 0, 600.3891, 0], [0, 718.3351, 181.5122, 0], [0, 0, 1, 0], [0, 0, 0, 1]])
R = np.eye(3)
T = np.array([0.05976739, -0.00148956, 0.00261632])
#=========================================================
# Using NumPy
#=========================================================
extrinsics = np.eye(4)
extrinsics[:3, :3] = R
extrinsics[:3, 3] = T
camera_full = np.matmul(intrinsics, extrinsics)
# array([[ 7.183351e+02, 0.000000e+00, 6.003891e+02, 4.450382e+01],
# [ 0.000000e+00, 7.183351e+02, 1.815122e+02, -5.951107e-01],
# [ 0.000000e+00, 0.000000e+00, 1.000000e+00, 2.616315e-03],
# [ 0.000000e+00, 0.000000e+00, 0.000000e+00, 1.000000e+00]])
# Point input
point_kitti = np.array([[4.0, 0.5, 15]])
num_points = point_kitti.shape[0]
# Append ones
points_4d = point_kitti.transpose()
points_4d = np.vstack((points_4d, np.ones((1, num_points))))
# Matrix Multiplication
pixels_numpy = np.matmul(camera_full, points_4d)
pixels_numpy[:2] /= pixels_numpy[2]
print(camera_full)
print("3D Point and Pixel in KITTI/NumPy")
print(point_kitti)
print(pixels_numpy.transpose())
#=========================================================
# Using PyTorch3D
#=========================================================
num_cameras = 1
focal_length = np.zeros((num_cameras, 2))
focal_length[:, 0] = intrinsics[0, 0]
focal_length[:, 1] = intrinsics[1, 1]
principal_point = np.zeros((num_cameras, 2))
principal_point[:, 0] = intrinsics[0, 2]
principal_point[:, 1] = intrinsics[1, 2]
focal_length = torch.from_numpy(focal_length).float()
principal_point = torch.from_numpy(principal_point).float()
# Convert focal_length and principal_point to NDC
# Reference
# https://pytorch3d.readthedocs.io/en/latest/_modules/pytorch3d/renderer/cameras.html#PerspectiveCameras
half_imwidth = w/2.0
half_imheight = h/2.0
focal_length[:,0] /= half_imwidth
focal_length[:,1] /= half_imheight
principal_point[:, 0] = - principal_point[:, 0]/half_imwidth + 1
principal_point[:, 1] = - principal_point[:, 1]/half_imheight + 1
R_camera = torch.from_numpy(R).unsqueeze(0).to(device)
T_camera = torch.from_numpy(T).unsqueeze(0).to(device)
R_camera = R_camera.repeat(num_cameras, 1, 1)
T_camera = T_camera.repeat(num_cameras, 1)
# Pytorch 3D is X-left, Y-up and Z-inside
# This KITTI matrix is in X- right, Y- down and Z-inside
# Hence, compared to KITTI, it has X and Y mirrored
mirror = torch.eye(4).unsqueeze(0).to(device).repeat(num_cameras, 1, 1)
mirror[:, 0, 0] = -1.0
mirror[:, 1, 1] = -1.0
# Update extrinsics to reflect mirroring of axes, Intrinsics would not change
R_T_joined = torch.zeros((num_cameras, 4, 4)).to(device)
R_T_joined[:, :3, :3] = R_camera
R_T_joined[:, :3, 3] = T_camera
new_R_T = torch.bmm(R_T_joined, mirror)
R_camera = new_R_T[:, :3, :3]
T_camera = new_R_T[:, :3, 3]
# print(R_camera.shape)
# print(T_camera.shape)
# print(focal_length.shape)
# print(principal_point.shape)
cameras = PerspectiveCameras(device=device, R= R_camera, T= T_camera, focal_length= focal_length, principal_point= principal_point)
point_torch = copy.copy(point_kitti)
point_torch = torch.from_numpy(point_torch).float().unsqueeze(0).repeat(num_cameras, 1, 1).to(device)
# Update points to reflect mirroring of axes
point_torch = torch.bmm(mirror[:, :3, :3], point_torch.transpose(1, 2)).transpose(1, 2) # N, V, 3
size_torch = torch.Tensor([[w, h]]).long().to(device)
pixels_torch = cameras.transform_points_screen(points= point_torch, image_size= size_torch)
print("3D Point and Pixel in PyTorch3D")
print(point_torch[0])
print(pixels_torch[0])
The output of running the above code is
3D Points and Pixels in KITTI/NumPy
[[ 4. 0.5 15. ]]
[[794.7734 205.3812 15.0026 1. ]]
3D Points and Pixels in PyTorch3D
tensor([[-4.0000, -0.5000, 15.0000]], device='cuda:0')
tensor([[ 405.6768, 157.2217, 0.0667]], device='cuda:0')
The x and y coordinates of 3D points is swapped because of the mirroring of the axes. Moreover, I also expect the output in the pixels' z-coordinate to be different 15.0026 and 0.6667 (one is in pixel space, and the other is in NDC space).
However, the x and y coordinates of the pixels should exactly match. Could you help me figure out - why are x and y coordinates of pixels in NumPy and PyTorch3D different?
@abhi1kumar thanks for providing a clear code example of your issue.
One possible source of error could be from not transposing the extrinsic matrix - is your matrix in this format?: https://github.com/facebookresearch/pytorch3d/blob/340662e98e97c5e105cf6570765d7bae3e6228bf/pytorch3d/transforms/transform3d.py#L121-L123.
You might need to do R.transpose(1,2) before you pass it into the camera.
Note that when initializing PerspectiveCameras you can pass in focal_length and principal_point in screen coordinates if you also pass in the image_size during initialization. Also for future, when you have num_cameras > 1 you don't need to do R_camera = R_camera.repeat(num_cameras, 1, 1), if the cameras are all the same then you only need to pass in R_camera with shape (1, 3, 3) and the cameras params will be automatically broadcast across the batch dim.
@nikhilaravi Thank you for your response. I have now passed in the transpose of the rotation matrix and image_size as the input to the PerspectiveCameras. The updated code is as follows:
"""
Sample Run:
python test/test_projection.py
"""
import os, sys
sys.path.append(os.getcwd())
import numpy as np
import torch
from pytorch3d.renderer import (
PerspectiveCameras
)
import copy
np.set_printoptions(precision= 4)
torch.set_printoptions(precision= 4, sci_mode= False)
h = 374
w = 1238
z_eps = 1e-3
# Set the cuda device
if torch.cuda.is_available():
device = torch.device("cuda:0")
torch.cuda.set_device(device)
else:
device = torch.device("cpu")
# Camera Matrix Parameters
# This KITTI matrix is in X- right, Y- down and Z-inside
# The screen space is X-right (in pixels), Y- down (in pixels)
# # K = [
# [fx, 0, px, 0],
# [0, fy, py, 0],
# [0, 0, 1, 0],
# [0, 0, 0, 1],
# ]
intrinsics = np.array([[718.3351, 0, 600.3891, 0], [0, 718.3351, 181.5122, 0], [0, 0, 1, 0], [0, 0, 0, 1]])
R = np.eye(3)
T = np.array([0.05976739, -0.00148956, 0.00261632])
# Point input
point_kitti = np.array([[4.0, 0.5, 15]])
#=========================================================
# Using NumPy
#=========================================================
extrinsics = np.eye(4)
extrinsics[:3, :3] = R
extrinsics[:3, 3] = T
camera_full = np.matmul(intrinsics, extrinsics)
# array([[ 7.183351e+02, 0.000000e+00, 6.003891e+02, 4.450382e+01],
# [ 0.000000e+00, 7.183351e+02, 1.815122e+02, -5.951107e-01],
# [ 0.000000e+00, 0.000000e+00, 1.000000e+00, 2.616315e-03],
# [ 0.000000e+00, 0.000000e+00, 0.000000e+00, 1.000000e+00]])
num_points = point_kitti.shape[0]
# Append ones
points_4d = point_kitti.transpose()
points_4d = np.vstack((points_4d, np.ones((1, num_points))))
# Matrix Multiplication
pixels_numpy = np.matmul(camera_full, points_4d)
pixels_numpy[:2] /= pixels_numpy[2]
# print(camera_full)
print("3D Point and Pixel in KITTI/NumPy")
print(point_kitti)
print(pixels_numpy.transpose())
#=========================================================
# Using PyTorch3D
#=========================================================
num_cameras = 1
focal_length = np.zeros((num_cameras, 2))
focal_length[:, 0] = intrinsics[0, 0]
focal_length[:, 1] = intrinsics[1, 1]
principal_point = np.zeros((num_cameras, 2))
principal_point[:, 0] = intrinsics[0, 2]
principal_point[:, 1] = intrinsics[1, 2]
focal_length = torch.from_numpy(focal_length).float().to(device)
principal_point = torch.from_numpy(principal_point).float().to(device)
image_size = torch.Tensor([[w, h]]).float().to(device)
image_size = image_size.repeat(num_cameras, 1)
R_camera = torch.from_numpy(R).unsqueeze(0).to(device)
T_camera = torch.from_numpy(T).unsqueeze(0).to(device)
R_camera = R_camera.repeat(num_cameras, 1, 1)
T_camera = T_camera.repeat(num_cameras, 1)
# Pytorch 3D is X-left, Y-up and Z-inside
# This KITTI matrix is in X- right, Y- down and Z-inside
# Hence, compared to KITTI, it has X and Y mirrored
mirror = torch.eye(4).unsqueeze(0).to(device).repeat(num_cameras, 1, 1)
mirror[:, 0, 0] = -1.0
mirror[:, 1, 1] = -1.0
# Update extrinsics to reflect mirroring of axes, Intrinsics would not change
R_T_joined = torch.zeros((num_cameras, 4, 4)).to(device)
R_T_joined[:, :3, :3] = R_camera
R_T_joined[:, :3, 3] = T_camera
new_R_T = torch.bmm(R_T_joined, mirror)
R_camera = new_R_T[:, :3, :3]
T_camera = new_R_T[:, :3, 3]
# print(R_camera.shape)
# print(T_camera.shape)
# print(focal_length.shape)
# print(principal_point.shape)
# print(image_size.shape)
# The rotation to cameras should be transposed.
# M = [
# [Rxx, Ryx, Rzx, 0],
# [Rxy, Ryy, Rzy, 0],
# [Rxz, Ryz, Rzz, 0],
# [Tx, Ty, Tz, 1],
# ]
# Reference: Nikhila Ravi
# https://github.com/facebookresearch/pytorch3d/blob/340662e98e97c5e105cf6570765d7bae3e6228bf/pytorch3d/transforms/transform3d.py#L121-L123
cameras = PerspectiveCameras(device=device, R= R_camera.transpose(1,2), T= T_camera, focal_length= focal_length, principal_point= principal_point, image_size= image_size)
point_torch = copy.copy(point_kitti)
point_torch = torch.from_numpy(point_torch).float().unsqueeze(0).repeat(num_cameras, 1, 1).to(device)
# Update points to reflect mirroring of axes
point_torch = torch.bmm(mirror[:, :3, :3], point_torch.transpose(1, 2)).transpose(1, 2) # N, V, 3
size_torch = torch.Tensor([[w, h]]).long().to(device)
pixels_torch = cameras.transform_points_screen(points= point_torch, image_size= size_torch)
print("3D Point and Pixel in PyTorch3D")
print(point_torch[0])
print(pixels_torch[0])
However, this does not make the output as expected. The output x and y coordinates of the pixels are still different.
I found the solution. The inbuilt transform_points_screen function for PerspectiveCameras is buggy. This function does not take the size_image into account while projecting from NDC to screen coordinates for Perspective Cameras.
The internal calibration matrix for PerspectiveCameras is
K = [
[fx, 0, px, 0],
[0, fy, py, 0],
[0, 0, 0, 1],
[0, 0, 1, 0],
]
So, projecting a point [X, Y, Z, 1] means multiplication by K which gives [Xfx + pxZ, Yfy + pyZ, 1, Z].
This is normalized to produce the pixels [x, y, z, 1]
[(X/Z) fx + px
(Y/Z) fy + py
1/Z
1]
In NDC coordinates, the pixels [x', y', z', 1]is
[ (X/Z) fx/ (0.5*W) + (1 - px/(0.5*W))
(Y/Z) fy/ (0.5*H) + (1 - py/(0.5*H))
1/Z
1]
The equations to convert NDC pixels [x', y', z', 1] to our screen pixels [x, y, z, 1] when using usual camera parameters are
x = (x'-1)*0.5*W + 2*px
y = (y'-1)*0.5*H + 2*py
The equations to convert NDC pixels [x', y', z', 1] to our screen pixels [x, y, z, 1] when using NDC space camera parameters are
x = (x'+1 - 2*px)*0.5*W
y = (y'+1 - 2*py)*0.5*H
Here is the small code which demonstrates this.
"""
Sample Run:
python test/test_projection.py
"""
import os, sys
sys.path.append(os.getcwd())
import numpy as np
import torch
from pytorch3d.renderer import (
PerspectiveCameras
)
import copy
np.set_printoptions(precision= 4, suppress= True)
torch.set_printoptions(precision= 4, sci_mode= False)
h = 374
w = 1238
z_eps = 1e-3
# Set the cuda device
if torch.cuda.is_available():
device = torch.device("cuda:0")
torch.cuda.set_device(device)
else:
device = torch.device("cpu")
# Camera Matrix Parameters
# This KITTI matrix is in X- right, Y- down and Z-inside
# The screen space is X-right (in pixels), Y- down (in pixels)
# # K = [
# [fx, 0, px, 0],
# [0, fy, py, 0],
# [0, 0, 1, 0],
# [0, 0, 0, 1],
# ]
intrinsics = np.array([[718.3351, 0, 600.3891, 0], [0, 718.3351, 181.5122, 0], [0, 0, 1, 0], [0, 0, 0, 1]])
R = np.eye(3)
T = np.array([0.05976739, -0.00148956, 0.00261632])
# Point input
point_kitti = np.array([[4.0, 0.5, 15]])
#=========================================================
# Using NumPy
#=========================================================
print("\n=> Using NumPy...")
extrinsics = np.eye(4)
extrinsics[:3, :3] = R
extrinsics[:3, 3] = T
camera_full = np.matmul(intrinsics, extrinsics)
# array([[ 7.183351e+02, 0.000000e+00, 6.003891e+02, 4.450382e+01],
# [ 0.000000e+00, 7.183351e+02, 1.815122e+02, -5.951107e-01],
# [ 0.000000e+00, 0.000000e+00, 1.000000e+00, 2.616315e-03],
# [ 0.000000e+00, 0.000000e+00, 0.000000e+00, 1.000000e+00]])
num_points = point_kitti.shape[0]
# Append ones
points_4d = point_kitti.transpose()
points_4d = np.vstack((points_4d, np.ones((1, num_points))))
# Matrix Multiplication
pixels_numpy = np.matmul(camera_full, points_4d)
pixels_numpy[:2] /= pixels_numpy[2]
pixels_numpy = pixels_numpy.transpose()
focal_length = np.zeros((2, ))
focal_length[0] = intrinsics[0, 0]
focal_length[1] = intrinsics[1, 1]
principal_point = np.zeros((2, ))
principal_point[0] = intrinsics[0, 2]
principal_point[1] = intrinsics[1, 2]
half_imwidth = w/2.0
half_imheight = h/2.0
focal_length[0] /= half_imwidth
focal_length[1] /= half_imheight
principal_point[0] = - principal_point[0]/half_imwidth + 1
principal_point[1] = - principal_point[1]/half_imheight + 1
ndc_matrix_np = np.eye((4))
ndc_matrix_np[0, 0] = focal_length[0]
ndc_matrix_np[1, 1] = focal_length[1]
ndc_matrix_np[0, 2] = principal_point[0]
ndc_matrix_np[1, 2] = principal_point[1]
camera_ndc = np.matmul(ndc_matrix_np, extrinsics)
# print(camera_ndc)
pixels_numpy_ndc = np.matmul(camera_ndc, points_4d)
pixels_numpy_ndc /= pixels_numpy_ndc[2]
pixels_numpy_ndc = pixels_numpy_ndc.transpose()
# Convert from NDC to screen coordinates
pixels_numpy_recalc = np.zeros(pixels_numpy_ndc.shape)
pixels_numpy_recalc[:, 0] = half_imwidth * (pixels_numpy_ndc[:, 0] + 1 - 2 * principal_point[0])
pixels_numpy_recalc[:, 1] = half_imheight * (pixels_numpy_ndc[:, 1] + 1 - 2 * principal_point[1])
# print(camera_full)
print("3D Point and Pixel in KITTI/NumPy")
print("3D Point =", point_kitti)
print("Px Inbuilt=", pixels_numpy)
print("Px Ours =", pixels_numpy_recalc[:, :2])
#=========================================================
# Using PyTorch3D
#=========================================================
print("\n=> Using Pytorch3D...")
num_cameras = 1
focal_length = np.zeros((num_cameras, 2))
focal_length[:, 0] = intrinsics[0, 0]
focal_length[:, 1] = intrinsics[1, 1]
principal_point = np.zeros((num_cameras, 2))
principal_point[:, 0] = intrinsics[0, 2]
principal_point[:, 1] = intrinsics[1, 2]
focal_length = torch.from_numpy(focal_length).float().to(device)
principal_point = torch.from_numpy(principal_point).float().to(device)
image_size = torch.Tensor([[w, h]]).float().to(device)
image_size = image_size.repeat(num_cameras, 1)
R_camera = torch.from_numpy(R).unsqueeze(0).to(device)
T_camera = torch.from_numpy(T).unsqueeze(0).to(device)
R_camera = R_camera.repeat(num_cameras, 1, 1)
T_camera = T_camera.repeat(num_cameras, 1)
# Pytorch 3D is X-left, Y-up and Z-inside
# This KITTI matrix is in X- right, Y- down and Z-inside
# Hence, compared to KITTI, it has X and Y mirrored
mirror = torch.eye(4).unsqueeze(0).to(device).repeat(num_cameras, 1, 1)
mirror[:, 0, 0] = -1.0
mirror[:, 1, 1] = -1.0
# Update extrinsics to reflect mirroring of axes, Intrinsics would not change
R_T_joined = torch.zeros((num_cameras, 4, 4)).to(device)
R_T_joined[:, :3, :3] = R_camera
R_T_joined[:, :3, 3] = T_camera
new_R_T = torch.bmm(R_T_joined, mirror)
R_camera = new_R_T[:, :3, :3]
T_camera = new_R_T[:, :3, 3]
# print(R_camera.shape)
# print(T_camera.shape)
# print(focal_length.shape)
# print(principal_point.shape)
# print(image_size.shape)
# The rotation to cameras should be transposed.
# M = [
# [Rxx, Ryx, Rzx, 0],
# [Rxy, Ryy, Rzy, 0],
# [Rxz, Ryz, Rzz, 0],
# [Tx, Ty, Tz, 1],
# ]
# Reference: Nikhila Ravi
# https://github.com/facebookresearch/pytorch3d/blob/340662e98e97c5e105cf6570765d7bae3e6228bf/pytorch3d/transforms/transform3d.py#L121-L123
cameras = PerspectiveCameras(device=device, R= R_camera.transpose(1,2), T= T_camera, focal_length= focal_length, principal_point= principal_point, image_size= image_size)
point_torch = copy.copy(point_kitti)
point_torch = torch.from_numpy(point_torch).float().unsqueeze(0).repeat(num_cameras, 1, 1).to(device)
# Update points to reflect mirroring of axes
point_torch = torch.bmm(mirror[:, :3, :3], point_torch.transpose(1, 2)).transpose(1, 2) # N, V, 3
size_torch = torch.Tensor([[w, h]]).long().to(device)
pixels_torch = cameras.transform_points_screen(points= point_torch, image_size= size_torch)
pixels_torch_ndc = cameras.transform_points(points= point_torch)
# Convert from NDC to screen coordinates
pixels_torch_recalc = pixels_torch_ndc.clone()
num_points = pixels_torch_ndc.shape[1]
principal_point = cameras.principal_point.unsqueeze(1).repeat(1, num_points, 1)
half_size_torch = size_torch.unsqueeze(1).repeat(1, num_points, 1)/2.0
if torch.any(cameras.image_size < 0):
# image_sizes not given and therefore, principal_points are normalized
pixels_torch_recalc[:, :, 0] = half_size_torch[:, :, 0] * (pixels_torch_ndc[:, :, 0] + 1 - 2 * principal_point[:, :, 0])
pixels_torch_recalc[:, :, 1] = half_size_torch[:, :, 1] * (pixels_torch_ndc[:, :, 1] + 1 - 2 * principal_point[:, :, 1])
else:
# image_sizes given and therefore, use principal_points in absolute scale
pixels_torch_recalc[:, :, 0] = half_size_torch[:, :, 0] * (pixels_torch_ndc[:, :, 0] - 1) + 2 * principal_point[:, :, 0]
pixels_torch_recalc[:, :, 1] = half_size_torch[:, :, 1] * (pixels_torch_ndc[:, :, 1] - 1) + 2 * principal_point[:, :, 1]
print("3D Point and Pixel in PyTorch3D")
print("3D Point =", point_torch[0])
print("Px Inbuilt=", pixels_torch[0])
print("Px Ours =", pixels_torch_recalc[0])
We have four variables - pixels_numpy, pixels_numpy_recalc, pixels_torch, pixels_torch_recalc. The recalc variables are recalculated from the ndc variables. Ideally, we should have
pixels_numpy = pixels_numpy_recalc = pixels_torch = pixels_torch_recalc
However, when we use transform_points_screen function, the results are different.
The output is
=> Using NumPy...
3D Point and Pixel in KITTI/NumPy
3D Point = [[ 4. 0.5 15. ]]
Px Inbuilt= [[794.7734 205.3812 15.0026 1. ]]
Px Ours = [[794.7734 205.3812]]
=> Using Pytorch3D...
3D Point and Pixel in PyTorch3D
3D Point = tensor([[-4.0000, -0.5000, 15.0000]], device='cuda:0')
Px Inbuilt= tensor([[ 405.6768, 157.2217, 0.0667]], device='cuda:0')
Px Ours = tensor([[ 794.7734, 205.3812, 0.0667]], device='cuda:0')
Let me know if you want me to raise the PR for this bug.
@abhi1kumar
You are providing a huge code snippet that is not compact or concise. If you want us to help you, you need to provide a more compact code snippet that reproduces the issue. (See instructions below in my post)
Now to cameras. Our cameras are by default in NDC space. This note should be helpful.
You seem to want to define cameras in screen space, not in NDC space. In that case, you need to provide the image_size.
For cameras defined in screen space, our camera class converts it to NDC using the image size.
In turn, the transform_points_screen method transforms points to NDC space (with the NDC converted camera) and then maps it back to screen space, again using the image size
The NDC to screen conversion there only affects x and y by converting (x, y) points from NDC space to screen coordinates. z is not converted.
Also, please note our coordinate system assumptions in PyTorch3D. For NDC and world coordinates in PyTorch3D, we assume that +X points left, and +Y points up. Screen (image) coordinates have +X points right, +Y points down.
If you still believe that there is a bug with cameras conversion, could you
(a) make sure that this is not because of the (R, T) transform. This means that you should disambiguate the camera transformations from world -> view (via RT) and view -> project (via K). Your issue might be because of the former and the coordinate system confusion.
(b) If you are sure that the issue is with the latter transform, please provide the K matrix you desire to set and a 3D point for test case (after it is transformed with RT in your example), along with the image size and a very compact code snippet.
@gkioxari Thank you for helping me out with such detailed comments. My apologies for not providing you concise code earlier.
Here is a small code snippet which reproduces the behaviour. For this, I have followed your suggestions and also the note
image_size with PerspectiveCamerasK matrix.Please let me know if this snippet works for you otherwise I will provide a different snippet.
import torch
from pytorch3d.renderer import PerspectiveCameras
torch.set_printoptions(precision= 2, sci_mode= False)
if torch.cuda.is_available():
device = torch.device("cuda:0")
else:
device = torch.device("cpu")
# Camera Intrinsics Parameters
# K = [
# [fx, 0, px, 0],
# [0, fy, py, 0],
# [0, 0, 1, 0],
# [0, 0, 0, 1],
# ]
K_camera_screen = torch.Tensor([[718.3351, 0, 600.3891, 0], [0, 718.3351, 181.5122, 0], \
[0, 0, 1, 0], [0, 0, 0, 1]]).unsqueeze(0).float().to(device) #N, 4, 4
points_screen = torch.Tensor([[4.0, 0.5, 15]]).unsqueeze(0).float().to(device) # N, V, 3
h = 374
w = 1238
num_cameras = 1
focal_length_screen = K_camera_screen[:, [0,1], [0,1]]
principal_point_screen = K_camera_screen[:, [0,1], 2]
image_size = torch.Tensor([[w, h]]).float().to(device)
cameras = PerspectiveCameras(device=device, \
focal_length= focal_length_screen, \
principal_point= principal_point_screen, \
image_size= image_size)
pixels_torch = cameras.transform_points_screen(points= points_screen, image_size= image_size)
print("Pixel X,Y coordinates using PyTorch3D Inbuilt Function")
print(pixels_torch[:, :, :2])
# Append ones to the points
points_4d = torch.ones(points_screen.shape[0], points_screen.shape[1], 4).float().to(device)
points_4d[:, :, :3] = points_screen
# Multiply by intrinsics matrix and divide by z to get the pixels
pixels_by_matmul = torch.bmm(K_camera_screen, points_4d.transpose(1,2)).transpose(1,2) # N, V, 4
pixels_by_matmul[:, :, :2] /= pixels_by_matmul[:, :, 2].unsqueeze(2)
print("\nPixel X,Y coordinates using Matrix Multiplication")
print(pixels_by_matmul[:, :, :2])
The output after running this code is
Pixel X,Y coordinates using PyTorch3D Inbuilt Function
tensor([[[408.50, 157.15]]], device='cuda:0')
Pixel X,Y coordinates using Matrix Multiplication
tensor([[[791.95, 205.46]]], device='cuda:0')
Clearly, PyTorch3D Inbuilt function and Matrix Multiplication both produce different outputs.
You need to adjust for the coordinate system assumptions in PyTorch3D (NDC vs screen coordinate systems) if you want (x_screen, y_screen) and point_screen (see below) to be consistent. My code runs fine and produces a consistent result. Note that the 3D point that goes into the cameras is (-x, -y, z) to account for the coordinate system assumptions between PyTorch3D's NDC and the image space.
Here is my code
device = torch.device("cuda:0")
height = 374
width = 1238
image_size = torch.tensor([width, height], device=device).view(1, 2)
# coordinate system: +X right, +Y down
x, y, z = 4.0, 0.5, 15.0
# intrinsics
fx = 718.3351
fy = 718.3351
px = 600.3891
py = 181.5122
x_screen = fx * x / z + px
y_screen = fy * y / z + py
cameras = PerspectiveCameras(focal_length=((fx, fy),), principal_point=((px, py),), image_size=((width, height),), device=device)
point_py3d = torch.tensor([-x, -y, z], device=device, dtype=torch.float32).view(1, 1, 3)
point_screen = cameras.transform_points_screen(point_py3d, image_size=image_size)
@gkioxari Passing the NDC coordinates (x, y, z) as (-x, -y, -z) to PerspectiveCameras.transform_points_screen() fixes the issue. I did not know that we have to adjust manually for the NDC (x-left, y-top, z-inside) to screen coordinates (x-right, y-down, z-inside). I somehow assumed that PyTorch3D would take care of automatically. It would be nice to note this explicitly. Thank you for helping me understand the error I was making.
A related question: Do we have to adjust the inputs in other modules such as Shaders and Fragments?
Thank you once again.
When using any renderer/graphics library, you need to understand what their world coordinate systems are. Each library has their own and there is no right or wrong one. For PyTorch3D, we assume that our world coordinate is +X points left, +Y points up and +Z points from the camera to the scene.
We state this explicitly in this note and with a figure:
https://github.com/facebookresearch/pytorch3d/blob/master/docs/notes/cameras.md
In your case, your 3D points came from a world coordinate system that had +X points right, +Y points down and +Z points from the camera to the scene. (Based on the comments of your code, this is the KITTI dataset convention).
So if you want to pass 3D points from the KITTI dataset (or any other dataset for that matter) to PyTorch3D you need to adjust the coordinates accordingly so that the input to PyTorch3D is in accordance to its assumptions. This is something that the user needs to be aware of and handle.
I was facing a similar issue before reading this. After flipping several axes, the results from pytorch3d get closed to numpy version. But I noticed that there exists around 0.5 differences between the pytorch3d and numpy.
Such as the examples output:
791.9451266666666 205.45670333333334
tensor([[[791.3054, 204.9074]]], device='cuda:0')
I can't figure out why the difference is so big.
@rrrrrguo I agree, there is a shift of about 0.5 pixels
The reason for this is multiplication by (img_width-1)/2.0 and (img_height-1)/2.0 instead of img_width/2.0 and img_height/2.0 in the transform_points_screen() function.
To get exactly same results, pass w,h to camera declaration while pass w+1, h+1 to account for the minus one in transform_points_screen
cameras = PerspectiveCameras(focal_length=((fx, fy),), principal_point=((px, py),),\
image_size=((width, height),), device=device)
# Note plus one in image_size of transform_points_screen
point_screen = cameras.transform_points_screen(point_py3d, image_size=image_size+1)