If it is a bug report, please include the following information:
python -c 'import cupy; cupy.show_config()')import time
import numpy as np
import cupy as cp
a = np.random.randn(3000, 3000)
a2 =cp.asarray(a,dtype=np.float64)
start=time.time()
a2 = cp.linalg.pinv(a2)
end=time.time()
print('gpu time',end-start)
start=time.time()
b=np.linalg.pinv(a)
end=time.time()
print('cpu time',end-start)
Problem:
I encounter a problem when using Cupy.linalg.pinv, but this function really cost time. Compared with cpu use 10s to get result, gpu use 13s. The larger the matrix, the more the delay on gpu. But for function like Cupy.linalg.inv and others, speed in gpu is much faster than cpu's. Could you tell me the reason?
Thanks for reporting! I can reproduce your result as following, and we do not have an immediate answer to why CuPy's counterpart takes so much time. Please let us investigate it.
NumPy | CuPy
--|--
2184.667 ms | 3203.060 ms
Click to see script
import numpy as np
import cupy as cp
import cupyx
def get_time(perf):
cpu_time = perf.cpu_times.mean()
gpu_time = perf.gpu_times.mean()
return max(cpu_time, gpu_time) * 1000 # ms
a = np.random.randn(3000, 3000)
a2 =cp.asarray(a,dtype=np.float64)
ret = []
perf = cupyx.time.repeat(np.linalg.pinv, (a,), n_repeat=10, n_warmup=3)
ret.append(get_time(perf))
perf = cupyx.time.repeat(cp.linalg.pinv, (a2,), n_repeat=10, n_warmup=3)
ret.append(get_time(perf))
print('NumPy | CuPy')
print('--|--')
print('{:.3f} ms | {:.3f} ms'.format(ret[0], ret[1]))
cupy.linalg.pinv uses cupy.linalg.svd that relies its implementation on cuSOLVER's gesvd, and some report discussed its performance as we found. Replacing it with gesvdj may resolve the issue.
https://github.com/tensorflow/tensorflow/issues/13603#issuecomment-418153277
https://github.com/pytorch/pytorch/pull/48436
gesvd seems to be slow unfortunately, so gesvdj might worth a shot for a single matrix.
However, if you have a stack of matrices (which pinv supports, see the ongoing PR #4686) to solve, gesvdj_batched is limited to small matrices (m,n<=32), so we'll still hit this performance bottleneck. Currently the batched SVD (#4628) is based on gesvdj_batched for small matrices and falls back to a loop over gesvd for arbitrary size, and the batched pinv #4686 is based on this strategy.
Most helpful comment
cupy.linalg.pinvusescupy.linalg.svdthat relies its implementation on cuSOLVER'sgesvd, and some report discussed its performance as we found. Replacing it withgesvdjmay resolve the issue.https://github.com/tensorflow/tensorflow/issues/13603#issuecomment-418153277
https://github.com/pytorch/pytorch/pull/48436