Cupy: Cupy.linalg.pinv very very Low performance

Created on 13 Dec 2020 · 3Comments · Source: cupy/cupy

If it is a bug report, please include the following information:

Conditions (you can just paste the output of python -c 'import cupy; cupy.show_config()')
- CuPy 8.2.0
- Ubuntu16.04 & Win10
- CUDA version10.1
- cuDNN/NCCL version: 7series
Code to reproduce

import time
import numpy as np
import cupy as cp

a = np.random.randn(3000, 3000)
a2 =cp.asarray(a,dtype=np.float64)
start=time.time()
a2 = cp.linalg.pinv(a2)
end=time.time()
print('gpu time',end-start)

start=time.time()
b=np.linalg.pinv(a)
end=time.time()
print('cpu time',end-start)

Problem:
I encounter a problem when using Cupy.linalg.pinv, but this function really cost time. Compared with cpu use 10s to get result, gpu use 13s. The larger the matrix, the more the delay on gpu. But for function like Cupy.linalg.inv and others, speed in gpu is much faster than cpu's. Could you tell me the reason?

performance medium

Source

Rongzihan

Most helpful comment

cupy.linalg.pinv uses cupy.linalg.svd that relies its implementation on cuSOLVER's gesvd, and some report discussed its performance as we found. Replacing it with gesvdj may resolve the issue.

https://github.com/tensorflow/tensorflow/issues/13603#issuecomment-418153277
https://github.com/pytorch/pytorch/pull/48436

takagi on 17 Dec 2020

👍2

All 3 comments

Thanks for reporting! I can reproduce your result as following, and we do not have an immediate answer to why CuPy's counterpart takes so much time. Please let us investigate it.

NumPy | CuPy
--|--
2184.667 ms | 3203.060 ms

Click to see script

import numpy as np
import cupy as cp
import cupyx

def get_time(perf):
    cpu_time = perf.cpu_times.mean()
    gpu_time = perf.gpu_times.mean()
    return max(cpu_time, gpu_time) * 1000  # ms

a = np.random.randn(3000, 3000)
a2 =cp.asarray(a,dtype=np.float64)

ret = []
perf = cupyx.time.repeat(np.linalg.pinv, (a,), n_repeat=10, n_warmup=3)
ret.append(get_time(perf))

perf = cupyx.time.repeat(cp.linalg.pinv, (a2,), n_repeat=10, n_warmup=3)
ret.append(get_time(perf))

print('NumPy | CuPy')
print('--|--')
print('{:.3f} ms | {:.3f} ms'.format(ret[0], ret[1]))

takagi on 15 Dec 2020

👀1

https://github.com/tensorflow/tensorflow/issues/13603#issuecomment-418153277
https://github.com/pytorch/pytorch/pull/48436

takagi on 17 Dec 2020

👍2

gesvd seems to be slow unfortunately, so gesvdj might worth a shot for a single matrix.

However, if you have a stack of matrices (which pinv supports, see the ongoing PR #4686) to solve, gesvdj_batched is limited to small matrices (m,n<=32), so we'll still hit this performance bottleneck. Currently the batched SVD (#4628) is based on gesvdj_batched for small matrices and falls back to a loop over gesvd for arbitrary size, and the batched pinv #4686 is based on this strategy.