Cupy: Cupy.linalg.pinv very very Low performance

Created on 13 Dec 2020  路  3Comments  路  Source: cupy/cupy

If it is a bug report, please include the following information:

  • Conditions (you can just paste the output of python -c 'import cupy; cupy.show_config()')

    • CuPy 8.2.0

    • Ubuntu16.04 & Win10

    • CUDA version10.1

    • cuDNN/NCCL version: 7series

  • Code to reproduce

import time
import numpy as np
import cupy as cp

a = np.random.randn(3000, 3000)
a2 =cp.asarray(a,dtype=np.float64)
start=time.time()
a2 = cp.linalg.pinv(a2)
end=time.time()
print('gpu time',end-start)

start=time.time()
b=np.linalg.pinv(a)
end=time.time()
print('cpu time',end-start)


Problem:
I encounter a problem when using Cupy.linalg.pinv, but this function really cost time. Compared with cpu use 10s to get result, gpu use 13s. The larger the matrix, the more the delay on gpu. But for function like Cupy.linalg.inv and others, speed in gpu is much faster than cpu's. Could you tell me the reason?

performance medium

Most helpful comment

cupy.linalg.pinv uses cupy.linalg.svd that relies its implementation on cuSOLVER's gesvd, and some report discussed its performance as we found. Replacing it with gesvdj may resolve the issue.

https://github.com/tensorflow/tensorflow/issues/13603#issuecomment-418153277
https://github.com/pytorch/pytorch/pull/48436

All 3 comments

Thanks for reporting! I can reproduce your result as following, and we do not have an immediate answer to why CuPy's counterpart takes so much time. Please let us investigate it.

NumPy | CuPy
--|--
2184.667 ms | 3203.060 ms


Click to see script

import numpy as np
import cupy as cp
import cupyx

def get_time(perf):
    cpu_time = perf.cpu_times.mean()
    gpu_time = perf.gpu_times.mean()
    return max(cpu_time, gpu_time) * 1000  # ms

a = np.random.randn(3000, 3000)
a2 =cp.asarray(a,dtype=np.float64)

ret = []
perf = cupyx.time.repeat(np.linalg.pinv, (a,), n_repeat=10, n_warmup=3)
ret.append(get_time(perf))

perf = cupyx.time.repeat(cp.linalg.pinv, (a2,), n_repeat=10, n_warmup=3)
ret.append(get_time(perf))

print('NumPy | CuPy')
print('--|--')
print('{:.3f} ms | {:.3f} ms'.format(ret[0], ret[1]))

cupy.linalg.pinv uses cupy.linalg.svd that relies its implementation on cuSOLVER's gesvd, and some report discussed its performance as we found. Replacing it with gesvdj may resolve the issue.

https://github.com/tensorflow/tensorflow/issues/13603#issuecomment-418153277
https://github.com/pytorch/pytorch/pull/48436

gesvd seems to be slow unfortunately, so gesvdj might worth a shot for a single matrix.

However, if you have a stack of matrices (which pinv supports, see the ongoing PR #4686) to solve, gesvdj_batched is limited to small matrices (m,n<=32), so we'll still hit this performance bottleneck. Currently the batched SVD (#4628) is based on gesvdj_batched for small matrices and falls back to a loop over gesvd for arbitrary size, and the batched pinv #4686 is based on this strategy.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

leofang picture leofang  路  3Comments

kmaehashi picture kmaehashi  路  3Comments

Bartzi picture Bartzi  路  4Comments

kmaehashi picture kmaehashi  路  3Comments

jakirkham picture jakirkham  路  4Comments