Why: https://github.com/eclipse/deeplearning4j/issues/8734#issuecomment-601565911
We should also check for any direct BLAS calls in DL4J/ND4J and switch them over to the op.
Needs: https://github.com/eclipse/deeplearning4j/issues/8797
Note we have Nd4j.gemm calls in BaseLayer, ConvolutionLayer, LSTM, etc.
Any/all of these can cause problems in heavily multi-threaded environments on CUDA.
There's a bit more than GEMM to BLAS/LAPACK/MKL...
Yep, we're aware :)
GEMM threading issues are a problem right now though.
The plan is to have proper op coverage for (most of - ideally all of) BLAS/LAPACK in the namespaces, where applicable using external libraries in libnd4j for the actual implementation. That way it can be used in both ND4J and SameDiff (usually with a nicer API), will be properly documented and findable in autogen docs, etc.
And data types... We support f16/bf16, and for BLAS it's not that simple: regular BLAS doesn't support anything besides f32/f64. cuBLAS supports f16. MKLDNN supports int/bf16 etc. So we'll be centralizing all this stuff in 1 place. Java will be getting valid results no matter what backend is used.