First reported here.
julia> using BenchmarkTools, SparseArrays, LinearAlgebra
julia> A = sprand(100,100,0.01);
julia> B = rand(100,100);
julia> C = A*B;
julia> @btime $C = $A*$B;
18.550 渭s (2 allocations: 78.20 KiB)
julia> @btime $C = $B*$A; # other way around
20.097 渭s (2 allocations: 78.20 KiB)
julia> @btime mul!($C,$A,$B);
14.531 渭s (0 allocations: 0 bytes)
julia> @btime mul!($C,$B,$A);
825.506 渭s (6 allocations: 336 bytes)
This asymmetry of performance, which already existed on 0.6, is not (just) due to CSC format of sparse matrices but because mul!
falls back to the most generic method in LinearAlgebra
.
julia> @which mul!(C, A, B)
[...] SparseArrays\src\linalg.jl:110
julia> @which mul!(C, B, A)
[...] LinearAlgebra\src\matmul.jl:172
It can be fixed (most simply) by copying the Base.:*
method and adjusting it to be a mul!
version:
import LinearAlgebra.mul!
function mul!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
mX, nX = size(X)
nX == A.m || throw(DimensionMismatch())
fill!(C, zero(eltype(C)))
rowval = A.rowval
nzval = A.nzval
@inbounds for multivec_row=1:mX, col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
C[multivec_row, col] += X[multivec_row, rowval[k]] * nzval[k]
end
C
end
which gives
julia> @btime mul!($C,$A,$B);
16.077 渭s (0 allocations: 0 bytes)
julia> @btime mul!($C,$B,$A);
18.241 渭s (0 allocations: 0 bytes)
Note that PR https://github.com/JuliaLang/julia/pull/24045/ might fix this (haven't looked into it in detail). However, since this PR is sort of lying around for over a year now, maybe we should add an intermediate fix?
Actually even more speedup can be achieved for dense*sparse by proper arrangement of loops.
Using the same setup:
julia> using BenchmarkTools, SparseArrays, LinearAlgebra
julia> A = sprand(100,100,0.01);
julia> B = rand(100,100);
julia> C = A*B;
proposed function:
import LinearAlgebra.mul!
function mul!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
mX, nX = size(X)
nX == A.m || throw(DimensionMismatch())
fill!(C, zero(eltype(C)))
rowval = A.rowval
nzval = A.nzval
@inbounds for multivec_row=1:mX, col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
C[multivec_row, col] += X[multivec_row, rowval[k]] * nzval[k]
end
C
end
gives:
julia> @btime mul!($C,$B,$A);
21.778 渭s (0 allocations: 0 bytes)
while:
function mul_alt!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
mX, nX = size(X)
nX == A.m || throw(DimensionMismatch())
fill!(C, zero(eltype(C)))
rowval = A.rowval
nzval = A.nzval
@inbounds for col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
ki=rowval[k]
kv=nzval[k]
for multivec_row=1:mX
C[multivec_row, col] += X[multivec_row, ki] * kv
end
end
C
end
gives:
@btime mul_alt!($C,$B,$A);
4.624 渭s (0 allocations: 0 bytes)
@Sacha0 Do you think (in the light of your PR https://github.com/JuliaLang/julia/pull/24045) that it is worth creating a PR for this?
@Sacha0
Hi Carsten! Great meeting you this past summer at JuliaCon! :)
Do you think (in the light of your PR #24045) that it is worth creating a PR for this?
Apologies, I'm not sure I understand the question --- are you implicitly asking about the timescale on which #24045 might rise from the grave? :)
Hi Carsten! Great meeting you this past summer at JuliaCon! :)
Ditto!
[...] are you implicitly asking about the timescale on which #24045 might rise from the grave? :)
Basically, yes :) As you can see in my first post the current performance is pretty bad and the suggested fix is simple and effective (~200x speedup). If your PR is overriding this any time soon though it might not be worth the effort, that's why I'm asking.
It'd be great to see a revival of #24045, of course :)
@crstnbr There is a discussion #23919 for adding an API for "combined multiply-add" _C = 伪AB + 尾C_ and I wrote a PR #29634 for this. If you are going to write a new method (which BTW sounds like a great addition!), it'd be nice if it uses this API. SparseArrays
already has this API using 5-argument mul!
so I think you don't need to wait for #29634 to implement it.
Basically, yes :) [...] It'd be great to see a revival of #24045
(A little update: I ended up working through the holiday, so likely it'll be a few more weeks. Best! S)
Hey @Sacha0. Any progress on this? If #24045 takes much longer, maybe we should fix the issue here temporarily as suggested above? It's kind of sad that this performance asymmetry still exists in 1.2/1.3.
Hey @crstnbr! :) Regrettably reality has sunk in: I will not have any bandwidth to work on things Julia for the foreseeable future. That being the case, it would be fantastic to see someone else push #24045 over the line. Formerly all that was necessary was transforming method signatures to use Adjoint
/Transpose
. Perhaps @KristofferC would still be up for the task? :) Best!
Has the original issue been solved?
Redoing the benchmarks (on the current master
) gives me:
julia> @btime $C = $A*$B;
22.556 渭s (2 allocations: 78.20 KiB)
julia> @btime $C = $B*$A;
29.979 渭s (2 allocations: 78.20 KiB)
julia> @btime mul!($C,$A,$B);
19.144 渭s (0 allocations: 0 bytes)
julia> @btime mul!($C,$B,$A);
21.595 渭s (0 allocations: 0 bytes)
Has the original issue been solved?
Yes, multiplication is not falling back to any generic matmul method, but to https://github.com/JuliaLang/julia/blob/d2daa81b5f3227210bbd0827d2c2f172c83d9ab1/stdlib/SparseArrays/src/linalg.jl#L131
Most helpful comment
24045 has seen enough buzz recently that I might prioritize resurrecting it over other things in the next few weeks. That said, I can't make any promises what with other obligations :). Best!