Julia: Asymmetric speed of in-place `sparse*dense` matrix product

Created on 8 Nov 2018 · 11Comments · Source: JuliaLang/julia

First reported here.

julia> using BenchmarkTools, SparseArrays, LinearAlgebra

julia> A = sprand(100,100,0.01);

julia> B = rand(100,100);

julia> C = A*B;

julia> @btime $C = $A*$B;
  18.550 μs (2 allocations: 78.20 KiB)

julia> @btime $C = $B*$A; # other way around
  20.097 μs (2 allocations: 78.20 KiB)

julia> @btime mul!($C,$A,$B);
  14.531 μs (0 allocations: 0 bytes)

julia> @btime mul!($C,$B,$A);
  825.506 μs (6 allocations: 336 bytes)

This asymmetry of performance, which already existed on 0.6, is not (just) due to CSC format of sparse matrices but because mul! falls back to the most generic method in LinearAlgebra.

julia> @which mul!(C, A, B)
[...] SparseArrays\src\linalg.jl:110

julia> @which mul!(C, B, A)
[...] LinearAlgebra\src\matmul.jl:172

It can be fixed (most simply) by copying the Base.:* method and adjusting it to be a mul! version:

import LinearAlgebra.mul!
function mul!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
    mX, nX = size(X)
    nX == A.m || throw(DimensionMismatch())
    fill!(C, zero(eltype(C)))
    rowval = A.rowval
    nzval = A.nzval
    @inbounds for multivec_row=1:mX, col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
        C[multivec_row, col] += X[multivec_row, rowval[k]] * nzval[k]
    end
    C
end

which gives

julia> @btime mul!($C,$A,$B);
  16.077 μs (0 allocations: 0 bytes)

julia> @btime mul!($C,$B,$A);
  18.241 μs (0 allocations: 0 bytes)

Note that PR https://github.com/JuliaLang/julia/pull/24045/ might fix this (haven't looked into it in detail). However, since this PR is sort of lying around for over a year now, maybe we should add an intermediate fix?

performance sparse

Source

crstnbr

Most helpful comment

Basically, yes :) [...] It'd be great to see a revival of #24045

24045 has seen enough buzz recently that I might prioritize resurrecting it over other things in the next few weeks. That said, I can't make any promises what with other obligations :). Best!

Sacha0 on 24 Nov 2018

❤4

All 11 comments

Actually even more speedup can be achieved for dense*sparse by proper arrangement of loops.

Using the same setup:

julia> using BenchmarkTools, SparseArrays, LinearAlgebra

julia> A = sprand(100,100,0.01);

julia> B = rand(100,100);

julia> C = A*B;

proposed function:

import LinearAlgebra.mul!
function mul!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
    mX, nX = size(X)
    nX == A.m || throw(DimensionMismatch())
    fill!(C, zero(eltype(C)))
    rowval = A.rowval
    nzval = A.nzval
    @inbounds for multivec_row=1:mX, col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
        C[multivec_row, col] += X[multivec_row, rowval[k]] * nzval[k]
    end
    C
end

gives:

julia> @btime mul!($C,$B,$A);
  21.778 μs (0 allocations: 0 bytes)

while:

function mul_alt!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
    mX, nX = size(X)
    nX == A.m || throw(DimensionMismatch())
    fill!(C, zero(eltype(C)))
    rowval = A.rowval
    nzval = A.nzval
    @inbounds for  col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
        ki=rowval[k]
        kv=nzval[k]
        for multivec_row=1:mX
            C[multivec_row, col] += X[multivec_row, ki] * kv
        end
    end
    C
end

gives:

@btime mul_alt!($C,$B,$A);
  4.624 μs (0 allocations: 0 bytes)

JimMajor on 22 Nov 2018

👍2

@Sacha0 Do you think (in the light of your PR https://github.com/JuliaLang/julia/pull/24045) that it is worth creating a PR for this?

crstnbr on 22 Nov 2018

@Sacha0

Hi Carsten! Great meeting you this past summer at JuliaCon! :)

Do you think (in the light of your PR #24045) that it is worth creating a PR for this?

Apologies, I'm not sure I understand the question --- are you implicitly asking about the timescale on which #24045 might rise from the grave? :)

Sacha0 on 22 Nov 2018

Hi Carsten! Great meeting you this past summer at JuliaCon! :)

Ditto!

[...] are you implicitly asking about the timescale on which #24045 might rise from the grave? :)

Basically, yes :) As you can see in my first post the current performance is pretty bad and the suggested fix is simple and effective (~200x speedup). If your PR is overriding this any time soon though it might not be worth the effort, that's why I'm asking.

It'd be great to see a revival of #24045, of course :)

crstnbr on 22 Nov 2018

@crstnbr There is a discussion #23919 for adding an API for "combined multiply-add" _C = αAB + βC_ and I wrote a PR #29634 for this. If you are going to write a new method (which BTW sounds like a great addition!), it'd be nice if it uses this API. SparseArrays already has this API using 5-argument mul! so I think you don't need to wait for #29634 to implement it.

tkf on 22 Nov 2018

Basically, yes :) [...] It'd be great to see a revival of #24045

24045 has seen enough buzz recently that I might prioritize resurrecting it over other things in the next few weeks. That said, I can't make any promises what with other obligations :). Best!

Sacha0 on 24 Nov 2018

❤4

(A little update: I ended up working through the holiday, so likely it'll be a few more weeks. Best! S)

Sacha0 on 30 Dec 2018

❤1

Hey @Sacha0. Any progress on this? If #24045 takes much longer, maybe we should fix the issue here temporarily as suggested above? It's kind of sad that this performance asymmetry still exists in 1.2/1.3.

crstnbr on 20 Aug 2019

Hey @crstnbr! :) Regrettably reality has sunk in: I will not have any bandwidth to work on things Julia for the foreseeable future. That being the case, it would be fantastic to see someone else push #24045 over the line. Formerly all that was necessary was transforming method signatures to use Adjoint/Transpose. Perhaps @KristofferC would still be up for the task? :) Best!

Sacha0 on 20 Aug 2019

Has the original issue been solved?
Redoing the benchmarks (on the current master) gives me:

julia> @btime $C = $A*$B;
  22.556 μs (2 allocations: 78.20 KiB)

julia> @btime $C = $B*$A;
  29.979 μs (2 allocations: 78.20 KiB)

julia> @btime mul!($C,$A,$B);
  19.144 μs (0 allocations: 0 bytes)

julia> @btime mul!($C,$B,$A);
  21.595 μs (0 allocations: 0 bytes)