Julia: Asymmetric speed of in-place `sparse*dense` matrix product

Created on 8 Nov 2018  路  11Comments  路  Source: JuliaLang/julia

First reported here.

julia> using BenchmarkTools, SparseArrays, LinearAlgebra

julia> A = sprand(100,100,0.01);

julia> B = rand(100,100);

julia> C = A*B;

julia> @btime $C = $A*$B;
  18.550 渭s (2 allocations: 78.20 KiB)

julia> @btime $C = $B*$A; # other way around
  20.097 渭s (2 allocations: 78.20 KiB)

julia> @btime mul!($C,$A,$B);
  14.531 渭s (0 allocations: 0 bytes)

julia> @btime mul!($C,$B,$A);
  825.506 渭s (6 allocations: 336 bytes)

This asymmetry of performance, which already existed on 0.6, is not (just) due to CSC format of sparse matrices but because mul! falls back to the most generic method in LinearAlgebra.

julia> @which mul!(C, A, B)
[...] SparseArrays\src\linalg.jl:110

julia> @which mul!(C, B, A)
[...] LinearAlgebra\src\matmul.jl:172

It can be fixed (most simply) by copying the Base.:* method and adjusting it to be a mul! version:

import LinearAlgebra.mul!
function mul!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
    mX, nX = size(X)
    nX == A.m || throw(DimensionMismatch())
    fill!(C, zero(eltype(C)))
    rowval = A.rowval
    nzval = A.nzval
    @inbounds for multivec_row=1:mX, col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
        C[multivec_row, col] += X[multivec_row, rowval[k]] * nzval[k]
    end
    C
end

which gives

julia> @btime mul!($C,$A,$B);
  16.077 渭s (0 allocations: 0 bytes)

julia> @btime mul!($C,$B,$A);
  18.241 渭s (0 allocations: 0 bytes)

Note that PR https://github.com/JuliaLang/julia/pull/24045/ might fix this (haven't looked into it in detail). However, since this PR is sort of lying around for over a year now, maybe we should add an intermediate fix?

performance sparse

Most helpful comment

Basically, yes :) [...] It'd be great to see a revival of #24045

24045 has seen enough buzz recently that I might prioritize resurrecting it over other things in the next few weeks. That said, I can't make any promises what with other obligations :). Best!

All 11 comments

Actually even more speedup can be achieved for dense*sparse by proper arrangement of loops.

Using the same setup:

julia> using BenchmarkTools, SparseArrays, LinearAlgebra

julia> A = sprand(100,100,0.01);

julia> B = rand(100,100);

julia> C = A*B;

proposed function:

import LinearAlgebra.mul!
function mul!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
    mX, nX = size(X)
    nX == A.m || throw(DimensionMismatch())
    fill!(C, zero(eltype(C)))
    rowval = A.rowval
    nzval = A.nzval
    @inbounds for multivec_row=1:mX, col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
        C[multivec_row, col] += X[multivec_row, rowval[k]] * nzval[k]
    end
    C
end

gives:

julia> @btime mul!($C,$B,$A);
  21.778 渭s (0 allocations: 0 bytes)

while:

function mul_alt!(C::StridedMatrix, X::StridedMatrix, A::SparseMatrixCSC)
    mX, nX = size(X)
    nX == A.m || throw(DimensionMismatch())
    fill!(C, zero(eltype(C)))
    rowval = A.rowval
    nzval = A.nzval
    @inbounds for  col = 1:A.n, k=A.colptr[col]:(A.colptr[col+1]-1)
        ki=rowval[k]
        kv=nzval[k]
        for multivec_row=1:mX
            C[multivec_row, col] += X[multivec_row, ki] * kv
        end
    end
    C
end

gives:

@btime mul_alt!($C,$B,$A);
  4.624 渭s (0 allocations: 0 bytes)

@Sacha0 Do you think (in the light of your PR https://github.com/JuliaLang/julia/pull/24045) that it is worth creating a PR for this?

@Sacha0

Hi Carsten! Great meeting you this past summer at JuliaCon! :)

Do you think (in the light of your PR #24045) that it is worth creating a PR for this?

Apologies, I'm not sure I understand the question --- are you implicitly asking about the timescale on which #24045 might rise from the grave? :)

Hi Carsten! Great meeting you this past summer at JuliaCon! :)

Ditto!

[...] are you implicitly asking about the timescale on which #24045 might rise from the grave? :)

Basically, yes :) As you can see in my first post the current performance is pretty bad and the suggested fix is simple and effective (~200x speedup). If your PR is overriding this any time soon though it might not be worth the effort, that's why I'm asking.

It'd be great to see a revival of #24045, of course :)

@crstnbr There is a discussion #23919 for adding an API for "combined multiply-add" _C = 伪AB + 尾C_ and I wrote a PR #29634 for this. If you are going to write a new method (which BTW sounds like a great addition!), it'd be nice if it uses this API. SparseArrays already has this API using 5-argument mul! so I think you don't need to wait for #29634 to implement it.

Basically, yes :) [...] It'd be great to see a revival of #24045

24045 has seen enough buzz recently that I might prioritize resurrecting it over other things in the next few weeks. That said, I can't make any promises what with other obligations :). Best!

(A little update: I ended up working through the holiday, so likely it'll be a few more weeks. Best! S)

Hey @Sacha0. Any progress on this? If #24045 takes much longer, maybe we should fix the issue here temporarily as suggested above? It's kind of sad that this performance asymmetry still exists in 1.2/1.3.

Hey @crstnbr! :) Regrettably reality has sunk in: I will not have any bandwidth to work on things Julia for the foreseeable future. That being the case, it would be fantastic to see someone else push #24045 over the line. Formerly all that was necessary was transforming method signatures to use Adjoint/Transpose. Perhaps @KristofferC would still be up for the task? :) Best!

Has the original issue been solved?
Redoing the benchmarks (on the current master) gives me:

julia> @btime $C = $A*$B;
  22.556 渭s (2 allocations: 78.20 KiB)

julia> @btime $C = $B*$A;
  29.979 渭s (2 allocations: 78.20 KiB)

julia> @btime mul!($C,$A,$B);
  19.144 渭s (0 allocations: 0 bytes)

julia> @btime mul!($C,$B,$A);
  21.595 渭s (0 allocations: 0 bytes)

Has the original issue been solved?

Yes, multiplication is not falling back to any generic matmul method, but to https://github.com/JuliaLang/julia/blob/d2daa81b5f3227210bbd0827d2c2f172c83d9ab1/stdlib/SparseArrays/src/linalg.jl#L131

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wilburtownsend picture wilburtownsend  路  3Comments

StefanKarpinski picture StefanKarpinski  路  3Comments

omus picture omus  路  3Comments

ararslan picture ararslan  路  3Comments

Keno picture Keno  路  3Comments