Julia: mul! performance regression on master

Created on 3 Dec 2019  路  19Comments  路  Source: JuliaLang/julia

On the current master branch, mul! of small matrices can be 10x slower than 1.3 (and also allocates)

julia> versioninfo()
Julia Version 1.4.0-DEV.556
Commit 4800158ef5* (2019-12-03 21:18 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.0.0)
  CPU: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

test code:

using LinearAlgebra
using BenchmarkTools

ndim = 3

m1 = rand(ComplexF64,ndim,ndim);
m2 = rand(ComplexF64,ndim,ndim);
ou = rand(ComplexF64,ndim,ndim);

@btime mul!($ou, $m1, $m2);

With the release-1.3 version,

33.471 ns (0 allocations: 0 bytes)

on master:

394.428 ns (1 allocation: 16 bytes)

While it's nano-seconds, when one has mul! in the inner-most/hot loop, this can easily translate to big performance degradation when most of the calculation is with such matrix productions. Larger matrices (e.g. ndim = 30) appears unaffected (dispatched to BLAS?) 2.732 渭s (0 allocations: 0 bytes) on 1.3 and 2.774 渭s (0 allocations: 0 bytes) on master.

linear algebra performance regression

Most helpful comment

So the issue seems to come from when the constructor MulAddMul is called, see https://github.com/JuliaLang/julia/pull/33743#discussion_r341834604. In the release version, calling mul!(C, A, B) calls mul!(C, A, B, true, false), which does

alpha, beta = promote(伪, 尾, zero(T))
    if alpha isa T && beta isa T
        return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul(alpha, beta))
    else
        return generic_matmatmul!(C, 'N', 'N', A, B, MulAddMul(伪, 尾))
    end

In #33743, we construct the MulAddMul one layer below, i.e., inside the gemm_wrapper!, as suggested by @tkf. Unfortunately, the compiler does not propagate the constants, so the construction of the MulAddMul(true, false) does not happen at compile time, and then badly hits performance, as was already noticed by @Jutho in https://github.com/JuliaLang/julia/pull/29634#issuecomment-547931473. Still, #33743 was addressing other performance issues with adjoints/transposes and number types different from Bool and the matrix eltype. Is there any way to help constant propagation here?

All 19 comments

Likely happened some time between mul!/gemm_wrapper! and matmul3x3!:

@btime LinearAlgebra.gemm_wrapper!($ou, $t, $t, $m1, $m2, 1.2, .5im)

388.228 ns (3 allocations: 80 bytes)

ad = LinearAlgebra.MulAddMul(1.2, 0.5im)
t = 'N'
@btime LinearAlgebra.matmul3x3!($ou, $t, $t, $m1, $m2, $ad)

48.857 ns (0 allocations: 0 bytes)

But there's not much between:
https://github.com/JuliaLang/julia/blob/1a6baefb51170027f7ae502ae510dc5023029ed7/stdlib/LinearAlgebra/src/matmul.jl#L552-L578


t = 'N'
al = 1.2
bt = .5im
@btime LinearAlgebra.matmul3x3!($ou, $t, $t, $m1, $m2, LinearAlgebra.MulAddMul($al, $bt))
@btime LinearAlgebra.matmul3x3!($ou, $t, $t, $m1, $m2, LinearAlgebra.MulAddMul(1.2, .5im))

returns:

359.817 ns (3 allocations: 80 bytes)
48.784 ns (0 allocations: 0 bytes)

So the issue seems to come from when the constructor MulAddMul is called, see https://github.com/JuliaLang/julia/pull/33743#discussion_r341834604. In the release version, calling mul!(C, A, B) calls mul!(C, A, B, true, false), which does

alpha, beta = promote(伪, 尾, zero(T))
    if alpha isa T && beta isa T
        return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul(alpha, beta))
    else
        return generic_matmatmul!(C, 'N', 'N', A, B, MulAddMul(伪, 尾))
    end

In #33743, we construct the MulAddMul one layer below, i.e., inside the gemm_wrapper!, as suggested by @tkf. Unfortunately, the compiler does not propagate the constants, so the construction of the MulAddMul(true, false) does not happen at compile time, and then badly hits performance, as was already noticed by @Jutho in https://github.com/JuliaLang/julia/pull/29634#issuecomment-547931473. Still, #33743 was addressing other performance issues with adjoints/transposes and number types different from Bool and the matrix eltype. Is there any way to help constant propagation here?

@dkarrasch Thanks for the comment, looks like julia thinks LinearAlgebra.MulAddMul(伪, 尾) might not be type stable? Test with

using LinearAlgebra

ndim = 3

a = rand(ComplexF64,ndim,ndim);
b = rand(ComplexF64,ndim,ndim);
c = rand(ComplexF64,ndim,ndim);

t = 'N'
al = 1.2
bt = .5im

@code_warntype LinearAlgebra.gemm_wrapper!(c, t, t, a, b, al, bt)

Now that you mention it, it's not surprising: the exact type (including its parameters) of MulAddMul is constructed based on the values of and , like iszero and isone. The return values of those two functions are AFAIK never determined by type. So maybe the issue is that by shifting the construction by one layer, we are potentially missing a function barrier that we have in the v1.3-release?

No, actually, I think we do have function barriers, matmul2x2! and matmul3x3!, so it really seems to be the constant propagation issue, i.e., whether or not the MulAddMul object can be constructed at compile time.

@daviehh Have you considered using StaticArrays.jl for the small matrix case?

using LinearAlgebra
using BenchmarkTools
using StaticArrays

ndim = 3

m1 = SMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))
m2 = SMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))
ou = MMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))

@btime mul!($ou, $m1, $m2)

That can be one weird way to get around this issue... performance seems to be the reason why julia has special code for 2x2 and 3x3 matrices multiplication in the first place due to the overhead of sending matrices to BLAS.gemm!(). Maybe rethink when to construct MulAddMul / deal with const. propagation or function barriers for now?

I tried StaticArrays before, for one of my particular project, the code need to tackle both large and small matrices, and it can be messy to have separate code and keep them synced & maintained to support both standard and static arrays.

Bump, this needs to be fixed / worked-around with some priority.

It looks like a single @inline fixes it (by helping const prop, I guess). See #34384.

Thanks for the fix! Since the 2x2 and 3x3 matrix multiplication is dealt with differently, can this be put in the nanosildier's benchmark script? possibly append 2 or 3 to the SIZES at

https://github.com/JuliaCI/BaseBenchmarks.jl/blob/406d5b1d11b498884e58e111d5745b29c9471d40/src/linalg/LinAlgBenchmarks.jl#L13

Putting triage since we need to decide between https://github.com/JuliaLang/julia/pull/34384, https://github.com/JuliaLang/julia/pull/34394 for 1.4. What do people think?

As I said in https://github.com/JuliaLang/julia/pull/34394#issuecomment-575031153, #34394 does not quite make sense to me since MulAddMul has the information about 0/1 in the type-domain to hoist the if-branch out of the loop. (OK, to be fair, there is _modify! for supporting BigFloat so this is not really true anymore.) If we go with #34394, I think it's better to just remove MulAddMul entirely (which of course can happen after #34394 to not block the release).

As a longer term solution, I think it's better to defer bringing 0'ness to the type domain as late as possible (which is the opposite of what we are doing ATM). I think something like this might work:

struct Zero <: Number end
*(::Zero, x) = zero(x)
*(x, ::Zero) = zero(x)

function mul_impl!(C, A, B, alpha, beta)
    iszero(beta) && !(beta isa Zero) && return mul_impl!(C, A, B, alpha, Zero())
    ...
end

It seems that #34384 together with https://github.com/JuliaLang/julia/pull/29634#discussion_r314010599 is the better solution. This should make everybody happy for the moment (i.e., for v1.4), since it makes 3- and 5-arg mul! for small matrices and multiplication of structured matrices performant. From there, we can think about further improvements. I think @daviehh suggested to add the 2x2 and 3x3 multiplication tests into the benchmark suite. That would help the future development.

Yes, #34384 together with https://github.com/JuliaLang/julia/pull/29634#discussion_r314010599 is the better solution, thanks a lot! Also added the test for 3 and 5 arg mul! for 2x2 and 3x3 matrices in PR https://github.com/JuliaCI/BaseBenchmarks.jl/pull/255

34601 is yet another PR to fix for this. It doesn't add the @inline.

@KristofferC Thanks! Sorry for pinging/nagging but can the corresponding nano-soldier script be updated to include the benchmark test for the 2x2 and 3x3 matrix productions? I've submitted a PR at https://github.com/JuliaCI/BaseBenchmarks.jl/pull/255/

Merged, but the Nanosoldier bot also need to update to use the latest BaseBenchmarks for it to take effect.

Was this page helpful?
0 / 5 - 0 ratings