On the current master branch, mul!
of small matrices can be 10x slower than 1.3 (and also allocates)
julia> versioninfo()
Julia Version 1.4.0-DEV.556
Commit 4800158ef5* (2019-12-03 21:18 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.0.0)
CPU: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
test code:
using LinearAlgebra
using BenchmarkTools
ndim = 3
m1 = rand(ComplexF64,ndim,ndim);
m2 = rand(ComplexF64,ndim,ndim);
ou = rand(ComplexF64,ndim,ndim);
@btime mul!($ou, $m1, $m2);
With the release-1.3 version,
33.471 ns (0 allocations: 0 bytes)
on master:
394.428 ns (1 allocation: 16 bytes)
While it's nano-seconds, when one has mul!
in the inner-most/hot loop, this can easily translate to big performance degradation when most of the calculation is with such matrix productions. Larger matrices (e.g. ndim = 30
) appears unaffected (dispatched to BLAS?) 2.732 渭s (0 allocations: 0 bytes)
on 1.3 and 2.774 渭s (0 allocations: 0 bytes)
on master.
Likely happened some time between mul!
/gemm_wrapper!
and matmul3x3!
:
@btime LinearAlgebra.gemm_wrapper!($ou, $t, $t, $m1, $m2, 1.2, .5im)
388.228 ns (3 allocations: 80 bytes)
ad = LinearAlgebra.MulAddMul(1.2, 0.5im)
t = 'N'
@btime LinearAlgebra.matmul3x3!($ou, $t, $t, $m1, $m2, $ad)
48.857 ns (0 allocations: 0 bytes)
But there's not much between:
https://github.com/JuliaLang/julia/blob/1a6baefb51170027f7ae502ae510dc5023029ed7/stdlib/LinearAlgebra/src/matmul.jl#L552-L578
t = 'N'
al = 1.2
bt = .5im
@btime LinearAlgebra.matmul3x3!($ou, $t, $t, $m1, $m2, LinearAlgebra.MulAddMul($al, $bt))
@btime LinearAlgebra.matmul3x3!($ou, $t, $t, $m1, $m2, LinearAlgebra.MulAddMul(1.2, .5im))
returns:
359.817 ns (3 allocations: 80 bytes)
48.784 ns (0 allocations: 0 bytes)
So the issue seems to come from when the constructor MulAddMul
is called, see https://github.com/JuliaLang/julia/pull/33743#discussion_r341834604. In the release version, calling mul!(C, A, B)
calls mul!(C, A, B, true, false)
, which does
alpha, beta = promote(伪, 尾, zero(T))
if alpha isa T && beta isa T
return gemm_wrapper!(C, 'N', 'N', A, B, MulAddMul(alpha, beta))
else
return generic_matmatmul!(C, 'N', 'N', A, B, MulAddMul(伪, 尾))
end
In #33743, we construct the MulAddMul
one layer below, i.e., inside the gemm_wrapper!
, as suggested by @tkf. Unfortunately, the compiler does not propagate the constants, so the construction of the MulAddMul(true, false)
does not happen at compile time, and then badly hits performance, as was already noticed by @Jutho in https://github.com/JuliaLang/julia/pull/29634#issuecomment-547931473. Still, #33743 was addressing other performance issues with adjoints/transposes and number types different from Bool
and the matrix eltype. Is there any way to help constant propagation here?
@dkarrasch Thanks for the comment, looks like julia thinks LinearAlgebra.MulAddMul(伪, 尾)
might not be type stable? Test with
using LinearAlgebra
ndim = 3
a = rand(ComplexF64,ndim,ndim);
b = rand(ComplexF64,ndim,ndim);
c = rand(ComplexF64,ndim,ndim);
t = 'N'
al = 1.2
bt = .5im
@code_warntype LinearAlgebra.gemm_wrapper!(c, t, t, a, b, al, bt)
Now that you mention it, it's not surprising: the exact type (including its parameters) of MulAddMul
is constructed based on the values of 伪
and 尾
, like iszero
and isone
. The return values of those two functions are AFAIK never determined by type. So maybe the issue is that by shifting the construction by one layer, we are potentially missing a function barrier that we have in the v1.3-release?
No, actually, I think we do have function barriers, matmul2x2!
and matmul3x3!
, so it really seems to be the constant propagation issue, i.e., whether or not the MulAddMul
object can be constructed at compile time.
@daviehh Have you considered using StaticArrays.jl
for the small matrix case?
using LinearAlgebra
using BenchmarkTools
using StaticArrays
ndim = 3
m1 = SMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))
m2 = SMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))
ou = MMatrix{ndim,ndim}(rand(ComplexF64,ndim,ndim))
@btime mul!($ou, $m1, $m2)
That can be one weird way to get around this issue... performance seems to be the reason why julia has special code for 2x2 and 3x3 matrices multiplication in the first place due to the overhead of sending matrices to BLAS.gemm!()
. Maybe rethink when to construct MulAddMul
/ deal with const. propagation or function barriers for now?
I tried StaticArrays before, for one of my particular project, the code need to tackle both large and small matrices, and it can be messy to have separate code and keep them synced & maintained to support both standard and static arrays.
Bump, this needs to be fixed / worked-around with some priority.
It looks like a single @inline
fixes it (by helping const prop, I guess). See #34384.
Thanks for the fix! Since the 2x2 and 3x3 matrix multiplication is dealt with differently, can this be put in the nanosildier's benchmark script? possibly append 2 or 3 to the SIZES at
Putting triage since we need to decide between https://github.com/JuliaLang/julia/pull/34384, https://github.com/JuliaLang/julia/pull/34394 for 1.4. What do people think?
As I said in https://github.com/JuliaLang/julia/pull/34394#issuecomment-575031153, #34394 does not quite make sense to me since MulAddMul
has the information about 0/1 in the type-domain to hoist the if-branch out of the loop. (OK, to be fair, there is _modify!
for supporting BigFloat
so this is not really true anymore.) If we go with #34394, I think it's better to just remove MulAddMul
entirely (which of course can happen after #34394 to not block the release).
As a longer term solution, I think it's better to defer bringing 0'ness to the type domain as late as possible (which is the opposite of what we are doing ATM). I think something like this might work:
struct Zero <: Number end
*(::Zero, x) = zero(x)
*(x, ::Zero) = zero(x)
function mul_impl!(C, A, B, alpha, beta)
iszero(beta) && !(beta isa Zero) && return mul_impl!(C, A, B, alpha, Zero())
...
end
It seems that #34384 together with https://github.com/JuliaLang/julia/pull/29634#discussion_r314010599 is the better solution. This should make everybody happy for the moment (i.e., for v1.4), since it makes 3- and 5-arg mul! for small matrices and multiplication of structured matrices performant. From there, we can think about further improvements. I think @daviehh suggested to add the 2x2 and 3x3 multiplication tests into the benchmark suite. That would help the future development.
I included https://github.com/JuliaLang/julia/pull/29634#discussion_r314010599 in #34384
Yes, #34384 together with https://github.com/JuliaLang/julia/pull/29634#discussion_r314010599 is the better solution, thanks a lot! Also added the test for 3 and 5 arg mul!
for 2x2 and 3x3 matrices in PR https://github.com/JuliaCI/BaseBenchmarks.jl/pull/255
@inline
.@KristofferC Thanks! Sorry for pinging/nagging but can the corresponding nano-soldier script be updated to include the benchmark test for the 2x2 and 3x3 matrix productions? I've submitted a PR at https://github.com/JuliaCI/BaseBenchmarks.jl/pull/255/
Merged, but the Nanosoldier bot also need to update to use the latest BaseBenchmarks for it to take effect.
Most helpful comment
So the issue seems to come from when the constructor
MulAddMul
is called, see https://github.com/JuliaLang/julia/pull/33743#discussion_r341834604. In the release version, callingmul!(C, A, B)
callsmul!(C, A, B, true, false)
, which doesIn #33743, we construct the
MulAddMul
one layer below, i.e., inside thegemm_wrapper!
, as suggested by @tkf. Unfortunately, the compiler does not propagate the constants, so the construction of theMulAddMul(true, false)
does not happen at compile time, and then badly hits performance, as was already noticed by @Jutho in https://github.com/JuliaLang/julia/pull/29634#issuecomment-547931473. Still, #33743 was addressing other performance issues with adjoints/transposes and number types different fromBool
and the matrix eltype. Is there any way to help constant propagation here?