Julia: Matrix Multiplication API

Created on 28 Sep 2017  ·  208Comments  ·  Source: JuliaLang/julia

Currently, there is the following line in the sparse matmult code:

https://github.com/JuliaLang/julia/blob/056b374919e11977d5a8d57b446ad1c72f3e6b1d/base/sparse/linalg.jl#L94-L95

I am assuming that this means we want to have the more general A_mul_B*!(α,A,B,β,C) = αAB + βC methods which overwrite C (the BLAS gemm API) for dense arrays. Is that still the case? (It also seems possible to keep both APIs, i.e. keep the A_mul_B*!(C,A,B) methods, which would simply be defined as A_mul_B*!(C,A,B) = A_mul_B*!(1,A,B,0,C).)

I would personally like to have the gemm API defined for all array types (this has been expressed elsewhere). Implementing these methods for dense arrays seems fairly simple, since they would just directly call gemm et al. The sparse case is already implemented. The only real modification would be to the pure julia generic matmult, which does not accept α and β arguments.

This would lead to generic code that works with any type of array/number. I currently have a simple implementation of expm (after having done the modification to _generic_matmatmult!) which works with bigfloats and sparse arrays.

linear algebra

Most helpful comment

I suggest a single round of approval voting with 10 days deadline from now. Approval voting means: Everyone votes for all options that they deem preferable to continuing the discussion. People who would rather have their least preferred name now than a continued discussion should vote for all three. If no option receives widespread approval, or the voting scheme itself fails to meet widespread approval, then we must continue the discussion. In case of almost-ties between approved options, @tkf gets to decide (PR author's privilege).

+1: I agree with this voting scheme and have cast my approval votes.
-1: I disagree with this voting scheme. If too many or too important people select this option, then the vote is moot.

Heart: mul! is preferable to continued discussion.
Rocket: muladd! is preferable to continued discussion.
Hooray: addmul! is preferable to continued discussion.

I tentatively suggest that 75% approval and 5 should definitely make a quorum (i.e. 75% of people who have voted at all, including disagreement with the entire voting procedure, and at least 5 people have approved of the winning option; if participation is low, then 5/6 or 6/8 make a quorum but unanimous 4/4 could be considered as failure).

All 208 comments

Ref. #9930, #20053, #23552. Best!

Thanks for the references. I suppose this issue has more to do with adding the gemm-style methods than an API revamp, but it can be closed if we think it still is too similar to #9930.

As a starting point, would there be support for having _generic_matmatmul! have the gemm API? It's a fairly simple change, and purely additive/nonbreaking, since the current method would simply be implemented by having α=1 and β=0. I can make the PR. I'd probably go similarly to this version (in that version I cut all the transpose stuff because I didn't need it, I wouldn't do that here).

Yes. That would be a good start. We need to consider the ordering of the arguments though. Originally, I thought it was more natural to follow the BLAS ordering but we fairly consistent about having output arguments first which is also the case for the current three-argument A_mul_B!. Furthermore and as you have already pointed out, the three-argument version would correspond to the five-argument version with α=1 and β=0 and default value arguments are last. Of course, we don't necessarily have to use the default value syntax for this but it would make sense to use it here.

Why not just introduce a generic gemm function?

Yes. That would be a good start. We need to consider the ordering of the arguments though. Originally, I thought it was more natural to follow the BLAS ordering but we fairly consistent about having output arguments first which is also the case for the current three-argument A_mul_B!. Furthermore and as you have already pointed out, the three-argument version would correspond to the five-argument version with α=1 and β=0 and default value arguments are last. Of course, we don't necessarily have to use the default value syntax for this but it would make sense to use it here.

Sounds good. We can continue the discussion about the actual argument ordering and method renaming in #9930. This is more about just having the five-argument version available, so I'll keep the current Ax_mul_Bx!(α,A,B,β,C) interface.

Why not just introduce a generic gemm function?

Are you suggesting renaming _generic_matmatmul! to gemm! in addition to the changes above?

To be clearer, I do think we should end up having a single method mul(C,A,B,α=1,β=0), along with lazy transpose/adjoint types to dispatch on, but that will be an other PR.

Why not just introduce a generic gemm function?

I think gemm is a misnomer in Julia. In BLAS, the ge part indicates that the matrices are general, i.e. have no special structure, the first m is multiply and the list m is matrix. In Julia, the ge (general) part is encoded in the signature as is the last m (matrix) so I think we should just call it mul!.

Is the notation mul!(α, A, B, β, C) from SparseArrays.jl

https://github.com/JuliaLang/julia/blob/160a46704fd1b349b5425f104a4ac8b323ea85af/stdlib/SparseArrays/src/linalg.jl#L32

"finalised" as the official syntax? And this would be in addition to mul!(C, A, B), lmul!(A, B), and rmul!(A,B)?

I'm not too big a fan of having A, B, and C in different orders.

Is the notation mul!(α, A, B, β, C) from SparseArrays.jl "finalised" as the official syntax?

I'd say no. Originally, I liked the idea of following BLAS (and the order also matched how this is usually written mathematically) but now I think it makes sense to just add the scaling arguments as optional fourth and fifth arguments.

So just to clarify, you'd like optional arguments in the sense

function mul!(C, A, B, α=1, β=0)
 ...
end

The other option would be optional keyword arguments

function mul!(C, A, B; α=1, β=0)
...
end

but I'm not sure people are too happy with unicode.

I'm very happy with unicode but it is true that we try always have an ascii option and it would be possible here. The names α and β are also not super intuitive unless you know BLAS so I think using positional arguments the better solution here.

In my opinion, a more logical nomenclature would be to let muladd!(A, B, C, α=1, β=1) map to the various BLAS routines that do multiplication and addition. (gemm as above, but also e.g. axpy when A is a scalar.)

The mul! function could then have an interface like mul!(Y, A, B, ...) taking an arbitrary number of arguments (just like * does) as long as all intermediate results can be stored in Y. (An optional kwarg could specify the order of multiplication, with a reasonable default)

mul!(Y, A::AbstractVecOrMat, B:AbstractVecOrMat, α::Number) would have the default implementation muladd!(A, B, Y, α=α, β=0), as would the other permutations of two matrix/vectors and a scalar.

Another vote to have mul!(C, A, B, α, β) defined for dense matrices. This would allow to write generic code for dense and sparse matrices. I wanted to define such a function in my non linear least squares package but I guess this is type piracy.

I also have been tempted to write mul!(C, A, B, α, β) methods for the MixedModels package and engage in a bit of type piracy, but it would be much better if such methods were in the LinearAlgebra package. Having methods for a muladd! generic for this operation would be fine with me too.

I'm in favour, though I think it should probably have a different name than just mul!. muladd! seems reasonable, but am certainly open to suggestions.

Maybe mulinc! for multiply-increment?

Maybe we can have something like addmul!(C, A, B, α=1, β=1)?

Isn’t this a form of muladd!? Or is the idea behind calling it addmul! that it mutates the add argument rather than the multiply argument? Would one ever mutate the multiply argument?

Note that we do mutate non-first elements in some cases, e.g. lmul! and ldiv!, so we could do them in the usual "muladd" order (i.e. muladd!(A,B,C)). The question is what order should α and β go? One option would be to make the keyword arguments?

Wouldn't it be nice if you leave an option to the implementers to dispatch on the types of the scalars α and β? It's easy to add sugars for end-users.

I thought we already settled on mul!(C, A, B, α, β) with default values for α, β. We are using this version in https://github.com/JuliaLang/julia/blob/b8ca1a499ff4044b9cb1ba3881d8c6fbb1f3c03b/stdlib/SparseArrays/src/linalg.jl#L32-L50. I think some packages are also using this form but I don't remember which on top of my head.

Thanks! It would be nice if that's documented.

I thought we already settled on mul!(C, A, B, α, β) with default values for α, β.

SparseArrays uses it, but I don't recall it being discussed anywhere.

In some ways the muladd! name is more natural because it is a multiplication followed by an addition. However, the default values of α and β, muladd!(C, A, B, α=1, β=0) (note that the default for β is zero, not one), turn it back into mul!(C, A, B).

It seems to be a case of pedantry vs consistency whether to call this mul! or muladd! and I think the case of the existing method in SparseArrays would argue for mul!. I find myself in the curious case of arguing for consistency despite my favorite Oscar Wilde quote, "Consistency is the last refuge of the unimaginative."

In some ways the muladd! name is more natural because it is a multiplication followed by an addition. However, the default values of α and β, muladd!(C, A, B, α=1, β=0) (note that the default for β is zero, not one), turn it back into mul!(C, A, B).

There is an interesting exception to this when C contains Infs or NaNs: theoretically, if β==0, the result should still be NaNs. This doesn't happen in practice because BLAS and our sparse matrix code explicitly check for β==0 then replace it with zeros.

You could consider that having defaults of α=true, β=false since true and false are "strong" 1 and 0 respectively, in the sense that true*x is always x and false*x is always zero(x).

lmul! should also have that exceptional behaviour: https://github.com/JuliaLang/julia/issues/28972

true and false are "strong" 1 and 0 respectively, in the sense that true*x is always x and false*x is always zero(x).

I didn't know that!:

julia> false*NaN
0.0

FWIW, I'm pretty happy with the readability of LazyArrays.jl syntax for this operation:

y .= α .* Mul(A,x) .+ β .* y

Behind the scenes it lowers to mul!(y, A, x, α, β), for BLAS-compatible arrays (banded and strided).

I didn't know that!

It's part of what makes im = Complex(false, true) work.

SparseArrays uses it, but I don't recall it being discussed anywhere.

It was discussed above in https://github.com/JuliaLang/julia/issues/23919#issuecomment-365463941 and implemented in https://github.com/JuliaLang/julia/pull/26117 without any objections. We don't have the α,β versions in the dense case so the only place in this repo where a decision would have an immediate effect would be SparseArrays.

What about LinearAlgebra.BLAS.gemm!? Shouldn't it be wrapped as 5-ary mul!, too?

It should but nobody has done it yet. There are many methods in matmul.jl.

It was discussed above in #23919 (comment) and implemented in #26117 without any objections.

Well, consider this my objection. I would prefer a different name.

Why would it be a different name? Both in the dense and sparse case, the basic algorithm does both the multiplication and addition.

If we give those functions different names, we'll have mul!(C,A,B) = dgemm(C,A,B,1,0) and muladd!(C,A,B,α, β) = dgemm(C,A,B,α, β).

The only advantage I see is if we actually split the methods and save an if β==0 call in the C = A*B case.

FYI, I started working on it in #29634 to add the interface to matmul.jl. I'm hoping to finish it by the time the name and signature are decided :)

An advantage of muladd! would be that we can have ternary muladd!(A, B, C) (or muladd!(C, A, B)?) with the default α = β = true (as mentioned in the original suggestion https://github.com/JuliaLang/julia/issues/23919#issuecomment-402953987). The method muladd!(A, B, C) is similar to muladd for Numbers so I guess it's more natural name especially if you already know muladd.

@andreasnoack It seems your earlier discussion is about method signature and preferring positional arguments over keyword arguments, not about method name. Do you have objection for the name muladd!? (The existence of 5-ary mul! in SparseArrays might be one, but defining the backward-compatible wrapper is not hard.)

Having both mul! and muladd! seems redundant when the former is just the latter with default values for α and β. Furthermore, the add part has been canonicalized by BLAS. If we could come up with a credible generic linear algebra application for muladd!, I'd like to hear about it but otherwise I'd prefer to avoid the redundancy.

Also, I'd strongly prefer that we keep the strong-zero property of false separate from the discussion of mul!. IMO any zero value of β should be strong as it is in BLAS and as it is in the current five argument mul! methods. I.e. this behavior should be a consequence of mul! and not the type of β. The alternative would diffucult to work with. E.g. mul!(Matrix{Float64}, Matrix{Float64}, Matrix{Float64}, 1.0, 0.0) ~could~couldn't use BLAS.

We can't change what BLAS does, but _requiring_ strong zero behaviour for floats does mean that every implementation will need a branch to check for zero.

If we could come up with a credible generic linear algebra application for muladd!

@andreasnoack By this, I suppose you mean "application for _three-argument_ muladd!" since otherwise you wouldn't agree to include five-argument mul!?

But I still can come up with an example where muladd!(A, B, C) is useful. For example, if you want to construct a "small-world" network, it's useful to have a "lazy" summation of a banded matrix and a sparse matrix. You can then write something like:

A :: SparseMatrixCSC
B :: BandedMatrix
x :: Vector  # input
y :: Vector  # output

# Compute `y .= (A .+ B) * x` efficiently:
fill!(y, 0)
muladd!(x, A, y)  # y .+= A * x
muladd!(x, B, y)  # y .+= B * x

But I don't mind manually writing trues there since I can simply wrap it for my usage. Having the five-argument function as a stable documented API is the most important goal here.

Going back to the point:

Having both mul! and muladd! seems redundant when the former is just the latter with default values for α and β.

But we have some * implemented in terms of mul! with the "default value" of the output array initialized appropriately. I think there are may such "shortcut" examples in Base and standard libraries? I think it makes sense to have both mul! and muladd! even though mul! is just a shortcut of muladd!.

I'd strongly prefer that we keep the strong-zero property of false separate from the discussion of mul!

I agree that it would be constructive to focus discussing the name of five-argument version of the multiply-add first (mul! vs muladd!).

I didn't do a good job when I asked for a generic use case where you needed muladd to work generically across matrices and numbers. The number version would be muladd without the exclamation mark so what I asked didn't really make sense.

Your example could just be written as

mul!(y, A, x, 1, 1)
mul!(y, B, x, 1, 1)

so I still don't see the need for muladd!. Is it just that you think this case is so common that writing 1, 1 is too verbose?

But we have some * implemented in terms of mul! with the "default value" of the output array initialized appropriately. I think there are may such "shortcut" examples in Base and standard libraries?

I don't get this one. Could you try to elaborate? What are the shortcuts you are talking about here?

so I still don't see the need for muladd!. Is it just that you think this case is so common that writing 1, 1 is too verbose?

I think muladd! is also more descriptive as to what it actually does (though perhaps it should be addmul!).

I don't have a problem with the name muladd!. Primarily, I just don't think we should have to functions for this and secondarily I don't think deprecating mul! in favor of muladd!/addmul! is worth it.

Is it just that you think this case is so common that writing 1, 1 is too verbose?

No. I'm totally fine with calling five argument function as long as it's a public API. I just tried to give an example where I only need three argument version (as I thought that was your request).

What are the shortcuts you are talking about here?

https://github.com/JuliaLang/julia/blob/f068f21d6099632bd5543ad065d5de96943c9181/stdlib/LinearAlgebra/src/matmul.jl#L140-L143

I think * defined here can be considered a shortcut of mul!. It's "just" mul! with a default value. So, why don't let mul! be a muladd!/addmul! with default values?

There are rmul! and lmul! defined as similar "shortcuts" as well:

https://github.com/JuliaLang/julia/blob/f068f21d6099632bd5543ad065d5de96943c9181/stdlib/LinearAlgebra/src/triangular.jl#L478-L479

deprecating mul!

I thought the discussion was about adding a new interface or not. If we need to deprecate mul! to add a new API, I don't think it's worth it.

The main arguments I can think of are:

  • conceptually the 5-argument form does more than just "multiply", and conveys this more clearly.
  • you can then write addmul!(C, A, B) instead of mul!(C,A,B,1,1) or mul!(C,A,B,true,true).

I think * defined here can be considered a shortcut of mul!. It's "just" mul! with a default value. So, why don't let mul! be a muladd!/addmul! with default values?

Because * is the default way of multiplying matrices and how most users would do it. In comparison, muladd! wouldn't be anywhere close to * in usage. Furthermore, it's even an existing operator whereas muladd!/addmul! would be a new function.

In don't think rmul! and lmul! fit this pattern because they generally aren't default value versions of out-of-place mul! methods.

Simon summarizes the benefits nicely in the post right above. The question is if the benefits are large enough to justify either an extra function of a renaming (which means deprecation of mul!). It's where we disagree. I don't think it's worth it.

When you say that renaming isn't worth it, did you take it into account that the API is not completely public? By that, I mean it's not in Julia's documentation.

I know LazyArrays.jl (and other packages?) already uses it so blindly following the semver wouldn't be good. But still, it's not as public as other functions.

mul! is exported from LinearAlgebra and broadly used so we'd definitely have to deprecate it at this point. It's a shame we didn't have this discussion when A_mul_B! became mul! or at least before 0.7 because it would have been a much better time to rename the function.

How about using mul! for now and update the name for LinearAlgebra v2.0 when we can update stdlibs separately?

LazyArrays.jl doesn’t use mul! since it’s not flexible for many types of matrices (and triggers a compiler slowness bug when you override with StridedArrays). It gives an alternative construction of the form

y .= Mul(A, x)

which I find is more descriptive. The 5 argument analogue is

y .= a .* Mul(A, x) .+ b .* y

I would argue in favour of deprecating mul! and moving over to LazyArrays.jl approach in LinearAlgebra.jl, but thats going to be a hard case to make.

LowRankApprox.jl does use mul!, but I might change it to use LazyArrays.jl approach and thereby avoid the compiler bug.

OK. I thought there were only two proposals. But apparently there are roughly three proposals?:

  1. three- and five-argument mul!
  2. three- and five-argument muladd!
  3. three-argument mul! and five-argument muladd!

(muladd! may be called addmul!)

I was thinking we are comparing 1 and 3. My understanding now is that @andreasnoack has been comparing 1 and 2.

I'd say 2 is not an option at all as three-argument mul! is a public API and widely used. What I meant by "the API is not completely public" was that five-argument mul! is not documented.

Yes, my plan was to keep mul! (as a 3 arg and possibly 4 arg form). I think this is worthwhile since 3 arg mul! and addmul! would have different behaviour, i.e. given addmul!(C, A, B, α, β), we would have:

mul!(C, A, B) = addmul!(C, A, B, 1, 0)
mul!(C, A, B, α) = addmul!(C, A, B, α, 0)
addmul!(C, A, B) = addmul!(C, A, B, 1, 1)
addmul!(C, A, B, α) = addmul!(C, A, B, α, 1)

However you may not want to actually implement them this way in practice, e.g. it may be simpler to just the 4-arg mul! and addmul! separately, and define the 5-arg addmul! as:

addmul!(C, A, B, α, β) = addmul!(C .= β .* C, A, B, α)

Bump!

However you may not want to actually implement them this way in practice, e.g. it may be simpler to just the 4-arg mul! and addmul! separately, and define the 5-arg addmul! as:
addmul!(C, A, B, α, β) = addmul!(C .= β .* C, A, B, α)

Why not do it optimally right away? The point of not doing it like that is that you only need to visit the elements of C once, which is definitely more efficient for large matrices. In addition, I can hardly believe that the code would be longer by defining just the 5-arg addmul! versus both 4-arg mul! and addmul! separately.

FYI I've modified LinearAlgebra's implementation of _generic_matmatmul! to take 5-arguments in LazyArrays: https://github.com/JuliaArrays/LazyArrays.jl/blob/8a50250fc6cf3f2402758088227769cf2de2e053/src/linalg/blasmul.jl#L70

Here it's called via:

materialize!(MulAdd(α, A, b, β, c)) 

but the actual code (in tiled_blasmul!) would be easy to translate back to LinearAlgebra.

What can be done to try to speed up this process? Things I'm working on would really benefit from a unified matrix multiplication API with in-place mul+add

The latest version of Strided.jl now also supports 5 argument mul!(C,A,B,α,β), dispatching to BLAS when possible, and using its own (multithreaded) implementation otherwise.

@Jutho great package! is there a roadmap for what's next? might the plan be to eventually merge with LinearAlgebra?

This was never my intention but I am not opposed to it if that is at some point requested. However, I think my liberal use of @generated functions (although there is just one) in the general mapreduce functionality might not well suited for Base.

My personal roadmap: this is mostly a low-level package to be used by higher-level packages, i.e. the new version of TensorOperations, and some other package I am working on. However, some more support for basic linear algebra would be good (e.g. applying norm to a StridedView currently falls back to a rather slow implementation of norm in Julia Base). And if I have time and learn to work with GPUs, try to implement an equally general mapreducekernel for GPUArrays.

I think the consensus so far is:

  1. We should keep mul!(C, A, B)
  2. We need _some_ 5-argument function for inplace multiply-add C = αAB + βC

I suggest to first focus on what the name of the 5-argument function should be and discuss additional API later (like 3- and 4-argument addmul!). But this is the "feature" we get from _not_ using mul! so it's hard to not mix.

@andreasnoack Is your concern about deprecation/renaming resolved by @simonbyrne's comment above https://github.com/JuliaLang/julia/issues/23919#issuecomment-431046516 ? I think there is no need for deprecation.

FYI, I just finished the implementation #29634. I appreciate if someone familiar with LinearAlgebra can review it.

I think it's simpler and better to name everything mul!. It also avoids the deprecation. If we really really want a different name, muladd is better.

Something else to maybe take into account when discussing the mul! API:

When scale! went away and got absorbed in the 0.6 -> 0.7 transition, I was a bit sad because to me, scalar multiplication (a property of vector spaces) was very different then multiplying objects themselve (a property of algebras). Nonetheless I have fully embraced the mul! approach, and very much appreciate the ability to rmul!(vector,scalar) and lmul!(scalar,vector) when the scalars multiplication is not commutative. But now I am everyday more bothered by the un-Julian name of two other in place vector space operations: axpy! and its generalization axpby!. Could these also be absorbed into mul!/muladd!/addmul!. Although it is a bit strange, if one of the two factors in A*B is already a scalar, there is no need for an additional scalar factor α.
But maybe then, in analogy to

mul!(C, A, B, α, β)

there could also be a

add!(Y, X, α, β)

to replace axpby!.

@andreasnoack Is your concern about deprecation/renaming resolved by @simonbyrne's comment above #23919 (comment) ? I think there is no need for deprecation.

See the last paragraph of https://github.com/JuliaLang/julia/issues/23919#issuecomment-430952179. I still think introducing a new function isn't worth it. If we do it anyway, I think we'd should deprecate the current 5-argument mul!.

@Jutho I think renaming acp(b)y! to add! would be a good idea.

See the last paragraph of #23919 (comment). I still think introducing a new function isn't worth it.

Yes, I've read it and replied that five-argument mul! was not documented and it was not a part of the public API. So, _technically_ there is no need for deprecation. See the last paragraph of https://github.com/JuliaLang/julia/issues/23919#issuecomment-430975159 (Of course, it would be kind to have a deprecation anyway so I already implemented it in #29634.)

Here, I'm assuming that declaration of the public API due to documentation of one signature (e.g., mul!(C, A, B)) does not apply to other signatures (e.g., mul!(C, A, B, α, β)). If it is _not_ the case, I think Julia and its stdlib are exposing too much internals. For example, here is the documented signature of Pkg.add

https://github.com/JuliaLang/julia/blob/0d713926f85dfa3e4e0962215b909b8e47e94f48/stdlib/Pkg/src/Pkg.jl#L76-L79

whereas the actual definition is

https://github.com/JuliaLang/julia/blob/0d713926f85dfa3e4e0962215b909b8e47e94f48/stdlib/Pkg/src/API.jl#L69-L70

https://github.com/JuliaLang/julia/blob/0d713926f85dfa3e4e0962215b909b8e47e94f48/stdlib/Pkg/src/API.jl#L27-L33

If the existence of the documentation of at least one signature of Pkg.add implies that other signatures are public API, Pkg.jl can't remove the behavior due to implementation details without bumping major version, e.g.,: Pkg.add(...; mode = :develop) runs Pkg.develop(...); all the keyword arguments to Context! are supported (which may actually be intended).

But this is just my impression anyway. Do you consider mul!(C, A, B, α, β) has been as public as mul!(C, A, B)?

I think we are talking past each other. What I'm saying is that I (still) don't think it introducing another function is worth it. Hence my reference to my previous comment. This is separate from the discussion of the deprecation of five-argument mul!.

However, if we decide to add another function then I think it would be best to deprecate five argument mul! instead of just breaking it. Of course, it's not as commonly used as three-argument mul!, but why not deprecate it instead of just breaking it?

This is separate from the discussion of the deprecation of five-argument mul!.

My interpretation of the last paragraph of your comment https://github.com/JuliaLang/julia/issues/23919#issuecomment-430952179 was that you acknowledged the benefits @simonbyrne listed https://github.com/JuliaLang/julia/issues/23919#issuecomment-430809383 for a new five-argument function but considered they were less valuable compared to keeping the _public API_ (as you mentioned "renaming" and "deprecation"). That's why I thought considering if five-argument mul! has been public or not was important.

But you also mentioned justification of having "an extra function" which, I suppose, is what you are referring now. Are you arguing that the computations _C = AB_ and _C = αAB + βC_ are similar enough so that the same name can describe both? I actually disagree since there can be other ways to generalize three-argument mul!: e.g., why not mul!(y, A₁, A₂, ..., Aₙ, x) for _y = A₁ A₂ ⋯ Aₙ x_ https://github.com/JuliaLang/julia/issues/23919#issuecomment-402953987?

why not deprecate it instead of just breaking it?

As I said in the previous comments, I do agree deprecating five-argument mul! is the right thing to do _if_ we were to introduce another function. This code already exists in my PR #29634.

Are you arguing that the computations C = AB and C = αAB + βC are similar enough so that the same name can describe both?

Yes, since the former is just the latter with β=0. It's fair to argue that muladd!/addmul! is a more precise name for C = αAB + βC but getting there would either require introducing another matrix multiplication function (muladd!/addmul!) or renaming mul! and I don't think it's worth it now. If this had come up in the spring then it would have been easier to consider a change.

I actually disagree since there can be other ways to generalize three-argument mul!:

Julia happened to define the in-place matrix multiply methods without the α and β arguments but the matrix multiply tradition is really based on BLAS-3 and there the general matrix multiplication function is C = αAB + βC.

renaming mul!

Do you mean renaming it in stdlib or in downstream user module/code? If you mean the former, it's already done (for LinearAlgebra and SparseArrays) in #29634 so I don't think you need to worry about it. If you mean the latter, I think it again boils down to public-or-not discussion.

the matrix multiply tradition is really based on BLAS-3

But Julia already departed from the naming convention of BLAS. So wouldn't it be nice to have more descriptive name?

Do you mean renaming it in stdlib or in downstream user module/code?

29634 doesn't rename the mul! function. It adds the new function addmul!.

But Julia already departed from the naming convention of BLAS.

I'm not talking about the naming. At least not exactly since Fortran 77 has some limitations that we don't have both in terms of function names and dispatch. I'm talking about what is being computed. The general matrix multiplication function in BLAS-3 computes C = αAB + βC and in Julia, it's been mul! (fka A_mul_B!).

So wouldn't it be nice to have more descriptive name?

It would, and I have said that several times. The problem is that it's not so much nicer that we should have two matrix multiplication functions that basically do the same thing.

29634 doesn't rename the mul! function. It adds the new function addmul!.

What I meant to say was that five-argument mul! was renamed to addmul!.

The problem is that it's not so much nicer that we should have two matrix multiplication functions that basically do the same thing.

I feel if they are basically the same or not is somewhat subjective. I think both _C = αAB + βC_ and _Y = A₁ A₂ ⋯ Aₙ X_ are mathematically valid generalization of _C = AB_. Unless _C = αAB + βC_ is the unique generalization, I don't think the argument is strong enough. It also depends on if you know BLAS API and I'm not sure if that's the basic knowledge for typical Julia users.

Also, _C = AB_ and _C = αAB + βC_ are computationally very different in that the content of C is used or not. It is an output-only parameter for the former and an input-output parameter for the latter. I think this difference deserves a visual cue. If I see mul!(some_func(...), ...) and mul! has the five-argument form, I have to count the number of arguments (which is hard when they are the results of function call since you have to match parentheses) to see if some_func does some computation or just an allocation. If we have addmul! then I can immediately expect that some_func in mul!(some_func(...), ...) only does allocation.

I feel if they are basically the same or not is somewhat subjective. I think both C = αAB + βC and Y = A₁ A₂ ⋯ Aₙ X are mathematically valid generalization of C = AB. Unless C = αAB + βC is the unique generalization, I don't think the argument is strong enough.

It might not be the unique generalization, however it is a generalization that can be computed at roughly identical cost, and that forms a useful primitive for building other linear algebra algorithms. On many occasions I have wanted to have a non-zero beta when implementing various linear algebra related algorithms, and always had to fall back to BLAS.gemm!. Other generalizations like the one you mention can anyway not be calculated in one shot without intermediate temporaries, so an in-place version is much less useful. Furthermore, they are not as generally useful as a primitive operation.

It also depends on if you know BLAS API and I'm not sure if that's the basic knowledge for typical Julia users.

As long as default arguments α=1 and β=0 are still in place, the three arg mul! will do what any Julia user would reasonably expect without BLAS background. For the more advanced options, one has to consult the manual, as one would have to do with any language and any function. Furthermore, this single mul! call does not only supersede gemm but also gemv and trmv (which strangely enough does not have α and β parameters in the BLAS API) and probably many others.

I do agree BLAS-3 is the right generalization in terms of computation and it composes very well. I'm bringing up another possible generalization just because I think it is not "unique enough" to justify using the same name. See also output-only vs input-output argument in the last paragraph of https://github.com/JuliaLang/julia/issues/23919#issuecomment-441267056. I think different name makes reading/reviewing code easier.

Furthermore, this single mul! call does not only supersede gemm but also gemv and trmv (which strangely enough does not have α and β parameters in the BLAS API) and probably many others.

Yep, already implemented in #29634 and it's ready to go once the name is decided (and get reviewed)!

I'm having a hard time following this conversation (it's a bit long and sprawling... sorry!), is the leading proposal something like mul!(C, A, B; α=true, β=false)?

I don't think keyword argument for α and β is on the table. For example, @andreasnoack dismissed keyword arguments in https://github.com/JuliaLang/julia/issues/23919#issuecomment-365762889. @simonbyrne mentioned keyword arguments in https://github.com/JuliaLang/julia/issues/23919#issuecomment-426881998 but his latest suggestion https://github.com/JuliaLang/julia/issues/23919#issuecomment-431046516 is positional arguments.

We haven't decided the name yet (i.e., mul! vs addmul! vs muladd!) and I think that's the central topic (or at least that's my wish).

How do you resolve this kind of controversy usually? Voting? Triage?

is the leading proposal something like mul!(C, A, B; α=true, β=false)?

I like this, but without the kwargs.

Didn't see the keyword thing. I am also hesitant to add unicode keywords. I think positional arguments with these default values are fine. As for the upcoming vote, mine is on plain mul!. I think this is a generalization of mul! that is sufficiently specific and useful not to need a new name.

Just to collect data (at least at the moment), let's do the voting:

What is your favorite function name for _C = αAB + βC_?

  • :+1: mul!
  • :-1: addmul!
  • :smile: muladd!
  • :tada: something else

To me addmul! seems to describe (A+B)C instead of AB + C.

At first I voted for mul! then I looked at the operation and thought "it's doing a multiply and then an add, obviously we should call it muladd!. Now I can't think of calling it anything else. The fact that it's in-place is clearly indicated by the ! and the scaling part seems fitting for keyword args.

it's doing a multiply and then an add, obviously we should call it muladd!

Only if you use the default value β=true, but then for any other value it's just something more general again. So what is the point of not calling it mul!, where any other value then the default value β=false also just gives you something more general? And how would you order the arguments, comparing to muladd(x,y,z) = x*y + z? Will be somewhat confusing, no?

I think muladd! has the drawback of sounding descriptive when it’s not: a descriptive named would be something like scalemuladd! to mention the scaling part.

So I prefer mul! as it’s nondescript enough to not lead to expectations.

That said, I call the lazy version in LazyArrays.jl MulAdd.

I prefer muladd! to mul! because it is nice to keep distinct a function that never uses the value of C (mul!) from a function that uses it (muladd!).

  • Technically it's a matrix multiplication: [A C] * [Bα; Iβ] or, see comment below, [αA βC] * [B; I]
  • ~We already have a 5-arg mul! for sparse matrices, the same for dense linalg would be consistent~ (not a new argument)

So I'd be in favour of calling it mul!.

  • Technicaly it's a matrix multiplication: [A C] * [Bα; Iβ]

...if eltype has commutative multiplication

...if eltype is commutative.

IIRC from a discussion with @andreasnoack Julia just defines gemm / gemv as y <- A * x * α + y * β because that makes most sense.

@haampie That's good know! As I implemented it the other way around in #29634.

This is of limited help, but

       C = α*A*B + β*C

is the best way I can come up with to express the operation and hence maybe a macro @call C = α*A*B + β*C or @call_specialized ... or something along these lines would be a natural interface - also for similar situations. Then the underlying function can be called whatever.

@mschauer LazyArrays.jl by @dlfivefifty has a great syntax for calling 5-argument mul! like your syntax (and more!).

I think we need to establish function-based API first and so that package authors can start overloading it for their specialized matrices. Then Julia community can start experimenting sugars for calling it.

Only if you use the default value β=true, but then for any other value it's just something more general again. So what is the point of not calling it mul!, where any other value then the default value β=false also just gives you something more general? And how would you order the arguments, comparing to muladd(x,y,z) = x*y + z? Will be somewhat confusing, no?

Sure there's some scaling in there but the "bones" of the operation are clearly multiply and add. I would also be fine with muladd!(A, B, C, α=true, β=false) to match the signature of muladd. Would have to be documented, of course, but that goes without saying. It does kind of make me wish that muladd took the additive part first, but the ship has sailed on that one.

And how would you order the arguments, comparing to muladd(x,y,z) = x*y + z? Will be somewhat confusing, no?

This is the reason why I prefer addmul! over muladd!. We can make sure the order of arguments is nothing to do with the scalar muladd. (Though I prefer muladd! over mul!)

FWIW here is a summary of the arguments so far. (I tried to be neutral but I'm pro-muladd!/addmul! so keep that in mind...)

The main disagreement is that whether or not _C = AB_ and _C = αAB + βC_ are different enough to give a new name for the latter.

They are similar enough because...

  1. It's BLAS-3 and nicely composable. So, _C = αAB + βC_ is an obvious generalization of _C = AB_ (https://github.com/JuliaLang/julia/issues/23919#issuecomment-441246606, https://github.com/JuliaLang/julia/issues/23919#issuecomment-441312375, etc.)

  2. _"muladd! has the drawback of sounding descriptive when it’s not: a descriptive named would be something like scalemuladd! to mention the scaling part."_ --- https://github.com/JuliaLang/julia/issues/23919#issuecomment-441819470

  3. _"Technically it's a matrix multiplication: [A C] * [Bα; Iβ]"_ --- https://github.com/JuliaLang/julia/issues/23919#issuecomment-441825009

They are different enough because...

  1. _C = αAB + βC_ is more than multiply (https://github.com/JuliaLang/julia/issues/23919#issuecomment-430809383, https://github.com/JuliaLang/julia/issues/23919#issuecomment-427075792, https://github.com/JuliaLang/julia/issues/23919#issuecomment-441813176, etc.).

  2. There could be other generalizations of mul! like Y = A₁ A₂ ⋯ Aₙ X (https://github.com/JuliaLang/julia/issues/23919#issuecomment-402953987 etc.)

  3. Input-only vs Input-output parameter: It's confusing to have a function that uses the data in C based on the number of arguments (https://github.com/JuliaLang/julia/issues/23919#issuecomment-441267056, https://github.com/JuliaLang/julia/issues/23919#issuecomment-441824982)

Another reason why mul! is better because...:

  1. Sparse matrix already has it. So it's good for backward compatibility. Counter argument: five-argument mul! is not documented so we don't need to consider it as a public API.

and why muladd!/addmul! is better because...:

  1. We can have different three- or four-argument "handy functions" for mul! and muladd!/addmul! separately (https://github.com/JuliaLang/julia/issues/23919#issuecomment-402953987, https://github.com/JuliaLang/julia/issues/23919#issuecomment-431046516, etc.). Counter argument: writing mul!(y, A, x, 1, 1) is not much verbose compared to mul!(y, A, x) (https://github.com/JuliaLang/julia/issues/23919#issuecomment-430674934, etc.)

Thanks for the objective summary @tkf

I would also be fine with muladd!(A, B, C, α=true, β=false) to match the signature of muladd.

I hope that for a function called mulladd! the default would be β=true. Still, I think this order of arguments, which is dictated from muladd, will be very confusing in relation to mul!(C,A,B)

Maybe I am mistaken, but I would think that most people/applications/high-level code (that are not already satisfied with just the multiplication operator *) need mul!. The ability to also mix βC, with β=1 (true) or otherwise, will be used in lower level code, by people that know that the BLAS API for matrix multiplication enables this. I would guess that these people would look for this functionality under mul!, which is the established Julia interface to gemm, gemv, ... Adding a new name (which confusing opposite argument order) does not seem worth it; I fail to see the gain?

I would guess that these people would look for this functionality under mul!, which is the established Julia interface to gemm, gemv, ... Adding a new name (which confusing opposite argument order) does not seem worth it; I fail to see the gain?

I think discoverability is not a big problem since we can simply mention muladd! in mul! docstring. Those who are skilled enough to know BLAS would know where to look for an API, right?

Regarding positional vs keyword arguments: It's not discussed here yet but I think C = αAB + βC with α being a diagonal matrix can be implemented as efficiently and easily as scalar α. Such extension requires that we can dispatch on the type of α which is impossible with the keyword argument.

Also, for non-commutative eltype, you may want to efficiently compute C = ABα + Cβ by calling muladd!(α', B', A', β', C') (hypothetical argument order). It may require that you are able to dispatch on lazy wrappers Adjoint(α) and Adjoint(β). (I don't use non-commutative numbers in Julia personally so this probably is very hypothetical.)

I agree with @Jutho's point that this multiply-add function is a low-level API for skilled programmers like library implementers. I think extensibility has a high priority for this API and positional argument is the way to go.

Another argument for avoiding keyword argument is what @andreasnoack said before https://github.com/JuliaLang/julia/issues/23919#issuecomment-365762889:

The names α and β are also not super intuitive unless you know BLAS

@tkf, sure, my argument is rather: the number of actual uses of β != 0 will be smaller than those of β == 0, and those who need it will not be surprised to find this slightly more general behaviour under mul!. Therefore, I don't see the gain of separating out this under a new name, especially since the argument order is messed up (with muladd! at least). If it has to be a new method, I also sympathize with your argument for addmul!.

those who need it will not be surprised to find this slightly more general behaviour under mul!.

I agree with this point.

Therefore, I don't see the gain of separating out this under a new name,

Unless you see some harms, I think it's a global gain since there are other people seeing benefits.

especially since the argument order is messed up (with muladd! at least)

I guess you'd count this as a harm and I get the point. It's just I think other benefits for muladd!/addmul! are more important.

I guess you'd count this as a harm and I get the point. It's just I think other benefits for muladd!/addmul! are more important.

That's indeed the harm, together with the fact that mul! has always been the single entry point into several BLAS operations related to multiplication, be it restricted by not giving full access to α and β. And now with muladd!, there would be two different entry points, depending on just a slight difference in requested operation which can easily be captured by an argument (and indeed, which is captured by an argument in the BLAS API). I think it was a mistake in Julia in the first place not to offer full access to the BLAS API (so thanks for fixing that @tkf). Despite it's old horrible fortran naming convention, I guess these guys knew why they were doing things this way. But likewise, I think this family of operations (i.e. the 2-parameter family of operations parameterized by α and β) belongs together under a single entry point, as they are in BLAS.

The most valid counter argument in my view is the difference between whether or not the original data in C will be accessed. But given the fact that Julia has embraced multiplying with false as a way to guarantee a zero result, even when the other factor is NaN, I think this is also taken care of. But maybe this fact needs to be communicated/documented better (it's been a while since I've read the documentation), and I also only recently learned about it. (That's why in KrylovKit.jl, I require the existence of a fill! method, to initialize an arbitrary vector-like user type with zeros. But now I know I can just to rmul!(x,false) instead, so I don't need to impose that fill! is implemented).

It's just I think other benefits for muladd!/addmul! are more important.

So let me reverse the question, what is these other benefit of having a new method? I've read your summary again, but I only see the point of accessing C, which I've just commented on.

I mentioned to my wife this morning that there has been a two-month long conversation in the Julia community regarding naming an operation. She suggested that it be called "fred!" - no acronym, no deep meaning, just a good name. Just putting this out there on her behalf.

Good that she included an exclamation mark! 😄

First of all, just in case, let me clarify that my concern is almost only coming from readability of the code, not writability or discoverability. You write code once but read many times.

what is these other benefit of having a new method?

As you commented, I think the output vs input-output parameter argument is most important. But this is just one reason why I think _C = αAB + βC_ is different from _C = AB_. I also think that simple fact that they are different in the sense that former expression is the strict "superset" of the latter needs a clear visual indication in the code. Different name helps an intermediate programmer (or little unfocused advanced programmer) skimming code and noticing that it is using something weirder than mul!.

I just checked the poll (you need to click "Load more" above) again and it looks like some votes moved from mul! to muladd!? Last time I saw it, mul! was winning. Let's record it here before they moved :laughing:

  • mul!: 6
  • addmul!: 2
  • muladd!: 8
  • something else: 1

Slightly more seriously, I still think this data does not show mul! or muladd! is clearer than the other. (Although it shows that addmul! is a minority :sob:)

It feels like we are stuck. How do we move on?

Just call it gemm! instead?

Just call it gemm! instead?

I hope this is a joke... unless you are proposing gemm!(α, A::Matrix, x::Vector, β, y::Vector) = gemv!(α, A, x, β, y) for matrix * vector.

Could we leave the mul! interface that already exists (with sparse matrices) for now so we can merge the PR and have the improvements, and worry about whether we want to add muladd! in an other PR?

Maybe it's already clear to everybody here but I just wanted to emphasize that the vote is not

  • mul! vs muladd!

but

  • mul! vs (mul! and muladd!)

i.e. having two mutating multiplication functions instead of a single.

I decided not to post anymore since any time I posted in favor of mul!, votes seemed to move from mul! to (mul! and muladd!).

However, I do have a question? If we go with the current majority vote, and we simultaneously have mul!(C,A,B) and muladd!(A,B,C,α=true,β=true), and I want to prepare a PR that replaces axpy! and axpby! with a more julian name add!, should that be add!(y, x, α=true, β=true) or add!(x, y, α=true, β=true) (where, for clarity, y is mutated). Or something else?

In case it isn't obvious, muladd!(A,B,C) would violate the convention that mutated arguments go first.

Could we leave the mul! interface that already exists

@jebej I think this "backward compatibility" argument is discussed extensively. Yet, it doesn't convince anyone (by looking at the poll, it's not just me).

worry about whether we want to add muladd! in an other PR?

It's bad to break public API. So if we say mul! then it's mul! forever (though in theory LinearAlgebra can bump its major version to break the API).

I want to prepare a PR that replaces axpy! and axpby! with a more julian name add!, should that be add!(y, x, α=true, β=true) or add!(x, y, α=true, β=true)

@Jutho Thanks, that would be great! I think choosing the order of argument would be straightforward once we decided the call signature of multiply-add API.

muladd!(A,B,C) would violate the convention that mutated arguments go first.

@simonbyrne But (as you already mentioned in https://github.com/JuliaLang/julia/issues/23919#issuecomment-426881998), lmul! and ldiv! mutate non-first argument. So I think we don't need to exclude muladd!(A,B,C,α,β) from the choice but rather count it as a negative point for this signature.

(But I'd say to go with muladd!(α, A, B, β, C) if we are going to have "textual order" API.)

By the way, one thing I don't understand by the result of the vote is the asymmetry of muladd! and addmul!. If you write C = βC + αAB, I think it looks like addmul! is more natural.

@tkf It's about what operation you do first. To me addmul! suggests that you do an addition first, and then multiply, as in (A+B)C. Of course it's subjective. But good names should appeal to intuition.

Ah, I get that point.

As this is still stuck, my proposal would have the usage pattern consisting of function definitions with (going with @callexpr for the second)

@callexpr(C .= β*C + α*A*B) = implementation(C, β, α, A, B)
@callexpr(C .= β*C + A*B) = implementation(C, β, true, A, B)

and maybe a more dispatch friendly form (going with @callname for the second)

function @callname(β*C + A*B)(C::Number, β::Number, A::Number, B::Number)
     β*C + A*B
end

and calls

@callexpr(A .= 2*C + A*B)
@callexpr(2*3 + 3*2)

and nobody has to worry (or know) how callexpr mangles algebraic operations into a unique function name (which does not depend on the argument symbols, only on operations and order of operations.)
I thought a bit about the implementation and it should be well doable.

@mschauer I think that's an interesting direction. Can you open a new issue? The API you are proposing can solve many other issues. I think it needs to go through a careful design process than solving a single instance of the issue it can solve.

So I've heard the rumor that the feature freeze of 1.1 is next week. Although the next minor release is "only" four months away, it would be really nice if we can have it in 1.1...

Anyway, we also need to decide the call signature (order of arguments and keyword or not) before merging the PR.

So let's do the vote again (as I've found that it's a nice stimulant).

_If_ we use muladd! for _C = ABα + Cβ_, what is your favorite call signature?

  • :+1: muladd!(C, A, B, α, β)
  • :-1: muladd!(A, B, C, α, β)
  • :smile: muladd!(C, A, B; α, β) (like :+1:, but with keyword argumentbs)
  • :tada: muladd!(A, B, C; α, β) (like :-1:, but with keyword arguments)
  • :confused: muladd!(A, B, α, C, β)
  • :heart: something else

If you have other keyword argument names in mind, vote for the ones using α and β and then comment on what names are better.

As we haven't decided what the name should be, we need to do it for mul! as well:

_If_ we use mul! for _C = ABα + Cβ_, what is your favorite call signature?

  • :+1: mul!(C, A, B, α, β)
  • :-1: mul!(A, B, C, α, β)
  • :smile: mul!(C, A, B; α, β) (like :+1:, but with keyword argumentbs)
  • :tada: mul!(A, B, C; α, β) (this is impossible)
  • :confused: mul!(A, B, α, C, β)
  • :heart: something else

NOTE: We are not changing the existing API mul!(C, A, B)

NOTE: We are not changing the existing API mul!(C, A, B)

I had not paid enough attention to this fact—we already have mul! and this is what it means:

mul!(Y, A, B) -> Y

Calculates the matrix-matrix or matrix-vector product A*B and stores the result in Y, overwriting the existing value of Y. Note that Y must not be aliased with either A or B.

Given that, it seems very natural to just expand that like this:

mul!(Y, A, B) -> Y
mul!(Y, A, B, α) -> Y
mul!(Y, A, B, α, β) -> Y

Calculates the matrix-matrix or matrix-vector product A*B and stores the result in Y, overwriting the existing value of Y. Note that Y must not be aliased with either A or B. If a scalar value, α, is supplied, then the α*A*B is computed instead of A*B. If a scalar value, β is supplied, then α*A*B + β*Y is computed instead. The same aliasing restriction applies to these variants.

However, I think there's a major concern with this: it seems at least as natural for mul!(Y, A, B, C, D) to compute A*B*C*D in place into Y—and that generic notion clashes very badly with mul!(Y, A, B, α, β) computing α*A*B + β*C. Moreover, it seems to me that computing A*B*C*D into Y is a thing that it would be useful and possible to do efficiently, avoiding intermediate allocations, so I really wouldn't want to block that meaning.

With that other natural generalization of mul! in mind, here's another thought:

mul!(Y, α, A, B) # Y .= α*A*B

This fits into the general model of mul!(out, args...) where you compute and write into out by multiplying args together. It relies on dispatch to deal with α being scalar instead of making it a special case—it's just another thing you're multiplying. When α is a scalar and A, B and Y are matrices, we can dispatch to BLAS to do this super efficiently. Otherwise, we can have a generic implementation.

Moreover, if you're in a non-commutative field (e.g. quaternions), then you can control which side the scaling by α happens on: mul!(Y, A, B, α) scales by α on the right instead of the left:

mul!(Y, A, B, α) # Y .= A*B*α

Yes, we can't call BLAS for quaternions, but it's generic and we can probably still do it reasonably efficiently (maybe even transforming it into some BLAS calls somehow).

Assuming that approach for Y .= α*A*B the next question becomes: what about scaling and incrementing Y? I started thinking about keywords for that, but then the non-commutative field came to mind that felt too awkward and limited. So, I started thinking about this API instead—which seems a little strange at first, but bear with me:

mul!((β, Y), α, A, B) # Y .= β*Y .+ α*A*B

A little odd, but it works. And in a non-commutative field, you can ask to multiply Y by β on the right like this:

mul!((Y, β), α, A, B) # Y .= Y*β .+ α*A*B

In full generality in a non-commutative field, you could scale on both left and right like this:

mul!((β₁, Y, β₂), α₁, A, B, α₂) # Y .= β₁*Y*β₂ + α₁*A*B*α₂

Now, of course, this is a little bit weird and there's no BLAS operation for this, but it's a generalization of GEMM that let's us express a _lot_ of things and which we can dispatch to BLAS operations trivially without even doing any nasty if/else branches.

I really like @StefanKarpinski 's suggestion as an underlying API call, but I am also thinking about whether this is how we actually want to expose things to users. IMO, at the end it should look simple, like an associated macro:

@affine! Y = β₁*Y*β₂ + α₁*A*B*α₂

The underlying function would then be something like @StefanKarpinski proposes.

But we should go further here. I really think that, if you make an API for it and a generic function, someone will make a Julia library that does it efficiently, so I agree that we shouldn't just stick to BLAS here. Things like MatrixChainMultiply.jl are already building DSLs for multiple matrix computations and DiffEq is doing its own thing with affine operator expressions. If we just have one representation for an affine expression in Base, we can define all of our work to be on the same thing.

@dlfivefifty looked into lazy linear algebra before, I think that should really be revived here. Building lazy representations of broadcast was crucial to get element-wise operations working on abstract arrays and on alternative computational hardware. We need the same for linear algebra. A representation of linear algebraic expressions would allow us to define new BLAS kernels on the fly from a Julia BLAS or transfer the equations over to a GPU/TPU.

Essentially all of the computations in scientific computing boil down to element-wise and linear algebraic operations, so having a high level description of both seems instrumental for building tooling to metaprogram on and explore new designs.

I'd have to think more about this proposal but, for now, I'll just comment that I don't think you'd like to compute A*B*C without a temporary. It seems to me that you'd have to pay with a lot of arithmetic operations to avoid the temporary.

I don't think you'd like to compute A*B*C without a temporary.

For mul! you already have an output array, however. I'm not sure if that helps or not. In any case, it seems like an implementation detail. The API mul!(Y, A, B, C...) expresses what you want to compute and lets the implementation pick the best way to do it, which was the general goal here.

I really like @StefanKarpinski 's suggestion as an underlying API call, but I am also thinking about whether this is how we actually want to expose things to users.

@ChrisRackauckas: I think the things your getting into can and should be explored in external packages—laziness, writing the computation you want and letting some kind of optimization pass pick out pieces that match certain algebraic patterns that it knows how to optimized, etc. Using mul! like this seems like just the kind of generic but easy-to-understand operation that we want at this level.

Note that there's no real debate to be had about mul!(Y, α, A, B)—it pretty much has to mean Y .= α*A*B since what else would it mean? So to me the only open question here is if using a tuple with a matrix and left and/or right scalars is a reasonable way to express that we want to increment and scale the output array. The general cases would be:

  1. mul!(Y::Matrx, args...): Y .= *(args...)
  2. mul!((β, Y)::{Number, Matrix}, args...): Y .= β*Y + *(args...)
  3. mul!((Y, β)::{Matrix, Number}, args...): Y .= Y*β + *(args...)
  4. mul!((β₁, Y, β₂)::{Number, Matrix, Number}, args...): Y .= β₁*Y*β₂ + *(args...)

Nothing else would be allowed for the first argument. This could be adopted as a more general convention for other operations where it makes sense to either overwrite or accumulate into the output array, optionally combined with scaling.

It didn't occur to me to "merge" mul!(out, args...) and GEMM-like interface! I like the extensibility of it (but then start writing the reply below and now I'm not sure...)

But my worry is that if it is easy to use as a overloading interface. We need to rely on the type system to work well for nested tuples. Do nested tuples work as good as flat tuples in Julia's type system? I'm wondering if something like "Tuple{Tuple{A1,B1},C1,D1} is more specific than Tuple{Tuple{A2,B2},C2,D2} iff Tuple{A1,B1,C1,D1} is more specific than Tuple{A2,B2,C2,D2}" holds. Otherwise it would be tricky to use as the overloading API.

Note that we do need to dispatch on scalar types to use reinterpret hack for the complex matrices (this is from PR #29634 so don't pay attention to the function name):

https://github.com/JuliaLang/julia/blob/fae1a7a3ae646c7ea1c08982976b57096fb0ae8d/stdlib/LinearAlgebra/src/matmul.jl#L157-L169

Another worry is that this is a somewhat limited interface for a computation graph executor. I think the main purpose of the multiply-add interface is to provide an overloading API to let library implementers define a small reusable computation kernel that can be implemented efficiently. It means that we can only implement _C = ABα_ and not, e.g., _αAB_ (see https://github.com/JuliaLang/julia/pull/29634#issuecomment-443103667). Supporting _α₁ABα₂_ for non-commutative eltype requires either a temporary array or increasing the number of arithmetic operations. It is not clear which one user wants and ideally this should be configurable. At this point we need a computation graph representation separated from the execution mechanism. I think this is better be explored in external packages (e.g., LazyArrays.jl, MappedArrays.jl). However, if we can find an implementation strategy that covers most of the usecase at some point, using mul! as the main entry point would make sense. I think this is actually yet another reason to favor muladd!; allocating a space for the future calling API.

I'd have to think more about this proposal but, for now, I'll just comment that I don't think you'd like to compute ABC without a temporary. It seems to me that you'd have to pay with a lot of arithmetic operations to avoid the temporary.

You can indeed proof that any contraction of an arbitrary number of tensors, the most efficient way to evaluate the whole thing always is using pairwise contractions. So multiplying several matrices is just a special case of that, you should multiply them pairwise (the best order is of course a non-trivial problem). That's why I think mul!(Y,X1,X2,X3...) is not such a useful primitive. And in the end, that's what I think mul! is, it's a primitive operation that developers can overload for their specific types. Any more complicated operation can then be written using a higher level construction, e.g. using macros, and would for example built a computation graph, which in the end is evaluated by calling primitive operations like mul!. Of course, that primitive could be sufficiently general to include cases like the non-commutative one that @StefanKarpinski mentions.

As long as no matrix multiplication/tensor contraction is involved, it is true that thinking in terms of primitive operations is not so useful and it can be beneficial to fuse everything together like broadcasting does.

In general, I agree that it would be good to have a default lazy representation / computation graph type in Base, but I don't think mul! is the way to construct it.

@tkf:

But my worry is that if it is easy to use as a overloading interface. We need to rely on the type system to work well for nested tuples. Do nested tuples work as good as flat tuples in Julia's type system?

Yes, we're all good on that front. I'm not sure where the nesting comes in but passing some things in a tuple is just as efficient as passing them all as immediate arguments—it's implemented exactly the same way.

It means that we can only implement _C = ABα_ and not, e.g., _αAB_

I'm confused... you can write mul!(C, A, B, α) and mul!(C, α, A, B). You could even write mul!(C, α₁, A, α₂, B, α₃). This seems like by far the most flexible generic matrix multiplication API that's been proposed so far.

Another worry is that this is a somewhat limited interface for a computation graph executor. I think the main purpose of the multiply-add interface is to provide an overloading API to let library implementers define a small reusable computation kernel that can be implemented efficiently.

At this point we need a computation graph representation separated from the execution mechanism.

That may be the case, but this is not the place for it—that can and should be developed in external packages. All that we need to solve this specific issue is a matrix multiplication API that generalizes what can be dispatched to BLAS operations—which is pretty much exactly what this does.

@Jutho

So multiplying several matrices is just a special case of that, you should multiply them pairwise (the best order is of course a non-trivial problem). That's why I think mul!(Y,X1,X2,X3...) is not such a useful primitive.

The mul! operation would allow the implementation to choose the order of multiplication, which is a useful property. Indeed, the ability to potentially do that was why we made the * operation parse as n-ary in the first place and the same reasoning applies even more to mul! since if you're using it, you presumably care enough about performance.

In general, I can't tell if you're arguing for or against my proposal for mul!.

I'm not sure where the nesting comes in but passing some things in a tuple is just as efficient as passing them all as immediate arguments

I wasn't worried about the efficiency but rather dispatching and method ambiguities since even current LinearAlgebra is somewhat fragile (which may be due to lack of my understanding of the type system; sometimes it surprises me). I was mentioning the nested tuple since I thought the resolution of method is done by intersecting tuple type of all positional arguments. That gives you a flat tuple. If you use a tuple in the first argument, you have a nested tuple.

It means that we can only implement _C = ABα_ and not, e.g., _αAB_

I'm confused... you can write mul!(C, A, B, α) and mul!(C, α, A, B).

I meant to say "we can only implement _C = ABα_ as efficiently as _C = AB_ and other candidates like _αAB_ cannot be implemented efficiently in all the combinations of matrix types." (By efficiency I mean the big O time complexity.) I'm not super certain that this is really the case but at least for the sparse matrix other two options are out.

At this point we need a computation graph representation separated from the execution mechanism.

That may be the case, but this is not the place for it—that can and should be developed in external packages.

That exactly is my point. I suggest to see this API as a minimal building block for such usage (of course, that's not the whole purpose). Implementation and design of vararg mul! can be done after people explored the design space in external packages.

Dispatching on types for even the current mul! is already “broken”: there is a combinatorial growth in ambiguity overrides needed to work with composable array types like SubArray and Adjoint.

The solution is to use traits, and LazyArrays.jl has a proof of concept version of mul! with traits.

But this is more a discussion on implementation than API. But using tuples to group terms feels wrong: isn’t this what the type system is there for? In which case you arrive at LazyArrays.jl solution.

The mul! operation would allow the implementation to choose the order of multiplication, which is a useful property. Indeed, the ability to potentially do that was why we made the * operation parse as n-ary in the first place and the same reasoning applies even more to mul! since if you're using it, you presumably care enough about performance.

That * parses as n-ary is extremely useful. I use it in TensorOperations.jl to implement the @tensoropt macro, which indeed optimizes contraction order. That I find a n-ary version of mul! less useful is because there is little point, from an efficiency perspective, in providing a pre-allocated place to put the result, if all the intermediate arrays still have to be allocated inside the function and then gc'ed. In fact, in TensorOperations.jl, several people noticed that the allocation of large temporaries is one of the places where Julia's gc is performing really badly (quite often leading to gc times of 50%).

Therefore, I would restrict mul! to what is truly a primitive operation, as also advocated by @tkf if I understand correctly: multiplying two matrices into a third one, with possibly scalar coefficients. Yes, we can think of the most general way to do this for non-commutative algebras, but for now I think the immediate need is convenient access to the functionality provided by BLAS (gemm, gemv, ...) that Julia's mul! wrapper lacks. I don't mind contemplating about the best API to make sure it is future proof to support indeed this primitive operation for more general number types.

I don't dislike your proposal with tuples, but could foresee potential confusion
Restricting from case 4 to case 2 of 3 seems to imply default values β₁ = 1 and β₂ = 1 (or actually true). But then, if neither are specified, it suddenly means β₁ = β₂ = 0 (false). Sure, the syntax is slightly different, since you write mul!(Y, args...), not mul!((Y,), args...). In the end, it's a matter of documentation, so I just wanted to point this out.

So in summary, no I am not really opposed to this syntax, though it is a new type of paradigm that is being introduced, and should then probably also be followed in other places. What I am opposed to is immediately wanting to generalize this to the multiplication of an arbitrary number of matrices, which, as argued above, I don't see the benefit of.

@dlfivefifty: But this is more a discussion on implementation than API. But using tuples to group terms feels wrong: isn’t this what the type system is there for? In which case you arrive at LazyArrays.jl solution.

But we're not going to go full lazy arrays here—there's already LazyArrays for that. Meanwhile, we need some way to express scaling Y. Using tuples seems like a simple, lightweight approach to express that small amount of structure. Does anyone have any other suggestions? We could have lscale and/or rscale keywords for β₁ and β₂, but that doesn't feel any more elegant and we would lose the ability to dispatch on that, which isn't crucial but is nice to have.

@Jutho: Therefore, I would restrict mul! to what is truly a primitive operation, as also advocated by @tkf if I understand correctly: multiplying two matrices into a third one, with possibly scalar coefficients. Yes, we can think of the most general way to do this for non-commutative algebras, but for now I think the immediate need is convenient access to the functionality provided by BLAS (gemm, gemv, ...) that Julia's mul! wrapper lacks.

I'm fine with only defining a small subset of operations for mul!, perhaps even just the ones that correspond structurally to valid BLAS calls. That would be:

# gemm: alpha = 1.0, beta = 0.0
mul!(Y::Matrix, A::Matrix, B::Matrix) # gemm! Y, A

# gemm: alpha = α, beta = 0.0 (these all do the same thing for BLAS types)
mul!(Y::Matrix, α::Number, A::Matrix, B::Matrix)
mul!(Y::Matrix, A::Matrix, α::Number, B::Matrix)
mul!(Y::Matrix, A::Matrix, B::Matrix, α::Number)

# gemm: alpha = α, beta = β (these all do the same thing for BLAS types)
mul!((β::Number, Y::Matrix), α::Number, A::Matrix, B::Matrix)
mul!((β::Number, Y::Matrix), A::Matrix, α::Number, B::Matrix)
mul!((β::Number, Y::Matrix), A::Matrix, B::Matrix, α::Number)
mul!((Y::Matrix, β::Number), α::Number, A::Matrix, B::Matrix)
mul!((Y::Matrix, β::Number), A::Matrix, α::Number, B::Matrix)
mul!((Y::Matrix, β::Number), A::Matrix, B::Matrix, α::Number)

# gemm: alpha = α, beta = β₁*β₂ (these all do the same thing for BLAS types)
mul!((β₁::Number, Y::Matrix, β₂::Number), α::Number, A::Matrix, B::Matrix)
mul!((β₁::Number, Y::Matrix, β₂::Number), A::Matrix, α::Number, B::Matrix)
mul!((β₁::Number, Y::Matrix, β₂::Number), A::Matrix, B::Matrix, α::Number)

To what end? Why allow so much variation in how to express BLAS operations?

  1. Because it allows people to express their intent—if the intent is to multiply on the left or right or both, why not allow people to express that and choose the right implementation?

  2. We can have generic fallbacks that do the right thing even for non-commutative element types.

The whole point of this issue is to have a generalization of in-place matrix multiplication that subsumes gemm! while being more generic than gemm!. Otherwise, why not just continue to write gemm!?

But we're not going to go full lazy arrays here

I'm not saying "full lazy arrays", I'm suggesting lazy in the same sense as Broadcasted, which is ultimately removed at compile time. Essentially I would add an Applied to represent lazy application of a function and instead of using tuples (which contain no context) you would have something like

materialize!(applied(+, applied(*, α, A, B), applied(*, β, C)))

This could be sugar coated like broadcast's .* notation to make it more readable, but I think this is immediately clear what is wanted, unlike the tuple-based proposal.

@StefanKarpinski , as said before, I certainly agree that we should contemplate about an interface that is future proof and generalizes correctly to other number types. The only problem, I think, is that your list is not complete. In principle you could have:

mul!((β₁::Number, Y::Matrix, β₂::Number), α₁::Number, A::Matrix, α₂::Number, B::Matrix, α₃::Number)

and all reduced versions thereof, i.e. if all of the 5 scalar arguments can be absent, that's 2^5 = 32 different possibilities. And this combined with all possibilities of different matrices or vectors.

I agree with @dlfivefifty that a broadcast-like approach is more feasible.

Yes, I realized I left out some of the options, but 32 methods doesn't seem all that crazy to me, after all we don't have to write them by hand. Adding a "broadcast-like system" or a lazy evaluation system that lets us write materialize!(applied(+, applied(*, α, A, B), applied(*, β, C))) seems like a far bigger addition and way out of scope for this issue. All we want is some way of spelling general matrix multiplication that's both generic and lets us dispatch to BLAS. If we can't all agree on that then I'm inclined to let people continue calling gemm! directly.

Yes, that's probably true; I assumed that it would be easier with scalar arguments going in the back, to easily provide defaults. But if with some @eval metaprogramming we can easily generate all 32 definitions, that's equally good. (Note, as you surely know, mul is not only gemm! but also gemv and trmm and ...).

Let me add that it's not just a BLAS wrapper. There are other pure-Julia specialized methods in stdlib. Also, it's important to have this as an overloading API: package authors can define mul! for their special matrix types.

I guess this is my stance:

  1. We might as well support mul!(C, A, B, a, b) now since it already exists in SparseArrays.jl
  2. We shouldn’t do anything else because dispatching on matrix types does not scale well. (As maintainer of BandedMatrices.jl, BlockArrays.jl, LowRankApprox.jl, etc. I can state that from experience.)
  3. Trait based designed does scale well, but it would be best to go all in and do a broadcast like Applied, since the design pattern is already established. This will have to wait til Julia 2.0, with a prototype continuing to be developed in LazyArrays.jl that suits my needs.

@dlfivefifty Do you think the difficulty in disambiguation of mul!((Y, β), α, A, B) API is equal to as in mul!(Y, A, B, α, β)? Considering matrix wrappers such as Transpose introduces the difficulty, including 2- and 3-tuples sounds like increasing the difficulty more (although I know tuple is a special case in Julia's type system).

  1. We might as well support mul!(C, A, B, a, b) now since it already exists in SparseArrays.jl

The fact that someone decided that mul!(C, A, B, a, b) should mean C .= b*C + a*A*B without thinking it through all the way is emphatically not a good reason to double down on this. If mul! is the in-place version of * then I don't see how mul!(out, args...) can mean anything besides out .= *(args...). Frankly, this is how you end up with a system that's a mess of poorly thought out, inconsistent APIs that only exist by historical accident. The mul! function is not exported from SparseArrays _and_ that particular method is not documented, so this is really the flimsiest possible reason to entrench an ill-conceived method that was probably only added because the function wasn't public! I propose that we undo that mistake and delete/rename that method of mul! instead.

From the rest of this discussion it sounds like we shouldn't do anything else since all the stakeholders want to do something fancier with traits and/or laziness outside of the standard library. I'm fine with that since deleting things is always nice.

From the rest of this discussion it sounds like we shouldn't do anything else since all the stakeholders want to do something fancier with traits and/or laziness outside of the standard library. I'm fine with that since deleting things is always nice.

It seems you are getting a bit fed up, which is understandable. However, I don't think that conclusion is true. If you are confident that the current suggestion can be implemented in a way that is scalable and still makes it convenient for package developers to overload that definition for their own matrix and vector types (as mentioned by @tkf), that would be a great way forward.

In particular, I would think that package developers only need to implement:

mul!((β₁, Y::MyVecOrMat, β₂), α₁, A::MyMat, α₂, B:: MyVecOrMat, α₃)

and maybe, e.g.,

mul!((β₁, Y::MyVecOrMat, β₂), α₁, A::Adjoint{<:MyMat}, α₂, B:: MyVecOrMat, α₃)
...

while Julia Base (or rather, LinearAlgebra standard library) takes care of handling all the defaults values etc.

I would think that package developers only need to implement:

mul!((β₁, Y::MyVecOrMat, β₂), α₁, A::MyMat, α₂, B:: MyVecOrMat, α₃)

I'd suggest to document

mul!((Y, β), A, B, α)

as the signature to be overloaded. This is because other locations for α changes the big O time complexity. See: https://github.com/JuliaLang/julia/pull/29634#issuecomment-443103667. This gives non-commutative numbers a non- first-class treatment. But AFAICT nobody here is actually using non-commutative numbers and I think we should wait until there is an actual need.

One thing I like about @StefanKarpinski's approach is that we can implement specialized method for mul!((Y, β), α::Diagonal, A, B) for _some_ matrix type of A (e.g., Adjoint{_,<:SparseMatrixCSC}) without changing the time complexity. (This is important for my application.) Of course, going this way would require further discussion in the API, especially for how to query for the existence of specialized method. Still, having the opportunity for extending the API is great.

If somebody clarifies my worry about method ambiguities I'll be all for the group-by-tuple approach.

This is because other locations for α changes the big O time complexity.

Is this something sparse matrix specifically? I don't quite the argument, especially not for dense matrices. In the implementation you link to, you show a case where α is between A and B?

I would think that package developers only need to implement...

This is very oversimplified. Let's suppose we have a matrix that behaves like a strided matrix, such as PseudoBlockMatrix from BlockArrays.jl. To fully support gemm! we need to override every permutation of PseudoBlockMatrix with (1) itself, (2) StridedMatrix, (3) Adjoints of itself, (4) Transposes of itself, (5) Adjoints of StridedMatrix, (6) Transposes of StridedMatrix, and possibly others. This is already 6^3 = 216 different combinations. Then you want to support trmm! and you have to do the same with UpperTriangular, UnitUpperTriangular, their adjoints, their transposes, and so on. Then gsmm! with Symmetric and Hermitian.

But in many applications we don't just want to work with the matrices, but also their subviews, particularly for block matrices where we want to work with blocks. Now we need to add every permuation of views of our matrix along with the 6 combinations above.

Now we have thousands of overrides, involving StridedMatrix, which is a very complicated union type. This is too much for the compiler, making the using time take over minutes, instead of seconds.

At this point one realises that the current mul!, and by extension the proposed extensions of mul!, is flawed by design and so package developers should just not bother with it. Fortunately LazyArrays.jl provides a temporary workaround using traits.

So I sort of agree with @StefanKarpinski in just leaving things as they are until a more substantial redesign is pursued, as spending effort on something that is flawed by design is not a good use of anyones time.

@dlfivefifty , I was referring only to how the scalar arguments are handled. All the complications with different matrix types that currently already exist for mul!(C,A,B) will of course remain.

This is because other locations for α changes the big O time complexity.

Is this something sparse matrix specifically? I don't quite the argument, especially not for dense matrices. In the implementation you link to, you show a case where α is between A and B?

@Jutho I think in general you can't put α in the inner most loop position. For example, in this case you can support α₁*A*B*α₃ but not A*α₂*B

https://github.com/JuliaLang/julia/blob/11c5680d5620b0b64420055e8474a2b8cf757010/stdlib/LinearAlgebra/src/matmul.jl#L661-L670

I think at least α₁ or α₂ in α₁*A*α₂*B*α₃ has to be 1 to avoid increasing asymptotic time complexity.

@dlfivefifty But even LazyArrays.jl needs some primitive functions to dispatch to, right? My understanding is that it solves the "dispatch hell" but doesn't decrease the number of computation "kernels" people have to implement.

No, there’s no “primitive” in the same way Broadcasted doesn’t have a “primitive”. But yes at the moment it doesn’t solve the “kernel” question. I think the next step is to redesign it to use a lazy Applied type with an ApplyStyle. Then there could be MulAddStyle to recognise BLAS-like operations, in a way that order doesn’t matter.

I would call materialize! or copyto! a primitive. At least it's the building block for the broadcasting mechanism. Likewise, I suppose LazyArrays.jl has to lower its lazy representation to functions with loops or ccalls to external libraries at some point, right? Would it be bad if the name of such function is mul!?

This is very oversimplified. Let's suppose we have a matrix that behaves like a strided matrix, such as PseudoBlockMatrix from BlockArrays.jl. To fully support gemm! we need to override every permutation of PseudoBlockMatrix with (1) itself, (2) StridedMatrix, (3) Adjoints of itself, (4) Transposes of itself, (5) Adjoints of StridedMatrix, (6) Transposes of StridedMatrix, and possibly others. This is already 6^3 = 216 different combinations. Then you want to support trmm! and you have to do the same with UpperTriangular, UnitUpperTriangular, their adjoints, their transposes, and so on. Then gsmm! with Symmetric and Hermitian.
But in many applications we don't just want to work with the matrices, but also their subviews, particularly for block matrices where we want to work with blocks. Now we need to add every permuation of views of our matrix along with the 6 combinations above.
Now we have thousands of overrides, involving StridedMatrix, which is a very complicated union type. This is too much for the compiler, making the using time take over minutes, instead of seconds.

I certainly agree that the current StridedArray type union is a major design flaw. I supported your attempt to fix this at some point.

Over at Strided.jl, I only implement mul! when all matrices involved are of my own custom type (Abstract)StridedView, whenever there is some mixedness in the types of A,B and C, I let Julia Base/LinearAlgebra handle this. Of course, this is to be used in a @strided macro environment, which tries to convert all the possible Base types into the StridedView type. Here, StridedView can represent subviews, transposes and adjoints, and certain reshapes, all with the same (parametric) type. Overall, the full multiplication code is about 100 lines:
https://github.com/Jutho/Strided.jl/blob/master/src/abstractstridedview.jl#L46-L147
The native Julia fallback in case BLAS does not apply is implemented using the more general mapreducedim! functionality provided by that package, and is no less efficient than the one in LinearAlgebra; but it is also multithreaded.

I think at least α₁ or α₂ in α₁*A*α₂*B*α₃ has to be 1 to avoid increasing asymptotic time complexity.

@tkf , I would assume that if these scalar coefficients take a default value one(T), or better yet, true, constant propagation and compiler optimizations will automatically eliminate that multiplication. in the inner most loop when it is a no-op. So it would still be convenient to just have to define the most general form.

I'm not sure if we can rely on constant propagation to eliminate all multiplications by 1 (true). For example, eltype may be Matrix. In that case, I think true * x (where x::Matrix) has to create a heap-allocated copy of x. Can Julia do some magic to eliminate that?

@Jutho I think this benchmark shows that Julia can't eliminate the intermediate multiplications in some cases:

function simplemul!((β₁, Y, β₂), α₁, A, α₂, B, α₃)
    @assert size(Y, 1) == size(A, 1)
    @assert size(Y, 2) == size(B, 2)
    @assert size(A, 2) == size(B, 1)
    @inbounds for i in 1:size(A, 1), j = 1:size(B, 2)
        acc = zero(α₁ * A[i, 1] * α₂ * B[1, j] * α₃ +
                   α₁ * A[i, 1] * α₂ * B[1, j] * α₃)
        for k = 1:size(A, 2)
            acc += A[i, k] * α₂ * B[k, j]
        end
        Y[i, j] = α₁ * acc * α₃ + β₁ * Y[i, j] * β₂
    end
    return Y
end

function simplemul!((Y, β), A, B, α)
    @assert size(Y, 1) == size(A, 1)
    @assert size(Y, 2) == size(B, 2)
    @assert size(A, 2) == size(B, 1)
    @inbounds for i in 1:size(A, 1), j = 1:size(B, 2)
        acc = zero(A[i, 1] * B[1, j] * α +
                   A[i, 1] * B[1, j] * α)
        for k = 1:size(A, 2)
            acc += A[i, k] * B[k, j]
        end
        Y[i, j] = acc * α + Y[i, j] * β
    end
    return Y
end

fullmul!(Y, A, B) = simplemul!((false, Y, false), true, A, true, B, true)
minmul!(Y, A, B) = simplemul!((Y, false), A, B, true)

using LinearAlgebra
k = 50
n = 50
A = [randn(k, k) for _ in 1:n, _ in 1:n]
B = [randn(k, k) for _ in 1:n]
Y = [zeros(k, k) for _ in 1:n]
@assert mul!(copy(Y), A, B) == fullmul!(copy(Y), A, B) == minmul!(copy(Y), A, B)

using BenchmarkTools
@btime mul!($Y, $A, $B)     # 63.845 ms (10400 allocations: 99.74 MiB)
@btime fullmul!($Y, $A, $B) # 80.963 ms (16501 allocations: 158.24 MiB)
@btime minmul!($Y, $A, $B)  # 64.017 ms (10901 allocations: 104.53 MiB)

Nice benchmark. I had also already noticed that it will indeed not eliminate these allocations by some similar experiments. For such cases, it might be useful to define a special purpose One singleton type which just defines *(::One, x::Any) = x and *(x::Any, ::One) = x, and that no user type ever needs to now about. Then the default value, at the very least for α₂, could be One().

Ah, yes, that's clever! I first thought I was now OK with supporting α₁ * A * α₂ * B * α₃ but then I think I found another problem: It is mathematically ambiguous what we should do when (say) A is a matrix-of-matrix and α₁ is a matrix. It won't be a problem if we _never_ support non-scalar arguments in α positions. However, it makes it impossible to present Y .= β₁*Y*β₂ + *(args...) as the mental model of mul!((β₁, Y, β₂), args...). Furthermore, it would be really nice if diagonal matrices can be passed to α₁ or α₂ as it can be calculated almost "for free" sometimes (and important in applications). I think there are two routes:

(1) Go with mul!((β₁, Y, β₂), α₁, A, α₂, B, α₃) but when overloading the method, the arguments α and β must accept Diagonal. It is easy to define a call path so that end-user code can still invoke it via scalar values. But for this to work efficiently, the "O(1) version" of Diagonal(fill(λ, n)) https://github.com/JuliaLang/julia/pull/30298#discussion_r239845163 has to be implemented in LinearAlgebra. Note that implementations for scalar and diagonal α are not very different; often it's just swapping α and α.diag[i]. So I don't think it's much of a burden to the package authors.

This solves the ambiguity I mentioned above because now you can call mul!(Y, α * I, A, B) when A is a matrix-of-matrix and α is a matrix that should be treated as the eltype of A.

(2) Maybe the above route (1) is still too complex? If so, go with muladd! instead for now. If we ever want to support mul!((β₁, Y, β₂), α₁, A, α₂, B, α₃), migrating to it while keeping backward compatibility is not difficult.

At this point I have to wonder if we shouldn't just have a very restricted and specialized matmul!(C, A, B, α, β) which is only defined to work for this general signature:

matmul!(
    C :: VecOrMatT,
    A :: Matrix{T},
    B :: VecOrMatT,
    α :: Union{Bool,T} = true,
    β :: Union{Bool,T} = false,
) where {
    T <: Number,
    VecOrMatT <: VecOrMat{T},
}

Also, it is really amazing that this signature can be written and dispatched on.

That's essentially my suggestion (2), right? (I'm guessing A :: Matrix{T} does not literally mean Core.Array{T,2}; otherwise it is more or less just gemm!)

I would be happy with that as a temporary solution, and can partially support it in the packages I maintain (“partial” because of the outstanding traits issue), though it adds another name to the mix: mul!, muladd! and now matmul!.

...Isn’t about time for someone to post “triage says ...” and call it?

Also, it is really amazing that this signature can be written and dispatched on.

Isn't the fact that you can cleanly dispatch on exactly this signature an argument for just making this a method of mul!. It can then also cleanly be deprecated in case some more general solution comes about.

making this a method of mul!

If we go with mul!(C, A, B, α, β) there is no way making it to generalize it to mul!((β₁, Y, β₂), α₁, A, α₂, B, α₃) and alike without breaking compatibility. (Maybe it's a "feature" since we are free from this discussion forever :smile:).

Also, let me note that bounding element type to Number _and_ keeping the compatibility with current 3-arg mul! (which already supports non-Number element type) will introduce a lot of duplication.

I have no idea what you’d expect mul!((β₁, Y, β₂), α₁, A, α₂, B, α₃) to do... so I think it’s a feature.

Bump. It makes me sad to have to use BLAS functions everywhere to get better in-place performance.

@StefanKarpinski Could you bring it up in triage?

We can discuss although I'm not sure what result that's going to have without some linalg people on the call, which there typically aren't these days. To be frank, I spent a fair amount of time and effort on trying to get this discussion to some kind of resolution and every idea seems to have been steadfastly rejected one way or another, so I have pretty much just checked out of this issue at this point. If someone could make a summary of the issue and why the various proposals are inadequate, that would be helpful for having a productive triage discussion. Otherwise I don't think we'll be able to do much.

I think the lack of consensus means it’s not the right time to incorporate this into StdLib.

Why not just a package MatMul.jl that implements one of the suggestions that down users can use? I don’t see why being in StdLib is that important in practice. I’ll happily support this in the packages I maintain.

I'm thinking just a nice Julian version of gemm! and gemv! matching what we have in SparseArrays already. Per @andreasnoack above:

I thought we already settled on mul!(C, A, B, α, β) with default values for α, β. We are using this version in

julia/stdlib/SparseArrays/src/linalg.jl

Lines 32 to 50 in b8ca1a4

...
I think some packages are also using this form but I don't remember which on top of my head.

That suggestion had 7 thumbs up and no thumbs down. Why don't we just implement that function for dense vectors/matrices? That would be a simple solution that covers the most common use case right?

OK. So I guess there even isn't consensus for if there is a consensus :sweat_smile:

_I_ thought that pretty much everyone [*] wanted this API and it's just a matter of the function name and signature. Comparing to _not_ having this API, I thought everyone is happy with any of the options (say mul!((β₁, Y, β₂), α₁, A, α₂, B, α₃), muladd!(C, A, B, α, β), and mul!(C, A, B, α, β)). Unless someone can make a convincing argument that a certain API is much worse than _not_ having it, I'd be happy with whatever triage decides.

@StefanKarpinski But please feel free to remove the triage tag if you think the discussion is not consolidated enough yet.

[*] OK, @dlfivefifty, I think you are doubtful about even the current 3-arg mul!. But that would require changing 3-arg mul! interface from scratch so I thought that's a way beyond the scope of this discussion (which, I've been interpreting as about _adding_ some form of 5-arg variant). I think we need something that works "enough" until LazyArrays.jl matures.

Why not just a package MatMul.jl that implements one of the suggestions that down users can use?

@dlfivefifty I think having it in LinearAlgebra.jl is important because it's an interface function (overloadable API). Also, since mul!(C::AbstractMatrix, A::AbstractVecOrMat, B::AbstractVecOrMat) is implemented in LinearAlgebra.jl, we won't be able to define mul! in terms of MatMul.muladd!. There are of course some workarounds, but it's much nicer to have a straightforward implementation, especially considering that it "only" requires deciding the name and signature.

Why don't we just implement that function for dense vectors/matrices?

@chriscoey Unfortunately this is not the only favorite for everyone: https://github.com/JuliaLang/julia/issues/23919#issuecomment-441717678. Here is my summary for pros and cons for this and other options https://github.com/JuliaLang/julia/issues/23919#issuecomment-441865841. (See other people's comments, too)

From triage: There are long term plans to have generic tensor contraction APIs including compiler support and selection out to BLAS, but for the medium term, just pick whatever API addresses your immediate need. If it matches BLAS, choosing BLAS names seems reasonable.

@Keno , any information you can share about the generic tensor contraction API and compiler support? I might have some interesting information to share as well, though not in public (yet).

No API design has been done on any of that, just a generic sense that we should have it. I'm aware that you've been working on some of these things, so would be good to have a design session at the appropriate juncture, but I don't think we're there yet.

If it matches BLAS, choosing BLAS names seems reasonable.

This is completely contrary to what we have been doing so far for the generic linear algebra function names.

What's the plan for strong/weak β == 0 in the proposed general version of BLAS.gemm!(α, A, B, β, C)?

If we lower to BLAS calls it will behave like a strong zero, even though this is now inconsistent with lmul!. I can't think of a solution to this apart from falling back to a generic_muladd! if β == 0.

What's the plan for strong/weak β == 0

It was only briefly discussed around my comment in https://github.com/JuliaLang/julia/issues/23919#issuecomment-430139849 so triage probably didn't address that question.

@Keno Although there hasn't been any API design yet, do you envision that "APIs including compiler support and selection out to BLAS" would be defined as mutating or would it be immutable, like the XLA linear algebra, in order to help the compiler? I.e. do you think mul! and/or muladd! would be part of such APIs?

ping @Keno for the @andreasnoack's question in https://github.com/JuliaLang/julia/issues/23919#issuecomment-475534454

Sorry, I was meant to come back to this issue (especially I requested triage) but I didn't know what the next action would be given the decision from triage.

If it matches BLAS, choosing BLAS names seems reasonable.

As @andreasnoack pointed out, we can't use (say) gemm! because we want support matrix-vector multiplication etc. But I guess we can just ignore this triage decision (which just says "If it matches BLAS"; it doesn't).

just pick whatever API addresses your immediate need.

So, I guess we can follow this direction. I think it means to forget about the tuple-based API suggested by @StefanKarpinski and "just" pick one of mul!/muladd!/addmul!.

We kind of go back to the original discussion. But I think it's nice that we have a constraint to not discuss about API anymore.

Any idea how to pick a name from mul!/muladd!/addmul!?


@chriscoey I think it's better to discuss future API elsewhere. This issue is already super long and we won't be able to make any progress unless we focus on mid-term solution. How about opening a new issue (or discourse thread)?

I suggest a single round of approval voting with 10 days deadline from now. Approval voting means: Everyone votes for all options that they deem preferable to continuing the discussion. People who would rather have their least preferred name now than a continued discussion should vote for all three. If no option receives widespread approval, or the voting scheme itself fails to meet widespread approval, then we must continue the discussion. In case of almost-ties between approved options, @tkf gets to decide (PR author's privilege).

+1: I agree with this voting scheme and have cast my approval votes.
-1: I disagree with this voting scheme. If too many or too important people select this option, then the vote is moot.

Heart: mul! is preferable to continued discussion.
Rocket: muladd! is preferable to continued discussion.
Hooray: addmul! is preferable to continued discussion.

I tentatively suggest that 75% approval and 5 should definitely make a quorum (i.e. 75% of people who have voted at all, including disagreement with the entire voting procedure, and at least 5 people have approved of the winning option; if participation is low, then 5/6 or 6/8 make a quorum but unanimous 4/4 could be considered as failure).

Feature freeze for 1.3 is around August 15: https://discourse.julialang.org/t/release-1-3-branch-date-approaching-aug-15/27233?u=chriscoey. I hope we can have it merged by then 😃. Thanks to all who have already voted!

We still also need to decide the behavior of β == 0 https://github.com/JuliaLang/julia/issues/23919#issuecomment-475420149 which is orthogonal to deciding the name. Also, the merge conflict in my PR has to be resolved and the PR needs some review on the implementation details (e.g., the approach I used there to handle undefs in the destination array). We may discover other issues during the review. So, I'm not sure if it can get into 1.3....

Re: β == 0, I think @andreasnoack's comment https://github.com/JuliaLang/julia/issues/23919#issuecomment-430139849 (my summary: β == 0-handling should be BLAS-compatible to leverage BLAS as much as possible) makes sense. It's hard to find opposing opinions other than @simonbyrne's argument just below ("every implementation will need a branch to check for zero"). Are there any other arguments against BLAS-like β == 0 handling?

@simonbyrne Regarding your comment https://github.com/JuliaLang/julia/issues/23919#issuecomment-430375349, I don't think explicit branching is a big issue because it's mostly just a onelinear β != 0 ? rmul!(C, β) : fill!(C, zero(eltype(C))). Also, for very generic implementation where you need to handle, e.g., C = Matrix{Any}(undef, 2, 2), the implementation requires explicit "strong zero" handling anyway (see the helper function _modify! in my PR https://github.com/JuliaLang/julia/pull/29634/files#diff-e5541a621163d78812e05b4ec9c33ef4R37). So, I think BLAS-like handling is the best choice here. What do you think?

Is it possible to have high performance weak zeros? In the sense that we would want:

julia> A = [NaN 0;
                     1 0]

julia> b = [0.0,0];

julia> 0.0*A*b
2-element Array{Float64,1}:
 NaN  
   0.0

julia> false*A*b
2-element Array{Float64,1}:
 0.0
 0.0

That is, we would need to manually determine which rows are meant to be NaN if we lower to BLAS (which uses strong 0).

@dlfivefifty I think BLAS can handle NaN in A and B, but not in C?

I guess there is no way to do NaN-aware _C = α * A * B + 0 * C_ efficiently using BLAS.gemm! etc. [1] and that's where @andreasnoack's argument is coming from.

[1] you need to save isnan.(C) somewhere and "pollute" C afterward

@chethega it has been more than 10 days since the vote

@chriscoey You're right, voting is closed.

I am too bad at github to get a full list of people who voted (that would be required to compute how many people cast any votes at all). However, when I eyeball the numbers, it appears quite clear that mul! has overwhelming support (probably managing a 75% approval quorum), and the second contender muladd! is far below 50%.

There was not a single objection to the voting scheme. I call the vote, mul! has won, and the name is decided. @tkf can go on making it fly :)

@chethega Thanks, it was a nice scheme for deciding this!

BTW, I can't do the rebase right away (but maybe within a few weeks) so if someone wants to do the rebase or reimplementation please don't wait for me.

Unfortunately, we did not have a vote on NaN-semantics. Feature freeze is next week, and we don't have enough time for a meaningful vote.

I propose we have a non-binding referendum collect a snapshot of the mood in the thread.

I see the following options:

  1. Continue discussing, hope that consensus is somehow reached before the deadline, or the core people grant us an extension, or something. :tada: (edit for clarification: This is the default option, i.e. what happens if we cannot reach consensus on a different option. The most likely outcome is that 5-arg mul! gets delayed until 1.4.)
  2. Merge the new feature, with undefined NaN-behavior as backstop. As soon as we reach consensus, we update code and/or docs to get the decided NaN behavior (strong vs weak zeros). The backstop of undefined implementation-defined NaN-behavior could end up indefinite, but that's not up for voting today. (implement whatever is fastest; users who care need to call different methods). :thumbsup:
  3. Gemm means Gemm! Merge for 1.3 with documented commitment to strong zeros. :heart:
  4. NaN means NaN! Merge for 1.3 with documented commitment to weak zeros. :eyes:
  5. Do something, just anything, before hitting the 1.3 cliff. :rocket:
  6. Reject the poll. :thumbsdown:
  7. Write in
  8. Proposed downthread: Alternative backstop arrangement. Try to merge quickly, and make weak/strong zero divergence an error for 1.3-alpha, until we can reach a more considered decision later. That is, we merge with !(alpha === false) && iszero(alpha) && !all(isfinite, C) && throw(ArgumentError()), and document that this error check is likely to be retired in favor of something else. :confused:

Feel free to select multiple, possibly contradicting options, and @tkf / triage feel free to potentially disregard the poll.

Edit: Currently only :tada: (patience) and :rocket: (impatience) are contradictory, but both are compatible with all others. As clarified downthread, triage will hopefully count the poll outcome at some unspecified date between Wed 14th and the Thu 15th, and take it into consideration in some unspecified manner. This is again meant as "approval voting", i.e. select all options that you like, not just your favorite one; it is understood that :rocket: does not disapprove of :thumbsup:, :heart:, :eyes: and :confused:. I apologize that this poll is more rushed than the last one.

I'd vote for strong zero (:heart:) if it were "Merge for 1.x" instead of "Merge for 1.3". Wouldn't the options make sense if all "Merge for 1.3" are "Merge 1.x"?

Thanks @chethega. @tkf I am in the position of really needing the new mul! ASAP without caring too much about the NaN decision (as long as performance is not compromised).

Did you check out LazyArrays.jl? FYI, it has really nice fused matrix multiplication support. You can also use BLAS.gemm! etc. safely since they are public methods https://docs.julialang.org/en/latest/stdlib/LinearAlgebra/#LinearAlgebra.BLAS.gemm!

I actually truly need the generic mul! since we use a wide variety of structured and sparse and plain old dense matrices, for representing different optimization problems most efficiently. I'm here for the genericity and speed.

I see. And I just remembered that we discussed things in LazyArrays.jl so of course you already knew it...

Regarding "ASAP", Julia's four months release cycle is, at least as a design, to avoid "merge rush" just before feature freeze. I know it's not fair for me to say this because I've tried the same thing before... But I think someone needs to mention this as a reminder. The bright side is Julia is super easy to build. You can start using it right after it is merged until next release.

Edit: release -> merge

Thanks. I find deadlines to be helpful motivators, and I would like to avoid letting this go stale again. So I would propose that we try to use the deadline as a goal.

It's great that you are actively injecting energy to this thread!

I actually truly need the generic mul! since we use a wide variety of structured and sparse and plain old dense matrices, for representing different optimization problems most efficiently. I'm here for the genericity and speed.

A 5-argument mul! won't work well when you have many different types: you'll need combinatorially many overrides to avoid ambiguity. This is one of the motivations behind LazyArrays.jl MemoryLayout system. It's used for "structured and sparse" matrices precisely for this reason in BandedMatrices.jl and BlockBandedMatrices.jl. (Here even sub views of banded matrices dispatch to banded BLAS routines.)

Thanks, I will try LazyArrays again.

Since the 5-arg mul seems to be generally considered as a temporary stopgap (until some solution like LazyArrays can be used in 2.0), I think we can aim to get it merged without necessarily being the ideal or perfect solution for the long run.

@chethega when do you think we should count the tallies for the new nonbinding vote?

@tkf Sure, you're right that strong/weak/undef zeros make sense for 1.x as well.

However, I think there are quite a few people who would rather have 1.3 mul! now than wait until 1.4 to get 5-arg mul!. If there was no deadline, then I'd wait some more and take some more time for thinking how to conduct a proper poll (at least 10 days for voting). Most importantly, we cannot have a meaningful vote without first presenting and benchmarking competing implementations on the speed and elegance of weak/strong zeros. I personally suspect that weak zeros could be made almost as fast as strong zeros, by first checking iszero(alpha), then scanning the matrix for !isfinite values, and only then using a slow path with extra allocation; but I prefer strong zero semantics anyway.

@chethega when do you think we should count the tallies for the new nonbinding vote?

Triage must make a decision (delay / strong / weak / backstop) this week for the 1.3 alpha. I think Thursday 15th or Wednesday 14th are sensible options for triage to make a count, and take it into consideration. I probably won't be able to join on Thursday, so someone else will have to count.

Realistically, it is OK to be conservative here, and miss the deadline, and continue the discussion and wait for 1.4.

On the other hand, we may already have reached consensus without noticing it: @andreasnoack made some powerful arguments that a zero coefficient should be a strong zero. It may well be that he managed to convince all the weak zero proponents. It may well be that there is a large majority that really wants 5-arg mul!, preferably last year, and does not really care about this little detail. If that was the case, then it would be a pity to further delay the feature, just because no one wants to shut up the discussion.

Why not just throw an error for now:

β == 0.0 && any(isnan,C) && throw(ArgumentError("use β = false"))

Why not just throw an error for now

I added that option to the poll. Great compromise idea!

Just to set expectations: the feature freeze for 1.3 is in three days so there’s basically no way this can make it in time. We’re being pretty strict about feature freezes and branching since it’s the only part of the release cycle that we can really control the timing of.

The work is already done in https://github.com/JuliaLang/julia/pull/29634 though. Just needs to be adjusted and rebased.

@tkf for #29634 could you list the work that remains to be done (incl renaming and handling zeros according to the vote)? I know you're busy so perhaps we can find a way to split any remaining todos so the burden won't fall on you again.

The TODOs I can think of ATM are:

My PR implements BLAS semantics of the β = 0 handling. So other handling, like throwing an error, also has to be implemented.

My PR implements BLAS semantics of the β = 0 handling.

Sorry, my memory was stale; my implementation was not consistent and propagates NaN _sometimes_. So additional TODO is to make the behavior of β = 0.0 consistent.

The MulAddMul type would just be for internal use, right?

Yes, it's completely an internal detail. My worries were (1) there may be too many specializations (beta=0 etc. is encoded in the type parameter) and (2) it decreases the readability of the source code.

Those are valid concerns. We already produce a ton of specializations in the linear algebra code so it's good to think through if we really need to specialize here. My thinking has generally been that we should optimize for small matrices since it not free (as you said, it complicates the source code and could increase compile times) and people are better off using StaticArrays for small matrix multiplications. Hence, I lean towards just checking the values at runtime but we can always adjust this later if we change our mind so we shouldn't let this cause any delays.

FYI soft zeros do have simple implementations:

if iszero(β) && β !== false && !iszero(α)
   lmul!(zero(T),y) # this handles soft zeros correctly
   BLAS.gemv!(α, A, x, one(T), y) # preserves soft zeros
elseif iszero(α) && iszero(β)
   BLAS.gemv!(one(T), A, x, one(T), y) # puts NaNs in the correct place
   lmul!(zero(T), y) # everything not NaN should be zero
elseif iszero(α) && !iszero(β)
   BLAS.gemv!(one(T), A, x, β, y) # puts NaNs in the correct place
   BLAS.gemv!(-one(T), A, x, one(T), y) # subtracts out non-NaN changes
end

@andreasnoack Sorry, I forgot that we actually needed the specialization to optimize the inner-most loop for some structured matrices like mul!(C, A::BiTriSym, B, α, β) https://github.com/JuliaLang/julia/pull/29634#issuecomment-440510551. Some specializations can be removed but that's actually more work (so delays).

Hence, I lean towards just checking the values at runtime but we can always adjust this later if we change our mind so we shouldn't let this cause any delays.

Great!

@andreasnoack Thanks a lot for reviewing and merging this in time!

Now that it is merged for 1.3 it started makes me very nervous about the implementation :smile:. I appreciate if people here can test their linear algebra code more thoroughly when 1.3-rc comes out!

No need to worry, there will be plenty of time for 1.3 RCs + PkgEvals to shake out bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

omus picture omus  ·  3Comments

manor picture manor  ·  3Comments

StefanKarpinski picture StefanKarpinski  ·  3Comments

i-apellaniz picture i-apellaniz  ·  3Comments

dpsanders picture dpsanders  ·  3Comments