Two related questions:
dot(x, M, y) = dot(x, M * y)
which I don't want to hard-code prefer an ability to overload it. Is this functionality sufficiently general (I think it is) that it should be in Base by default for the standard arrays?
EDIT: second question is deleted; it needs more thinking.
I that this would be useful to have in base and I've thought about adding it several times. One of the benefits is that you can completely avoid a temporary M*y
.
yes - that was exactly the origin of this - I should have said, sorry.
Would be nice to get some benefit from BLAS here, though it's only a BLAS2 operation. @xianyi, do you have a suggestion on the most efficient way to use BLAS for x'*M*y
?
One option would be to break x
into chunks of say, 16 elements (big enough to exceed a cache line but small enough that the memory allocation of a length-16 temporary vector is an issue, and compute x'*M
on that chunk via BLAS2 using a subset of the columns of M
(since Julia is column-major, this doesn't require any copying), then accumulate that chunk of the dot product with y
. However, this is suboptimal because it doesn't fully exploit multi-threading: each of the chunk dot products could ideally be computed in a parallel thread.
Ideally you'd also want a specialized method for the common case where M
is a diagonal matrix, especially a diagonal real matrix.
And tridiagonal "mass" matrices are also common, as well as general sparse matrices M
, in problems arising from discretized PDEs.
I was mostly thinking of sparse cases for my own use. These should be relatively easy to implement in Julia?
For the matvec, it seems that multithreading is almost the only speedup you'll get from BLAS. For a 500x500 problem and a Julia double loop with @inbounds
and @simd
I get
julia> blas_set_num_threads(1);
julia> @time for i = 1:10000; Ac_mul_B!(y, A, x); end # OpenBLAS 1 thread
0.484342 seconds
julia> blas_set_num_threads(4);
julia> @time for i = 1:10000; Ac_mul_B!(y, A, x); end # OpenBLAS 4 threads
0.184681 seconds
julia> @time for i = 1:10000; mymul(y, A, x); end # Julia
0.509110 seconds
so if/when the @threads
overhead is gone, a pure Julia version should probably be okay.
I'm surprised there isn't some SIMD speedup.
Probably also worthwhile to special case x === y
and issym(A)
.
@stevengj I don't understand. Both OpenBLAS and Julia use SIMD so where would you have expected to have seen the speedup?
In OpenBLAS the SIMD is hard-coded (I thought), whereas in Julia we rely on the compiler to do its magic, which doesn't always work as well. Maybe this is one of the cases where the compiler does a good job in exploiting SIMD.
In my experiments with mat-vec and mat-mat, Julia has done a good job exploiting SIMD. For mat-vec my implemention was asymptotically as fast as OpenBlas, although they use a better algorithm for mat-mat, so theirs was faster in that case (about about 3x IIRC).
To follow up on this, for when everything is sparse we can also do much better by interleaving the computations. e.g. in a recent project I did:
function mul_ut_X_v(u::SparseVector, X::SparseMatrixCSC, v::SparseVector)
u.n == X.m && X.n == u.n || throw(DimensionMismatch())
z = zero(promote_type(eltype(u),eltype(X),eltype(v)))
for (vi,vv) in zip(v.nzind, v.nzval)
X_ptr_lo = X.colptr[vi]
X_ptr_hi = X.colptr[vi+1] - 1
if X_ptr_lo <= X_ptr_hi
z += Base.SparseArrays._spdot(dot, 1, length(u.nzind), u.nzind, u.nzval,
X_ptr_lo, X_ptr_hi, X.rowval, X.nzval) * vv
end
end
z
end
Once we have lazy transposes, we could overload the ternary multiplication operator u' * X * v
for this. Until then, I would be fine with overloading dot
, or adding a bilinear
function.
As pointed out by @andyferris, we can now overload ternary *
for this, e.g.
function Base.:*(u::RowVector{T}, X::Matrix{T}, v::Vector{T}) where {T}
m,n = size(X)
length(u) == m && length(v) == n || throw(DimensionMismatch())
z = zero(T)
for j = 1:n
zz = zero(T)
@simd for i = 1:m
@inbounds zz += u[i]*X[i,j]
end
z += zz*v[j]
end
z
end
This gives a 2x speedup on my machine.
Credit to @ViralBShah also I think? (Note we need to make the mechanics of Ac_mul_B
a bit different (nuke it from orbit) to make this work).
I still think it would be nice to have dot(x, A, y)
, even if ternary *
calls this.
@andyferris Why would we need to change Ac_mul_B
?
I just mean the lowering. You want it to come out as *(adjoint(x), A, y)
rather than Ac_mul_B(x, A*y)
.
Overloading the ternary * seems to do that already
Yeah, as long as you use an explicit *
. It's fiddly, though:
julia> expand(Main, :(x'*A*y))
:(adjoint(x) * A * y)
julia> expand(Main, :(x'A*y))
:(Ac_mul_B(x, A) * y)
julia> expand(Main, :(x'*A))
:(Ac_mul_B(x, A))
julia> expand(Main, :(x'A))
:(Ac_mul_B(x, A))
This would be fixed if we settle on a definition of (and implement) a matrix transpose because then all the Ax_mul_Bx
s can go away.
Have there been any regrets about implementing a vector transpose?
I think that people are generally happy with the vector transpose. There have been a few comments where people have been surprised/confused/annoyed but I think that is to be expected given the length of #4774.
Yes, I have to say the experience with RowVectors behaviorally embedded as row matrices except for linear algebraic operations has been lovely. They generally just do what you want.
The generalized inner product would be nice to have without creating a temporary.
What is the current thinking on bringing this to Julia? This type of quadratic form x'Qy (often with x = y and Q symmetric) is ubiquitous in math-optimization (I'm surprised there is no blas function for it). I see there is some implementation in PDMats.jl for the case where Q is PSD (which is restrictive). Brief related discussion: https://discourse.julialang.org/t/matrix-multiplication-a-b-a
@chriscoey I suspect this might be an issue waiting for someone to contribute a pull request to LinearAlgebra
? The discussion above seems pretty receptive to the idea.
So, we do have a ternary *
for Diagonal
s, and I quickly wrote up one for SparseMatrixCSC
and AbstractVector
s:
function *(x::Adjoint{<:Any,<:AbstractVector}, A::SparseMatrixCSC, y::AbstractVector)
length(x) == A.m || throw(DimensionMismatch())
A.n == length(y) || throw(DimensionMismatch())
T = promote_type(eltype(x), eltype(A), eltype(y))
out = T(0)
rvals = rowvals(A)
nzvals = nonzeros(A)
@inbounds for col = 1:A.n
temp = T(0)
for k in nzrange(A, col)
temp += x[rvals[k]] * nzvals[k]
end
out += temp * y[col]
end
return out
end
Would we insert this into SparseArrays
as is (perhaps together with @simonbyrne's all-sparse version above), or should we introduce a generic dot(x, A, y)
first, make it throw in general, but have more specific implementations and make the ternary *
s call those?
I would have the dot(x, A, y)
first.
Most helpful comment
I still think it would be nice to have
dot(x, A, y)
, even if ternary*
calls this.