Consider the two following tests. The first one uses Julia's A_mul_B
(OpenBLAS's dgemv
):
function test1(n::Int, count::Int)
A = rand(n, n)
x = rand(n)
y = zeros(n)
tic()
for i = 1 : count
A_mul_B(y, A, x)
end
toq()
end
The second one implements matrix-vector multiplications itself:
function naive_A_mul_B(y::Vector{Float64}, A::Matrix{Float64}, x::Vector{Float64})
m, n = size(A)
for i = 1 : m
y[i] = 0.0
for j = 1 : n
y[i] += A[i, j] * x[j]
end
end
end
function test2(n::Int, count::Int)
A = rand(n, n)
x = rand(n)
y = zeros(n)
tic()
for i = 1 : count
naive_A_mul_B(y, A, x)
end
toq()
end
The results are (count = 100 000
)
N OpenBLAS Naive
4 0.035218 0.005754
5 0.641212 0.008684
8 0.650700 0.019424
9 0.871438 0.024323
16 1.079199 0.077812
17 1.249294 0.085496
32 2.009289 0.322948
33 2.051556 0.339220
64 2.051730 1.342379
65 2.078451 1.416710
This is really sad.
P.S. I ran the same test using dgemv
that comes with MKL and results are even sadder:
N OpenBLAS Naive MKL
4 0.026248 0.005759 0.011873
5 0.682821 0.008710 0.012759
8 0.703205 0.022573 0.013530
9 0.918482 0.024350 0.015054
16 1.148282 0.075867 0.020143
17 1.307566 0.085483 0.023182
32 2.035381 0.318547 0.040268
33 2.050683 0.340855 0.047492
64 2.114853 1.342950 0.142106
65 2.101495 1.418431 0.160844
See also #2489. We could handle small matvecs on our own, but this should probably also be filed as an OpenBLAS bug.
I'm quite surprised with how large an N this extends to. I can partially confirm this, but on my machine OpenBLAS is much more better:
OpenBLAS Naive
4 0.029808847 0.008809609
5 0.123966276 0.012788681
8 0.118868987 0.028228581
9 0.121300932 0.035596733
16 0.12656238 0.111826848
17 0.125384132 0.126661398
32 0.175371494 0.452407687
33 0.189356229 0.52830851
64 0.316179272 1.889909086
65 0.274108522 2.000223288
OpenBLAS is very architecture-specific.
On my machine, I get performance more like Tim's:
4 0.021467 0.004708
5 0.065906 0.006636
8 0.064675 0.015061
9 0.064371 0.022904
16 0.071695 0.058209
17 0.066378 0.066214
32 0.079359 0.229914
33 0.079927 0.252575
64 0.148048 1.032274
65 0.116915 1.048045
Is this with OPENBLAS_NUM_THREADS=1?
Perhaps we should have some general infrastructure to deal with such cases to handle changeover from native julia to library implementations.
Kind of like multiple dispatch but based on problem size.
Yes! We can call it an "if statement".
Setting OPENBLAS_NUM_THREADS=1
produces much more likeable result:
N OpenBLAS Naive MKL
4 0.029357 0.005750 0.011478
5 0.029628 0.008686 0.012789
8 0.030499 0.019367 0.013001
9 0.022988 0.024703 0.015096
16 0.029898 0.075955 0.019873
17 0.032563 0.085504 0.022977
32 0.059854 0.318140 0.041384
33 0.073469 0.340058 0.047291
64 0.180811 1.343092 0.145430
65 0.190066 1.418690 0.161348
@ViralBShah We could, but this should be a problem of OpenBLAS. MKL also uses threading, but it handles small matrices gracefully.
I suggest we analyzer performance of OpenBLAS on Windows a little further and if such behavior is a pattern we can disable threading at all (at least on Windows). Can anybody check if this problem exists on Linux/MacOS?
On my (oldish, non-hyperthreading) machines, running fresh windows and linux 64-bit linux julia builds, OpenBlas gemv does not look too good either. Below is a slightly modified timing table, where I have replaced naive_A_mul_B with a cache friendly version that iterates for j=1:n, i=1:m y[i] += A[i,j] * x[j] end
. Also, I have normalized the timings by dividing by n^2
. I have also added a gemm-based MV multiplier. The number of openblas threads is in brackets.
The code I used is here: https://github.com/bsxfan/ad4julia/blob/master/timegemv.jl. (Compared to @VHaravy's test code, I think my normalized timing for n=4 sufffers a bit from overhead to decide which of my MV functions to call, but for larger n, the division by n^2 makes this constant overhead insignificant.)
For my machines at least, I would conclude you don't want to be using any openblas-based matrix vector multiplication for n<60 or so and for larger n, multithreading doesn't hurt, but doesn't help very much either. (For big matrix-matrix multiplications however, the multithreading does help.)
Windows (2 cores):
n loop gemv(1) gemm(1) gemv(2) gemm(2)
4 16.85375 21.42875 49.59875 22.39188 45.50625
5 10.63240 23.42240 37.59880 456.27040 38.06080
8 7.52422 7.34344 15.10844 203.75234 13.36281
9 5.23160 6.08765 12.31802 170.88185 12.41309
16 3.95770 2.25723 4.51445 45.70129 4.30375
17 3.82567 2.13277 4.19889 23.11398 3.82567
32 2.88925 0.89160 1.76063 12.90381 1.91864
33 2.84769 0.74994 1.82534 11.23155 1.76522
64 2.86197 0.60945 1.65530 3.25981 1.60263
65 4.16052 0.63734 1.59108 3.24143 1.55005
128 2.86832 0.53115 1.42252 1.18128 1.44509
129 2.93028 0.52665 1.43019 1.06581 1.56538
512 2.84102 0.44971 1.37967 0.32553 1.37985
513 2.84649 0.46067 1.37898 0.33463 1.36106
1024 3.16857 1.47960 1.97020 1.49662 2.03573
1025 3.14996 1.49318 2.18105 1.44015 2.12832
2048 3.13348 1.62384 2.15700 1.54804 2.13653
2049 3.13667 1.61306 2.25749 1.58398 2.30797
Linux (8 cores):
n loop gemv(1) gemm(1) gemv(8) gemm(8)
4 18.57250 21.28250 50.72438 23.55313 45.28875
5 10.79400 12.19760 32.91320 136.93360 30.87360
8 5.85875 4.84844 12.38969 51.08484 12.40016
9 5.24605 4.10593 10.21679 48.62457 10.34642
16 3.50402 1.60164 3.79465 17.43223 4.03980
17 3.42273 1.54747 3.62574 15.74512 3.74419
32 2.91198 0.77287 1.77106 6.30886 1.72822
33 2.92479 0.74156 1.70529 5.43544 1.69363
64 2.83866 0.53758 1.60033 1.63550 1.58147
65 3.00864 0.57960 1.52095 1.56001 1.53420
128 2.88911 0.53451 1.42644 0.53462 1.39886
129 2.88510 0.53506 1.45560 0.50755 1.46066
512 2.89669 0.45824 1.42252 0.17840 1.40695
513 2.88091 0.47314 1.40404 0.17443 1.39493
1024 3.32143 1.88073 2.53437 1.81940 2.49573
1025 3.37001 1.90204 2.68743 0.13025 2.66368
2048 3.48342 2.06975 2.83885 1.61386 2.84011
2049 3.40828 2.08051 2.88406 1.62771 2.88558
Cc: @xianyi
For a moment I thought my timings here were in contrast with this post (https://groups.google.com/forum/?fromgroups=#!topic/julia-dev/s3_azLP6oyY) where I compared Julia's sum(A,2)
with GEMV. There, for A, n x n
, and n=1000, I found GEMV with 8 threads gave a 20-fold performance gain. On closer inspection, multithreaded GEMV seems to have this sweet spot for medium sized n, on my machine roughly 200 < n 1500, but after about 1500, the advantage deteriorates to just a factor 2 (similar to single-threaded GEMV).
Curiously, exactly at n=1024 there is a bad spot. Using my test code referred to above, I compare multithreaded gemv time relative to loopMV time (smaller than one means blas is better):
julia> for n=1020:1:1026;println(n,": ",test(gemvMV,n,100,8)/test(loopMV,n,100,1)) end
1020: 0.04129974380460607
1021: 0.03499813589895558
1022: 0.040023128676005285
1023: 0.03524801930325523
1024: 0.48713279239389196
1025: 0.03455542414148307
1026: 0.037994575168908744
At 1024 it takes more than 10 times longer than at 1023 or 1025.
(From the table in my previous post, n=5 is also a particularly bad spot for multithreaded GEMV.)
The performance hit at size 1024
doesn't seem to occur with MKL.
Blaze requires a separate BLAS to be installed.
As @JeffBezanson notes, Blaze ends up using BLAS at a low level, as does Armadillo. Other C++ templated linear algebra systems like Eigen create linear algebra classes that do not use the BLAS. In all these cases, however, the objective is to organize vector types, matrix types and various factorizations in some kind of an object-oriented framework so the programmer can write expressions in a natural way and still have them evaluated efficiently.
We are steadily building such a framework in Julia but with a big difference. Julia provided multiple dispatch whereas systems based on C++ must build methods into distinct classes. Many linear algebra calculations like products and solving linear systems are poster children for multiple dispatch. The decision of how to evaluate the expression should depend on both the left and the right operands. In Julia this is possible.
And the core developers have build marvelous parsing rules to provide efficient evaluation of many expressions like A'B. Accomplishing this using delayed evaluation techniques in languages like C++ ends up with some truly horrible looking code.
Frankly speaking, C++ template libraries have its advantages over Julia (at least at this moment).
For example, C++ template libraries can achieve lazy evaluation with zero runtime-overhead even with very complex expressions. A correctly implemented C++ template library can evaluate complex vectorized expression like r = sum(x.^2 + y.^2 + sin(z).^2)
without creating any intermediate arrays, and map the computation to tight SIMD loops.
The way they achieve this is to use template-based generic programming (OOP is not actually used a lot in such libraries). So they can achieve _multiple dispatch_ in a static sense. Such libraries could be tricky to implement and C++ meta-programming codes can be quite difficult to follow. However, the client codes that _use_ the library are usually concise and friendly.
Having said that, Julia does have many other advantages though.
I would point out that all of that can also be done in Julia just as easily
as in C++ but we've chosen not to follow that route with the core Julia
types because of the complexity of implementation and because in my
experience it ends up being brittle and hard to understand in the sense
that everything is cool until it isn't – and then you're kind of screwed.
We need to figure out how to deal with this better in Julia without giving
up the simplicity and transparency that we've maintained so far.
On Mon, Jun 17, 2013 at 6:17 PM, Dahua Lin [email protected] wrote:
Frankly speaking, C++ template libraries have its advantages over Julia
(at least at this moment).For example, C++ template libraries can achieve lazy evaluation with zero
runtime-overhead even with very complex expressions. A correctly
implemented C++ template library can evaluate complex vectorized expression
like r = sum(x.^2 + y.^2 + sin(z).^2) without creating any intermediate
arrays, and map the computation to tight SIMD loops.The way they achieve this is to use template-based generic programming
(OOP is not actually used a lot in such libraries). So they can achieve _multiple
dispatch_ in a static sense. Such libraries could be tricky to implement
and C++ meta-programming codes can be quite difficult to follow. However,
the client codes that _use_ the library are usually concise and friendly.Having said that, Julia does have many other advantages though.
—
Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/3239#issuecomment-19578878
.
@lindahua I was banking on your Devectorize
package to deal with complex vectorized expressions. This is becoming a bottleneck in some of the stuff I want to do as well. I really wish we are able to do such things in base julia, and some of that stuff such as SIMD is on the compiler optimization roadmap (#3440).
This is similiar to xianyi/OpenBLAS#103.
As a concrete example of what Stefan was talking about (just came up in my own coding), lazy evaluation works great for simple cases but runs into trouble quickly. Compare something like r = A*x+b
vs. r = A*(x+b)
using preallocated output r
. The first is pretty easy, but for the second you're best off introducing a temporary to store the result of x+b
unless you're willing to destroy x
. (Vector operations are non-aliasing, but matrix multiplication aliases, so in the second case you can't use r
as a temporary).
@timholy You are right that special care need to be taken for such cases. A sophisticated matrix library typically comes with a large bunch of specialized template functions that dispatch the computation based on the type of expression. If expressions are constructed recursively, A * x + b
and A * (x + b)
will be of different types (e.g. BinaryExpr<Add, BinaryExpr<Times, Matrix, Vector>, Vector>
and BinaryExpr<Times, BinaryExpr<Add, Vector, Vector> >
).
The expressions are not always lazy evaluated, one can write specialized rules to evaluate part of the expression earlier than the others.
Having said that, I agree that things can quite complicated. C++ Boost has a library Proto, which allows you to express complex mapping between expression and computation in a relatively concise and systematic way.
The C++ library NT2 is a library (much more comprehensive than Eigen, Armadillo, or Blaze) that uses Proto to give a complete coverage of such things that try to map all kinds of expressions that one may ever conceive to the most efficient computation routines. The cost is that the library becomes very complex, with several hundreds of thousands lines of code (probably 3x - 4x larger than the Julia codebase), and the compilation cost is high -- the compiler needs to work through all rules (expressed in template meta-programing) to actually emit the code.
I am not suggesting to take such approach in Julia. I agree with @StefanKarpinski that this will become brittle and overly complex.
Instead I think the compilation optimization stuff that Jeff is working on may achieve similar goals in a cleaner way. I am looking forward to that.
My hypothesis is that most of the performance overhead can be eliminated without lazy evaluation via:
I'm not positive about this, but I suspect that these three improvements together with other compiler work will get us to where we need to be without losing the nice clean eager, dynamic semantics that we currently have.
My opinion is that you all forget what this issue is about. We need to make sure that basic BLAS and LAPACK functions work as fast as possible (or at least not as slow as some of them work now).
Julia does not need a very complicated linear algebra like above mentioned Blaze, Proto, NT2 or Armadillo. Small optimizations like mentioned by @StefanKarpinski is what will make Julia's linear algebra fast and flexible.
Sorry about the earlier mess, I responded in email rather than writing here
@StefanKarpinski suggested "Better syntax and support for in-place operations in the cases where you really can't afford to use a temporary." and my response was
I was going to suggest this in a more general issue regarding further enhancements of linear algebra methods but, since you brought it up ...
I suggest adding methods for solve!
to the factorization types and special matrix types. A solve! method is an in-place solve. That is, it overwrites the second argument (the right-hand side of the system) with the solution.
If this seems reasonable I will begin implementation.
@VHaravy I agree that the point in this issue is lost in discussion. As @xianyi mentioned, the openblas issue with threading is known, and the cut-offs for using multiple threads need to be improved in openblas. If threads are disabled, you will get worse performance on larger problems.
For the time being, just like we have julia implementation for small matrix multiplication, we could extend that for small matvecs as well.
@dmbates, or you could write the three-argument version solve!(output, A, b)
.
@timholy For the most common cases where A
is a factorization of a square matrix the three-argument version would be equivalent to solve!(A, copy!(output, b))
. I guess it is a matter of taste which version is preferable.
This issue is related to an issue or thread by @johnmyleswhite regarding the names of mutating function variants like the three argument version of Ac_mul_B, but I can't find the issue now. I also had the impression that @lindahua mentioned something like this in his announcement of the NumericFunctors package, but I can't find that either. (The google is not strong in you today, my young apprentice.)
For function in base Julia at least there should be consistent naming conventions for the versions that mutate an input argument and those that overwrite an output argument in place. So for the case of solve
solve(A,b)
returns a freshly-allocated arraysolve!(A,b)
overwrites b
with the solution, returning the mutated versionsolve!(output, A, b)
overwrites output
with the solutionI think that could be applied consistently to many of the other numerical linear algebra functions.
cc: @ViralBShah, @andreasnoackjensen
One small observation: when chatting with @stevengj yesterday, we discovered that OpenBLAS appears to use OpenMP's static scheduler in some core driver loop. Simply changing the scheduler directive to schedule(auto)
and recompiling resulted in a 35% speedup for a very simple test problem:
julia> y=x=randn(10,10); @time begin for i=1:10^6; y=x*x; end; end; y[1] #With schedule(static) as in v0.2.8
elapsed time: 3.593052362 seconds (928000000 bytes allocated)
julia> y=x=randn(10,10); @time begin for i=1:10^6; y=x*x; end; end; y[1] #With schedule(auto) / v0.2.8
elapsed time: 2.638416476 seconds (928000000 bytes allocated)
julia> versioninfo()
Julia Version 0.3.0-prerelease+2145
Commit a9949c6* (2014-03-22 04:06 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.1.0)
CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
WORD_SIZE: 64
BLAS: libopenblas (NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm
Recently found that complex matrix multiplication gives wrong results, even with a 4x4 matrix, in a machine with Windows7-pro running on a AMD FX(tm)-8320 Eight-Core Processor. Error appears both in Julia 0.2.1 and 0.3-dev.
Can you give an example? I'm really disturbed by what seem to be a lot of cases of OpenBLAS giving incorrect answers lately.
Crazy idea. Anyone want to try incorporating https://github.com/flame/blis and https://github.com/flame/libflame as alternate BLAS/LAPACK backends?
That is a lot of work. Is it blas compatible? How does the performance compare?
Read their FAQs. Major issue seems to be windows support. It does have a blas interface.
Would not expect it to be a quick job, no. More like an up-for-grabs experiment if someone's feeling adventurous. According to their publications (http://www.cs.utexas.edu/users/flame/pubs/BLISTOMS.pdf), serial performance is comparable but it's not clear to me whether they've implemented multithreading yet.
Edit: if the FAQ entry about not supporting building as a shared library still applies, then it may be a non-starter for now.
In the thread I mentioned
(https://github.com/JuliaLang/julia/issues/7031) is some test code. I
reply it below, with a sample output. More information and examples in
that thread.
I shall stress that I have 2 machines with Windows 7 64 bits, and this
only happens in the one which has an AMD 8320 8-core (the other one
with an AMD A8-3850 APU with 4 cores runs the code OK). Errors happen
both with julia 0.2.1 and 0.3-dev (details in the thread.)
Also, this happens only with complex matrices, not with real matrices.
Another clue is that I have ran the code in the AMD 8320, in Ubuntu
12.10 over Virtualbox, using Ubuntu's julia 0.2.1, and the the code
runs OK. This seems to be related with Windows7+8-core AMD, a chip
with a not so common number of processors (don't know if, or how, the
'*' routine is implemented in OpenBLAS using multi-threading and/or
multiple available cores...)
If someone has a similar setup (windows + amd 8-core), it would be
interesting to know the results of the test...
Below is the code and a sample result (I generate random matrices, the
error is always present). Note that the multiplication seems to work
with reals, but fails with complex matrices. I also use lu() to get a
couple of matrices, but the error is in the multiplication, not in
factoring. Also the vectors p (permutations in lu()) also don't
matter. Finally I compare native matrix mult with '*' and a simple
triple loop to do the same job. The error appears always with thw
random matrices, but I spotted the misbehavior when developing a small
program for simulating AC circuits (it is shown an example output in
the thread)
N=4
M=rand(Float64,N,N)
L,U,p=lu(M)
LU=L*U
println(p)
LoopLU=zeros(Float64,N,N)
for i in 1:N
for j in 1:N
s=0.0+0.0im
for k in 1:N s += L[i,k]*U[k,j] end
LoopLU[i,j]=s
end
end
println(LU-LoopLU) # IT IS ZERO
MC=zeros(Complex{Float64},N,N)
MC = rand(Float64,N,N)+(0.0+1.0im)*rand(Float64,N,N) # complex random matrix
L,U,p=lu(MC)
LU=L*U
println(p)
LoopLU=zeros(Complex{Float64},N,N)
for i in 1:N
for j in 1:N
s=0.0+0.0im
for k in 1:N s += L[i,k]*U[k,j] end
LoopLU[i,j]=s
end
end
println(LU-LoopLU) # GIVES != ZERO!!!
[3,4,2,1]
[0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0]
[4,2,3,1]
Complex{Float64}[0.0 + 0.0im 0.0 + 0.0im 0.010724319095534562 -
0.00436327616604229im 0.05626670997775374 - 0.04487609382080637im
0.0 + 0.0im 0.0 + 1.1102230246251565e-16im
0.00031766699802464327 - 0.0007146125364691867im 0.028237510396673327
On 6/1/14, Viral B. Shah [email protected] wrote:
That is a lot of work. Is it blas compatible? How does the performance
compare?
Reply to this email directly or view it on GitHub:
https://github.com/JuliaLang/julia/issues/3239#issuecomment-44767138
This issue is something I noticed while developing _StaticArrays_. For _very_ small sizes, there is a lot of overhead coming from Julia, let alone BLAS. For instance, there are specialized 2x2 and 3x3 matrix multiplication code, but after a _very_ quick look through the code it passes through several functions repeatedly checking size matching, checking the character (!!) that describes if its transposed or not, etc before it gets to evaluation.
A lot of this run-time overhead could be eliminated with e.g. Val{'N'}
instead of 'N'
for the (lack of) transposition, and taking care to only check the dimensions once (and possibly even elided where possible by @inbounds
, or something). Out of curiosity, is some of this on the roadmap? (e.g. lazy transposition, taking vector transposes slowly seriously, or whatever might force us to re-write the A_mul_Bc
interface?)
@StefanKarpinski wrote:
2.. Automatic free insertion in loops: if we can know when a temporary created in a loop can be immediately freed at the end of the loop body, then the next iteration can reuse that memory.
That sounds pretty cool, too!
Sorry to resurrect an old thread, but I am also seeing performance issues for small Mat*Vec operations. Is there a fix on the horizon for this ? Just want to know whats the best way to solve the problem. I am happy with writing my own optimized matvec
function but then I wonder what all other native functions should I re-write. Opens up a can of worms in my head :-)
Btw: We are taking 3x3 matrices and 3x1 vector operations, common when working with points in cartesian space.
I am on Julia 0.6 release using openblas.
julia> versioninfo()
Julia Version 0.6.1-pre.0
Commit dcf39a1 (2017-06-19 13:06 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)
Please ask questions at discourse.julialang.org.
@andreasnoack Understood. Generally I would, but it was related to this thread so I asked it here.
Use StaticArrays.jl (they also have mutable versions).
StaticArrays.jl is the only real solution for small matvecs above as @KristofferC points out. I am closing this one since it is unlikely OpenBLAS is going to be able to do very much about it, and even if it does, StaticArrays.jl will still be much better.
I will note that I don't quite agree with that, Viral - there are problems here we can feasibly fix in LinearAlgebra
even if the result will be slightly slower than with StaticArrays
. E.g. last time I looked the dispatch pattern for matrix multiplication is quite deep and results in multiple re-checking of sizes and so-on. This would have been written differently if we thought of a native multiplication algorithm for small matrices with an override to call BLAS only for large cases (the code appeared to me to have the opposite philosophy and/or history - we aim to dispatch to BLAS and have a native/generic algorithm implemented later on as an afterthought).
My main reason to close this issue was that things won't change in OpenBLAS. Perhaps we should rename this issue for the LinearAlgebra
fix and not necessarily be about the upstream issue then. I agree that it can at least be made better.
BTW, while matrix-matrix multiply has a deep dispatch pattern, I believe matrix-vector is not so deep.
Most helpful comment
My main reason to close this issue was that things won't change in OpenBLAS. Perhaps we should rename this issue for the
LinearAlgebra
fix and not necessarily be about the upstream issue then. I agree that it can at least be made better.