Just ran make test-perf
and observed this. Most results are the same, and we've improved on about 6 tests, but unfortunately there are some significant regressions:
| test | 0.4 | 4f65737874 | 4f65737874 -O3 | b029f5078f |
| --- | --- | --- | --- | --- |
| parse_int | 0.190 | 0.315 | | 0.245 |
| cons | 80.837 | 91.86 | | 90.14 |
| gk | 52.035 | 62.92 | | 63.84 |
| sparsemul | 32.5 | | | 37.14 (threads) |
| sparsemul2 | 62.836 | 65.73 | 63.23 | 68.5 |
| simplex | 39.209 | 55.02 | 39.90 | 43.83 |
| splitline | 49.86 | 56.87 | 60.86 | 67.44 |
| add1 | 33.63 | 44.09 | 37.82 | 36.6 |
| add1_logical | 41.797 | 105.3 | 60.12 | 45.4 |
| binary_trees | 25.5 | | | 29.64 |
| k_nucleotide | 61 | | | 75.34 (threads) |
There is also a significant regression in the spell
benchmark.
0.4:
spell 10090.812 10533.558 10285.806 183.614
0.5:
spell 11888.845 12585.925 12231.564 297.824
I tried changing some of the containers to use String
instead of AbstractString
but it didn't seem to help much. Using SubString{String}
as the key type of the Dict in train
gives
spell 11395.126 11790.546 11550.910 169.565
slightly better but still slower than 0.4 and still 2.5x slower than python.
Some of the planned changes may improve this 鈥撀爄n particular, the new Char
type.
We should probably try to track down what led to the slowdown though - then any internal representation changes would be speedups, not just countering regressions elsewhere.
a retrospective daily (?) history of the perf tests might help, that way it could be seen whether it was single/few commits which were to blame. cc @jrevels
Looks like #16236 improved things a bit, but not enough.
Spell test after #16236:
spell 11070.436 11496.856 11296.590 204.591
It probably makes sense to run benchmarks on the version right before the String change to figure out how much of this is due to various other changes and how much is due to that change.
It probably makes sense to run benchmarks on the version right before the String change to figure out how much of this is due to various other changes and how much is due to that change.
See here. Hopefully I got the commit range correct - I meant to do the most recent commit related to #16058 vs. the commit right before #16058 in the history.
I wonder how much of this can be traced back to #15997. I was trying to track down a recent performance regression of https://github.com/JuliaLang/julia/issues/15541 (thinking I had broken typemap), and found that @tkelman commit 53665868270c5202a46e30bae7412faf13a9eabe was responsible for the ~15-20% performance regression in that test.
Excellent point. I'll retry without threading.
I expect threading to have a larger performance impact on the C runtime (e.g. #15541) than generated (especially runtime JITed code) since the optimization we use for TLS access doesn't really work for C code....
After #16439 is merged, it would be helpful to see an updated table with JULIA_THREADS=0. I think most of these have been addressed or are due to Box (such as the json test), which is being tracked separately.
Measured again, still with threads enabled. Not many changes, but rand_mat_stat is much worse. k_nucleotide is so much faster that it's suspicious. Needs investigating.
Tried with JULIA_THREADS = 0
and the differences don't seem significant. Would be good to try on other systems though.
Does this changes too much with different LLVM versions?
rand_mat_stat
slowdown seems to because of slowdowns in hcat
and hvcat
. Those don't show up in a profile running the code on 0.4.
julia> Profile.print()
498 ./REPL.jl:92; macro expansion
498 ./REPL.jl:62; eval_user_input(::Any, ::Base.REP...
498 ./boot.jl:225; eval(::Module, ::Any)
498 ./profile.jl:16; macro expansion;
3 ./REPL[1]:6; randmatstat(::Int64)
1 ./random.jl:83; mt_pop!
1 ./random.jl:1105; randn
1 ./random.jl:1107; randn
1 ./random.jl:1119; randn_unlikely(::MersenneTwist...
3 ./REPL[1]:7; randmatstat(::Int64)
1 ./random.jl:1105; randn
2 ./random.jl:1107; randn
1 ./random.jl:1119; randn_unlikely(::MersenneTwist...
119 ./REPL[1]:10; randmatstat(::Int64)
10 ./abstractarray.jl:653; typed_hcat(::Type{T}, ::Array{...
1 ./abstractarray.jl:658; typed_hcat(::Type{T}, ::Array{...
101 ./abstractarray.jl:666; typed_hcat(::Type{T}, ::Array{...
5 ./boot.jl:307; Array{Float64,N}(::Tuple{Int64,...
1 ./abstractarray.jl:670; typed_hcat(::Type{T}, ::Array{...
5 ./abstractarray.jl:672; typed_hcat(::Type{T}, ::Array{...
1 ./array.jl:59; copy!(::Array{Float64,2}, ::Int6...
283 ./REPL[1]:11; randmatstat(::Int64)
11 ./abstractarray.jl:910; typed_hvcat(::Type{T}, ::Tuple...
177 ./abstractarray.jl:914; typed_hvcat(::Type{T}, ::Tuple...
134 ./array.jl:719; hcat(::Array{Float64,2}, ::Arr...
129 ./abstractarray.jl:666; typed_hcat(::Type{T}, ::Array...
4 ./boot.jl:307; Array{Float64,N}(::Tuple{Int64...
5 ./abstractarray.jl:672; typed_hcat(::Type{T}, ::Array...
1 ./array.jl:58; copy!(::Array{Float64,2}, ::In...
25 ./tuple.jl:10; getindex(::Tuple{Array{Float64...
94 ./abstractarray.jl:917; typed_hvcat(::Type{T}, ::Tuple...
5 ./abstractarray.jl:690; typed_vcat(::Type{T}, ::Array{...
2 ./abstractarray.jl:691; typed_vcat(::Type{T}, ::Array{...
2 ./reduce.jl:64; mapfoldl(::Base.##30#32, ::Func...
71 ./abstractarray.jl:698; typed_vcat(::Type{T}, ::Array{...
2 ./boot.jl:307; Array{Float64,N}(::Tuple{Int64...
7 ./abstractarray.jl:703; typed_vcat(::Type{T}, ::Array{...
1 ./multidimensional.jl:339; _setindex!(::Base.LinearFast, ...
4 ./multidimensional.jl:340; _setindex!(::Base.LinearFast, ...
2 ./multidimensional.jl:380; macro expansion
1 ./operators.jl:420; setindex_shape_check(::Array{...
1 ./operators.jl:424; setindex_shape_check(::Array{...
1 ./multidimensional.jl:383; macro expansion
28 ./REPL[1]:12; randmatstat(::Int64)
1 ./linalg/dense.jl:148; trace(::Array{Float64,2})
20 ./linalg/matmul.jl:148; At_mul_B!
20 ./linalg/matmul.jl:256; syrk_wrapper!(::Array{Float64,...
16 ./linalg/blas.jl:1145; syrk!(::Char, ::Char, ::Float...
1 ./linalg/blas.jl:1152; syrk!(::Char, ::Char, ::Float...
1 ./linalg/matmul.jl:201; copytri!(::Array{Float64,2}, ...
2 ./linalg/matmul.jl:203; copytri!(::Array{Float64,2}, ...
1 ./linalg/matmul.jl:0; At_mul_B(::Array{Float64,2}, ::...
4 ./linalg/matmul.jl:146; At_mul_B(::Array{Float64,2}, ::...
27 ./REPL[1]:13; randmatstat(::Int64)
12 ./linalg/dense.jl:172; ^(::Array{Float64,2}, ::Int64)
8 ./intfuncs.jl:90; power_by_squaring(::Array{Float...
8 ./linalg/matmul.jl:330; gemm_wrapper!(::Array{Float64,...
8 ./linalg/blas.jl:963; gemm!(::Char, ::Char, ::Float6...
4 ./linalg/matmul.jl:129; *
6 ./linalg/matmul.jl:148; At_mul_B!
1 ./linalg/matmul.jl:255; syrk_wrapper!(::Array{Float64,2...
1 ./abstractarray.jl:42; stride(::Array{Float64,2}, ::Int64)
5 ./linalg/matmul.jl:256; syrk_wrapper!(::Array{Float64,2...
2 ./linalg/blas.jl:1145; syrk!(::Char, ::Char, ::Float6...
3 ./linalg/matmul.jl:203; copytri!(::Array{Float64,2}, :...
5 ./linalg/matmul.jl:146; At_mul_B(::Array{Float64,2}, ::...
32 ./linalg/dense.jl:172; ^
22 ./intfuncs.jl:90; power_by_squaring(::Array{Float...
1 ./linalg/matmul.jl:304; gemm_wrapper!(::Array{Float64,...
1 ./linalg/matmul.jl:311; gemm_wrapper!(::Array{Float64,...
2 ./linalg/matmul.jl:329; gemm_wrapper!(::Array{Float64,...
2 ./abstractarray.jl:0; stride(::Array{Float64,2}, ::Int64)
18 ./linalg/matmul.jl:330; gemm_wrapper!(::Array{Float64,...
18 ./linalg/blas.jl:963; gemm!(::Char, ::Char, ::Float6...
10 ./linalg/matmul.jl:129; *
3 ./random.jl:1135; randn
In the following "mine" refers to commit e280a27
One commit before mine
julia,fib,0.042998,0.128659,0.046870,0.002392
julia,parse_int,0.242335,2.028559,0.268087,0.078824
julia,mandel,0.136463,0.198907,0.138009,0.003110
julia,quicksort,0.316253,0.580323,0.334022,0.013232
julia,pi_sum,40.930918,41.945148,41.060571,0.152101
julia,rand_mat_stat,17.376123,20.571832,18.228378,0.572528
julia,rand_mat_mul,44.568483,56.915150,47.338284,2.601669
julia,printfd,20.593159,21.428679,20.693425,0.141257
julia,micro.mem,270.566406,270.566406,270.566406,0.000000
mine
julia,fib,0.042982,0.124890,0.046680,0.002268
julia,parse_int,0.251515,1.852985,0.279358,0.079412
julia,mandel,0.136527,0.205742,0.138098,0.002956
julia,quicksort,0.313997,0.465756,0.332772,0.007824
julia,pi_sum,40.936211,41.703891,41.049687,0.113519
julia,rand_mat_stat,39.680477,46.019009,41.446778,1.222534
julia,rand_mat_mul,44.922529,55.156609,47.373832,2.114620
julia,printfd,20.579925,21.573189,20.672500,0.166164
julia,micro.mem,271.417969,271.417969,271.417969,0.000000
almost latest master (from yesterday I think)
julia,fib,0.044243,0.119606,0.046837,0.003259
julia,parse_int,0.289132,7.768136,0.322119,0.152336
julia,mandel,0.136763,0.250258,0.138992,0.006331
julia,quicksort,0.318064,0.534072,0.334189,0.016374
julia,pi_sum,40.511055,55.959972,46.687331,4.624115
julia,rand_mat_stat,50.049147,76.699907,59.645703,10.819121
julia,rand_mat_mul,71.592445,157.663542,89.907365,20.233438
julia,printfd,23.629916,42.501379,30.292729,5.762164
julia,micro.mem,268.761719,268.761719,268.761719,0.000000
It does seem that there's a doubling (ouch) from my commit. However, since then it seems to have gotten even worse (especially rand_mat_mul). This is just one run of test/perf/micro/perf.jl
on each commit.
splatting penalty probably? some manually written out cases for small numbers of inputs may help?
@pkofod Thanks for looking into it.
We need to be able to take advantage of the definition at https://github.com/JuliaLang/julia/blob/7f95e1b303c43ba8b39bd1e3c2a609cdcf8f386c/base/abstractarray.jl#L799, which is particularly fast.
Another way to look at it is that typed_hvcat
is really slow, since it calls hcat on all the arguments followed by vcat. That could certainly be improved.
See #16741
After #16741 there is a significant improvement in rand_mat_stat, but it is still much slower than 0.4.
k_nucleotide mystery solved: we are downloading a data file for this, and the file has changed and gotten much smaller.
...and the results aren't pretty. Much slower than 0.4. If I change AbstractString
to String
it helps a lot, but is still slower than 0.4.
After #16741 there is a significant improvement in rand_mat_stat, but it is still much slower than 0.4.
If #16741 fixed the perf regression I introduced, maybe it is still the increase after commit e280a27
lurking around. My numbers suggested that since e280a27
rand_mat_stat had increased by 50% and rand_mat_mul by 100% in run time.
k_nucleotide is largely string operations, and a big issue seems to be String
being slower than ASCIIString
. Fortunately at least a couple of the issues are easily fixable. Here are the culprits:
String
constructor copying (already about to be fixed)endof
is very slow (compared to just returning the length of an array)String
is using the abstract fallback for nextind
, which is very slow. Needs a custom implementation.==
for String
is slow, I think only because of endof
.
nextind
k_nucleotide takes 71ms on 0.4. With threads off, String instead of AbstractString, and the String changes I listed above, it takes 61ms on 0.5, which is great. If I change String back to AbstractString, the time jumps to 94ms. This seems to point to dynamic dispatch as the remaining issue.
Dynamic dispatch is probably also the issue in binary_trees, since that's pretty much all it does.
binary_trees
yes, profiling shows the expensive line is 32 check(t::Node) = t.info + check(t.left) - check(t.right)
, which has no type information for any of those operations (the most expensive of which ends up being the +
, since it has so many cache entries)
json
this test is almost entirely measuring the cost of mutating state via closures
add1_logical
this should be essentially https://github.com/JuliaLang/julia/issues/16243
The regression in rand_mat_stat
boils down to the a regression in the Array
constructor under certain circumstances. Take for example the following functions:
function foo(t)
n = 5
v = zeros(t)
for i = 1:t
a = randn(n,n)
b = randn(n,n)
c = randn(n,n)
d = randn(n,n)
A = (a,b,c,d)
T = Base.promote_eltype(A...)
nrows = size(A[1],1)
ncols = 0
for a in A; ncols += size(a,2); end
B = similar(full(A[1]), T, nrows, ncols)
v[i] = B[end]
end
return v
end
function baz(A::AbstractVecOrMat...)
T = Base.promote_eltype(A...)
nargs = length(A)
nrows = size(A[1],1)
ncols = 0
for j = 1:nargs; ncols += size(A[j],2); end
B = similar(full(A[1]), T, nrows, ncols)
return B
end
function bar(t)
n = 5
v = zeros(t)
for i = 1:t
a = randn(n,n)
b = randn(n,n)
c = randn(n,n)
d = randn(n,n)
B = baz(a,b,c,d)
v[i] = B[end]
end
return v
end
foo
performs the same in 0.4.5 and master. But bar
is slower in master and profiling shows that the line B = similar(full(A[1]), T, nrows, ncols)
is to blame, which in turns means that the Array
constructor is struggling in this case.
This is as far as I can get for now (I'll have to learn more about the internals to be of more help)
a couple ideas:
1) cache apply_type results for types with missing trailing parameters (like Array{T}
)
2) restore optimization of f(A...) = g(A...)
in codegen
I'm guessing https://github.com/JuliaLang/julia/pull/16498 might be causing this. So 1) would likely be a solution.
(2) is in #16836.
TODO: try benchmarking julia -O3
and seeing if it improves benchmarks; also see how bad the compilation times are with -O3
these days.
Table updated. I also tried -O3 (just for the benchmarks; didn't rebuild the sysimg with it). -O3 times are given only where significantly different from the normal times. Indeed, it helps some of the worst regressions a lot, in the case of sparsemul
taking it from slower than 0.4 to faster.
(the sysimage should be defaulting to building with -O3)
JULIA_THREADS=0 or =1 for the new columns?
Ah, right. THREADS=1 in the new columns.
The problem with add1
and add1_logical
is with zip
. Changing
function _unsafe_getindex(::LinearIndexing, src::AbstractArray, I::AbstractArray{Bool})
shape = index_shape(src, I)
dest = similar(src, shape)
size(dest) == shape || throw_checksize_error(dest, shape)
D = eachindex(dest)
Ds = start(D)
for (i, s) in zip(eachindex(I), eachindex(src))
@inbounds Ii = I[i]
if Ii
d, Ds = next(D, Ds)
@inbounds dest[d] = src[s]
end
end
dest
end
similar to what is was before
function _unsafe_getindex(::LinearIndexing, src::AbstractArray, I::AbstractArray{Bool})
shape = index_shape(src, I)
dest = similar(src, shape)
size(dest) == shape || throw_checksize_error(dest, shape)
D = eachindex(dest)
Ds = start(D)
s = 0
for i in eachindex(I)
s += 1
@inbounds if I[i]
d, Ds = next(D, Ds)
dest[d] = src[s]
end
end
dest
end
makes them faster than in 0.4. This was introduced in #15434 and friends
Good find @pabloferz!
cc @timholy, I thought the point of replacing indexing loops like that in so many places in base with zip and eachindex was that it had gotten to performance parity and wouldn't slow the common case down? Did that regress and become no longer the case?
Reading #15356 I noticed @nalimilan 's comment, which might explain some of the findings above when building with -O3
Dunno; it's also possible that wasn't run on nanosoldier, or that we lack(ed) the test.
The problem with changing it to the old version is that it's going to be deadly-slow if src
is LinearSlow
. What is it in this test? We might need a version of _unsafe_getindex
specialized for LinearFast
.
I wonder if #9080 has gotten worse?
It looks like there's a trivial performance bug in closure lowering (which for some reason was already marked with TODO in julia-syntax.scm), which seems to be responsible for about half of the json difference. Most of the rest appears to be due to JULIA_THREADS (possibly https://github.com/JuliaLang/julia/issues/16804).
@time sumcart_manual(A)
@time sumcart_iter(A)
yields
0.108923 seconds (5 allocations: 176 bytes)
0.589222 seconds (5 allocations: 176 bytes)
I can probably get rid of all generated functions in CartesianIndex/CartesianRange operations. Is that useful in any event?
Yes, getting rid of generated functions is always nice, if possible.
Looking into the add1_logical
regression, it looks like LLVM 3.7 (SROA?) is not able to optimize countnz(logical_y)
correctly anymore. As written (and if we make for-loops to have implicit @inbounds
), then it should be vectorizable now (as llvm 3.3 can do on current master), and thus much faster than 0.4 (but llvm 3.7 fails to do this).
addendum: for some reason, putting @inbounds
in front of the for-loop elids bounds checking inside next
, but I think that's an inliner bug for the effect-free computation of getindex
edit to addendum: llvm 3.7 gets much worse at generating code if the @inbounds
trick is used (unlike llvm 3.3, which vectorizes it)
addendum: for some reason, putting @inbounds in front of the for-loop elids bounds checking inside next, but I think that's an inliner bug for the effect-free computation of getindex
If you added the addendum after the merger of #16260, note that next
now has a @_propagate_inbounds_meta
:
Progress! We're starting to do pretty well; several of the regressions are fixed, and most of the ones that remain are minor, around ~10%. Regressions marked (threads)
are only a problem with threads enabled. Most of the numbers in the table are affected by enabling threads, but for these threading makes the difference between being a regression vs. not.
Shall we move this to 0.5.x then? (And make sure that this time we actually work though the .x issues.)
Has anyone done this yet, and does it work? @nanosoldier runbenchmarks(ALL, vs=":release-0.4")`
Missed a backtick, try again: @nanosoldier runbenchmarks(ALL, vs=":release-0.4")
Will that work on an issue? Try on the latest commit to master maybe?
It doesn't work in issues, since there's no explicit commit that can be assumed by context (though master would be a reasonable default). You can comment on commits though, as Tony mentioned. I triggered the job here.
I did a comparison between the two reports and this is a filtered (not completely exhaustive) summary of what is consistently slower. The number to the right is the slowdown.
["string","join"]
1.19
["string","replace"]
2.54
["sparse","index",("spvec","integer",10000)]
1.81
["sparse","index",("spmat","array",1000)]
3.00
["sparse","index",("spmat","col","array",100)]
3.15
["sort","mergesort",("sortperm! reverse","random")]
1.23
and all other mergesort ones are consistently slower
["sort","issorted",("reverse","random")]
2.71
["sort","issorted",("reverse","ascending")]
2.85
["simd",("two_reductions","Float64",4096)]
1.29
["simd",("local_arrays","Int32",4096)]
27.83
["simd",("local_arrays","Float64",4096)]
18.57
["simd",("conditional_loop!","Float32",4095)]
2.95
["shootout","revcomp"]
23.50
["shootout","fasta"]
1.24
["scalar","fastmath",("add","Complex{Float32}")]
1.50
["scalar","fastmath",("add","Complex{Float64}")]
2.00
Basically all scalar arithmetic on BigInt and BigFloat are slower
["scalar","arithmetic",("div","Complex{BigFloat}","UInt64")]
["problem","stockcorr"]
1.23
["problem","ziggurat"]
1.22
["problem","simplex"]
1.36
["problem","laplacian","laplace_sparse_matvec"]
3.98
["parallel","remotecall",("identity",1024)]
1.20
and the other remotecalls
Basically everything with tridiagonal and co, some examples:
["linalg","arithmetic",("-","Tridiagonal","Tridiagonal",256)]
1.45
["linalg","arithmetic",("-","Vector","Vector",256)]
1.44
["linalg","arithmetic",("-","Bidiagonal","Bidiagonal",1024)]
["array","index",("sumrange","BaseBenchmarks.ArrayBenchmarks.ArrayLF{Float32,2}")]
1.38
["array","index",("sumlogical","BaseBenchmarks.ArrayBenchmarks.ArrayLF{Float32,2}")]
1.20
and as well for other number types
["array","index",("sumlinear","BaseBenchmarks.ArrayBenchmarks.ArrayLF{Int32,2}")]
1.82
["array","index",("sumelt_boundscheck","linspace(1.0,2.0,10000000)")]
2.06
["array","index",("sumcartesian","1:100000000")]
3.99
["array","growth",("push_single!",2048)]
1.78
["array","bool","bitarray_true_load!"]
1.17
["array","bool","boolarray_bool_load!"]
1.28
If you added the addendum after the merger of #16260, note that next now has a @_propagate_inbounds_meta:
Its good to know that's now there, but I checked my source tree before posting to confirm this was a known inliner bug, as it preceded that PR.
@JeffBezanson Is there some concrete actions required still to move this issue to v0.5.x, or what's left that seems unexplained / uncorrected?
The regressions in splitline and gk are a bit worrying, but hopefully we can work on them in the RC period.
I'm more worried about the 27x regression on the @simd
benchmarks. Anyone know why this loop:
function perf_local_arrays(V)
# SIMD loop on local arrays declared without type annotations
T, n = eltype(V), length(V)
X = rand(T, n)
Y = rand(T, n)
Z = rand(T, n)
@simd for i in eachindex(X)
@inbounds X[i] = Y[i] * Z[i]
end
return X
end
doesn't vectorize? (Input: rand(Float64, 4096)
)
@ArchRobison ^
It vectorizes on LLVM 3.8
And on 3.7.1 it vectorizes too with -O3
. When it doesn't, it's because of https://github.com/JuliaLang/julia/issues/15369. All the TBAA's are correctly attached in codegen and maintained by llvm (at least with https://github.com/JuliaLang/julia/pull/16897)
for those who may not have seen it, here's 0.5.0-rc2 vs 0.4.6: https://github.com/JuliaCI/BaseBenchmarkReports/blob/6d82a5518a25740eef4abde8359ea3cdbc630375/0350e57_vs_2e358ce/report.md
With the exception of the horrifying regressions on "simd", fairly satisfying overall. I'm especially pleased with the improvements in so many array benchmarks. Worried about the linalg regressions, but an optimist can always hope that my barrage of PRs this morning may help.
It _feels_ like the array arithmetic is just some inlining problem or something like that.
Closing this, since at this point the experiment will need to be repeated for 0.6.
Most helpful comment
Progress! We're starting to do pretty well; several of the regressions are fixed, and most of the ones that remain are minor, around ~10%. Regressions marked
(threads)
are only a problem with threads enabled. Most of the numbers in the table are affected by enabling threads, but for these threading makes the difference between being a regression vs. not.