Julia: misc. benchmark regressions since 0.4

Created on 29 Apr 2016  路  68Comments  路  Source: JuliaLang/julia

Just ran make test-perf and observed this. Most results are the same, and we've improved on about 6 tests, but unfortunately there are some significant regressions:

| test | 0.4 | 4f65737874 | 4f65737874 -O3 | b029f5078f |
| --- | --- | --- | --- | --- |
| parse_int | 0.190 | 0.315 | | 0.245 |
| cons | 80.837 | 91.86 | | 90.14 |
| gk | 52.035 | 62.92 | | 63.84 |
| sparsemul | 32.5 | | | 37.14 (threads) |
| sparsemul2 | 62.836 | 65.73 | 63.23 | 68.5 |
| simplex | 39.209 | 55.02 | 39.90 | 43.83 |
| splitline | 49.86 | 56.87 | 60.86 | 67.44 |
| add1 | 33.63 | 44.09 | 37.82 | 36.6 |
| add1_logical | 41.797 | 105.3 | 60.12 | 45.4 |
| binary_trees | 25.5 | | | 29.64 |
| k_nucleotide | 61 | | | 75.34 (threads) |

help wanted performance regression

Most helpful comment

Progress! We're starting to do pretty well; several of the regressions are fixed, and most of the ones that remain are minor, around ~10%. Regressions marked (threads) are only a problem with threads enabled. Most of the numbers in the table are affected by enabling threads, but for these threading makes the difference between being a regression vs. not.

All 68 comments

There is also a significant regression in the spell benchmark.

0.4:

spell              10090.812 10533.558 10285.806  183.614

0.5:

spell              11888.845 12585.925 12231.564  297.824

I tried changing some of the containers to use String instead of AbstractString but it didn't seem to help much. Using SubString{String} as the key type of the Dict in train gives

spell              11395.126 11790.546 11550.910  169.565

slightly better but still slower than 0.4 and still 2.5x slower than python.

Some of the planned changes may improve this 鈥撀爄n particular, the new Char type.

We should probably try to track down what led to the slowdown though - then any internal representation changes would be speedups, not just countering regressions elsewhere.

a retrospective daily (?) history of the perf tests might help, that way it could be seen whether it was single/few commits which were to blame. cc @jrevels

Looks like #16236 improved things a bit, but not enough.

Spell test after #16236:

spell              11070.436 11496.856 11296.590  204.591

It probably makes sense to run benchmarks on the version right before the String change to figure out how much of this is due to various other changes and how much is due to that change.

It probably makes sense to run benchmarks on the version right before the String change to figure out how much of this is due to various other changes and how much is due to that change.

See here. Hopefully I got the commit range correct - I meant to do the most recent commit related to #16058 vs. the commit right before #16058 in the history.

I wonder how much of this can be traced back to #15997. I was trying to track down a recent performance regression of https://github.com/JuliaLang/julia/issues/15541 (thinking I had broken typemap), and found that @tkelman commit 53665868270c5202a46e30bae7412faf13a9eabe was responsible for the ~15-20% performance regression in that test.

Excellent point. I'll retry without threading.

I expect threading to have a larger performance impact on the C runtime (e.g. #15541) than generated (especially runtime JITed code) since the optimization we use for TLS access doesn't really work for C code....

After #16439 is merged, it would be helpful to see an updated table with JULIA_THREADS=0. I think most of these have been addressed or are due to Box (such as the json test), which is being tracked separately.

Measured again, still with threads enabled. Not many changes, but rand_mat_stat is much worse. k_nucleotide is so much faster that it's suspicious. Needs investigating.

Tried with JULIA_THREADS = 0 and the differences don't seem significant. Would be good to try on other systems though.

Does this changes too much with different LLVM versions?

rand_mat_stat slowdown seems to because of slowdowns in hcat and hvcat. Those don't show up in a profile running the code on 0.4.

julia> Profile.print()
498 ./REPL.jl:92; macro expansion
 498 ./REPL.jl:62; eval_user_input(::Any, ::Base.REP...
  498 ./boot.jl:225; eval(::Module, ::Any)
   498 ./profile.jl:16; macro expansion;
    3   ./REPL[1]:6; randmatstat(::Int64)
     1 ./random.jl:83; mt_pop!
     1 ./random.jl:1105; randn
     1 ./random.jl:1107; randn
      1 ./random.jl:1119; randn_unlikely(::MersenneTwist...
    3   ./REPL[1]:7; randmatstat(::Int64)
     1 ./random.jl:1105; randn
     2 ./random.jl:1107; randn
      1 ./random.jl:1119; randn_unlikely(::MersenneTwist...
    119 ./REPL[1]:10; randmatstat(::Int64)
     10  ./abstractarray.jl:653; typed_hcat(::Type{T}, ::Array{...
     1   ./abstractarray.jl:658; typed_hcat(::Type{T}, ::Array{...
     101 ./abstractarray.jl:666; typed_hcat(::Type{T}, ::Array{...
      5 ./boot.jl:307; Array{Float64,N}(::Tuple{Int64,...
     1   ./abstractarray.jl:670; typed_hcat(::Type{T}, ::Array{...
     5   ./abstractarray.jl:672; typed_hcat(::Type{T}, ::Array{...
      1 ./array.jl:59; copy!(::Array{Float64,2}, ::Int6...
    283 ./REPL[1]:11; randmatstat(::Int64)
     11  ./abstractarray.jl:910; typed_hvcat(::Type{T}, ::Tuple...
     177 ./abstractarray.jl:914; typed_hvcat(::Type{T}, ::Tuple...
      134 ./array.jl:719; hcat(::Array{Float64,2}, ::Arr...
       129 ./abstractarray.jl:666; typed_hcat(::Type{T}, ::Array...
        4 ./boot.jl:307; Array{Float64,N}(::Tuple{Int64...
       5   ./abstractarray.jl:672; typed_hcat(::Type{T}, ::Array...
        1 ./array.jl:58; copy!(::Array{Float64,2}, ::In...
      25  ./tuple.jl:10; getindex(::Tuple{Array{Float64...
     94  ./abstractarray.jl:917; typed_hvcat(::Type{T}, ::Tuple...
      5  ./abstractarray.jl:690; typed_vcat(::Type{T}, ::Array{...
      2  ./abstractarray.jl:691; typed_vcat(::Type{T}, ::Array{...
       2 ./reduce.jl:64; mapfoldl(::Base.##30#32, ::Func...
      71 ./abstractarray.jl:698; typed_vcat(::Type{T}, ::Array{...
       2 ./boot.jl:307; Array{Float64,N}(::Tuple{Int64...
      7  ./abstractarray.jl:703; typed_vcat(::Type{T}, ::Array{...
       1 ./multidimensional.jl:339; _setindex!(::Base.LinearFast, ...
       4 ./multidimensional.jl:340; _setindex!(::Base.LinearFast, ...
        2 ./multidimensional.jl:380; macro expansion
         1 ./operators.jl:420; setindex_shape_check(::Array{...
         1 ./operators.jl:424; setindex_shape_check(::Array{...
        1 ./multidimensional.jl:383; macro expansion
    28  ./REPL[1]:12; randmatstat(::Int64)
     1  ./linalg/dense.jl:148; trace(::Array{Float64,2})
     20 ./linalg/matmul.jl:148; At_mul_B!
      20 ./linalg/matmul.jl:256; syrk_wrapper!(::Array{Float64,...
       16 ./linalg/blas.jl:1145; syrk!(::Char, ::Char, ::Float...
       1  ./linalg/blas.jl:1152; syrk!(::Char, ::Char, ::Float...
       1  ./linalg/matmul.jl:201; copytri!(::Array{Float64,2}, ...
       2  ./linalg/matmul.jl:203; copytri!(::Array{Float64,2}, ...
     1  ./linalg/matmul.jl:0; At_mul_B(::Array{Float64,2}, ::...
     4  ./linalg/matmul.jl:146; At_mul_B(::Array{Float64,2}, ::...
    27  ./REPL[1]:13; randmatstat(::Int64)
     12 ./linalg/dense.jl:172; ^(::Array{Float64,2}, ::Int64)
      8 ./intfuncs.jl:90; power_by_squaring(::Array{Float...
       8 ./linalg/matmul.jl:330; gemm_wrapper!(::Array{Float64,...
        8 ./linalg/blas.jl:963; gemm!(::Char, ::Char, ::Float6...
      4 ./linalg/matmul.jl:129; *
     6  ./linalg/matmul.jl:148; At_mul_B!
      1 ./linalg/matmul.jl:255; syrk_wrapper!(::Array{Float64,2...
       1 ./abstractarray.jl:42; stride(::Array{Float64,2}, ::Int64)
      5 ./linalg/matmul.jl:256; syrk_wrapper!(::Array{Float64,2...
       2 ./linalg/blas.jl:1145; syrk!(::Char, ::Char, ::Float6...
       3 ./linalg/matmul.jl:203; copytri!(::Array{Float64,2}, :...
     5  ./linalg/matmul.jl:146; At_mul_B(::Array{Float64,2}, ::...
    32  ./linalg/dense.jl:172; ^
     22 ./intfuncs.jl:90; power_by_squaring(::Array{Float...
      1  ./linalg/matmul.jl:304; gemm_wrapper!(::Array{Float64,...
      1  ./linalg/matmul.jl:311; gemm_wrapper!(::Array{Float64,...
      2  ./linalg/matmul.jl:329; gemm_wrapper!(::Array{Float64,...
       2 ./abstractarray.jl:0; stride(::Array{Float64,2}, ::Int64)
      18 ./linalg/matmul.jl:330; gemm_wrapper!(::Array{Float64,...
       18 ./linalg/blas.jl:963; gemm!(::Char, ::Char, ::Float6...
     10 ./linalg/matmul.jl:129; *
    3   ./random.jl:1135; randn

In the following "mine" refers to commit e280a27

One commit before mine

julia,fib,0.042998,0.128659,0.046870,0.002392
julia,parse_int,0.242335,2.028559,0.268087,0.078824
julia,mandel,0.136463,0.198907,0.138009,0.003110
julia,quicksort,0.316253,0.580323,0.334022,0.013232
julia,pi_sum,40.930918,41.945148,41.060571,0.152101
julia,rand_mat_stat,17.376123,20.571832,18.228378,0.572528
julia,rand_mat_mul,44.568483,56.915150,47.338284,2.601669
julia,printfd,20.593159,21.428679,20.693425,0.141257
julia,micro.mem,270.566406,270.566406,270.566406,0.000000

mine

julia,fib,0.042982,0.124890,0.046680,0.002268
julia,parse_int,0.251515,1.852985,0.279358,0.079412
julia,mandel,0.136527,0.205742,0.138098,0.002956
julia,quicksort,0.313997,0.465756,0.332772,0.007824
julia,pi_sum,40.936211,41.703891,41.049687,0.113519
julia,rand_mat_stat,39.680477,46.019009,41.446778,1.222534
julia,rand_mat_mul,44.922529,55.156609,47.373832,2.114620
julia,printfd,20.579925,21.573189,20.672500,0.166164
julia,micro.mem,271.417969,271.417969,271.417969,0.000000

almost latest master (from yesterday I think)

julia,fib,0.044243,0.119606,0.046837,0.003259
julia,parse_int,0.289132,7.768136,0.322119,0.152336
julia,mandel,0.136763,0.250258,0.138992,0.006331
julia,quicksort,0.318064,0.534072,0.334189,0.016374
julia,pi_sum,40.511055,55.959972,46.687331,4.624115
julia,rand_mat_stat,50.049147,76.699907,59.645703,10.819121
julia,rand_mat_mul,71.592445,157.663542,89.907365,20.233438
julia,printfd,23.629916,42.501379,30.292729,5.762164
julia,micro.mem,268.761719,268.761719,268.761719,0.000000

It does seem that there's a doubling (ouch) from my commit. However, since then it seems to have gotten even worse (especially rand_mat_mul). This is just one run of test/perf/micro/perf.jl on each commit.

splatting penalty probably? some manually written out cases for small numbers of inputs may help?

@pkofod Thanks for looking into it.

We need to be able to take advantage of the definition at https://github.com/JuliaLang/julia/blob/7f95e1b303c43ba8b39bd1e3c2a609cdcf8f386c/base/abstractarray.jl#L799, which is particularly fast.

Another way to look at it is that typed_hvcat is really slow, since it calls hcat on all the arguments followed by vcat. That could certainly be improved.

See #16741

After #16741 there is a significant improvement in rand_mat_stat, but it is still much slower than 0.4.

k_nucleotide mystery solved: we are downloading a data file for this, and the file has changed and gotten much smaller.

...and the results aren't pretty. Much slower than 0.4. If I change AbstractString to String it helps a lot, but is still slower than 0.4.

After #16741 there is a significant improvement in rand_mat_stat, but it is still much slower than 0.4.

If #16741 fixed the perf regression I introduced, maybe it is still the increase after commit e280a27 lurking around. My numbers suggested that since e280a27 rand_mat_stat had increased by 50% and rand_mat_mul by 100% in run time.

k_nucleotide is largely string operations, and a big issue seems to be String being slower than ASCIIString. Fortunately at least a couple of the issues are easily fixable. Here are the culprits:

  • String constructor copying (already about to be fixed)
  • endof is very slow (compared to just returning the length of an array)
  • String is using the abstract fallback for nextind, which is very slow. Needs a custom implementation.
  • == for String is slow, I think only because of endof.

nextind

16648

k_nucleotide takes 71ms on 0.4. With threads off, String instead of AbstractString, and the String changes I listed above, it takes 61ms on 0.5, which is great. If I change String back to AbstractString, the time jumps to 94ms. This seems to point to dynamic dispatch as the remaining issue.

Dynamic dispatch is probably also the issue in binary_trees, since that's pretty much all it does.

binary_trees

yes, profiling shows the expensive line is 32 check(t::Node) = t.info + check(t.left) - check(t.right), which has no type information for any of those operations (the most expensive of which ends up being the +, since it has so many cache entries)

json

this test is almost entirely measuring the cost of mutating state via closures

add1_logical

this should be essentially https://github.com/JuliaLang/julia/issues/16243

The regression in rand_mat_stat boils down to the a regression in the Array constructor under certain circumstances. Take for example the following functions:

function foo(t)
    n = 5
    v = zeros(t)
    for i = 1:t
        a = randn(n,n)
        b = randn(n,n)
        c = randn(n,n)
        d = randn(n,n)
        A = (a,b,c,d)
        T = Base.promote_eltype(A...)
        nrows = size(A[1],1)
        ncols = 0
        for a in A; ncols += size(a,2); end
        B = similar(full(A[1]), T, nrows, ncols)
        v[i] = B[end]
    end
    return v
end

function baz(A::AbstractVecOrMat...)
    T = Base.promote_eltype(A...)
    nargs = length(A)
    nrows = size(A[1],1)
    ncols = 0
    for j = 1:nargs; ncols += size(A[j],2); end
    B = similar(full(A[1]), T, nrows, ncols)
    return B
end

function bar(t)
    n = 5
    v = zeros(t)
    for i = 1:t
        a = randn(n,n)
        b = randn(n,n)
        c = randn(n,n)
        d = randn(n,n)
        B = baz(a,b,c,d)
        v[i] = B[end]
    end
    return v
end

foo performs the same in 0.4.5 and master. But bar is slower in master and profiling shows that the line B = similar(full(A[1]), T, nrows, ncols) is to blame, which in turns means that the Array constructor is struggling in this case.

This is as far as I can get for now (I'll have to learn more about the internals to be of more help)

a couple ideas:
1) cache apply_type results for types with missing trailing parameters (like Array{T})
2) restore optimization of f(A...) = g(A...) in codegen

I'm guessing https://github.com/JuliaLang/julia/pull/16498 might be causing this. So 1) would likely be a solution.

(2) is in #16836.

TODO: try benchmarking julia -O3 and seeing if it improves benchmarks; also see how bad the compilation times are with -O3 these days.

Table updated. I also tried -O3 (just for the benchmarks; didn't rebuild the sysimg with it). -O3 times are given only where significantly different from the normal times. Indeed, it helps some of the worst regressions a lot, in the case of sparsemul taking it from slower than 0.4 to faster.

(the sysimage should be defaulting to building with -O3)

JULIA_THREADS=0 or =1 for the new columns?

Ah, right. THREADS=1 in the new columns.

The problem with add1 and add1_logical is with zip. Changing

function _unsafe_getindex(::LinearIndexing, src::AbstractArray, I::AbstractArray{Bool})
    shape = index_shape(src, I)
    dest = similar(src, shape)
    size(dest) == shape || throw_checksize_error(dest, shape)

    D = eachindex(dest)
    Ds = start(D)
    for (i, s) in zip(eachindex(I), eachindex(src))
        @inbounds Ii = I[i]
        if Ii
            d, Ds = next(D, Ds)
            @inbounds dest[d] = src[s]
        end
    end
    dest
end

similar to what is was before

function _unsafe_getindex(::LinearIndexing, src::AbstractArray, I::AbstractArray{Bool})
    shape = index_shape(src, I)
    dest = similar(src, shape)
    size(dest) == shape || throw_checksize_error(dest, shape)

    D = eachindex(dest)
    Ds = start(D)
    s = 0
    for i in eachindex(I)
        s += 1
        @inbounds if I[i]
            d, Ds = next(D, Ds)
            dest[d] = src[s]
        end
    end
    dest
end

makes them faster than in 0.4. This was introduced in #15434 and friends

Good find @pabloferz!

cc @timholy, I thought the point of replacing indexing loops like that in so many places in base with zip and eachindex was that it had gotten to performance parity and wouldn't slow the common case down? Did that regress and become no longer the case?

Reading #15356 I noticed @nalimilan 's comment, which might explain some of the findings above when building with -O3

Dunno; it's also possible that wasn't run on nanosoldier, or that we lack(ed) the test.

The problem with changing it to the old version is that it's going to be deadly-slow if src is LinearSlow. What is it in this test? We might need a version of _unsafe_getindex specialized for LinearFast.

I wonder if #9080 has gotten worse?

It looks like there's a trivial performance bug in closure lowering (which for some reason was already marked with TODO in julia-syntax.scm), which seems to be responsible for about half of the json difference. Most of the rest appears to be due to JULIA_THREADS (possibly https://github.com/JuliaLang/julia/issues/16804).

9080 has gotten insanely worse on 0.5:

@time sumcart_manual(A)
@time sumcart_iter(A)

yields

  0.108923 seconds (5 allocations: 176 bytes)
  0.589222 seconds (5 allocations: 176 bytes)

I can probably get rid of all generated functions in CartesianIndex/CartesianRange operations. Is that useful in any event?

Yes, getting rid of generated functions is always nice, if possible.

Looking into the add1_logical regression, it looks like LLVM 3.7 (SROA?) is not able to optimize countnz(logical_y) correctly anymore. As written (and if we make for-loops to have implicit @inbounds), then it should be vectorizable now (as llvm 3.3 can do on current master), and thus much faster than 0.4 (but llvm 3.7 fails to do this).

addendum: for some reason, putting @inbounds in front of the for-loop elids bounds checking inside next, but I think that's an inliner bug for the effect-free computation of getindex

edit to addendum: llvm 3.7 gets much worse at generating code if the @inbounds trick is used (unlike llvm 3.3, which vectorizes it)

addendum: for some reason, putting @inbounds in front of the for-loop elids bounds checking inside next, but I think that's an inliner bug for the effect-free computation of getindex

If you added the addendum after the merger of #16260, note that next now has a @_propagate_inbounds_meta:

Progress! We're starting to do pretty well; several of the regressions are fixed, and most of the ones that remain are minor, around ~10%. Regressions marked (threads) are only a problem with threads enabled. Most of the numbers in the table are affected by enabling threads, but for these threading makes the difference between being a regression vs. not.

Shall we move this to 0.5.x then? (And make sure that this time we actually work though the .x issues.)

Has anyone done this yet, and does it work? @nanosoldier runbenchmarks(ALL, vs=":release-0.4")`

Missed a backtick, try again: @nanosoldier runbenchmarks(ALL, vs=":release-0.4")

Will that work on an issue? Try on the latest commit to master maybe?

It doesn't work in issues, since there's no explicit commit that can be assumed by context (though master would be a reasonable default). You can comment on commits though, as Tony mentioned. I triggered the job here.

I did a comparison between the two reports and this is a filtered (not completely exhaustive) summary of what is consistently slower. The number to the right is the slowdown.

String

["string","join"] 1.19
["string","replace"] 2.54

Sparse

["sparse","index",("spvec","integer",10000)] 1.81
["sparse","index",("spmat","array",1000)] 3.00
["sparse","index",("spmat","col","array",100)] 3.15

Sort

["sort","mergesort",("sortperm! reverse","random")] 1.23
and all other mergesort ones are consistently slower

["sort","issorted",("reverse","random")] 2.71
["sort","issorted",("reverse","ascending")] 2.85

Simd

["simd",("two_reductions","Float64",4096)] 1.29
["simd",("local_arrays","Int32",4096)] 27.83
["simd",("local_arrays","Float64",4096)] 18.57
["simd",("conditional_loop!","Float32",4095)] 2.95

Shootout

["shootout","revcomp"] 23.50
["shootout","fasta"] 1.24

Scalar

["scalar","fastmath",("add","Complex{Float32}")] 1.50
["scalar","fastmath",("add","Complex{Float64}")] 2.00

Basically all scalar arithmetic on BigInt and BigFloat are slower
["scalar","arithmetic",("div","Complex{BigFloat}","UInt64")]

Problem

["problem","stockcorr"] 1.23
["problem","ziggurat"] 1.22
["problem","simplex"] 1.36
["problem","laplacian","laplace_sparse_matvec"] 3.98

Parallel

["parallel","remotecall",("identity",1024)] 1.20
and the other remotecalls

LinAlg

Basically everything with tridiagonal and co, some examples:

["linalg","arithmetic",("-","Tridiagonal","Tridiagonal",256)] 1.45
["linalg","arithmetic",("-","Vector","Vector",256)] 1.44

["linalg","arithmetic",("-","Bidiagonal","Bidiagonal",1024)]

Array

["array","index",("sumrange","BaseBenchmarks.ArrayBenchmarks.ArrayLF{Float32,2}")] 1.38
["array","index",("sumlogical","BaseBenchmarks.ArrayBenchmarks.ArrayLF{Float32,2}")] 1.20
and as well for other number types

["array","index",("sumlinear","BaseBenchmarks.ArrayBenchmarks.ArrayLF{Int32,2}")] 1.82
["array","index",("sumelt_boundscheck","linspace(1.0,2.0,10000000)")] 2.06

["array","index",("sumcartesian","1:100000000")] 3.99
["array","growth",("push_single!",2048)] 1.78
["array","bool","bitarray_true_load!"] 1.17
["array","bool","boolarray_bool_load!"] 1.28

If you added the addendum after the merger of #16260, note that next now has a @_propagate_inbounds_meta:

Its good to know that's now there, but I checked my source tree before posting to confirm this was a known inliner bug, as it preceded that PR.

@JeffBezanson Is there some concrete actions required still to move this issue to v0.5.x, or what's left that seems unexplained / uncorrected?

The regressions in splitline and gk are a bit worrying, but hopefully we can work on them in the RC period.

I'm more worried about the 27x regression on the @simd benchmarks. Anyone know why this loop:

function perf_local_arrays(V)
    # SIMD loop on local arrays declared without type annotations
    T, n = eltype(V), length(V)
    X = rand(T, n)
    Y = rand(T, n)
    Z = rand(T, n)
    @simd for i in eachindex(X)
        @inbounds X[i] = Y[i] * Z[i]
    end
    return X
end

doesn't vectorize? (Input: rand(Float64, 4096))

@ArchRobison ^

It vectorizes on LLVM 3.8

And on 3.7.1 it vectorizes too with -O3. When it doesn't, it's because of https://github.com/JuliaLang/julia/issues/15369. All the TBAA's are correctly attached in codegen and maintained by llvm (at least with https://github.com/JuliaLang/julia/pull/16897)

With the exception of the horrifying regressions on "simd", fairly satisfying overall. I'm especially pleased with the improvements in so many array benchmarks. Worried about the linalg regressions, but an optimist can always hope that my barrage of PRs this morning may help.

It _feels_ like the array arithmetic is just some inlining problem or something like that.

Closing this, since at this point the experiment will need to be repeated for 0.6.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

StefanKarpinski picture StefanKarpinski  路  3Comments

sbromberger picture sbromberger  路  3Comments

iamed2 picture iamed2  路  3Comments

m-j-w picture m-j-w  路  3Comments

Keno picture Keno  路  3Comments