Julia: misc. benchmark regressions since 0.4

Created on 29 Apr 2016 · 68Comments · Source: JuliaLang/julia

Just ran make test-perf and observed this. Most results are the same, and we've improved on about 6 tests, but unfortunately there are some significant regressions:

| test | 0.4 | 4f65737874 | 4f65737874 -O3 | b029f5078f |
| --- | --- | --- | --- | --- |
| parse_int | 0.190 | 0.315 | | 0.245 |
| cons | 80.837 | 91.86 | | 90.14 |
| gk | 52.035 | 62.92 | | 63.84 |
| sparsemul | 32.5 | | | 37.14 (threads) |
| sparsemul2 | 62.836 | 65.73 | 63.23 | 68.5 |
| simplex | 39.209 | 55.02 | 39.90 | 43.83 |
| splitline | 49.86 | 56.87 | 60.86 | 67.44 |
| add1 | 33.63 | 44.09 | 37.82 | 36.6 |
| add1_logical | 41.797 | 105.3 | 60.12 | 45.4 |
| binary_trees | 25.5 | | | 29.64 |
| k_nucleotide | 61 | | | 75.34 (threads) |

help wanted performance regression

Source

JeffBezanson

Most helpful comment

Progress! We're starting to do pretty well; several of the regressions are fixed, and most of the ones that remain are minor, around ~10%. Regressions marked (threads) are only a problem with threads enabled. Most of the numbers in the table are affected by enabling threads, but for these threading makes the difference between being a regression vs. not.

JeffBezanson on 15 Jun 2016

🎉6 ❤1 😄1

All 68 comments

There is also a significant regression in the spell benchmark.

0.4:

spell              10090.812 10533.558 10285.806  183.614

0.5:

spell              11888.845 12585.925 12231.564  297.824

I tried changing some of the containers to use String instead of AbstractString but it didn't seem to help much. Using SubString{String} as the key type of the Dict in train gives

spell              11395.126 11790.546 11550.910  169.565

slightly better but still slower than 0.4 and still 2.5x slower than python.

JeffBezanson on 6 May 2016

Some of the planned changes may improve this – in particular, the new Char type.

StefanKarpinski on 6 May 2016

We should probably try to track down what led to the slowdown though - then any internal representation changes would be speedups, not just countering regressions elsewhere.

tkelman on 7 May 2016

a retrospective daily (?) history of the perf tests might help, that way it could be seen whether it was single/few commits which were to blame. cc @jrevels

hayd on 7 May 2016

Looks like #16236 improved things a bit, but not enough.

JeffBezanson on 7 May 2016

Spell test after #16236:

spell              11070.436 11496.856 11296.590  204.591

JeffBezanson on 7 May 2016

It probably makes sense to run benchmarks on the version right before the String change to figure out how much of this is due to various other changes and how much is due to that change.

StefanKarpinski on 9 May 2016

It probably makes sense to run benchmarks on the version right before the String change to figure out how much of this is due to various other changes and how much is due to that change.

See here. Hopefully I got the commit range correct - I meant to do the most recent commit related to #16058 vs. the commit right before #16058 in the history.

jrevels on 9 May 2016

I wonder how much of this can be traced back to #15997. I was trying to track down a recent performance regression of https://github.com/JuliaLang/julia/issues/15541 (thinking I had broken typemap), and found that @tkelman commit 53665868270c5202a46e30bae7412faf13a9eabe was responsible for the ~15-20% performance regression in that test.

vtjnash on 17 May 2016

Excellent point. I'll retry without threading.

JeffBezanson on 17 May 2016

I expect threading to have a larger performance impact on the C runtime (e.g. #15541) than generated (especially runtime JITed code) since the optimization we use for TLS access doesn't really work for C code....

yuyichao on 17 May 2016

After #16439 is merged, it would be helpful to see an updated table with JULIA_THREADS=0. I think most of these have been addressed or are due to Box (such as the json test), which is being tracked separately.

vtjnash on 19 May 2016

Measured again, still with threads enabled. Not many changes, but rand_mat_stat is much worse. k_nucleotide is so much faster that it's suspicious. Needs investigating.

JeffBezanson on 29 May 2016

Tried with JULIA_THREADS = 0 and the differences don't seem significant. Would be good to try on other systems though.

JeffBezanson on 29 May 2016

Does this changes too much with different LLVM versions?

pabloferz on 1 Jun 2016

rand_mat_stat slowdown seems to because of slowdowns in hcat and hvcat. Those don't show up in a profile running the code on 0.4.

julia> Profile.print()
498 ./REPL.jl:92; macro expansion
 498 ./REPL.jl:62; eval_user_input(::Any, ::Base.REP...
  498 ./boot.jl:225; eval(::Module, ::Any)
   498 ./profile.jl:16; macro expansion;
    3   ./REPL[1]:6; randmatstat(::Int64)
     1 ./random.jl:83; mt_pop!
     1 ./random.jl:1105; randn
     1 ./random.jl:1107; randn
      1 ./random.jl:1119; randn_unlikely(::MersenneTwist...
    3   ./REPL[1]:7; randmatstat(::Int64)
     1 ./random.jl:1105; randn
     2 ./random.jl:1107; randn
      1 ./random.jl:1119; randn_unlikely(::MersenneTwist...
    119 ./REPL[1]:10; randmatstat(::Int64)
     10  ./abstractarray.jl:653; typed_hcat(::Type{T}, ::Array{...
     1   ./abstractarray.jl:658; typed_hcat(::Type{T}, ::Array{...
     101 ./abstractarray.jl:666; typed_hcat(::Type{T}, ::Array{...
      5 ./boot.jl:307; Array{Float64,N}(::Tuple{Int64,...
     1   ./abstractarray.jl:670; typed_hcat(::Type{T}, ::Array{...
     5   ./abstractarray.jl:672; typed_hcat(::Type{T}, ::Array{...
      1 ./array.jl:59; copy!(::Array{Float64,2}, ::Int6...
    283 ./REPL[1]:11; randmatstat(::Int64)
     11  ./abstractarray.jl:910; typed_hvcat(::Type{T}, ::Tuple...
     177 ./abstractarray.jl:914; typed_hvcat(::Type{T}, ::Tuple...
      134 ./array.jl:719; hcat(::Array{Float64,2}, ::Arr...
       129 ./abstractarray.jl:666; typed_hcat(::Type{T}, ::Array...
        4 ./boot.jl:307; Array{Float64,N}(::Tuple{Int64...
       5   ./abstractarray.jl:672; typed_hcat(::Type{T}, ::Array...
        1 ./array.jl:58; copy!(::Array{Float64,2}, ::In...
      25  ./tuple.jl:10; getindex(::Tuple{Array{Float64...
     94  ./abstractarray.jl:917; typed_hvcat(::Type{T}, ::Tuple...
      5  ./abstractarray.jl:690; typed_vcat(::Type{T}, ::Array{...
      2  ./abstractarray.jl:691; typed_vcat(::Type{T}, ::Array{...
       2 ./reduce.jl:64; mapfoldl(::Base.##30#32, ::Func...
      71 ./abstractarray.jl:698; typed_vcat(::Type{T}, ::Array{...
       2 ./boot.jl:307; Array{Float64,N}(::Tuple{Int64...
      7  ./abstractarray.jl:703; typed_vcat(::Type{T}, ::Array{...
       1 ./multidimensional.jl:339; _setindex!(::Base.LinearFast, ...
       4 ./multidimensional.jl:340; _setindex!(::Base.LinearFast, ...
        2 ./multidimensional.jl:380; macro expansion
         1 ./operators.jl:420; setindex_shape_check(::Array{...
         1 ./operators.jl:424; setindex_shape_check(::Array{...
        1 ./multidimensional.jl:383; macro expansion
    28  ./REPL[1]:12; randmatstat(::Int64)
     1  ./linalg/dense.jl:148; trace(::Array{Float64,2})
     20 ./linalg/matmul.jl:148; At_mul_B!
      20 ./linalg/matmul.jl:256; syrk_wrapper!(::Array{Float64,...
       16 ./linalg/blas.jl:1145; syrk!(::Char, ::Char, ::Float...
       1  ./linalg/blas.jl:1152; syrk!(::Char, ::Char, ::Float...
       1  ./linalg/matmul.jl:201; copytri!(::Array{Float64,2}, ...
       2  ./linalg/matmul.jl:203; copytri!(::Array{Float64,2}, ...
     1  ./linalg/matmul.jl:0; At_mul_B(::Array{Float64,2}, ::...
     4  ./linalg/matmul.jl:146; At_mul_B(::Array{Float64,2}, ::...
    27  ./REPL[1]:13; randmatstat(::Int64)
     12 ./linalg/dense.jl:172; ^(::Array{Float64,2}, ::Int64)
      8 ./intfuncs.jl:90; power_by_squaring(::Array{Float...
       8 ./linalg/matmul.jl:330; gemm_wrapper!(::Array{Float64,...
        8 ./linalg/blas.jl:963; gemm!(::Char, ::Char, ::Float6...
      4 ./linalg/matmul.jl:129; *
     6  ./linalg/matmul.jl:148; At_mul_B!
      1 ./linalg/matmul.jl:255; syrk_wrapper!(::Array{Float64,2...
       1 ./abstractarray.jl:42; stride(::Array{Float64,2}, ::Int64)
      5 ./linalg/matmul.jl:256; syrk_wrapper!(::Array{Float64,2...
       2 ./linalg/blas.jl:1145; syrk!(::Char, ::Char, ::Float6...
       3 ./linalg/matmul.jl:203; copytri!(::Array{Float64,2}, :...
     5  ./linalg/matmul.jl:146; At_mul_B(::Array{Float64,2}, ::...
    32  ./linalg/dense.jl:172; ^
     22 ./intfuncs.jl:90; power_by_squaring(::Array{Float...
      1  ./linalg/matmul.jl:304; gemm_wrapper!(::Array{Float64,...
      1  ./linalg/matmul.jl:311; gemm_wrapper!(::Array{Float64,...
      2  ./linalg/matmul.jl:329; gemm_wrapper!(::Array{Float64,...
       2 ./abstractarray.jl:0; stride(::Array{Float64,2}, ::Int64)
      18 ./linalg/matmul.jl:330; gemm_wrapper!(::Array{Float64,...
       18 ./linalg/blas.jl:963; gemm!(::Char, ::Char, ::Float6...
     10 ./linalg/matmul.jl:129; *
    3   ./random.jl:1135; randn

ViralBShah on 2 Jun 2016

In the following "mine" refers to commit e280a27

One commit before mine

julia,fib,0.042998,0.128659,0.046870,0.002392
julia,parse_int,0.242335,2.028559,0.268087,0.078824
julia,mandel,0.136463,0.198907,0.138009,0.003110
julia,quicksort,0.316253,0.580323,0.334022,0.013232
julia,pi_sum,40.930918,41.945148,41.060571,0.152101
julia,rand_mat_stat,17.376123,20.571832,18.228378,0.572528
julia,rand_mat_mul,44.568483,56.915150,47.338284,2.601669
julia,printfd,20.593159,21.428679,20.693425,0.141257
julia,micro.mem,270.566406,270.566406,270.566406,0.000000

mine

julia,fib,0.042982,0.124890,0.046680,0.002268
julia,parse_int,0.251515,1.852985,0.279358,0.079412
julia,mandel,0.136527,0.205742,0.138098,0.002956
julia,quicksort,0.313997,0.465756,0.332772,0.007824
julia,pi_sum,40.936211,41.703891,41.049687,0.113519
julia,rand_mat_stat,39.680477,46.019009,41.446778,1.222534
julia,rand_mat_mul,44.922529,55.156609,47.373832,2.114620
julia,printfd,20.579925,21.573189,20.672500,0.166164
julia,micro.mem,271.417969,271.417969,271.417969,0.000000

almost latest master (from yesterday I think)

julia,fib,0.044243,0.119606,0.046837,0.003259
julia,parse_int,0.289132,7.768136,0.322119,0.152336
julia,mandel,0.136763,0.250258,0.138992,0.006331
julia,quicksort,0.318064,0.534072,0.334189,0.016374
julia,pi_sum,40.511055,55.959972,46.687331,4.624115
julia,rand_mat_stat,50.049147,76.699907,59.645703,10.819121
julia,rand_mat_mul,71.592445,157.663542,89.907365,20.233438
julia,printfd,23.629916,42.501379,30.292729,5.762164
julia,micro.mem,268.761719,268.761719,268.761719,0.000000

It does seem that there's a doubling (ouch) from my commit. However, since then it seems to have gotten even worse (especially rand_mat_mul). This is just one run of test/perf/micro/perf.jl on each commit.

pkofod on 3 Jun 2016

splatting penalty probably? some manually written out cases for small numbers of inputs may help?

tkelman on 3 Jun 2016

@pkofod Thanks for looking into it.

We need to be able to take advantage of the definition at https://github.com/JuliaLang/julia/blob/7f95e1b303c43ba8b39bd1e3c2a609cdcf8f386c/base/abstractarray.jl#L799, which is particularly fast.

JeffBezanson on 3 Jun 2016

Another way to look at it is that typed_hvcat is really slow, since it calls hcat on all the arguments followed by vcat. That could certainly be improved.

JeffBezanson on 3 Jun 2016

See #16741

pabloferz on 3 Jun 2016

After #16741 there is a significant improvement in rand_mat_stat, but it is still much slower than 0.4.