Truffleruby: Performance optimizations for my Array#to_proc

Created on 4 Aug 2017 · 4Comments · Source: oracle/truffleruby

I wrote the following Array#to_proc. It makes Ruby more concise and pleasant to me.

Unfortunately, it leads to a performance hit on MRI(at the moment). I wonder if it can be optimized on TruffleRuby. Could it possibly become a zero-cost abstraction?

Here is an example:

require 'to_proc/all'

array_of_numbers = [0,1,2,3,4,5]
range = 1..2

array_of_numbers.map { |number| range.include? number }
array_of_numbers.map &[range, :include?]
#=> [false, true, true, false, false, false]


array_of_arrays = [[:a],[:b],[:c],[:d]]
array_to_append = [1,2]

array_of_arrays.map { |array| array + array_to_append }
array_of_arrays.map &[:+, array_to_append]
#=> [[:a, 1, 2], [:b, 1, 2], [:c, 1, 2], [:d, 1, 2]]

I am looking for the ways to contribute to TruffleRuby(probably after JVM 9 released). Performance optimizations for my Array#to_proc use case is one of the topics that interest me the most, at the moment.

I would be grateful if you could point me to the relevant parts of the codebase and relevant materials.

Source

ch1c0t

Most helpful comment

We got interesting results!
I used GraalVM 0.26 from OTN and MRI 2.4.1.

The first change I made to the benchmark is to allow it to run for longer, by adding

  x.iterations = 5

to both Benchmark.ips blocks.
This runs each benchmark five times consecutively, which allow results to stabilize.

Let's run each Benchmark.ips group one by one.
Otherwise it would mean in the second benchmark we run a compiled version of
Array#to_proc that is the combination of the usage in the first and the second benchmark.
To do that, I just commented the Benchmark.ips group to not run.

First benchmark: include?

MRI:
$ ruby -v
ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]

$ ruby bench_array_to_proc_iters.rb 
Warming up --------------------------------------
   #to_proc include?     1.000  i/100ms
      block include?     1.000  i/100ms
   #to_proc include?     1.000  i/100ms
      block include?     1.000  i/100ms
   #to_proc include?     1.000  i/100ms
      block include?     1.000  i/100ms
   #to_proc include?     1.000  i/100ms
      block include?     1.000  i/100ms
   #to_proc include?     1.000  i/100ms
      block include?     1.000  i/100ms
Calculating -------------------------------------
   #to_proc include?      7.063  (± 0.0%) i/s -     36.000  in   5.097487s
      block include?      8.553  (± 0.0%) i/s -     43.000  in   5.027790s
   #to_proc include?      7.028  (± 0.0%) i/s -     36.000  in   5.122740s
      block include?      8.500  (± 0.0%) i/s -     43.000  in   5.061580s
   #to_proc include?      7.036  (± 0.0%) i/s -     36.000  in   5.117119s
      block include?      8.527  (± 0.0%) i/s -     43.000  in   5.043421s
   #to_proc include?      7.022  (± 0.0%) i/s -     36.000  in   5.128119s
      block include?      8.501  (± 0.0%) i/s -     43.000  in   5.060190s
   #to_proc include?      7.030  (± 0.0%) i/s -     36.000  in   5.121864s
      block include?      8.498  (± 0.0%) i/s -     43.000  in   5.062551s

Comparison:
      block include?:        8.5 i/s
   #to_proc include?:        7.0 i/s - 1.21x  slower

So that's like your results, there is a significant overhead on MRI.

$ ~/code/graalvm-0.26/bin/ruby -v
truffleruby 0.26, like ruby 2.3.3 <OpenJDK 64-Bit Graal VM 1.8.0_121-b13 with Graal> [linux-x86_64]

# I reuse the benchmark-ips installed by MRI for convenience here
# One could also do ~/code/graalvm-0.26/bin/ruby -S gem install benchmark-ips,
# but for that TruffleRuby needs to be integrated with a Ruby manager
# (or use TRUFFLERUBY_RESILIENT_GEM_HOME=true).

$ ~/code/graalvm-0.26/bin/ruby -I~/.gem/ruby/2.4.1/gems/benchmark-ips-2.7.2/lib bench_array_to_proc_iters.rb
Warming up --------------------------------------
   #to_proc include?     8.000  i/100ms
      block include?    11.000  i/100ms
   #to_proc include?    15.000  i/100ms
      block include?    13.000  i/100ms
   #to_proc include?    33.000  i/100ms
      block include?    21.000  i/100ms
   #to_proc include?    40.000  i/100ms
      block include?    26.000  i/100ms
   #to_proc include?    40.000  i/100ms
      block include?    26.000  i/100ms
Calculating -------------------------------------
   #to_proc include?    179.213  (±11.2%) i/s -    880.000  in   5.011706s
      block include?    254.029  (±23.6%) i/s -      1.144k in   5.024343s
   #to_proc include?    452.736  (± 2.0%) i/s -      2.280k in   5.038072s
      block include?    282.178  (± 1.8%) i/s -      1.430k in   5.069233s
   #to_proc include?    453.829  (± 1.1%) i/s -      2.280k in   5.024652s
      block include?    283.608  (± 0.7%) i/s -      1.430k in   5.042473s
   #to_proc include?    454.198  (± 1.1%) i/s -      2.280k in   5.020536s
      block include?    282.818  (± 1.1%) i/s -      1.430k in   5.057029s
   #to_proc include?    455.261  (± 0.9%) i/s -      2.280k in   5.008603s
      block include?    273.327  (± 5.1%) i/s -      1.378k in   5.055163s

Comparison:
   #to_proc include?:      455.3 i/s
      block include?:      273.3 i/s - 1.67x  slower

First benchmark: instead of (#to_proc, block) 7/8.5 iterations/s, TruffleRuby runs 455/273 iterations/s.
That's a 65x/32x speedup over MRI!
But wait, #to_proc include? is actually 1.67x faster than the block version?

Actually there is a reason for that!
The benchmark reads the variable range captured 2 lexical scopes above in the inner loop:

array_of_numbers = 1..1_000_000
range = 20..40_000

Benchmark.ips do |x|
  x.iterations = 5

  x.report '#to_proc include?' do
    array_of_numbers.map &[range, :include?]
  end

  x.report 'block include?' do
    array_of_numbers.map { |number| range.include? number }
  end
end

But with Array#to_proc we actually read it only once per iteration,
and then read it from the lambda in #to_proc which folds away.
To be more fair, we can store range in a constant MY_RANGE:

~/code/graalvm-0.26/bin/ruby -I~/.gem/ruby/2.4.1/gems/benchmark-ips-2.7.2/lib bench_array_to_proc_iters_const.rb 
Warming up --------------------------------------
   #to_proc include?     8.000  i/100ms
      block include?    22.000  i/100ms
   #to_proc include?    15.000  i/100ms
      block include?    27.000  i/100ms
   #to_proc include?    34.000  i/100ms
      block include?    31.000  i/100ms
   #to_proc include?    77.000  i/100ms
      block include?    77.000  i/100ms
   #to_proc include?    77.000  i/100ms
      block include?    77.000  i/100ms
Calculating -------------------------------------
   #to_proc include?    176.797  (± 9.6%) i/s -    924.000  in   5.292224s
      block include?    726.082  (±18.5%) i/s -      3.311k in   5.038979s
   #to_proc include?    773.143  (± 1.9%) i/s -      3.927k in   5.081265s
      block include?    774.650  (± 2.2%) i/s -      3.927k in   5.071775s
   #to_proc include?    775.459  (± 1.5%) i/s -      3.927k in   5.065373s
      block include?    750.786  (± 3.2%) i/s -      3.773k in   5.030590s
   #to_proc include?    749.608  (± 2.5%) i/s -      3.773k in   5.036557s
      block include?    749.925  (± 3.9%) i/s -      3.773k in   5.039326s
   #to_proc include?    760.790  (± 3.2%) i/s -      3.850k in   5.065920s
      block include?    767.078  (± 2.7%) i/s -      3.850k in   5.023114s

Comparison:
      block include?:      767.1 i/s
   #to_proc include?:      760.8 i/s - same-ish: difference falls within error

Now there is no significant difference between the two versions, and we are
almost 100x faster than MRI (changing to a constant did not change MRI's score).

Second benchmark: append

This benchmark has the same problem as the one above,
array_to_append should be a constant (ARRAY_TO_APPEND) to be fair.
So we change the benchmark like that.

$ ruby bench_array_to_proc_iters_const.rb                         
Warming up --------------------------------------
     #to_proc append    81.937k i/100ms
        block append   152.283k i/100ms
     #to_proc append    80.402k i/100ms
        block append   142.596k i/100ms
     #to_proc append    78.866k i/100ms
        block append   149.598k i/100ms
     #to_proc append    78.686k i/100ms
        block append   149.088k i/100ms
     #to_proc append    76.223k i/100ms
        block append   147.299k i/100ms
Calculating -------------------------------------
     #to_proc append    926.243k (± 4.1%) i/s -      4.650M in   5.028747s
        block append      1.952M (± 6.7%) i/s -      9.722M in   5.003298s
     #to_proc append    921.311k (± 5.1%) i/s -      4.650M in   5.060526s
        block append      2.008M (± 4.7%) i/s -     10.016M in   5.000673s
     #to_proc append    942.812k (± 3.5%) i/s -      4.726M in   5.019020s
        block append      1.971M (± 4.8%) i/s -      9.869M in   5.018637s
     #to_proc append    883.998k (± 6.9%) i/s -      4.421M in   5.026459s
        block append      2.019M (± 3.5%) i/s -     10.164M in   5.040880s
     #to_proc append    911.273k (± 4.4%) i/s -      4.573M in   5.028715s
        block append      1.982M (± 5.3%) i/s -     10.016M in   5.070194s

Comparison:
        block append:  1981502.9 i/s
     #to_proc append:   911273.3 i/s - 2.17x  slower

That's a pretty big overhead.

$ ~/code/graalvm-0.26/bin/ruby -I~/.gem/ruby/2.4.1/gems/benchmark-ips-2.7.2/lib bench_array_to_proc_iters_const.rb
Warming up --------------------------------------
     #to_proc append   637.580k i/100ms
        block append   780.447k i/100ms
     #to_proc append   749.302k i/100ms
        block append   877.479k i/100ms
     #to_proc append     1.042M i/100ms
        block append     1.066M i/100ms
     #to_proc append   977.468k i/100ms
        block append     1.095M i/100ms
     #to_proc append     1.093M i/100ms
        block append     1.101M i/100ms
Calculating -------------------------------------
     #to_proc append     14.166M (±14.7%) i/s -     65.610M in   5.034431s
        block append     14.582M (±10.2%) i/s -     70.452M in   5.049298s
     #to_proc append     14.632M (± 2.5%) i/s -     73.264M in   5.010349s
        block append     14.578M (± 4.0%) i/s -     73.754M in   5.068358s
     #to_proc append     14.304M (± 4.0%) i/s -     72.170M in   5.054003s
        block append     14.819M (± 1.8%) i/s -     74.855M in   5.053035s
     #to_proc append     14.692M (± 2.3%) i/s -     74.357M in   5.064026s
        block append     14.861M (± 1.5%) i/s -     74.855M in   5.038001s
     #to_proc append     14.744M (± 1.5%) i/s -     74.357M in   5.044541s
        block append     14.876M (± 1.5%) i/s -     74.855M in   5.033147s

Comparison:
        block append: 14876012.3 i/s
     #to_proc append: 14743577.5 i/s - same-ish: difference falls within error

No significant difference. Here again Array#to_proc has no overhead!
And TruffleRuby is (block, #to_proc) 7.5x/16x faster than MRI.
The speedup is less, because this benchmark needs to allocate a lot of new arrays.

The benchmark files I used are in this gist.

eregon on 4 Aug 2017

👍2 ❤1

All 4 comments

Hello,
This looks like it should optimize very well under TruffleRuby, if the method containing the call to to_proc is compiled.

Partial Escape Analysis should see through the literal arrays you convert to a Proc and so constant fold Array#[], Array#empty?, etc.
.is_a? should fold, and send as well.
The *splats should optimize well as these arrays are escape-analyzed as well and are small.

I would propose as the first step to write a benchmark of a few examples you care about like the ones above using benchmark/ips. Then we can assess the current performance, and more easily dig into potential issues.

eregon on 4 Aug 2017

Thank you. Here is a benchmark. The results on MRI 2.4:

Warming up --------------------------------------                   
       Array#to_proc     1.000  i/100ms                             
             default     1.000  i/100ms                             
Calculating -------------------------------------                   
       Array#to_proc      2.676  (± 0.0%) i/s -     14.000  in   5.233228s                                                              
             default      3.301  (± 0.0%) i/s -     17.000  in   5.151718s                                                              

Comparison:                       
             default:        3.3 i/s                                
       Array#to_proc:        2.7 i/s - 1.23x  slower                

Warming up --------------------------------------                   
       Array#to_proc    25.267k i/100ms                             
             default    48.108k i/100ms                             
Calculating -------------------------------------                   
       Array#to_proc    309.345k (± 0.6%) i/s -      1.567M in   5.064304s                                                              
             default    708.505k (± 0.9%) i/s -      3.560M in   5.025035s                                                              

Comparison:                       
             default:   708504.6 i/s                                
       Array#to_proc:   309344.9 i/s - 2.29x  slower

ch1c0t on 4 Aug 2017

We got interesting results!
I used GraalVM 0.26 from OTN and MRI 2.4.1.

The first change I made to the benchmark is to allow it to run for longer, by adding

  x.iterations = 5

to both Benchmark.ips blocks.
This runs each benchmark five times consecutively, which allow results to stabilize.

First benchmark: include?

MRI:
$ ruby -v
ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]

$ ruby bench_array_to_proc_iters.rb 
Warming up --------------------------------------
   #to_proc include?     1.000  i/100ms
      block include?     1.000  i/100ms
   #to_proc include?     1.000  i/100ms
      block include?     1.000  i/100ms
   #to_proc include?     1.000  i/100ms
      block include?     1.000  i/100ms
   #to_proc include?     1.000  i/100ms
      block include?     1.000  i/100ms
   #to_proc include?     1.000  i/100ms
      block include?     1.000  i/100ms
Calculating -------------------------------------
   #to_proc include?      7.063  (± 0.0%) i/s -     36.000  in   5.097487s
      block include?      8.553  (± 0.0%) i/s -     43.000  in   5.027790s
   #to_proc include?      7.028  (± 0.0%) i/s -     36.000  in   5.122740s
      block include?      8.500  (± 0.0%) i/s -     43.000  in   5.061580s
   #to_proc include?      7.036  (± 0.0%) i/s -     36.000  in   5.117119s
      block include?      8.527  (± 0.0%) i/s -     43.000  in   5.043421s
   #to_proc include?      7.022  (± 0.0%) i/s -     36.000  in   5.128119s
      block include?      8.501  (± 0.0%) i/s -     43.000  in   5.060190s
   #to_proc include?      7.030  (± 0.0%) i/s -     36.000  in   5.121864s
      block include?      8.498  (± 0.0%) i/s -     43.000  in   5.062551s

Comparison:
      block include?:        8.5 i/s
   #to_proc include?:        7.0 i/s - 1.21x  slower

So that's like your results, there is a significant overhead on MRI.

$ ~/code/graalvm-0.26/bin/ruby -v
truffleruby 0.26, like ruby 2.3.3 <OpenJDK 64-Bit Graal VM 1.8.0_121-b13 with Graal> [linux-x86_64]

# I reuse the benchmark-ips installed by MRI for convenience here
# One could also do ~/code/graalvm-0.26/bin/ruby -S gem install benchmark-ips,
# but for that TruffleRuby needs to be integrated with a Ruby manager
# (or use TRUFFLERUBY_RESILIENT_GEM_HOME=true).

$ ~/code/graalvm-0.26/bin/ruby -I~/.gem/ruby/2.4.1/gems/benchmark-ips-2.7.2/lib bench_array_to_proc_iters.rb
Warming up --------------------------------------
   #to_proc include?     8.000  i/100ms
      block include?    11.000  i/100ms
   #to_proc include?    15.000  i/100ms
      block include?    13.000  i/100ms
   #to_proc include?    33.000  i/100ms
      block include?    21.000  i/100ms
   #to_proc include?    40.000  i/100ms
      block include?    26.000  i/100ms
   #to_proc include?    40.000  i/100ms
      block include?    26.000  i/100ms
Calculating -------------------------------------
   #to_proc include?    179.213  (±11.2%) i/s -    880.000  in   5.011706s
      block include?    254.029  (±23.6%) i/s -      1.144k in   5.024343s
   #to_proc include?    452.736  (± 2.0%) i/s -      2.280k in   5.038072s
      block include?    282.178  (± 1.8%) i/s -      1.430k in   5.069233s
   #to_proc include?    453.829  (± 1.1%) i/s -      2.280k in   5.024652s
      block include?    283.608  (± 0.7%) i/s -      1.430k in   5.042473s
   #to_proc include?    454.198  (± 1.1%) i/s -      2.280k in   5.020536s
      block include?    282.818  (± 1.1%) i/s -      1.430k in   5.057029s
   #to_proc include?    455.261  (± 0.9%) i/s -      2.280k in   5.008603s
      block include?    273.327  (± 5.1%) i/s -      1.378k in   5.055163s

Comparison:
   #to_proc include?:      455.3 i/s
      block include?:      273.3 i/s - 1.67x  slower

Actually there is a reason for that!
The benchmark reads the variable range captured 2 lexical scopes above in the inner loop:

array_of_numbers = 1..1_000_000
range = 20..40_000

Benchmark.ips do |x|
  x.iterations = 5

  x.report '#to_proc include?' do
    array_of_numbers.map &[range, :include?]
  end

  x.report 'block include?' do
    array_of_numbers.map { |number| range.include? number }
  end
end

But with Array#to_proc we actually read it only once per iteration,
and then read it from the lambda in #to_proc which folds away.
To be more fair, we can store range in a constant MY_RANGE:

~/code/graalvm-0.26/bin/ruby -I~/.gem/ruby/2.4.1/gems/benchmark-ips-2.7.2/lib bench_array_to_proc_iters_const.rb 
Warming up --------------------------------------
   #to_proc include?     8.000  i/100ms
      block include?    22.000  i/100ms
   #to_proc include?    15.000  i/100ms
      block include?    27.000  i/100ms
   #to_proc include?    34.000  i/100ms
      block include?    31.000  i/100ms
   #to_proc include?    77.000  i/100ms
      block include?    77.000  i/100ms
   #to_proc include?    77.000  i/100ms
      block include?    77.000  i/100ms
Calculating -------------------------------------
   #to_proc include?    176.797  (± 9.6%) i/s -    924.000  in   5.292224s
      block include?    726.082  (±18.5%) i/s -      3.311k in   5.038979s
   #to_proc include?    773.143  (± 1.9%) i/s -      3.927k in   5.081265s
      block include?    774.650  (± 2.2%) i/s -      3.927k in   5.071775s
   #to_proc include?    775.459  (± 1.5%) i/s -      3.927k in   5.065373s
      block include?    750.786  (± 3.2%) i/s -      3.773k in   5.030590s
   #to_proc include?    749.608  (± 2.5%) i/s -      3.773k in   5.036557s
      block include?    749.925  (± 3.9%) i/s -      3.773k in   5.039326s
   #to_proc include?    760.790  (± 3.2%) i/s -      3.850k in   5.065920s
      block include?    767.078  (± 2.7%) i/s -      3.850k in   5.023114s

Comparison:
      block include?:      767.1 i/s
   #to_proc include?:      760.8 i/s - same-ish: difference falls within error

Now there is no significant difference between the two versions, and we are
almost 100x faster than MRI (changing to a constant did not change MRI's score).

Second benchmark: append

This benchmark has the same problem as the one above,
array_to_append should be a constant (ARRAY_TO_APPEND) to be fair.
So we change the benchmark like that.

$ ruby bench_array_to_proc_iters_const.rb                         
Warming up --------------------------------------
     #to_proc append    81.937k i/100ms
        block append   152.283k i/100ms
     #to_proc append    80.402k i/100ms
        block append   142.596k i/100ms
     #to_proc append    78.866k i/100ms
        block append   149.598k i/100ms
     #to_proc append    78.686k i/100ms
        block append   149.088k i/100ms
     #to_proc append    76.223k i/100ms
        block append   147.299k i/100ms
Calculating -------------------------------------
     #to_proc append    926.243k (± 4.1%) i/s -      4.650M in   5.028747s
        block append      1.952M (± 6.7%) i/s -      9.722M in   5.003298s
     #to_proc append    921.311k (± 5.1%) i/s -      4.650M in   5.060526s
        block append      2.008M (± 4.7%) i/s -     10.016M in   5.000673s
     #to_proc append    942.812k (± 3.5%) i/s -      4.726M in   5.019020s
        block append      1.971M (± 4.8%) i/s -      9.869M in   5.018637s
     #to_proc append    883.998k (± 6.9%) i/s -      4.421M in   5.026459s
        block append      2.019M (± 3.5%) i/s -     10.164M in   5.040880s
     #to_proc append    911.273k (± 4.4%) i/s -      4.573M in   5.028715s
        block append      1.982M (± 5.3%) i/s -     10.016M in   5.070194s

Comparison:
        block append:  1981502.9 i/s
     #to_proc append:   911273.3 i/s - 2.17x  slower

That's a pretty big overhead.

$ ~/code/graalvm-0.26/bin/ruby -I~/.gem/ruby/2.4.1/gems/benchmark-ips-2.7.2/lib bench_array_to_proc_iters_const.rb
Warming up --------------------------------------
     #to_proc append   637.580k i/100ms
        block append   780.447k i/100ms
     #to_proc append   749.302k i/100ms
        block append   877.479k i/100ms
     #to_proc append     1.042M i/100ms
        block append     1.066M i/100ms
     #to_proc append   977.468k i/100ms
        block append     1.095M i/100ms
     #to_proc append     1.093M i/100ms
        block append     1.101M i/100ms
Calculating -------------------------------------
     #to_proc append     14.166M (±14.7%) i/s -     65.610M in   5.034431s
        block append     14.582M (±10.2%) i/s -     70.452M in   5.049298s
     #to_proc append     14.632M (± 2.5%) i/s -     73.264M in   5.010349s
        block append     14.578M (± 4.0%) i/s -     73.754M in   5.068358s
     #to_proc append     14.304M (± 4.0%) i/s -     72.170M in   5.054003s
        block append     14.819M (± 1.8%) i/s -     74.855M in   5.053035s
     #to_proc append     14.692M (± 2.3%) i/s -     74.357M in   5.064026s
        block append     14.861M (± 1.5%) i/s -     74.855M in   5.038001s
     #to_proc append     14.744M (± 1.5%) i/s -     74.357M in   5.044541s
        block append     14.876M (± 1.5%) i/s -     74.855M in   5.033147s

Comparison:
        block append: 14876012.3 i/s
     #to_proc append: 14743577.5 i/s - same-ish: difference falls within error

The benchmark files I used are in this gist.

eregon on 4 Aug 2017

👍2 ❤1

Thank you. I really appreciate it. Very informative and educational.

ch1c0t on 5 Aug 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Ruby 2.7 support

deepj · 16Comments

Official Docker images of Truffleruby

deepj · 20Comments

oj: pthread_mutex_init is unsupported!

deepj · 31Comments

Support the Nokogiri gem

chrisseaton · 27Comments

Build failure 1.0.0-rc6 with macOS Mojave

hsbt · 28Comments