I wrote the following Array#to_proc. It makes Ruby more concise and pleasant to me.
Unfortunately, it leads to a performance hit on MRI(at the moment). I wonder if it can be optimized on TruffleRuby. Could it possibly become a zero-cost abstraction?
Here is an example:
require 'to_proc/all'
array_of_numbers = [0,1,2,3,4,5]
range = 1..2
array_of_numbers.map { |number| range.include? number }
array_of_numbers.map &[range, :include?]
#=> [false, true, true, false, false, false]
array_of_arrays = [[:a],[:b],[:c],[:d]]
array_to_append = [1,2]
array_of_arrays.map { |array| array + array_to_append }
array_of_arrays.map &[:+, array_to_append]
#=> [[:a, 1, 2], [:b, 1, 2], [:c, 1, 2], [:d, 1, 2]]
I am looking for the ways to contribute to TruffleRuby(probably after JVM 9 released). Performance optimizations for my Array#to_proc use case is one of the topics that interest me the most, at the moment.
I would be grateful if you could point me to the relevant parts of the codebase and relevant materials.
Hello,
This looks like it should optimize very well under TruffleRuby, if the method containing the call to to_proc is compiled.
Array#[], Array#empty?, etc..is_a? should fold, and send as well.I would propose as the first step to write a benchmark of a few examples you care about like the ones above using benchmark/ips. Then we can assess the current performance, and more easily dig into potential issues.
Thank you. Here is a benchmark. The results on MRI 2.4:
Warming up --------------------------------------
Array#to_proc 1.000 i/100ms
default 1.000 i/100ms
Calculating -------------------------------------
Array#to_proc 2.676 (± 0.0%) i/s - 14.000 in 5.233228s
default 3.301 (± 0.0%) i/s - 17.000 in 5.151718s
Comparison:
default: 3.3 i/s
Array#to_proc: 2.7 i/s - 1.23x slower
Warming up --------------------------------------
Array#to_proc 25.267k i/100ms
default 48.108k i/100ms
Calculating -------------------------------------
Array#to_proc 309.345k (± 0.6%) i/s - 1.567M in 5.064304s
default 708.505k (± 0.9%) i/s - 3.560M in 5.025035s
Comparison:
default: 708504.6 i/s
Array#to_proc: 309344.9 i/s - 2.29x slower
We got interesting results!
I used GraalVM 0.26 from OTN and MRI 2.4.1.
The first change I made to the benchmark is to allow it to run for longer, by adding
x.iterations = 5
to both Benchmark.ips blocks.
This runs each benchmark five times consecutively, which allow results to stabilize.
Let's run each Benchmark.ips group one by one.
Otherwise it would mean in the second benchmark we run a compiled version of
Array#to_proc that is the combination of the usage in the first and the second benchmark.
To do that, I just commented the Benchmark.ips group to not run.
MRI:
$ ruby -v
ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
$ ruby bench_array_to_proc_iters.rb
Warming up --------------------------------------
#to_proc include? 1.000 i/100ms
block include? 1.000 i/100ms
#to_proc include? 1.000 i/100ms
block include? 1.000 i/100ms
#to_proc include? 1.000 i/100ms
block include? 1.000 i/100ms
#to_proc include? 1.000 i/100ms
block include? 1.000 i/100ms
#to_proc include? 1.000 i/100ms
block include? 1.000 i/100ms
Calculating -------------------------------------
#to_proc include? 7.063 (± 0.0%) i/s - 36.000 in 5.097487s
block include? 8.553 (± 0.0%) i/s - 43.000 in 5.027790s
#to_proc include? 7.028 (± 0.0%) i/s - 36.000 in 5.122740s
block include? 8.500 (± 0.0%) i/s - 43.000 in 5.061580s
#to_proc include? 7.036 (± 0.0%) i/s - 36.000 in 5.117119s
block include? 8.527 (± 0.0%) i/s - 43.000 in 5.043421s
#to_proc include? 7.022 (± 0.0%) i/s - 36.000 in 5.128119s
block include? 8.501 (± 0.0%) i/s - 43.000 in 5.060190s
#to_proc include? 7.030 (± 0.0%) i/s - 36.000 in 5.121864s
block include? 8.498 (± 0.0%) i/s - 43.000 in 5.062551s
Comparison:
block include?: 8.5 i/s
#to_proc include?: 7.0 i/s - 1.21x slower
So that's like your results, there is a significant overhead on MRI.
$ ~/code/graalvm-0.26/bin/ruby -v
truffleruby 0.26, like ruby 2.3.3 <OpenJDK 64-Bit Graal VM 1.8.0_121-b13 with Graal> [linux-x86_64]
# I reuse the benchmark-ips installed by MRI for convenience here
# One could also do ~/code/graalvm-0.26/bin/ruby -S gem install benchmark-ips,
# but for that TruffleRuby needs to be integrated with a Ruby manager
# (or use TRUFFLERUBY_RESILIENT_GEM_HOME=true).
$ ~/code/graalvm-0.26/bin/ruby -I~/.gem/ruby/2.4.1/gems/benchmark-ips-2.7.2/lib bench_array_to_proc_iters.rb
Warming up --------------------------------------
#to_proc include? 8.000 i/100ms
block include? 11.000 i/100ms
#to_proc include? 15.000 i/100ms
block include? 13.000 i/100ms
#to_proc include? 33.000 i/100ms
block include? 21.000 i/100ms
#to_proc include? 40.000 i/100ms
block include? 26.000 i/100ms
#to_proc include? 40.000 i/100ms
block include? 26.000 i/100ms
Calculating -------------------------------------
#to_proc include? 179.213 (±11.2%) i/s - 880.000 in 5.011706s
block include? 254.029 (±23.6%) i/s - 1.144k in 5.024343s
#to_proc include? 452.736 (± 2.0%) i/s - 2.280k in 5.038072s
block include? 282.178 (± 1.8%) i/s - 1.430k in 5.069233s
#to_proc include? 453.829 (± 1.1%) i/s - 2.280k in 5.024652s
block include? 283.608 (± 0.7%) i/s - 1.430k in 5.042473s
#to_proc include? 454.198 (± 1.1%) i/s - 2.280k in 5.020536s
block include? 282.818 (± 1.1%) i/s - 1.430k in 5.057029s
#to_proc include? 455.261 (± 0.9%) i/s - 2.280k in 5.008603s
block include? 273.327 (± 5.1%) i/s - 1.378k in 5.055163s
Comparison:
#to_proc include?: 455.3 i/s
block include?: 273.3 i/s - 1.67x slower
First benchmark: instead of (#to_proc, block) 7/8.5 iterations/s, TruffleRuby runs 455/273 iterations/s.
That's a 65x/32x speedup over MRI!
But wait, #to_proc include? is actually 1.67x faster than the block version?
Actually there is a reason for that!
The benchmark reads the variable range captured 2 lexical scopes above in the inner loop:
array_of_numbers = 1..1_000_000
range = 20..40_000
Benchmark.ips do |x|
x.iterations = 5
x.report '#to_proc include?' do
array_of_numbers.map &[range, :include?]
end
x.report 'block include?' do
array_of_numbers.map { |number| range.include? number }
end
end
But with Array#to_proc we actually read it only once per iteration,
and then read it from the lambda in #to_proc which folds away.
To be more fair, we can store range in a constant MY_RANGE:
~/code/graalvm-0.26/bin/ruby -I~/.gem/ruby/2.4.1/gems/benchmark-ips-2.7.2/lib bench_array_to_proc_iters_const.rb
Warming up --------------------------------------
#to_proc include? 8.000 i/100ms
block include? 22.000 i/100ms
#to_proc include? 15.000 i/100ms
block include? 27.000 i/100ms
#to_proc include? 34.000 i/100ms
block include? 31.000 i/100ms
#to_proc include? 77.000 i/100ms
block include? 77.000 i/100ms
#to_proc include? 77.000 i/100ms
block include? 77.000 i/100ms
Calculating -------------------------------------
#to_proc include? 176.797 (± 9.6%) i/s - 924.000 in 5.292224s
block include? 726.082 (±18.5%) i/s - 3.311k in 5.038979s
#to_proc include? 773.143 (± 1.9%) i/s - 3.927k in 5.081265s
block include? 774.650 (± 2.2%) i/s - 3.927k in 5.071775s
#to_proc include? 775.459 (± 1.5%) i/s - 3.927k in 5.065373s
block include? 750.786 (± 3.2%) i/s - 3.773k in 5.030590s
#to_proc include? 749.608 (± 2.5%) i/s - 3.773k in 5.036557s
block include? 749.925 (± 3.9%) i/s - 3.773k in 5.039326s
#to_proc include? 760.790 (± 3.2%) i/s - 3.850k in 5.065920s
block include? 767.078 (± 2.7%) i/s - 3.850k in 5.023114s
Comparison:
block include?: 767.1 i/s
#to_proc include?: 760.8 i/s - same-ish: difference falls within error
Now there is no significant difference between the two versions, and we are
almost 100x faster than MRI (changing to a constant did not change MRI's score).
This benchmark has the same problem as the one above,
array_to_append should be a constant (ARRAY_TO_APPEND) to be fair.
So we change the benchmark like that.
$ ruby bench_array_to_proc_iters_const.rb
Warming up --------------------------------------
#to_proc append 81.937k i/100ms
block append 152.283k i/100ms
#to_proc append 80.402k i/100ms
block append 142.596k i/100ms
#to_proc append 78.866k i/100ms
block append 149.598k i/100ms
#to_proc append 78.686k i/100ms
block append 149.088k i/100ms
#to_proc append 76.223k i/100ms
block append 147.299k i/100ms
Calculating -------------------------------------
#to_proc append 926.243k (± 4.1%) i/s - 4.650M in 5.028747s
block append 1.952M (± 6.7%) i/s - 9.722M in 5.003298s
#to_proc append 921.311k (± 5.1%) i/s - 4.650M in 5.060526s
block append 2.008M (± 4.7%) i/s - 10.016M in 5.000673s
#to_proc append 942.812k (± 3.5%) i/s - 4.726M in 5.019020s
block append 1.971M (± 4.8%) i/s - 9.869M in 5.018637s
#to_proc append 883.998k (± 6.9%) i/s - 4.421M in 5.026459s
block append 2.019M (± 3.5%) i/s - 10.164M in 5.040880s
#to_proc append 911.273k (± 4.4%) i/s - 4.573M in 5.028715s
block append 1.982M (± 5.3%) i/s - 10.016M in 5.070194s
Comparison:
block append: 1981502.9 i/s
#to_proc append: 911273.3 i/s - 2.17x slower
That's a pretty big overhead.
$ ~/code/graalvm-0.26/bin/ruby -I~/.gem/ruby/2.4.1/gems/benchmark-ips-2.7.2/lib bench_array_to_proc_iters_const.rb
Warming up --------------------------------------
#to_proc append 637.580k i/100ms
block append 780.447k i/100ms
#to_proc append 749.302k i/100ms
block append 877.479k i/100ms
#to_proc append 1.042M i/100ms
block append 1.066M i/100ms
#to_proc append 977.468k i/100ms
block append 1.095M i/100ms
#to_proc append 1.093M i/100ms
block append 1.101M i/100ms
Calculating -------------------------------------
#to_proc append 14.166M (±14.7%) i/s - 65.610M in 5.034431s
block append 14.582M (±10.2%) i/s - 70.452M in 5.049298s
#to_proc append 14.632M (± 2.5%) i/s - 73.264M in 5.010349s
block append 14.578M (± 4.0%) i/s - 73.754M in 5.068358s
#to_proc append 14.304M (± 4.0%) i/s - 72.170M in 5.054003s
block append 14.819M (± 1.8%) i/s - 74.855M in 5.053035s
#to_proc append 14.692M (± 2.3%) i/s - 74.357M in 5.064026s
block append 14.861M (± 1.5%) i/s - 74.855M in 5.038001s
#to_proc append 14.744M (± 1.5%) i/s - 74.357M in 5.044541s
block append 14.876M (± 1.5%) i/s - 74.855M in 5.033147s
Comparison:
block append: 14876012.3 i/s
#to_proc append: 14743577.5 i/s - same-ish: difference falls within error
No significant difference. Here again Array#to_proc has no overhead!
And TruffleRuby is (block, #to_proc) 7.5x/16x faster than MRI.
The speedup is less, because this benchmark needs to allocate a lot of new arrays.
The benchmark files I used are in this gist.
Thank you. I really appreciate it. Very informative and educational.
Most helpful comment
We got interesting results!
I used GraalVM 0.26 from OTN and MRI 2.4.1.
The first change I made to the benchmark is to allow it to run for longer, by adding
to both Benchmark.ips blocks.
This runs each benchmark five times consecutively, which allow results to stabilize.
Let's run each Benchmark.ips group one by one.
Otherwise it would mean in the second benchmark we run a compiled version of
Array#to_proc that is the combination of the usage in the first and the second benchmark.
To do that, I just commented the Benchmark.ips group to not run.
First benchmark: include?
So that's like your results, there is a significant overhead on MRI.
First benchmark: instead of (#to_proc, block) 7/8.5 iterations/s, TruffleRuby runs 455/273 iterations/s.
That's a 65x/32x speedup over MRI!
But wait,
#to_proc include?is actually 1.67x faster than the block version?Actually there is a reason for that!
The benchmark reads the variable
rangecaptured 2 lexical scopes above in the inner loop:But with Array#to_proc we actually read it only once per iteration,
and then read it from the lambda in #to_proc which folds away.
To be more fair, we can store
rangein a constantMY_RANGE:Now there is no significant difference between the two versions, and we are
almost 100x faster than MRI (changing to a constant did not change MRI's score).
Second benchmark: append
This benchmark has the same problem as the one above,
array_to_appendshould be a constant (ARRAY_TO_APPEND) to be fair.So we change the benchmark like that.
That's a pretty big overhead.
No significant difference. Here again Array#to_proc has no overhead!
And TruffleRuby is (block, #to_proc) 7.5x/16x faster than MRI.
The speedup is less, because this benchmark needs to allocate a lot of new arrays.
The benchmark files I used are in this gist.