$ elixir --version
Erlang/OTP 19 [erts-8.3] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]
Elixir 1.4.4
Seems like calling String.split/2 is much slower than doing String.splitter/2 and pipe the output to Enum.to_list when the input string is quite large.
Here's an example where using String.splitter/2 is at least 3 times faster:
iex(1)> str = (1..1000000) |> Enum.join(","); nil
nil
iex(2)> :timer.tc(fn -> str |> String.splitter(",") |> Enum.to_list |> length end)
{842961, 1000000}
iex(3)> :timer.tc(fn -> str |> String.split(",") |> length end)
{2943151, 1000000}
NOTE: for small strings String.split/2 is faster
String.split/2 should be at least as fast as String.splitter/2 piped to Enum.to_list.
@waj nice catch. Yeah, one wouldn't expect that. I presume @josevalim will come with a quick solution :)
Never benchmark in the shell as code there is evaluated. Do you get the
same results if you run it inside a module compiled by a Mix project?
I just benchmarked the compiled version, and it appears this is somewhat correct. Below is my benchmarking code, as well as the results of the benchmark:
code
defmodule Split.Fast do
def split(str) do
str |> String.splitter(",") |> Enum.to_list
end
end
defmodule Split.Slow do
def split(str) do
String.split(str, ",")
end
end
defmodule Split.Benchmark do
def benchmark do
Benchee.run(%{
"splitter |> to_list" => fn -> bench_func(Split.Fast) end,
"split" => fn -> bench_func(Split.Slow) end
}, time: 10, print: [fast_warning: false])
end
def bench_func(module) do
(1..1000000) |> Enum.join(",") |> module.split()
end
end
Split.Benchmark.benchmark()
results
~/sandbox/benchee $ mix run samples/string_split.exs
Operating System: macOS
CPU Information: Intel(R) Core(TM) i5-4260U CPU @ 1.40GHz
Number of Available Cores: 4
Available memory: 8.589934592 GB
Elixir 1.4.2
Erlang 19.2.3
Benchmark suite executing with the following configuration:
warmup: 2.00 s
time: 10.00 s
parallel: 1
inputs: none specified
Estimated total run time: 24.00 s
Benchmarking split...
Benchmarking splitter |> to_list...
Name ips average deviation median
splitter |> to_list 0.66 1.51 s ±7.45% 1.47 s
split 0.21 4.85 s ±2.04% 4.84 s
Comparison:
splitter |> to_list 0.66
split 0.21 - 3.21x slower
I realized the above benchmark wasn't great, so I re-did it .The results are even more drastic.
code
defmodule Split.Fast do
def split(str) do
str |> String.splitter(",") |> Enum.to_list
end
end
defmodule Split.Slow do
def split(str) do
String.split(str, ",")
end
end
defmodule Split.Benchmark do
def benchmark do
Benchee.run(%{
"splitter |> to_list" => fn -> bench_func(Split.Fast) end,
"split" => fn -> bench_func(Split.Slow) end
}, time: 10, print: [fast_warning: false])
end
@str Enum.join((1..1000000), ",")
def bench_func(module) do
module.split(@str)
end
end
Split.Benchmark.benchmark()
results
~/sandbox/benchee $ mix run samples/string_split.exs
Operating System: macOS
CPU Information: Intel(R) Core(TM) i5-4260U CPU @ 1.40GHz
Number of Available Cores: 4
Available memory: 8.589934592 GB
Elixir 1.4.2
Erlang 19.2.3
Benchmark suite executing with the following configuration:
warmup: 2.00 s
time: 10.00 s
parallel: 1
inputs: none specified
Estimated total run time: 24.00 s
Benchmarking split...
Benchmarking splitter |> to_list...
Name ips average deviation median
splitter |> to_list 2.01 0.50 s ±18.24% 0.46 s
split 0.168 5.94 s ±1.42% 5.94 s
Comparison:
splitter |> to_list 2.01
split 0.168 - 11.94x slower
The breakeven point seems to be at this string Enum.join((1..90000), ",")
results
~/sandbox/benchee $ mix run samples/string_split.exs
Operating System: macOS
CPU Information: Intel(R) Core(TM) i5-4260U CPU @ 1.40GHz
Number of Available Cores: 4
Available memory: 8.589934592 GB
Elixir 1.4.2
Erlang 19.2.3
Benchmark suite executing with the following configuration:
warmup: 2.00 s
time: 10.00 s
parallel: 1
inputs: none specified
Estimated total run time: 24.00 s
Benchmarking split...
Benchmarking splitter |> to_list...
Name ips average deviation median
splitter |> to_list 24.46 40.88 ms ±22.31% 39.49 ms
split 22.41 44.63 ms ±18.76% 42.20 ms
Comparison:
splitter |> to_list 24.46
split 22.41 - 1.09x slower
@potatosalad Hi Andrew, I thought you would be interested on this one. Apparently our version that splits strings as we go ends up faster than the Erlang one that is written in C.
Here is what I am using to benchmark:
defmodule Foo do
def run(count) do
str = (1..count) |> Enum.join(",")
IO.write "splitter: "
Task.async(fn ->
:timer.tc(fn -> str |> String.splitter(",") |> Enum.to_list end) |> elem(0) |> IO.inspect
end) |> Task.await()
IO.write "binary: "
Task.async(fn ->
:timer.tc(fn -> str |> :binary.split(",", [:global]) end) |> elem(0) |> IO.inspect
end) |> Task.await()
IO.write "tail: "
Task.async(fn ->
:timer.tc(fn -> str |> Foo.split(",") end) |> elem(0) |> IO.inspect
end) |> Task.await()
end
def split(string, pattern) do
pattern = :binary.compile_pattern(pattern)
do_split(string, pattern)
end
defp do_split(string, pattern) do
case :binary.split(string, pattern) do
[first, second] -> [first | do_split(second, pattern)]
[first] -> [first]
end
end
end
IO.puts "## warm up"
Foo.run(10)
IO.puts "## 10"
Foo.run(10)
IO.puts "## 1000"
Foo.run(1000)
IO.puts "## 1000000"
Foo.run(1000000)
Any idea if there is something that could be done in the Erlang side to optimize their case?
I played with it a bit more and it seems the performance of :binary.split/3 (with :global) decreases significantly with regards to the length of matched strings. The longer the parts, the slower it is overall.
When using :binary.copy("a,", 1_000_000) as input, :binary.split/3 is faster, when using :binary.copy("abcdef,", 1_000_000), :binary.split/3 is significantly slower.
@josevalim TL;DR: Problem with reduction counting in OTP? Maybe?
:binary.split/2,3, :binary.match/2,3, and :binary.matches/2,3 all share the same underlying BIF search code in erts/emulator/beam/erl_bif_binary.c, but the specific performance issue can be narrowed down to lines 1,228 to 1,331 in the do_binary_find function.
The quickest solution might be to have String.split/2,3 detect whenever a global split is being requested for a binary larger than say 1 MB (byte size larger than 0x100000) and use the tail implementation instead.
In many cases with smaller binaries, a global :binary.split/3 tends to be 2-3x faster than the tail implementation. However, this behavior seems to exponentially worsen for larger binaries with large split "parts."
This should also be fixed in OTP, but the underlying cause may be trickier to resolve in a timely manner.
I adjusted your benchmark a little to help pinpoint where the slowdown was coming from:
BM and AC are references to whether the Boyer-Moore or Aho-Corasick search algorithm is being used (BM if one, AC if more than one).count=4, size=2 results in a binary that looks like <<"aa,aa,aa,aa">> for BM mode and <<"aa,aa;aa;aa">> for AC mode). Any size higher than 256 took longer to complete than I had patience for (I waited for a few minutes before giving up).match and matches implementations to see whether it was split specific or rather the "find all" or "global" search portions of the BIF.binary and matches have the lowest number of reductions per microsecond as they are making a single call to the BIF which then decides when to yield control back to the scheduler).I ran the benchmark using the following command from the root of the Elixir repository:
bin/elixirc foo.ex && erl -noshell -pa . -pa lib/elixir/ebin -s 'Elixir.Foo'
Here are some of the results from OTP-19.3.4 and Elixir master branch (the warm up rounds have been removed for brevity):
## count=1000000, size=1
[BM] splitter: 823286 μs, 6059987 reds, 7.36 reds/μs
[AC] splitter: 782953 μs, 6059987 reds, 7.74 reds/μs
[BM] binary: 974609 μs, 200113 reds, 0.21 reds/μs
[AC] binary: 1008900 μs, 200113 reds, 0.20 reds/μs
[BM] match: 339587 μs, 3031758 reds, 8.93 reds/μs
[AC] match: 346488 μs, 3031758 reds, 8.75 reds/μs
[BM] matches: 1139377 μs, 2201630 reds, 1.93 reds/μs
[AC] matches: 1101599 μs, 2201630 reds, 2.00 reds/μs
[BM] tail: 224853 μs, 2016232 reds, 8.97 reds/μs
[AC] tail: 219852 μs, 2016232 reds, 9.17 reds/μs
## count=1000000, size=32
[BM] splitter: 931063 μs, 9070884 reds, 9.74 reds/μs
[AC] splitter: 837971 μs, 9068879 reds, 10.82 reds/μs
[BM] binary: 11423930 μs, 3301973 reds, 0.29 reds/μs
[AC] binary: 11157636 μs, 3301973 reds, 0.30 reds/μs
[BM] match: 491531 μs, 6030119 reds, 12.27 reds/μs
[AC] match: 389073 μs, 6033833 reds, 15.51 reds/μs
[BM] matches: 11388190 μs, 5301973 reds, 0.47 reds/μs
[AC] matches: 11304604 μs, 5301973 reds, 0.47 reds/μs
[BM] tail: 428681 μs, 5014975 reds, 11.70 reds/μs
[AC] tail: 304430 μs, 5014975 reds, 16.47 reds/μs
Results from OTP-20.0-rc1 and Elixir master branch show a 30% speed improvement in some cases (however, José's tail implementation still wins):
## count=1000000, size=1
[BM] splitter: 692872 μs, 11002192 reds, 15.88 reds/μs
[AC] splitter: 695571 μs, 11019040 reds, 15.84 reds/μs
[BM] binary: 664868 μs, 883980 reds, 1.33 reds/μs
[AC] binary: 673304 μs, 972364 reds, 1.44 reds/μs
[BM] match: 329214 μs, 4627517 reds, 14.06 reds/μs
[AC] match: 310093 μs, 4563137 reds, 14.72 reds/μs
[BM] matches: 724821 μs, 3974871 reds, 5.48 reds/μs
[AC] matches: 774506 μs, 4101751 reds, 5.30 reds/μs
[BM] tail: 227740 μs, 3554956 reds, 15.61 reds/μs
[AC] tail: 225368 μs, 3537852 reds, 15.70 reds/μs
## count=1000000, size=32
[BM] splitter: 1024214 μs, 14605233 reds, 14.26 reds/μs
[AC] splitter: 878970 μs, 14506473 reds, 16.50 reds/μs
[BM] binary: 8113830 μs, 11902886 reds, 1.47 reds/μs
[AC] binary: 8043617 μs, 11992714 reds, 1.49 reds/μs
[BM] match: 540722 μs, 7778967 reds, 14.39 reds/μs
[AC] match: 383299 μs, 7638382 reds, 19.93 reds/μs
[BM] matches: 8099647 μs, 14698099 reds, 1.81 reds/μs
[AC] matches: 8434463 μs, 15092123 reds, 1.79 reds/μs
[BM] tail: 411121 μs, 6599806 reds, 16.05 reds/μs
[AC] tail: 272319 μs, 6557368 reds, 24.08 reds/μs
Here is the code I used:
# foo.ex
defmodule Foo do
def run(count) do
run(count, 1)
run(count, 32)
end
def run(count, size) do
run(1000, size, "warm up")
run(1000, size, nil)
run(count, size, nil)
end
def run(count, size, label) do
if label do
IO.puts("## count=#{count}, size=#{size} (#{label})")
else
IO.puts("## count=#{count}, size=#{size}")
end
element = :binary.copy("a", size)
# Aho-Corasick
ac_elements0 = :lists.duplicate(div(count, 2), element)
ac_elements1 = :lists.duplicate(div(count, 2) + rem(count, 2), element)
ac_string = Enum.join(ac_elements0, ",") <> ";" <> Enum.join(ac_elements1, ";")
ac_pattern = :binary.compile_pattern([",", ";"])
ac = {ac_string, ac_pattern}
# Boyer-Moore
bm_elements = :lists.duplicate(count, element)
bm_string = Enum.join(bm_elements, ",")
bm_pattern = :binary.compile_pattern(",")
bm = {bm_string, bm_pattern}
^count = measure "splitter", bm, ac, fn (str, pattern) ->
str |> String.splitter(pattern) |> Enum.to_list()
end
^count = measure "binary", bm, ac, fn (str, pattern) ->
str |> :binary.split(pattern, [:global])
end
^count = measure "match", bm, ac, fn (str, pattern) ->
str |> Foo.split_match(pattern)
end
^count = measure "matches", bm, ac, fn (str, pattern) ->
str |> Foo.split_matches(pattern)
end
^count = measure "tail", bm, ac, fn (str, pattern) ->
str |> Foo.split_tail(pattern)
end
end
defp measure(label, bm, ac, fun) do
count = measure("[BM] #{label}", bm, fun)
^count = measure("[AC] #{label}", ac, fun)
count
end
defp measure(label, {string, pattern}, fun) do
IO.write(String.pad_trailing("#{label}:", 15))
Task.async(fn() ->
me = :erlang.self()
ta = :erlang.monotonic_time()
{:reductions, ra} = :erlang.process_info(me, :reductions)
result = fun.(string, pattern)
{:reductions, rb} = :erlang.process_info(me, :reductions)
tb = :erlang.monotonic_time()
r = rb - ra
t = :erlang.convert_time_unit(tb - ta, :native, :microsecond)
IO.write(String.pad_leading("#{t}", 14))
IO.binwrite(" μs, ")
IO.write(String.pad_leading("#{r}", 14))
IO.write(" reds\n")
length(result)
end)
|> Task.await(:infinity)
end
def split_match(string, pattern) do
do_split_match(string, pattern, byte_size(string), 0, 0)
end
defp do_split_match(string, pattern, s, p1, l1) do
case :binary.match(string, pattern, [scope: {p1 + l1, s - p1 - l1}]) do
{p2, l2} ->
[:binary.part(string, p1 + l1, p2 - p1 - l1) | do_split_match(string, pattern, s, p2, l2)]
:nomatch when p1 == 0 and l1 == 0 ->
[string]
:nomatch ->
[:binary.part(string, p1 + l1, s - p1 - l1)]
end
end
def split_matches(string, pattern) do
matches = :binary.matches(string, pattern)
do_split_matches(string, [{0,0} | matches])
end
defp do_split_matches(string, [{p1, l1} | matches=[{p2, _} | _]]) do
[:binary.part(string, p1 + l1, p2 - p1 - l1) | do_split_matches(string, matches)]
end
defp do_split_matches(string, [{0, 0}]) do
[string]
end
defp do_split_matches(string, [{p1, l1}]) do
[:binary.part(string, p1 + l1, byte_size(string) - p1 - l1)]
end
def split_tail(string, pattern) do
do_split_tail(string, pattern)
end
defp do_split_tail(string, pattern) do
case :binary.split(string, pattern) do
[first, second] -> [first | do_split_tail(second, pattern)]
[first] -> [first]
end
end
def start() do
Foo.run(1000000)
:erlang.halt()
end
end
On second thought, it might actually be the "result" function in the BIF called do_split_global_result and do_match_global_result where it converts the found positions into the large list of binaries or position/length tuples. It _should_ be trapping after counting reductions whenever such a large list is being returned, but it isn't. I would bet that this is the cause of the performance issue.
I'll play around with it a little and see if it's an easy enough patch to submit to OTP.
@potatosalad :heart: :green_heart: :blue_heart: :yellow_heart: :purple_heart:
I almost have a working solution to submit to OTP, but wanted to share a progress update.
I've identified the main performance issue (every time the BIF traps, it allocates and copies all of the find state memory inefficiently), and the first version of the patch results in ~1.4-3.8x faster than the tail implementation.
## count=1000000, size=1
[BM] split: 98341 us, 327151 reds, 3.33 reds/us
[AC] split: 60746 us, 249111 reds, 4.10 reds/us
[BM] tail: 222346 us, 3495534 reds, 15.72 reds/us
[AC] tail: 232267 us, 3540694 reds, 15.24 reds/us
## count=1000000, size=32
[BM] split: 274017 us, 3476576 reds, 12.69 reds/us
[AC] split: 115176 us, 3366868 reds, 29.23 reds/us
[BM] tail: 384976 us, 6475626 reds, 16.82 reds/us
[AC] tail: 267070 us, 6531898 reds, 24.46 reds/us
This is among the finest cases of nerd-sniping I have seen in a long time =)
I will mark this as an Erlang bug then. @potatosalad, please let us know when you submit a pull request. We are definitely interested in seeing the outcome. Thank you!
Just submitted erlang/otp#1480. The changes result in a 2-3x speed improvement over the tail recursion implementation. However, comparing the changes to the current implementation is where the biggest improvement can be seen: ~140x faster for a ~128MB binary with 1KB parts.
See this benchmark for more information.
Closing this in favor of the OTP PR!
Most helpful comment
I almost have a working solution to submit to OTP, but wanted to share a progress update.
I've identified the main performance issue (every time the BIF traps, it allocates and copies all of the find state memory inefficiently), and the first version of the patch results in ~1.4-3.8x faster than the tail implementation.