Elixir: String.split/2 is too slow with large strings

Created on 25 May 2017 · 13Comments · Source: elixir-lang/elixir

Environment

Elixir & Erlang versions (elixir --version):

$ elixir --version
Erlang/OTP 19 [erts-8.3] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

Elixir 1.4.4

Operating system: macOS Sierra

Current behavior

Seems like calling String.split/2 is much slower than doing String.splitter/2 and pipe the output to Enum.to_list when the input string is quite large.

Here's an example where using String.splitter/2 is at least 3 times faster:

iex(1)> str = (1..1000000) |> Enum.join(","); nil
nil
iex(2)> :timer.tc(fn -> str |> String.splitter(",") |> Enum.to_list |> length end)
{842961, 1000000}
iex(3)> :timer.tc(fn -> str |> String.split(",") |> length end)
{2943151, 1000000}

NOTE: for small strings String.split/2 is faster

Expected behavior

String.split/2 should be at least as fast as String.splitter/2 piped to Enum.to_list.

Elixir Chore Discussion Erlang bug

Source

waj

Most helpful comment

I almost have a working solution to submit to OTP, but wanted to share a progress update.

I've identified the main performance issue (every time the BIF traps, it allocates and copies all of the find state memory inefficiently), and the first version of the patch results in ~1.4-3.8x faster than the tail implementation.

## count=1000000, size=1
[BM] split:             98341 us,         327151 reds,     3.33 reds/us
[AC] split:             60746 us,         249111 reds,     4.10 reds/us
[BM] tail:             222346 us,        3495534 reds,    15.72 reds/us
[AC] tail:             232267 us,        3540694 reds,    15.24 reds/us
## count=1000000, size=32
[BM] split:            274017 us,        3476576 reds,    12.69 reds/us
[AC] split:            115176 us,        3366868 reds,    29.23 reds/us
[BM] tail:             384976 us,        6475626 reds,    16.82 reds/us
[AC] tail:             267070 us,        6531898 reds,    24.46 reds/us

potatosalad on 27 May 2017

❤11 👍6 🎉4

All 13 comments

@waj nice catch. Yeah, one wouldn't expect that. I presume @josevalim will come with a quick solution :)

StanBright on 25 May 2017

Never benchmark in the shell as code there is evaluated. Do you get the
same results if you run it inside a module compiled by a Mix project?

josevalim on 25 May 2017

I just benchmarked the compiled version, and it appears this is somewhat correct. Below is my benchmarking code, as well as the results of the benchmark:

code

defmodule Split.Fast do
  def split(str) do
    str |> String.splitter(",") |> Enum.to_list
  end
end

defmodule Split.Slow do
  def split(str) do
    String.split(str, ",")
  end
end

defmodule Split.Benchmark do
  def benchmark do
    Benchee.run(%{
      "splitter |> to_list" => fn -> bench_func(Split.Fast) end,
      "split"               => fn -> bench_func(Split.Slow) end
    }, time: 10, print: [fast_warning: false])
  end

  def bench_func(module) do
    (1..1000000) |> Enum.join(",") |> module.split()
  end
end

Split.Benchmark.benchmark()

results

 ~/sandbox/benchee $ mix run samples/string_split.exs
Operating System: macOS
CPU Information: Intel(R) Core(TM) i5-4260U CPU @ 1.40GHz
Number of Available Cores: 4
Available memory: 8.589934592 GB
Elixir 1.4.2
Erlang 19.2.3
Benchmark suite executing with the following configuration:
warmup: 2.00 s
time: 10.00 s
parallel: 1
inputs: none specified
Estimated total run time: 24.00 s


Benchmarking split...
Benchmarking splitter |> to_list...

Name                          ips        average  deviation         median
splitter |> to_list          0.66         1.51 s     ±7.45%         1.47 s
split                        0.21         4.85 s     ±2.04%         4.84 s

Comparison:
splitter |> to_list          0.66
split                        0.21 - 3.21x slower

EDIT

I realized the above benchmark wasn't great, so I re-did it .The results are even more drastic.

code

defmodule Split.Fast do
  def split(str) do
    str |> String.splitter(",") |> Enum.to_list
  end
end

defmodule Split.Slow do
  def split(str) do
    String.split(str, ",")
  end
end

defmodule Split.Benchmark do
  def benchmark do
    Benchee.run(%{
      "splitter |> to_list" => fn -> bench_func(Split.Fast) end,
      "split"               => fn -> bench_func(Split.Slow) end
    }, time: 10, print: [fast_warning: false])
  end

  @str Enum.join((1..1000000), ",")
  def bench_func(module) do
    module.split(@str)
  end
end

Split.Benchmark.benchmark()

results

 ~/sandbox/benchee $ mix run samples/string_split.exs
Operating System: macOS
CPU Information: Intel(R) Core(TM) i5-4260U CPU @ 1.40GHz
Number of Available Cores: 4
Available memory: 8.589934592 GB
Elixir 1.4.2
Erlang 19.2.3
Benchmark suite executing with the following configuration:
warmup: 2.00 s
time: 10.00 s
parallel: 1
inputs: none specified
Estimated total run time: 24.00 s


Benchmarking split...
Benchmarking splitter |> to_list...

Name                          ips        average  deviation         median
splitter |> to_list          2.01         0.50 s    ±18.24%         0.46 s
split                       0.168         5.94 s     ±1.42%         5.94 s

Comparison:
splitter |> to_list          2.01
split                       0.168 - 11.94x slower

EDIT 2

The breakeven point seems to be at this string Enum.join((1..90000), ",")

results

 ~/sandbox/benchee $ mix run samples/string_split.exs
Operating System: macOS
CPU Information: Intel(R) Core(TM) i5-4260U CPU @ 1.40GHz
Number of Available Cores: 4
Available memory: 8.589934592 GB
Elixir 1.4.2
Erlang 19.2.3
Benchmark suite executing with the following configuration:
warmup: 2.00 s
time: 10.00 s
parallel: 1
inputs: none specified
Estimated total run time: 24.00 s


Benchmarking split...
Benchmarking splitter |> to_list...

Name                          ips        average  deviation         median
splitter |> to_list         24.46       40.88 ms    ±22.31%       39.49 ms
split                       22.41       44.63 ms    ±18.76%       42.20 ms

Comparison:
splitter |> to_list         24.46
split                       22.41 - 1.09x slower

devonestes on 25 May 2017

@potatosalad Hi Andrew, I thought you would be interested on this one. Apparently our version that splits strings as we go ends up faster than the Erlang one that is written in C.

Here is what I am using to benchmark:

defmodule Foo do
  def run(count) do
    str = (1..count) |> Enum.join(",")
    IO.write "splitter: "
    Task.async(fn ->
      :timer.tc(fn -> str |> String.splitter(",") |> Enum.to_list end) |> elem(0) |> IO.inspect
    end) |> Task.await()
    IO.write "binary: "
    Task.async(fn ->
      :timer.tc(fn -> str |> :binary.split(",", [:global]) end) |> elem(0) |> IO.inspect
    end) |> Task.await()
    IO.write "tail: "
    Task.async(fn ->
      :timer.tc(fn -> str |> Foo.split(",") end) |> elem(0) |> IO.inspect
    end) |> Task.await()
  end

  def split(string, pattern) do
    pattern = :binary.compile_pattern(pattern)
    do_split(string, pattern)
  end

  defp do_split(string, pattern) do
    case :binary.split(string, pattern) do
      [first, second] -> [first | do_split(second, pattern)]
      [first] -> [first]
    end
  end
end

IO.puts "## warm up"
Foo.run(10)
IO.puts "## 10"
Foo.run(10)
IO.puts "## 1000"
Foo.run(1000)
IO.puts "## 1000000"
Foo.run(1000000)

Any idea if there is something that could be done in the Erlang side to optimize their case?

josevalim on 25 May 2017

I played with it a bit more and it seems the performance of :binary.split/3 (with :global) decreases significantly with regards to the length of matched strings. The longer the parts, the slower it is overall.

When using :binary.copy("a,", 1_000_000) as input, :binary.split/3 is faster, when using :binary.copy("abcdef,", 1_000_000), :binary.split/3 is significantly slower.

michalmuskala on 25 May 2017

@josevalim TL;DR: Problem with reduction counting in OTP? Maybe?

Overview

:binary.split/2,3, :binary.match/2,3, and :binary.matches/2,3 all share the same underlying BIF search code in erts/emulator/beam/erl_bif_binary.c, but the specific performance issue can be narrowed down to lines 1,228 to 1,331 in the do_binary_find function.

The quickest solution might be to have String.split/2,3 detect whenever a global split is being requested for a binary larger than say 1 MB (byte size larger than 0x100000) and use the tail implementation instead.

In many cases with smaller binaries, a global :binary.split/3 tends to be 2-3x faster than the tail implementation. However, this behavior seems to exponentially worsen for larger binaries with large split "parts."

This should also be fixed in OTP, but the underlying cause may be trickier to resolve in a timely manner.

Details

I adjusted your benchmark a little to help pinpoint where the slowdown was coming from:

BM and AC are references to whether the Boyer-Moore or Aho-Corasick search algorithm is being used (BM if one, AC if more than one).
Based the comment from @michalmuskala, I added support for specifying the size of each "part" of the binary (so count=4, size=2 results in a binary that looks like <<"aa,aa,aa,aa">> for BM mode and <<"aa,aa;aa;aa">> for AC mode). Any size higher than 256 took longer to complete than I had patience for (I waited for a few minutes before giving up).
Added match and matches implementations to see whether it was split specific or rather the "find all" or "global" search portions of the BIF.
Count the number of reductions for each implementation (notice that binary and matches have the lowest number of reductions per microsecond as they are making a single call to the BIF which then decides when to yield control back to the scheduler).

I ran the benchmark using the following command from the root of the Elixir repository:

bin/elixirc foo.ex && erl -noshell -pa . -pa lib/elixir/ebin -s 'Elixir.Foo'

Here are some of the results from OTP-19.3.4 and Elixir master branch (the warm up rounds have been removed for brevity):

## count=1000000, size=1
[BM] splitter:         823286 μs,        6059987 reds,     7.36 reds/μs
[AC] splitter:         782953 μs,        6059987 reds,     7.74 reds/μs
[BM] binary:           974609 μs,         200113 reds,     0.21 reds/μs
[AC] binary:          1008900 μs,         200113 reds,     0.20 reds/μs
[BM] match:            339587 μs,        3031758 reds,     8.93 reds/μs
[AC] match:            346488 μs,        3031758 reds,     8.75 reds/μs
[BM] matches:         1139377 μs,        2201630 reds,     1.93 reds/μs
[AC] matches:         1101599 μs,        2201630 reds,     2.00 reds/μs
[BM] tail:             224853 μs,        2016232 reds,     8.97 reds/μs
[AC] tail:             219852 μs,        2016232 reds,     9.17 reds/μs
## count=1000000, size=32
[BM] splitter:         931063 μs,        9070884 reds,     9.74 reds/μs
[AC] splitter:         837971 μs,        9068879 reds,    10.82 reds/μs
[BM] binary:         11423930 μs,        3301973 reds,     0.29 reds/μs
[AC] binary:         11157636 μs,        3301973 reds,     0.30 reds/μs
[BM] match:            491531 μs,        6030119 reds,    12.27 reds/μs
[AC] match:            389073 μs,        6033833 reds,    15.51 reds/μs
[BM] matches:        11388190 μs,        5301973 reds,     0.47 reds/μs
[AC] matches:        11304604 μs,        5301973 reds,     0.47 reds/μs
[BM] tail:             428681 μs,        5014975 reds,    11.70 reds/μs
[AC] tail:             304430 μs,        5014975 reds,    16.47 reds/μs

Results from OTP-20.0-rc1 and Elixir master branch show a 30% speed improvement in some cases (however, José's tail implementation still wins):

## count=1000000, size=1
[BM] splitter:         692872 μs,       11002192 reds,    15.88 reds/μs
[AC] splitter:         695571 μs,       11019040 reds,    15.84 reds/μs
[BM] binary:           664868 μs,         883980 reds,     1.33 reds/μs
[AC] binary:           673304 μs,         972364 reds,     1.44 reds/μs
[BM] match:            329214 μs,        4627517 reds,    14.06 reds/μs
[AC] match:            310093 μs,        4563137 reds,    14.72 reds/μs
[BM] matches:          724821 μs,        3974871 reds,     5.48 reds/μs
[AC] matches:          774506 μs,        4101751 reds,     5.30 reds/μs
[BM] tail:             227740 μs,        3554956 reds,    15.61 reds/μs
[AC] tail:             225368 μs,        3537852 reds,    15.70 reds/μs
## count=1000000, size=32
[BM] splitter:        1024214 μs,       14605233 reds,    14.26 reds/μs
[AC] splitter:         878970 μs,       14506473 reds,    16.50 reds/μs
[BM] binary:          8113830 μs,       11902886 reds,     1.47 reds/μs
[AC] binary:          8043617 μs,       11992714 reds,     1.49 reds/μs
[BM] match:            540722 μs,        7778967 reds,    14.39 reds/μs
[AC] match:            383299 μs,        7638382 reds,    19.93 reds/μs
[BM] matches:         8099647 μs,       14698099 reds,     1.81 reds/μs
[AC] matches:         8434463 μs,       15092123 reds,     1.79 reds/μs
[BM] tail:             411121 μs,        6599806 reds,    16.05 reds/μs
[AC] tail:             272319 μs,        6557368 reds,    24.08 reds/μs

Here is the code I used:

# foo.ex
defmodule Foo do
  def run(count) do
    run(count, 1)
    run(count, 32)
  end

  def run(count, size) do
    run(1000, size, "warm up")
    run(1000, size, nil)
    run(count, size, nil)
  end

  def run(count, size, label) do
    if label do
      IO.puts("## count=#{count}, size=#{size} (#{label})")
    else
      IO.puts("## count=#{count}, size=#{size}")
    end
    element = :binary.copy("a", size)
    # Aho-Corasick
    ac_elements0 = :lists.duplicate(div(count, 2), element)
    ac_elements1 = :lists.duplicate(div(count, 2) + rem(count, 2), element)
    ac_string = Enum.join(ac_elements0, ",") <> ";" <> Enum.join(ac_elements1, ";")
    ac_pattern = :binary.compile_pattern([",", ";"])
    ac = {ac_string, ac_pattern}
    # Boyer-Moore
    bm_elements = :lists.duplicate(count, element)
    bm_string = Enum.join(bm_elements, ",")
    bm_pattern = :binary.compile_pattern(",")
    bm = {bm_string, bm_pattern}
    ^count = measure "splitter", bm, ac, fn (str, pattern) ->
      str |> String.splitter(pattern) |> Enum.to_list()
    end
    ^count = measure "binary", bm, ac, fn (str, pattern) ->
      str |> :binary.split(pattern, [:global])
    end
    ^count = measure "match", bm, ac, fn (str, pattern) ->
      str |> Foo.split_match(pattern)
    end
    ^count = measure "matches", bm, ac, fn (str, pattern) ->
      str |> Foo.split_matches(pattern)
    end
    ^count = measure "tail", bm, ac, fn (str, pattern) ->
      str |> Foo.split_tail(pattern)
    end
  end

  defp measure(label, bm, ac, fun) do
    count = measure("[BM] #{label}", bm, fun)
    ^count = measure("[AC] #{label}", ac, fun)
    count
  end

  defp measure(label, {string, pattern}, fun) do
    IO.write(String.pad_trailing("#{label}:", 15))
    Task.async(fn() ->
      me = :erlang.self()
      ta = :erlang.monotonic_time()
      {:reductions, ra} = :erlang.process_info(me, :reductions)
      result = fun.(string, pattern)
      {:reductions, rb} = :erlang.process_info(me, :reductions)
      tb = :erlang.monotonic_time()
      r = rb - ra
      t = :erlang.convert_time_unit(tb - ta, :native, :microsecond)
      IO.write(String.pad_leading("#{t}", 14))
      IO.binwrite(" μs, ")
      IO.write(String.pad_leading("#{r}", 14))
      IO.write(" reds\n")
      length(result)
    end)
    |> Task.await(:infinity)
  end

  def split_match(string, pattern) do
    do_split_match(string, pattern, byte_size(string), 0, 0)
  end

  defp do_split_match(string, pattern, s, p1, l1) do
    case :binary.match(string, pattern, [scope: {p1 + l1, s - p1 - l1}]) do
      {p2, l2} ->
        [:binary.part(string, p1 + l1, p2 - p1 - l1) | do_split_match(string, pattern, s, p2, l2)]
      :nomatch when p1 == 0 and l1 == 0 ->
        [string]
      :nomatch ->
        [:binary.part(string, p1 + l1, s - p1 - l1)]
    end
  end

  def split_matches(string, pattern) do
    matches = :binary.matches(string, pattern)
    do_split_matches(string, [{0,0} | matches])
  end

  defp do_split_matches(string, [{p1, l1} | matches=[{p2, _} | _]]) do
    [:binary.part(string, p1 + l1, p2 - p1 - l1) | do_split_matches(string, matches)]
  end
  defp do_split_matches(string, [{0, 0}]) do
    [string]
  end
  defp do_split_matches(string, [{p1, l1}]) do
    [:binary.part(string, p1 + l1, byte_size(string) - p1 - l1)]
  end

  def split_tail(string, pattern) do
    do_split_tail(string, pattern)
  end

  defp do_split_tail(string, pattern) do
    case :binary.split(string, pattern) do
      [first, second] -> [first | do_split_tail(second, pattern)]
      [first] -> [first]
    end
  end

  def start() do
    Foo.run(1000000)
    :erlang.halt()
  end
end

potatosalad on 25 May 2017

On second thought, it might actually be the "result" function in the BIF called do_split_global_result and do_match_global_result where it converts the found positions into the large list of binaries or position/length tuples. It _should_ be trapping after counting reductions whenever such a large list is being returned, but it isn't. I would bet that this is the cause of the performance issue.

I'll play around with it a little and see if it's an easy enough patch to submit to OTP.

potatosalad on 25 May 2017

@potatosalad :heart: :green_heart: :blue_heart: :yellow_heart: :purple_heart:

josevalim on 25 May 2017

I almost have a working solution to submit to OTP, but wanted to share a progress update.

## count=1000000, size=1
[BM] split:             98341 us,         327151 reds,     3.33 reds/us
[AC] split:             60746 us,         249111 reds,     4.10 reds/us
[BM] tail:             222346 us,        3495534 reds,    15.72 reds/us
[AC] tail:             232267 us,        3540694 reds,    15.24 reds/us
## count=1000000, size=32
[BM] split:            274017 us,        3476576 reds,    12.69 reds/us
[AC] split:            115176 us,        3366868 reds,    29.23 reds/us
[BM] tail:             384976 us,        6475626 reds,    16.82 reds/us
[AC] tail:             267070 us,        6531898 reds,    24.46 reds/us

potatosalad on 27 May 2017

❤11 👍6 🎉4

This is among the finest cases of nerd-sniping I have seen in a long time =)

sorentwo on 27 May 2017

❤5 🎉4

I will mark this as an Erlang bug then. @potatosalad, please let us know when you submit a pull request. We are definitely interested in seeing the outcome. Thank you!

josevalim on 28 May 2017

🎉1

Just submitted erlang/otp#1480. The changes result in a 2-3x speed improvement over the tail recursion implementation. However, comparing the changes to the current implementation is where the biggest improvement can be seen: ~140x faster for a ~128MB binary with 1KB parts.

See this benchmark for more information.

potatosalad on 29 May 2017

❤5

Closing this in favor of the OTP PR!

josevalim on 5 Jun 2017

Was this page helpful?

0 / 5 - 0 ratings