Fsharp: Performance of certain build-in functions, esp the ones related to Array manipulation

Created on 3 Jun 2020  Â·  26Comments  Â·  Source: dotnet/fsharp

I've noticed for some time now that quite a few functions that are called in many places are suboptimal in terms of performance. For instance, Array.create, Array.init, Array.concat and Array.filter can be improved upon with sometimes about 1.2-1.5x the speed, sometimes even manyfold (I got Array.replicate times source |> Array.concat 25X faster with a dedicated Array.repeat, so that was a bit of cheating).

Sometimes such improvements relate specifically to large arrays, sometimes specifically to small ones. In line of that, String.xxx functions that use StringBuilder when replaced with ToCharArray() instead improve by about 20-30%.

Before I start making a PR, which would certainly include showing the timings with BenchmarkDotNet for a variety of inputs, would such a PR be welcome? While nothing I tried is really complex, and what I tried is provably correct, as always with optimizations, they will increase the complexity of the code.

To get some idea, here's String.map vs a very simple optimization of str.ToCharArray() |> Array.map mapper |> String, which in many cases wins 40% (needs more input lengths comparisons, though, this was with 12-char string):

| Method | Job | Jit | Platform | Runtime | Mean | Error | StdDev | Min | Max | Ratio | Rank | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---------- |----------- |---------- |--------- |----------- |----------:|---------:|---------:|----------:|----------:|------:|-----:|-------:|------:|------:|----------:|
| fastMap | Job-MACJMZ | LegacyJit | X64 | .NET 4.8 | 106.91ns | 1.069ns | 1.000ns | 105.31ns | 108.28ns | 0.65 | 1 | 0.0263 | - | - | 168 B |
| stringMap | Job-MACJMZ | LegacyJit | X64 | .NET 4.8 | 164.29ns | 2.743ns | 2.291ns | 160.02ns | 169.34ns | 1.00 | 2 | 0.0315 | - | - | 201 B |
| | | | | | | | | | | | | | | | |
| fastMap | Job-CZIGBL | LegacyJit | X86 | .NET 4.8 | 118.16ns | 1.094ns | 0.970ns | 116.88ns | 120.14ns | 0.64 | 1 | 0.0219 | - | - | 120 B |
| stringMap | Job-CZIGBL | LegacyJit | X86 | .NET 4.8 | 184.93ns | 1.647ns | 1.540ns | 181.82ns | 187.86ns | 1.00 | 2 | 0.0233 | - | - | 128 B |
| | | | | | | | | | | | | | | | |
| fastMap | Job-FZSSVS | RyuJit | X64 | .NET 4.8 | 95.61ns | 0.974ns | 0.911ns | 94.33ns | 97.43ns | 0.63 | 1 | 0.0261 | - | - | 168 B |
| stringMap | Job-FZSSVS | RyuJit | X64 | .NET 4.8 | 151.22ns | 1.639ns | 1.368ns | 149.13ns | 154.09ns | 1.00 | 2 | 0.0315 | - | - | 201 B |
| | | | | | | | | | | | | | | | |
| fastMap | Job-OXSCJZ | LegacyJit | X86 | .NET 4.6.1 | 117.96ns | 1.202ns | 1.065ns | 116.12ns | 119.48ns | 0.62 | 1 | 0.0227 | - | - | 120 B |
| stringMap | Job-OXSCJZ | LegacyJit | X86 | .NET 4.6.1 | 191.32ns | 3.016ns | 2.673ns | 188.10ns | 196.17ns | 1.00 | 2 | 0.0237 | - | - | 128 B |

Area-Library Feature Improvement

Most helpful comment

From experience in this repo I can say that bigger performance PRs will
sometimes have a hard time to get in.

Abel Braaksma notifications@github.com schrieb am Di., 16. Juni 2020,
20:05:

Please do it as separate PRs so that it's easier to get those accepted

I was figuring to do one PR for the string stuff, which is really just a
few lines in a single file
https://github.com/dotnet/fsharp/blob/master/src/fsharp/FSharp.Core/string.fs,
but if you think it is better to do it function-by-function, I can do that,
but I thought that would make it harder instead of easier...

@cartermp https://github.com/cartermp, @TIHan https://github.com/TIHan,
@dsyme https://github.com/dsyme, any preference here so that I don't
have to do the PR's twice?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dotnet/fsharp/issues/9390#issuecomment-644922607, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAAOANA5G7FVK3G2I2ZPV4DRW6X7LANCNFSM4NR6H2KA
.

All 26 comments

@abelbraaksma great observation. Would you be interested in submitting a PR to adjust these functions?

@cartermp, sure thing, I just wasn't certain how much of "performance sugar" is allowed. I mean, I would assume using Span is not possible because we still support older .NET versions, but heavily-used functions like Array.create benefits tremendously with some loop-unrolling and block-alignment, for instance.

I'll submit a PR (will probably mark it WIP as this may need some iterations) and have you comment on whether certain performance-doping is allowed or not.

@cartermp, one other question, for perf comparison, would it be enough to just run RuyJit with .NET 4.8 on 64 bit, plus one LegacyJit on x86, or should I make more system configurations (.NET Core, Standard, I don't have Mono installed...)? It should be enough to focus on the 80% users group right, and not focus too much on niche configurations?

I think .NET Core (latest) and .NET Framework (any version) would be good enough to run some benchmarks against.

Regarding performance sugar, yes Span can't be used because that would make FSharp.Core inaccessibly anywhere except for .NET Core and Xamarin. But anything else should be fine, so long as it doesn't change the public API or the end result of the functions.

I think a PR would be fine.

After playing around a little with the various ways String.map can be optimized, I'm pasting the summarized results here. I expect to be able to use the "lessons learned" here with the other functions, so that I can settle with less variants to test.

What I tried:

  • Removing StringBuilder, as we know the size of the string beforehand.
  • Inlining the loop as opposed to Array.map and loop unrolling. The latter proved not so effective as can be seen below (x2 means 2 unrolls, x4 means 4)
  • while vs for, which should be indistinguishable mostly, and on .NET Core that's true, but somehow on .NET Framework (not shown here) while was significantly slower
  • several ways of allocating the new string
  • an approach with fixed and String.Copy. These are the fastest by far, and use least memory, but may not be "admissable" as it makes the function unsafe.
  • for comparison, I tried a view approaches with Span and String.Create. These cannot be used, as they require a dependency for .NET Framework. They are a good alternative to fixed, also little memory use and faster than the rest, except on small strings.
  • optimizing for small strings, which is smallMap in the overview. This is the winner on the low range, but slightly slower for larger input (though this may be measurement deviation, the code is the same as fasterMap, except for a few leading ifs).

My own verdict:

  • If possible, I would go for fixed without loop-unrolling. It is simple enough, and it uses least memory, and is 3x to even 4x faster than the current implementation. The larger the input, the bigger the difference.
  • Otherwise, the choice should be between fasterMap, simple and straightforward, but generally fastest without "doping", or smallMap, if we want to have top performance on strings up to 8 chars (above that, they're mostly equal).

Note that any alternative I tried was faster than the current implementation for average input. And almost all of them have less GC pressure and memory use.

I've also tested with very large inputs up until 100MB, usually the perf difference shift in favor of the new algorithms, because much less memory is used. With 100MB, fixed is almost 5x times faster than current.

Benchmarks were run with BDN, using 0.01 max relative error. Originally each test was run with a variety of mapper functions, but that made little change, so I left those variations out.

  • bold is fastest of the likely candidates
  • italics is fastest of the fixed versions
  • I didn't bother to tag Span and Alt, as I don't think they are possible candidates, they are here for comparison.
  • "current" means String.map as it is now in the FSharp.Core library. It is repeated for each block. Candidate means they are simplest to adopt, Alt means also simple, but slightly slower on .NET Framework than shown here, I don't prefer them. Fixed is fixed, and Span is span.
BenchmarkDotNet=v0.12.1, OS=Windows 7 SP1 (6.1.7601.0)
Intel Xeon CPU X5680 3.33GHz, 2 CPU, 12 logical and 12 physical cores
Frequency=3247119 Hz, Resolution=307.9653 ns, Timer=TSC
.NET Core SDK=3.1.300-preview-015135
  [Host]        : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT DEBUG
  .NET Core 3.1 : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT

Job=.NET Core 3.1  Runtime=.NET Core 3.1  

0 and 1 chars

| Method | Categories | Value | Mean | Error | StdDev | Median | Ratio | Rank | Gen 0 | Allocated |
|------------------ |----------- |------ |-------------:|-----------:|-----------:|-------------:|------:|-----:|-------:|----------:|
| current | Candidate | 0 | 12.86ns | 0.101ns | 0.089ns | 12.89ns | 1.00 | 1 | 0.0038 | 24 B |
| fastMap | Candidate | 0 | 13.07ns | 0.078ns | 0.069ns | 13.06ns | 1.02 | 2 | 0.0038 | 24 B |
| fasterMap | Candidate | 0 | 12.75ns | 0.071ns | 0.067ns | 12.76ns | 0.99 | 1 | 0.0038 | 24 B |
| smallMap | Candidate | 0 | 16.57ns | 0.097ns | 0.091ns | 16.53ns | 1.29 | 3 | 0.0038 | 24 B |
|   | | | | | | | | | | |
| current | Alt | 0 | 12.76ns | 0.173ns | 0.154ns | 12.76ns | 1.00 | 1 | 0.0038 | 24 B |
| whileMap | Alt | 0 | 12.88ns | 0.073ns | 0.065ns | 12.89ns | 1.01 | 1 | 0.0038 | 24 B |
| allocateMap | Alt | 0 | 12.86ns | 0.161ns | 0.151ns | 12.89ns | 1.01 | 1 | 0.0038 | 24 B |
| unrollMapX2 | Alt | 0 | 15.88ns | 0.213ns | 0.199ns | 15.78ns | 1.25 | 3 | 0.0038 | 24 B |
| unrollMapX4 | Alt | 0 | 15.43ns | 0.082ns | 0.076ns | 15.43ns | 1.21 | 2 | 0.0038 | 24 B |
|   | | | | | | | | | | |
| current | Fixed | 0 | 12.59ns | 0.172ns | 0.161ns | 12.51ns | 1.00 | 1 | 0.0038 | 24 B |
| fixedCopy | Fixed | 0 | 28.35ns | 0.361ns | 0.338ns | 28.19ns | 2.25 | 2 | 0.0076 | 48 B |
| fixedCopyx2 | Fixed | 0 | 28.36ns | 0.212ns | 0.177ns | 28.42ns | 2.25 | 2 | 0.0076 | 48 B |
| fixedCopyx4 | Fixed | 0 | 28.51ns | 0.300ns | 0.281ns | 28.57ns | 2.26 | 2 | 0.0076 | 48 B |
|   | | | | | | | | | | |
| current | Span | 0 | 12.90ns | 0.126ns | 0.118ns | 12.87ns | 1.00 | 1 | 0.0038 | 24 B |
| spanCopy | Span | 0 | 28.60ns | 0.169ns | 0.141ns | 28.65ns | 2.22 | 2 | 0.0076 | 48 B |
| spanStrCreate | Span | 0 | 35.44ns | 0.395ns | 0.350ns | 35.37ns | 2.75 | 3 | 0.0178 | 112 B |
| spanStrCreateAlt | Span | 0 | 48.54ns | 0.936ns | 0.919ns | 48.79ns | 3.76 | 4 | 0.0229 | 144 B |
| spanStrCreateUnx4 | Span | 0 | 47.61ns | 0.984ns | 1.171ns | 47.65ns | 3.70 | 4 | 0.0229 | 144 B |
|   | | | | | | | | | | |
| current | Candidate | 1 | 69.09ns | 0.852ns | 0.756ns | 69.10ns | 1.00 | 4 | 0.0254 | 160 B |
| fastMap | Candidate | 1 | 52.83ns | 0.654ns | 0.612ns | 52.60ns | 0.76 | 3 | 0.0178 | 112 B |
| fasterMap | Candidate | 1 | 45.80ns | 0.482ns | 0.451ns | 45.85ns | 0.66 | 2 | 0.0127 | 80 B |
| smallMap | Candidate | 1 | 24.11ns | 0.419ns | 0.372ns | 24.11ns | 0.35 | 1 | 0.0076 | 48 B |
|   | | | | | | | | | | |
| current | Alt | 1 | 67.27ns | 0.505ns | 0.473ns | 67.23ns | 1.00 | 4 | 0.0254 | 160 B |
| whileMap | Alt | 1 | 46.82ns | 0.499ns | 0.467ns | 46.75ns | 0.70 | 3 | 0.0127 | 80 B |
| allocateMap | Alt | 1 | 43.11ns | 0.635ns | 0.563ns | 42.95ns | 0.64 | 2 | 0.0127 | 80 B |
| unrollMapX2 | Alt | 1 | 23.80ns | 0.247ns | 0.231ns | 23.87ns | 0.35 | 1 | 0.0076 | 48 B |
| unrollMapX4 | Alt | 1 | 23.59ns | 0.148ns | 0.138ns | 23.64ns | 0.35 | 1 | 0.0076 | 48 B |
|   | | | | | | | | | | |
| current | Fixed | 1 | 69.15ns | 1.132ns | 1.059ns | 69.30ns | 1.00 | 3 | 0.0254 | 160 B |
| fixedCopy | Fixed | 1 | 30.94ns | 0.520ns | 0.486ns | 31.12ns | 0.45 | 2 | 0.0076 | 48 B |
| fixedCopyx2 | Fixed | 1 | 30.36ns | 0.396ns | 0.371ns | 30.20ns | 0.44 | 1 | 0.0076 | 48 B |
| fixedCopyx4 | Fixed | 1 | 31.12ns | 0.271ns | 0.226ns | 31.07ns | 0.45 | 2 | 0.0076 | 48 B |
|   | | | | | | | | | | |
| current | Span | 1 | 68.61ns | 0.356ns | 0.333ns | 68.62ns | 1.00 | 5 | 0.0254 | 160 B |
| spanCopy | Span | 1 | 31.36ns | 0.529ns | 0.494ns | 31.22ns | 0.46 | 1 | 0.0076 | 48 B |
| spanStrCreate | Span | 1 | 51.92ns | 1.075ns | 1.056ns | 51.75ns | 0.76 | 2 | 0.0216 | 136 B |
| spanStrCreateAlt | Span | 1 | 62.80ns | 0.841ns | 0.787ns | 63.05ns | 0.92 | 3 | 0.0267 | 168 B |
| spanStrCreateUnx4 | Span | 1 | 65.55ns | 0.716ns | 0.670ns | 65.73ns | 0.96 | 4 | 0.0267 | 168 B |

5 and 10 chars

| Method | Categories | Value | Mean | Error | StdDev | Median | Ratio | Rank | Gen 0 | Allocated |
|------------------ |----------- |------ |-------------:|-----------:|-----------:|-------------:|------:|-----:|-------:|----------:|
| current | Candidate | 5 | 98.39ns | 0.599ns | 0.531ns | 98.15ns | 1.00 | 4 | 0.0280 | 176 B |
| fastMap | Candidate | 5 | 64.59ns | 1.170ns | 1.094ns | 64.16ns | 0.66 | 3 | 0.0216 | 136 B |
| fasterMap | Candidate | 5 | 54.99ns | 0.511ns | 0.478ns | 54.95ns | 0.56 | 1 | 0.0153 | 96 B |
| smallMap | Candidate | 5 | 61.75ns | 1.051ns | 0.983ns | 61.69ns | 0.63 | 2 | 0.0153 | 96 B |
|   | | | | | | | | | | |
| current | Alt | 5 | 104.15ns | 2.107ns | 2.342ns | 105.23ns | 1.00 | 4 | 0.0280 | 176 B |
| whileMap | Alt | 5 | 55.53ns | 0.439ns | 0.367ns | 55.42ns | 0.53 | 2 | 0.0153 | 96 B |
| allocateMap | Alt | 5 | 52.50ns | 0.750ns | 0.701ns | 52.49ns | 0.50 | 1 | 0.0153 | 96 B |
| unrollMapX2 | Alt | 5 | 55.48ns | 0.780ns | 0.729ns | 55.24ns | 0.53 | 2 | 0.0153 | 96 B |
| unrollMapX4 | Alt | 5 | 59.18ns | 0.919ns | 0.860ns | 59.39ns | 0.57 | 3 | 0.0153 | 96 B |
| | | | | | | | | | | |
| current | Fixed | 5 | 103.54ns | 1.900ns | 1.866ns | 103.56ns | 1.00 | 3 | 0.0280 | 176 B |
| fixedCopy | Fixed | 5 | 37.75ns | 0.698ns | 0.619ns | 37.71ns | 0.36 | 1 | 0.0089 | 56 B |
| fixedCopyx2 | Fixed | 5 | 37.71ns | 0.408ns | 0.382ns | 37.75ns | 0.36 | 1 | 0.0089 | 56 B |
| fixedCopyx4 | Fixed | 5 | 39.36ns | 0.488ns | 0.407ns | 39.38ns | 0.38 | 2 | 0.0089 | 56 B |
| | | | | | | | | | | |
| current | Span | 5 | 102.48ns | 2.076ns | 2.977ns | 101.57ns | 1.00 | 4 | 0.0280 | 176 B |
| spanCopy | Span | 5 | 40.11ns | 0.787ns | 0.966ns | 39.84ns | 0.39 | 1 | 0.0089 | 56 B |
| spanStrCreate | Span | 5 | 63.09ns | 1.292ns | 2.086ns | 62.85ns | 0.62 | 2 | 0.0229 | 144 B |
| spanStrCreateAlt | Span | 5 | 74.00ns | 1.426ns | 1.334ns | 73.76ns | 0.71 | 3 | 0.0280 | 176 B |
| spanStrCreateUnx4 | Span | 5 | 74.88ns | 0.924ns | 0.864ns | 74.88ns | 0.72 | 3 | 0.0280 | 176 B |
|   | | | | | | | | | | |
|  10 chars | | | | | | | | | | |
| current | Candidate | 10 | 137.76ns | 1.850ns | 1.640ns | 137.51ns | 1.00 | 4 | 0.0317 | 200 B |
| fastMap | Candidate | 10 | 81.41ns | 0.725ns | 0.678ns | 81.29ns | 0.59 | 3 | 0.0267 | 168 B |
| fasterMap | Candidate | 10 | 69.10ns | 0.469ns | 0.438ns | 69.18ns | 0.50 | 1 | 0.0191 | 120 B |
| smallMap | Candidate | 10 | 75.97ns | 0.810ns | 0.718ns | 76.21ns | 0.55 | 2 | 0.0191 | 120 B |
|   | | | | | | | | | | |
| current | Alt | 10 | 135.96ns | 1.211ns | 1.011ns | 135.73ns | 1.00 | 3 | 0.0317 | 200 B |
| whileMap | Alt | 10 | 70.25ns | 1.253ns | 1.110ns | 70.04ns | 0.52 | 2 | 0.0191 | 120 B |
| allocateMap | Alt | 10 | 67.91ns | 0.863ns | 0.721ns | 67.95ns | 0.50 | 1 | 0.0191 | 120 B |
| unrollMapX2 | Alt | 10 | 70.09ns | 0.635ns | 0.594ns | 69.92ns | 0.52 | 2 | 0.0191 | 120 B |
| unrollMapX4 | Alt | 10 | 68.89ns | 0.695ns | 0.650ns | 68.92ns | 0.51 | 1 | 0.0191 | 120 B |
|   | | | | | | | | | | |
| current | Fixed | 10 | 137.42ns | 2.783ns | 3.809ns | 135.49ns | 1.00 | 3 | 0.0317 | 200 B |
| fixedCopy | Fixed | 10 | 47.57ns | 0.471ns | 0.394ns | 47.66ns | 0.34 | 1 | 0.0114 | 72 B |
| fixedCopyx2 | Fixed | 10 | 48.22ns | 0.301ns | 0.282ns | 48.20ns | 0.35 | 1 | 0.0114 | 72 B |
| fixedCopyx4 | Fixed | 10 | 49.94ns | 0.730ns | 0.609ns | 49.83ns | 0.36 | 2 | 0.0114 | 72 B |
|   | | | | | | | | | | |
| current | Span | 10 | 136.76ns | 1.311ns | 1.226ns | 136.86ns | 1.00 | 5 | 0.0317 | 200 B |
| spanCopy | Span | 10 | 51.29ns | 0.400ns | 0.355ns | 51.43ns | 0.37 | 1 | 0.0114 | 72 B |
| spanStrCreate | Span | 10 | 75.43ns | 0.913ns | 0.763ns | 75.27ns | 0.55 | 2 | 0.0254 | 160 B |
| spanStrCreateAlt | Span | 10 | 87.43ns | 0.599ns | 0.531ns | 87.50ns | 0.64 | 4 | 0.0305 | 192 B |
| spanStrCreateUnx4 | Span | 10 | 85.80ns | 1.017ns | 0.901ns | 85.81ns | 0.63 | 3 | 0.0305 | 192 B |

31 and 100 chars

| Method | Categories | Value | Mean | Error | StdDev | Median | Ratio | Rank | Gen 0 | Allocated |
|------------------ |----------- |------ |-------------:|-----------:|-----------:|-------------:|------:|-----:|-------:|----------:|
| current | Candidate | 31 | 296.99ns | 5.675ns | 12.217ns | 294.12ns | 1.00 | 4 | 0.0443 | 280 B |
| fastMap | Candidate | 31 | 139.43ns | 1.113ns | 1.041ns | 139.34ns | 0.46 | 3 | 0.0458 | 288 B |
| fasterMap | Candidate | 31 | 122.91ns | 0.920ns | 0.816ns | 123.21ns | 0.41 | 1 | 0.0317 | 200 B |
| smallMap | Candidate | 31 | 135.52ns | 1.314ns | 1.165ns | 135.52ns | 0.45 | 2 | 0.0317 | 200 B |
|   | | | | | | | | | | |
| current | Alt | 31 | 299.02ns | 5.659ns | 5.558ns | 297.41ns | 1.00 | 3 | 0.0443 | 280 B |
| whileMap | Alt | 31 | 120.16ns | 2.376ns | 3.005ns | 117.99ns | 0.41 | 1 | 0.0317 | 200 B |
| allocateMap | Alt | 31 | 129.14ns | 0.867ns | 0.811ns | 129.41ns | 0.43 | 2 | 0.0317 | 200 B |
| unrollMapX2 | Alt | 31 | 118.54ns | 1.121ns | 0.994ns | 118.43ns | 0.40 | 1 | 0.0317 | 200 B |
| unrollMapX4 | Alt | 31 | 119.67ns | 1.613ns | 1.430ns | 119.46ns | 0.40 | 1 | 0.0317 | 200 B |
|   | | | | | | | | | | |
| current | Fixed | 31 | 292.63ns | 3.647ns | 3.412ns | 291.12ns | 1.00 | 4 | 0.0443 | 280 B |
| fixedCopy | Fixed | 31 | 90.22ns | 0.522ns | 0.488ns | 90.32ns | 0.31 | 1 | 0.0178 | 112 B |
| fixedCopyx2 | Fixed | 31 | 92.90ns | 0.598ns | 0.559ns | 92.94ns | 0.32 | 2 | 0.0178 | 112 B |
| fixedCopyx4 | Fixed | 31 | 100.58ns | 1.755ns | 1.642ns | 100.22ns | 0.34 | 3 | 0.0178 | 112 B |
|   | | | | | | | | | | |
| current | Span | 31 | 295.64ns | 1.518ns | 1.267ns | 295.66ns | 1.00 | 5 | 0.0443 | 280 B |
| spanCopy | Span | 31 | 101.58ns | 1.860ns | 1.826ns | 101.79ns | 0.34 | 1 | 0.0178 | 112 B |
| spanStrCreate | Span | 31 | 132.14ns | 2.560ns | 2.395ns | 132.83ns | 0.45 | 2 | 0.0317 | 200 B |
| spanStrCreateAlt | Span | 31 | 146.46ns | 2.965ns | 3.530ns | 147.03ns | 0.49 | 4 | 0.0370 | 232 B |
| spanStrCreateUnx4 | Span | 31 | 140.53ns | 2.264ns | 2.007ns | 141.06ns | 0.47 | 3 | 0.0370 | 232 B |
|   | | | | | | | | | | |
|  100 chars | | | | | | | | | | |
| current | Candidate | 100 | 796.34ns | 15.934ns | 14.125ns | 788.09ns | 1.00 | 3 | 0.0877 | 552 B |
| fastMap | Candidate | 100 | 347.12ns | 5.982ns | 7.986ns | 345.38ns | 0.44 | 2 | 0.1106 | 696 B |
| fasterMap | Candidate | 100 | 311.80ns | 1.806ns | 1.601ns | 312.26ns | 0.39 | 1 | 0.0749 | 472 B |
| smallMap | Candidate | 100 | 344.51ns | 2.428ns | 2.027ns | 345.05ns | 0.43 | 2 | 0.0749 | 472 B |
|   | | | | | | | | | | |
| current | Alt | 100 | 802.38ns | 14.461ns | 13.527ns | 803.31ns | 1.00 | 5 | 0.0877 | 552 B |
| whileMap | Alt | 100 | 310.82ns | 3.155ns | 2.951ns | 310.92ns | 0.39 | 3 | 0.0749 | 472 B |
| allocateMap | Alt | 100 | 334.36ns | 1.520ns | 1.421ns | 334.56ns | 0.42 | 4 | 0.0749 | 472 B |
| unrollMapX2 | Alt | 100 | 283.68ns | 1.994ns | 1.865ns | 284.27ns | 0.35 | 2 | 0.0749 | 472 B |
| unrollMapX4 | Alt | 100 | 276.16ns | 3.283ns | 2.910ns | 276.29ns | 0.34 | 1 | 0.0749 | 472 B |
|   | | | | | | | | | | |
| current | Fixed | 100 | 804.66ns | 5.603ns | 4.679ns | 805.33ns | 1.00 | 4 | 0.0877 | 552 B |
| fixedCopy | Fixed | 100 | 226.07ns | 1.051ns | 0.877ns | 226.04ns | 0.28 | 1 | 0.0393 | 248 B |
| fixedCopyx2 | Fixed | 100 | 234.08ns | 2.609ns | 2.313ns | 234.07ns | 0.29 | 2 | 0.0391 | 248 B |
| fixedCopyx4 | Fixed | 100 | 245.54ns | 2.454ns | 2.175ns | 245.17ns | 0.31 | 3 | 0.0391 | 248 B |
|   | | | | | | | | | | |
| current | Span | 100 | 804.95ns | 8.385ns | 7.844ns | 805.59ns | 1.00 | 5 | 0.0877 | 552 B |
| spanCopy | Span | 100 | 257.32ns | 1.536ns | 1.283ns | 257.22ns | 0.32 | 1 | 0.0391 | 248 B |
| spanStrCreate | Span | 100 | 315.95ns | 1.891ns | 1.677ns | 315.88ns | 0.39 | 3 | 0.0534 | 336 B |
| spanStrCreateAlt | Span | 100 | 344.50ns | 6.737ns | 6.302ns | 342.70ns | 0.43 | 4 | 0.0587 | 368 B |
| spanStrCreateUnx4 | Span | 100 | 294.39ns | 2.441ns | 2.283ns | 293.91ns | 0.37 | 2 | 0.0587 | 368 B |

256 and more chars

| Method | Categories | Value | Mean | Error | StdDev | Median | Ratio | Rank | Gen 0 | Gen 1 | Allocated |
| -------------------- | ------------- | -------- | --------------: | ------------: | ------------: | --------------: | -------: | ----: | ---------: | ---------: | ----------: |
| current | Candidate | 256 | 1,952.41ns | 11.159ns | 9.892ns | 1,951.66ns | 1.00 | 3 | 0.1869 | - | 1176 B |
| fastMap | Candidate | 256 | 800.65ns | 7.703ns | 6.829ns | 802.18ns | 0.41 | 2 | 0.2594 | - | 1632 B |
| fasterMap | Candidate | 256 | 739.04ns | 9.457ns | 8.847ns | 735.24ns | 0.38 | 1 | 0.1745 | - | 1096 B |
| smallMap | Candidate | 256 | 811.24ns | 3.172ns | 2.967ns | 811.42ns | 0.42 | 2 | 0.1745 | - | 1096 B |
|   | | | | | | | | | | | |
| current | Alt | 256 | 1,950.68ns | 16.943ns | 15.849ns | 1,948.46ns | 1.00 | 5 | 0.1869 | - | 1176 B |
| whileMap | Alt | 256 | 730.78ns | 7.339ns | 6.506ns | 731.13ns | 0.37 | 3 | 0.1745 | - | 1096 B |
| allocateMap | Alt | 256 | 791.70ns | 7.619ns | 5.949ns | 790.93ns | 0.41 | 4 | 0.1745 | - | 1096 B |
| unrollMapX2 | Alt | 256 | 663.65ns | 3.941ns | 3.687ns | 663.96ns | 0.34 | 2 | 0.1745 | - | 1096 B |
| unrollMapX4 | Alt | 256 | 641.95ns | 6.201ns | 5.497ns | 640.22ns | 0.33 | 1 | 0.1745 | - | 1096 B |
|   | | | | | | | | | | | |
| current | Fixed | 256 | 1,975.77ns | 24.444ns | 22.865ns | 1,977.10ns | 1.00 | 4 | 0.1869 | - | 1176 B |
| fixedCopy | Fixed | 256 | 551.65ns | 7.620ns | 7.127ns | 549.79ns | 0.28 | 1 | 0.0887 | - | 560 B |
| fixedCopyx2 | Fixed | 256 | 569.96ns | 5.247ns | 4.908ns | 568.99ns | 0.29 | 2 | 0.0887 | - | 560 B |
| fixedCopyx4 | Fixed | 256 | 589.82ns | 4.786ns | 3.997ns | 590.19ns | 0.30 | 3 | 0.0887 | - | 560 B |
|   | | | | | | | | | | | |
| current | Span | 256 | 1,969.66ns | 19.996ns | 18.705ns | 1,970.26ns | 1.00 | 5 | 0.1869 | - | 1176 B |
| spanCopy | Span | 256 | 632.78ns | 6.331ns | 5.922ns | 634.22ns | 0.32 | 1 | 0.0887 | - | 560 B |
| spanStrCreate | Span | 256 | 732.16ns | 5.446ns | 5.094ns | 733.44ns | 0.37 | 3 | 0.1030 | - | 648 B |
| spanStrCreateAlt | Span | 256 | 772.54ns | 5.266ns | 4.668ns | 773.87ns | 0.39 | 4 | 0.1078 | - | 680 B |
| spanStrCreateUnx4 | Span | 256 | 665.30ns | 10.455ns | 9.779ns | 665.44ns | 0.34 | 2 | 0.1078 | - | 680 B |
|   | | | | | | | | | | | |
|  1024 chars | | | | | | | | | | | |
| current | Candidate | 1024 | 7,629.39ns | 60.810ns | 53.907ns | 7,625.62ns | 1.00 | 3 | 0.6714 | - | 4248 B |
| fastMap | Candidate | 1024 | 3,082.48ns | 26.745ns | 25.017ns | 3,070.76ns | 0.40 | 2 | 0.9918 | 0.0038 | 6240 B |
| fasterMap | Candidate | 1024 | 2,815.90ns | 11.005ns | 8.592ns | 2,815.05ns | 0.37 | 1 | 0.6638 | - | 4168 B |
| smallMap | Candidate | 1024 | 2,804.66ns | 35.226ns | 31.227ns | 2,801.53ns | 0.37 | 1 | 0.6638 | - | 4168 B |
|   | | | | | | | | | | | |
| current | Alt | 1024 | 7,653.67ns | 66.666ns | 62.360ns | 7,651.03ns | 1.00 | 5 | 0.6714 | - | 4248 B |
| whileMap | Alt | 1024 | 2,820.49ns | 16.317ns | 15.263ns | 2,820.79ns | 0.37 | 3 | 0.6638 | - | 4168 B |
| allocateMap | Alt | 1024 | 3,046.43ns | 17.272ns | 14.423ns | 3,047.76ns | 0.40 | 4 | 0.6638 | - | 4168 B |
| unrollMapX2 | Alt | 1024 | 2,690.30ns | 26.892ns | 22.456ns | 2,687.61ns | 0.35 | 2 | 0.6638 | - | 4168 B |
| unrollMapX4 | Alt | 1024 | 2,530.67ns | 14.308ns | 13.383ns | 2,531.89ns | 0.33 | 1 | 0.6638 | - | 4168 B |
|   | | | | | | | | | | | |
| current | Fixed | 1024 | 7,647.49ns | 51.204ns | 47.897ns | 7,635.90ns | 1.00 | 4 | 0.6714 | - | 4248 B |
| fixedCopy | Fixed | 1024 | 2,121.33ns | 29.708ns | 27.789ns | 2,118.47ns | 0.28 | 1 | 0.3319 | - | 2096 B |
| fixedCopyx2 | Fixed | 1024 | 2,164.05ns | 11.677ns | 9.117ns | 2,164.54ns | 0.28 | 2 | 0.3319 | - | 2096 B |
| fixedCopyx4 | Fixed | 1024 | 2,263.88ns | 23.129ns | 20.504ns | 2,255.20ns | 0.30 | 3 | 0.3319 | - | 2096 B |
|   | | | | | | | | | | | |
| current | Span | 1024 | 7,669.55ns | 67.241ns | 62.898ns | 7,679.77ns | 1.00 | 4 | 0.6714 | - | 4248 B |
| spanCopy | Span | 1024 | 2,397.11ns | 16.097ns | 14.269ns | 2,398.97ns | 0.31 | 1 | 0.3319 | - | 2096 B |
| spanStrCreate | Span | 1024 | 2,814.25ns | 43.848ns | 41.016ns | 2,815.72ns | 0.37 | 3 | 0.3471 | - | 2184 B |
| spanStrCreateAlt | Span | 1024 | 2,865.30ns | 22.238ns | 20.801ns | 2,866.39ns | 0.37 | 3 | 0.3510 | - | 2216 B |
| spanStrCreateUnx4 | Span | 1024 | 2,448.91ns | 35.363ns | 31.348ns | 2,452.51ns | 0.32 | 2 | 0.3510 | - | 2216 B |
|   | | | | | | | | | | | |
|  8192 chars | | | | | | | | | | | |
| current | Candidate | 8192 | 61,304.72ns | 350.006ns | 327.396ns | 61,285.96ns | 1.00 | 3 | 5.1270 | 0.2441 | 32920 B |
| fastMap | Candidate | 8192 | 25,785.01ns | 305.116ns | 285.405ns | 25,831.80ns | 0.42 | 2 | 7.8125 | 0.3052 | 49248 B |
| fasterMap | Candidate | 8192 | 23,305.23ns | 121.598ns | 113.743ns | 23,323.38ns | 0.38 | 1 | 5.2185 | 0.1526 | 32840 B |
| smallMap | Candidate | 8192 | 23,291.14ns | 125.680ns | 117.561ns | 23,337.55ns | 0.38 | 1 | 5.2185 | 0.1526 | 32840 B |
|   | | | | | | | | | | | |
| current | Alt | 8192 | 61,967.77ns | 536.539ns | 501.879ns | 61,816.82ns | 1.00 | 5 | 5.1270 | 0.2441 | 32920 B |
| whileMap | Alt | 8192 | 23,592.00ns | 328.376ns | 307.163ns | 23,607.09ns | 0.38 | 3 | 5.2185 | 0.1526 | 32840 B |
| allocateMap | Alt | 8192 | 24,950.07ns | 201.715ns | 178.815ns | 24,959.22ns | 0.40 | 4 | 5.2185 | 0.1526 | 32840 B |
| unrollMapX2 | Alt | 8192 | 22,003.28ns | 116.477ns | 97.264ns | 22,005.69ns | 0.36 | 2 | 5.2185 | 0.1526 | 32840 B |
| unrollMapX4 | Alt | 8192 | 21,074.46ns | 333.249ns | 295.416ns | 21,083.36ns | 0.34 | 1 | 5.2185 | 0.1526 | 32840 B |
|   | | | | | | | | | | | |
| current | Fixed | 8192 | 61,401.46ns | 508.529ns | 450.797ns | 61,337.03ns | 1.00 | 4 | 5.1270 | 0.2441 | 32920 B |
| fixedCopy | Fixed | 8192 | 16,758.64ns | 118.942ns | 111.258ns | 16,766.96ns | 0.27 | 1 | 2.5940 | - | 16432 B |
| fixedCopyx2 | Fixed | 8192 | 17,569.22ns | 151.468ns | 141.683ns | 17,581.75ns | 0.29 | 2 | 2.5940 | - | 16432 B |
| fixedCopyx4 | Fixed | 8192 | 18,269.65ns | 119.090ns | 105.570ns | 18,253.54ns | 0.30 | 3 | 2.5940 | - | 16432 B |
|   | | | | | | | | | | | |
| current | Span | 8192 | 62,410.59ns | 785.822ns | 735.058ns | 62,375.61ns | 1.00 | 4 | 5.1270 | 0.2441 | 32920 B |
| spanCopy | Span | 8192 | 19,228.95ns | 113.409ns | 94.702ns | 19,240.35ns | 0.31 | 1 | 2.5940 | - | 16432 B |
| spanStrCreate | Span | 8192 | 22,603.78ns | 168.023ns | 157.169ns | 22,597.73ns | 0.36 | 3 | 2.6245 | - | 16520 B |
| spanStrCreateAlt | Span | 8192 | 21,961.11ns | 321.794ns | 301.006ns | 21,942.30ns | 0.35 | 2 | 2.6245 | - | 16552 B |
| spanStrCreateUnx4 | Span | 8192 | 18,994.96ns | 241.445ns | 214.034ns | 18,928.00ns | 0.30 | 1 | 2.6245 | - | 16552 B |

The code:

```f#
module StringMap =

/// Basic String.map replacement, Array.map over CharArray
/// dotnet mem: 6x(!) string size
let fastMap mapper str =
    if String.IsNullOrEmpty str then String.Empty
    else
        str.ToCharArray()
        |> Array.map mapper
        |> String

/// A faster String.map replacement, inlining the loop instead of Array.map. Uses CharArray.
/// NOTES: used as baseline for StringMapUnsafe
/// dotnet mem: 4x string size
let fasterMap mapper str =
    if String.IsNullOrEmpty str then String.Empty
    else
        let result = str.ToCharArray()
        let mutable i = 0
        for c in result do
            result.[i] <- mapper c
            i <- i + 1

        String result

/// Instead of for..in uses while. Otherwise same as fasterMap
/// dotnet mem: 4x string size
let whileMap mapper str =
    if String.IsNullOrEmpty str then String.Empty
    else
        let result = str.ToCharArray()
        let mutable i = 0
        while i < result.Length do
            result.[i] <- mapper result.[i]
            i <- i + 1

        String result

/// Like fasterMap, but uses Array.zeroCreate with string-length
/// dotnet mem: 4x string data
let allocateMap mapper str =
    if String.IsNullOrEmpty str then String.Empty
    else
        let result = Array.zeroCreate str.Length
        let mutable i = 0
        for c in str do
            result.[i] <- mapper c
            i <- i + 1

        String result

/// String.map equiv. optimized for very small maps. The break-even point is around 4 chars.
let smallMap (mapper: char -> char) (str: string) =
    let len = String.length str
    if len > 4 then
        let result = str.ToCharArray()
        let mutable i = 0
        for c in result do
            result.[i] <- mapper c
            i <- i + 1

        String result

    elif len > 2 then
        if len = 3 
        then String [|mapper str.[0]; mapper str.[1]; mapper str.[2]|]
        else String [|mapper str.[0]; mapper str.[1]; mapper str.[2]; mapper str.[3]|]

    elif len = 2 then
        String [|mapper str.[0]; mapper str.[1]|]

    elif len = 1 then
        (mapper str.[0]).ToString()

    else
        String.Empty


/// String.map equiv. that uses safe unrolling. This is only 2-5% faster than String.fasterMap above. It suffers from non-elimination of range-checks.
let unrollMapx2 (mapper: char -> char) (str: string) =
    let len = String.length str

    if len > 4 then
        let result = str.ToCharArray()
        let mutable i = 0
        while i < len - len % 2 do
            result.[i] <- mapper result.[i]
            result.[i + 1] <- mapper result.[i + 1]
            i <- i + 2

        // the remainder
        if len % 2 = 1 then
            result.[i] <- mapper result.[i]
            //i <- i + 1

        String result

    elif len > 2 then
        if len = 3 
        then String [|mapper str.[0]; mapper str.[1]; mapper str.[2]|]
        else String [|mapper str.[0]; mapper str.[1]; mapper str.[2]; mapper str.[3]|]

    elif len = 2 then
        String [|mapper str.[0]; mapper str.[1]|]

    elif len = 1 then
        (mapper str.[0]).ToString()

    else
        String.Empty


/// String.map equiv. that uses safe unrolling. This is only 2-5% faster than String.fasterMap above. 
/// It suffers from non-elimination of range-checks.
let unrollMapx4 (mapper: char -> char) (str: string) =
    let len = String.length str

    if len > 4 then
        let result = str.ToCharArray()
        let mutable i = 0
        while i < len - len % 4 do
            result.[i] <- mapper result.[i]
            result.[i + 1] <- mapper result.[i + 1]
            result.[i + 2] <- mapper result.[i + 2]
            result.[i + 3] <- mapper result.[i + 3]
            i <- i + 4

        // the remainder
        while i < len do
            result.[i] <- mapper result.[i]
            i <- i + 1

        String result

    elif len > 2 then
        if len = 3 
        then String [|mapper str.[0]; mapper str.[1]; mapper str.[2]|]
        else String [|mapper str.[0]; mapper str.[1]; mapper str.[2]; mapper str.[3]|]

    elif len = 2 then
        String [|mapper str.[0]; mapper str.[1]|]

    elif len = 1 then
        (mapper str.[0]).ToString()

    else
        String.Empty

module StringMapUnsafe =
/// Unrollx2. Read from source as str.[idc], use NativePtr for result, which is created with String.Copy + fixed, return modified string
let fixedCopyx2 (mapper: char -> char) (str: string) =

    let len = String.length str
    let result = String.Copy str
    use shadow = fixed result
    let mutable i = 0
    while i < len - len % 2 do
        NativePtr.set shadow i (mapper str.[i])
        NativePtr.set shadow (i + 1) (mapper str.[i + 1])
        i <- i + 2

    while i < len do
        NativePtr.set shadow i (mapper str.[i])
        i <- i + 1

    result

/// Unrollx4. Read from source as str.[idc], use NativePtr for result, which is created with String.Copy + fixed, return modified string
let fixedCopyx4 (mapper: char -> char) (str: string) =
    let len = String.length str

    let result = String.Copy str
    use shadow = fixed result
    let mutable i = 0
    while i < len - len % 4 do
        NativePtr.set shadow i (mapper str.[i])
        NativePtr.set shadow (i + 1) (mapper str.[i + 1])
        NativePtr.set shadow (i + 1) (mapper str.[i + 2])
        NativePtr.set shadow (i + 1) (mapper str.[i + 3])
        i <- i + 4

    while i < len do
        NativePtr.set shadow i (mapper str.[i])
        i <- i + 1

    result

/// Unrollx4. Create copy with String.Copy + fixed and use NativePtr to read and write from it, return modified string.
let fixedCopy (mapper: char -> char) (str: string) =
    let len = String.length str

    /// a copy to use as result (we don't want to override the input reference string!)
    let result = String.Copy str

    /// pointer to the result-string
    use shadow = fixed result

    // loop-unroll
    let mutable i = 0
    while i < len - len % 4 do
        // NOTES:
        // - read/write to same mem address appears to be somewhat faster than reading from string by index, like with str.[i]
        // - while less IL instr are generated with calculating the address ourselves and then NativePtr.read/write, this appears to be faster (and safer).
        NativePtr.set shadow i (mapper (NativePtr.get shadow i))
        NativePtr.set shadow (i + 1) (mapper (NativePtr.get shadow (i + 1)))
        NativePtr.set shadow (i + 2) (mapper (NativePtr.get shadow (i + 2)))
        NativePtr.set shadow (i + 3) (mapper (NativePtr.get shadow (i + 3)))
        i <- i + 4

    // process the remainder
    while i < len do
        NativePtr.set shadow i (mapper (NativePtr.get shadow i))
        i <- i + 1

    result

/// Unrollx4. Create copy with String.Copy, ReadOnlySpan over it, marshal-as read/write Span<char> over same copy. Return modified underlying string.
/// Performance: ~20-30% faster than fasterMap.
let spanCopy (mapper: char -> char) (str: string) =
    let len = String.length str

    let resultStr = String.Copy str
    let roSpan = str.AsSpan()
    let writeSpan = MemoryMarshal.CreateSpan(&MemoryMarshal.GetReference roSpan, len)

    // loop-unroll
    let mutable i = 0
    while i < len - len % 4 do
        writeSpan.[i] <- mapper roSpan.[i]
        writeSpan.[i + 1] <- mapper roSpan.[i + 1]
        writeSpan.[i + 2] <- mapper roSpan.[i + 2]
        writeSpan.[i + 3] <- mapper roSpan.[i + 3]
        i <- i + 4

    // process the remainder
    while i < len do
        writeSpan.[i] <- mapper roSpan.[i]
        i <- i + 1

    resultStr

/// No unroll, uses String.Create with Span.
/// Performance: 0-20% faster than fasterMap, but fast only on larger (64kb) input
let spanCreate1 (mapper: char -> char) (str: string) =

    let len = String.length str
    String.Create(len, str, fun buf v -> 
        let mutable i = 0
        for c in v do
            buf.[i] <- mapper c
            i <- i + 1)

/// No unroll, uses String.Create with Span, no closure
/// Performance: 0-22% faster than fasterMap, the larger the intput, the bigger the gain (sometimes slower!)
let spanCreate2 (mapper: char -> char) (str: string) =

    let len = String.length str
    String.Create(len, (str, mapper), fun buf (str, f) -> 
        let mutable i = 0
        for c in str do
            buf.[i] <- f c
            i <- i + 1)

/// Unrollx4, uses String.Create with Span. All indexed operations on spans.
/// Performance: 15-32% (!) faster than fasterMap, the larger the input, the bigger the gain
let spanCreateUnrollx4 (mapper: char -> char) (str: string) =

    let len = String.length str
    String.Create(len, (str, mapper), fun target (str, f) -> 
        let mutable i = 0
        let source = str.AsSpan()
        while i < len - len % 4 do
            target.[i] <- f source.[i]
            target.[i + 1] <- f source.[i + 1]
            target.[i + 2] <- f source.[i + 2]
            target.[i + 3] <- f source.[i + 3]
            i <- i + 4

        // process the remainder
        while i < len do
            target.[i] <- f source.[i]
            i <- i + 1)

```

@dsyme or @cartermp, can I get a confirmation that using fixed and pointer arithmetic is off limits (or not)? If it's allowed, I gladly use it, as it gives dramatic reduction in memory usage and increases performance, but obviously, it requires more careful coding and it will automatically mark certain functions as unsafe, which I think is a "no go". Also, it may be considered fine on arrays, but not on strings, I'd totally understand that.

There are already dramatic perf improvements using other methods, but fixed trumps all. And before I get to the array functions, I like to limit my scope (consider the above an exercise in what's possible).

I know Span cannot be used, otherwise that would be a good alternative and gets close to fixed in terms of memory and speed.

Yes, spans are off-limits (unless we decide to move FSharp.Core to also bring in a dependency on System.Memory). I'd actually love to do that but I know that @dsyme is quite allergic to them.

I think we should strive for consistency in implementation here. Array and list internals tend not to use pointers to accomplish things. I'd like for us to eek out as much perf and memory as possible if we can, but I think that ultimately the best way forward is with Span, when it's appropriate to add it to FSharp.Core. It strikes the right balance of perf, readability, and safety. So the second-best implementation is probably sufficient, and a clear upgrade anyways.

unless we decide to move FSharp.Core to also bring in a dependency on System.Memory

If, for ImmutableArray, we decide to add that dependency, then I believe that includes System.Memory:

image

(which in turn gives us also System.Buffers and System.Numerics.Vectors, each of which open new ways of potentially improving certain key performance areas)

Array and list internals tend not to use pointers to accomplish things.
<snip/>but I think that ultimately the best way forward is with Span, when it's appropriate to add it to FSharp.Core. It strikes the right balance of perf, readability, and safety.

Agreed. Yeah, Span would give us wings in some areas :).

Ok, I'll move forward without the "extreme" optimizations than.

Next up: String.concat. This one does nothing more than calling into String.Join, but since it takes a sequence for the concatenation and doesn't special-case the empty separator string, I figured, let's see what happens if I special case for array and list (if anyone know of others that should be special-cased, please let me know. For instance, I wonder if there's a safe way for library functions to get the underlying array or list if the sequence is created with Set.ofArray or Seq.ofList, so that people won't get punished for using that instead of casting().

TLDR conclusions:

  • Code changes for this one are simple and straightforward
  • Speedup of up to ~50% gain for array and up to 20% gain for lists. Seq stays the same.
  • Mem decrease: almost 60% less mem use for arrays, and ~40% for lists
  • GC pressure drops up to 50% except for very small lists and arrays

I think these numbers can be expected considering that String.Join uses an optimized and mutable string under the hood with pre-calculated size for Concat and when an array is passed. String.Join with IEnumerable uses a more resource-hungry StringBuilder.

Below: bold is fastest. I didn't select a champion if it was within margins of error.

BenchmarkDotNet=v0.12.1, OS=Windows 7 SP1 (6.1.7601.0)
Intel Xeon CPU X5680 3.33GHz, 2 CPU, 12 logical and 12 physical cores
Frequency=3247119 Hz, Resolution=307.9653 ns, Timer=TSC
.NET Core SDK=3.1.300-preview-015135
  [Host]     : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT DEBUG
  Job-FRSYQM : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT

MaxRelativeError=0.02  Runtime=.NET Core 3.1  MaxIterationCount=10  
MaxWarmupIterationCount=5  MinIterationCount=5  MinWarmupIterationCount=2  

| Method | Sep | CollSize | CollType | Mean | Error | StdDev | Ratio | Rank | Gen 0 | Gen 1 | Allocated |
| ------------- | ---------------- | ---------- | --------- | -------------: | -----------: | -----------: | -------: | ----: | ---------: | ---------: | ----------: |
| current | | Small | Array | 484.4ns | 7.61ns | 1.98ns | 1.00 | 2 | 0.0467 | - | 296 B |
| concatOpt | | Small | Array | 256.3ns | 4.80ns | 1.25ns | 0.53 | 1 | 0.0420 | - | 264 B |
|   | | | | | | | | | | | |
| current | | Small | List | 554.0ns | 8.99ns | 3.21ns | 1.00 | 2 | 0.0477 | - | 304 B |
| concatOpt | | Small | List | 391.7ns | 7.23ns | 2.58ns | 0.71 | 1 | 0.0710 | - | 448 B |
|   | | | | | | | | | | | |
| current | | Small | Seq | 541.7ns | 10.36ns | 7.49ns | 1.00 | 1 | 0.0467 | - | 296 B |
| concatOpt | | Small | Seq | 556.4ns | 9.02ns | 2.34ns | 1.02 | 2 | 0.0467 | - | 296 B |
|   | | | | | | | | | | | |
| current | | Medium | Array | 10,715.8ns | 148.95ns | 38.68ns | 1.00 | 2 | 2.1667 | 0.0916 | 13656 B |
| concatOpt | | Medium | Array | 4,674.7ns | 92.60ns | 72.29ns | 0.43 | 1 | 0.7553 | 0.0076 | 4784 B |
|   | | | | | | | | | | | |
| current | | Medium | List | 12,422.7ns | 188.90ns | 29.23ns | 1.00 | 2 | 2.1667 | 0.0916 | 13664 B |
| concatOpt | | Medium | List | 7,951.7ns | 152.82ns | 175.98ns | 0.64 | 1 | 1.2665 | 0.0153 | 8008 B |
|   | | | | | | | | | | | |
| current | | Medium | Seq | 12,905.6ns | 435.52ns | 651.86ns | 1.00 | 2 | 2.1667 | 0.0916 | 13656 B |
| concatOpt | | Medium | Seq | 11,461.7ns | 212.56ns | 94.38ns | 0.88 | 1 | 2.1667 | 0.0916 | 13656 B |
|   | | | | | | | | | | | |
| current | &\|& | Small | Array | 594.5ns | 11.54ns | 14.17ns | 1.00 | 2 | 0.0648 | - | 408 B |
| concatOpt | &\|& | Small | Array | 402.5ns | 7.20ns | 2.57ns | 0.69 | 1 | 0.0596 | - | 376 B |
|   | | | | | | | | | | | |
| current | &\|& | Small | List | 654.4ns | 12.02ns | 8.69ns | 1.00 | 2 | 0.0658 | - | 416 B |
| concatOpt | &\|& | Small | List | 524.0ns | 9.51ns | 4.22ns | 0.80 | 1 | 0.0887 | - | 560 B |
|   | | | | | | | | | | | |
| current | &\|& | Small | Seq | 624.0ns | 12.35ns | 12.13ns | 1.00 | 1 | 0.0648 | - | 408 B |
| concatOpt | &\|& | Small | Seq | 626.8ns | 11.61ns | 9.70ns | 1.01 | 1 | 0.0648 | - | 408 B |
|   | | | | | | | | | | | |
| current | &\|& | Medium | Array | 13,245.0ns | 255.18ns | 133.47ns | 1.00 | 2 | 2.5482 | 0.1373 | 16048 B |
| concatOpt | &\|& | Medium | Array | 7,656.2ns | 127.40ns | 66.63ns | 0.58 | 1 | 1.1368 | 0.0229 | 7176 B |
|   | | | | | | | | | | | |
| current | &\|& | Medium | List | 14,149.6ns | 220.37ns | 78.59ns | 1.00 | 2 | 2.5482 | 0.1373 | 16056 B |
| concatOpt | &\|& | Medium | List | 10,824.5ns | 183.83ns | 81.62ns | 0.77 | 1 | 1.6479 | 0.0458 | 10400 B |
|   | | | | | | | | | | | |
| current | &\|& | Medium | Seq | 13,809.2ns | 258.95ns | 154.10ns | 1.00 | 1 | 2.5482 | 0.1373 | 16048 B |
| concatOpt | &\|& | Medium | Seq | 15,213.7ns | 565.78ns | 846.83ns | 1.07 | 2 | 2.5330 | 0.1221 | 16048 B |

Big inputs, 819200 items, 4874240 chars total, ~9.3MB

| Method | Sep | CollType | Mean | Error | StdDev | Ratio | Rank | Gen 0 | Gen 1 | Gen 2 | Allocated |
| ------------- | ---------------- | --------- | ----------: | ----------: | ----------: | -------: | ----: | --------: | -------: | -------: | -----------: |
| current | | Array | 27.00ms | 0.351ms | 0.091ms | 1.00 | 2 | 1656.2500 | 718.7500 | 125.0000 | 18.65 MB |
| concatOpt | | Array | 10.27ms | 0.155ms | 0.069ms | 0.38 | 1 | - | - | - | 9.3 MB |
|   | | | | | | | | | | | |
| current | | List | 32.12ms | 0.201ms | 0.052ms | 1.00 | 2 | 1656.2500 | 718.7500 | 125.0000 | 18.65 MB |
| concatOpt | | List | 22.57ms | 0.682ms | 0.357ms | 0.71 | 1 | - | - | - | 15.55 MB |
|   | | | | | | | | | | | |
| current | | Seq | 29.01ms | 0.140ms | 0.036ms | 1.00 | 1 | 1656.2500 | 718.7500 | 125.0000 | 18.65 MB |
| concatOpt | | Seq | 30.62ms | 0.600ms | 0.314ms | 1.06 | 2 | 62.5000 | 62.5000 | 62.5000 | 31.55 MB |
|   | | | | | | | | | | | |
| current | &\|& | Array | 32.07ms | 0.707ms | 0.109ms | 1.00 | 2 | 2250.0000 | 875.0000 | 187.5000 | 24.92 MB |
| concatOpt | &\|& | Array | 16.51ms | 0.259ms | 0.115ms | 0.51 | 1 | - | - | - | 12.42 MB |
|   | | | | | | | | | | | |
| current | &\|& | List | 36.85ms | 0.505ms | 0.131ms | 1.00 | 2 | 2285.7143 | 928.5714 | 214.2857 | 24.92 MB |
| concatOpt | &\|& | List | 28.60ms | 0.564ms | 0.250ms | 0.78 | 1 | - | - | - | 18.67 MB |
|   | | | | | | | | | | | |
| current | &\|& | Seq | 33.47ms | 0.580ms | 0.151ms | 1.00 | 1 | 2266.6667 | 933.3333 | 200.0000 | 24.92 MB |
| concatOpt | &\|& | Seq | 33.34ms | 0.453ms | 0.118ms | 1.00 | 1 | 2266.6667 | 933.3333 | 200.0000 | 24.92 MB |


The source code:

```f#
let private concatArray sep (strings: string []) =
match String.length sep with
| 0 -> String.Concat strings
| 1 -> String.Join(sep.[0], strings, 0, strings.Length)
| _ -> String.Join(sep, strings, 0, strings.Length)

let concatOpt sep (strings: seq) =
match strings with
| :? array as arr -> concatArray sep arr
| :? list as lst -> lst |> List.toArray |> concatArray sep
| _ ->
// special-casing Seq for sep = String.Empty is not faster due to the involved
// overhead of Seq.toArray and the already well-optimized String.Join
String.Join(sep, strings)
```

I think using screenshots is a little bit easier to read, those tables take too much space horizontally and vertically, making it hard to read.

String.iter, unrolled

There was little that could be improved for String.iter, but unrolling showed a ~25% perf gain:
image

String.iteri, unrolled

Basically same story:

image

String.mapi

Performance characteristics were, unsurprisingly, much the same as for String.map above. An easy gain of 2x performance, and reduction in mem and GC pressure.

Real advancement can be made here with fixed or with Span when the time's there. Just like String.map, unrolling had limited effect, but was more significant than for String.map. I suggest to take the unrollx2 variant (x4 and x8 had a difference of ~2%, not worth the effort). This would lead to 2.5x performance gain.

image

String.filter

Since Array.filter is highly optimized with a bit-mask method, I decided to try a method where I simply use Array.filter on the string. This improved perf dramatically by some 1.7-2.5 times, depending on whether the filter is true on "all" characters, "some" or "none" (false for all).

The buildAndCut method zero-allocates the target array and uses a String-overload to create a new string of only the filters characters. I thought this would have more mem pressure, but result was better in both mem and speed.

```f#
let filterByArray predicate (str: string) =
if String.IsNullOrEmpty str then String.Empty
else
str.ToCharArray()
|> Array.filter predicate
|> String

let filterBuildAndCut predicate (str: string) =
let len = String.length str

if len = 0 then String.Empty
else
    let target = Array.zeroCreate len
    let mutable i = 0
    for c in str do
        if predicate c then 
            target.[i] <- c
            i <- i + 1

    String(target, 0, i)

```

image

Still TODO:

  • String.collect
  • String.init
  • String.replicate
  • String.forall
  • String.exists
  • String.length

Next set:

String.length

At first it seems not much can be done here, but at close inspection, it calls String.get_Length twice on the happy path. Changing String.IsNullOrEmpty to isNull gives a 33% perf improvement. Since this is such a small function, it'll only show when used in tight loops. Anyway, it's as simple as changing the current impl. to this:

```f#
let nullOnly (str:string) =
if isNull str then 0
else str.Length

Fun in a loop to make is measurable, adjusted with `OperationsPerInvoke`:
![image](https://user-images.githubusercontent.com/16015770/84595145-bed0d900-ae56-11ea-997e-921c034ea0fb.png)

(note that this is not a pure measurement, as there's auxiliary code present, but the difference between the two is only the body of the `String.length` function.)

It's interesting to take a peak at the disassembly here. If we zoom in on the body of the loop, which is inlined by the JIT, this is what the original `String.length` gave us. As can be seen, this does two pointer-dereferences to calculate the length:

```assembly
; FSharp.Perf.BenchLength.current()
       sub       rsp,28
       mov       ecx,[rcx+8]
       call      FSharp.Perf.Data.get(Int32)
       xor       edx,edx
M00_L00:  // loop body, i.e., String.length
       test      rax,rax              // rax & rax, in other words: null-check, sets flags
       je        short M00_L01        // if zero (null) jump back to loop
       cmp       dword ptr [rax+8],0  // compare string-length to 0
       jbe       short M00_L01        // if equal, jump back to loop
       mov       ecx,[rax+8]          // if not, get string length fallthrough to loop
M00_L01:
       inc       edx
       cmp       edx,2711
       jl        short M00_L00
       add       rsp,28
       ret
; Total bytes of code 43

When we compare that to the disassembly where we only use isNull, the pointer dereference is gone completely and we have only three short instructions left in the body:

; FSharp.Perf.BenchLength.nullOnly()
       sub       rsp,28
       mov       ecx,[rcx+8]
       call      FSharp.Perf.Data.get(Int32)
       xor       edx,edx
M00_L00:  // loop body, i.e., nullOnly method
       test      rax,rax        // rax & rax, for null-check, sets flags
       je        short M00_L01  // if zero (null) jump back to loop
       mov       ecx,[rax+8]    // get the string-length from the string and fallthrough to loop
M00_L01:
       inc       edx
       cmp       edx,2711
       jl        short M00_L00
       add       rsp,28
       ret
; Total bytes of code 37

As you can see, we are down from five assembly instructions (incl two pointer dereferences) to only three assembly instructions.

Maybe a bit much of an analysis for such a small function, but it was fun to check what goes on under the hood. Note that the assembly can look different in other scenarios because of JIT optimizations.

It's also good to find out that the JIT does not do an extra null-check for the callvirt IL instruction, as it already appears to know that the reference cannot be null at that point. In fact, the whole call to get_Length disappears and essentially boils down to a single mov ecx, [rax+8].

String.exists && String.forall

This method takes a predicate that operates on the char. There's no direct equiv. in the .NET stdlibs. I've tried several alternatives of coding this, esp. since at first sight it seems the original is _not tail-recursive_, but this is recognized by the JIT anyway and it ~is unrolled in~ becomes an efficient loop.

Same is true for String.forall.

The only change to make here is the redundant call to get_Length, which only has an impact when searching very short strings.

tbc...

Next up:

String.replicate

There're some serious optimizations that are possible here, esp. if we would open up use of cpblk or initblk (they've been optimized, see https://github.com/dotnet/coreclr/pull/7198; we'll maybe get there when we start with array functions), but for this round I'll stick to optimizing with a simple Array.Copy, which internally calls an optimized native method already.

This gives us between 1.2x and 8x (!) performance improvements. Occasionally, the Array.Copy method can be a few percents slower, but this seems to be influenced by LOH when doing this on really large strings and counts. Since the general case is likely small strings, this greatly improves the general case.

The number of iterations and memcopies is now O(log(n)) instead of O(n), and there's no redundant memory allocated (other than for the required copy of the string at the end).

The optimization in code:

```f#
let replicateOpt count (str:string) =
if isNull str then
String.Empty
else
let len = str.Length
let target = Array.zeroCreate (len * count)
let source = str.ToCharArray()
Array.Copy(source, 0, target, 0, len)
let mutable i = len
while i * 2 < target.Length do
Array.Copy(target, 0, target, i, i)
i <- i * 2

        Array.Copy(target, 0, target, i, target.Length - i)
        String target
Speed:
![image](https://user-images.githubusercontent.com/16015770/84655772-d8832680-af11-11ea-9bf8-4a310752915d.png)

Memory:
![image](https://user-images.githubusercontent.com/16015770/84655822-f3ee3180-af11-11ea-93bd-5acce3433279.png)

### `String.init`

This one proves to be tricky to optimize, as the performance distribution of both the array-based new version and the existing implementation based on `StringBuilder` is not uniform for both memory and speed. This makes comparison hard. When `Count` is small (number of iterations) then the new version is the clear winner, but at larger counts, esp. if the returned strings are small (range 1-2 chars), the original is much faster and uses less memory.

When it comes a random distribution of strings in the range 0-200 chars, the difference is neglible*:

![image](https://user-images.githubusercontent.com/16015770/84662195-c4dcbd80-af1b-11ea-911f-33c0b2171408.png)

<sup>_* EDIT: the "empty" category wasn't empty, but same as "5or16", by accident. And "random" used `Random::Next`, which caused too much noise. You can disregard those rows. Oh, and `replicateOpt` really is `initOpt`, just missed while renaming._</sup>

Unless I can think of something better (unrolling the original, had no effect either), this one should probably remain as-is, since we won't be able to predict the average result of the `initialize` function.

The code:

```f#
let initOpt count (initializer: int -> string) =
    let target = Array.zeroCreate count
    for i=0 to count - 1 do 
        target.[i] <- initializer i

    String.Concat target

tbc (only String.collect remains)

String.init revisited

I couldn't let go. From the prev. measurements, it was clear that using the highly optimized String.Concat was significantly faster on small counts, so I leveraged that and now anything below count = 8 is between 12x (!!) and 1.1x faster, most over 2x faster, depending on size of count and size of input (larger strings means greater improvement).

Memory use and GC pressure also drops by 2x - 8x.

image

The code (from count > 4 it becomes a paramarray, also note that when count > 8 will be put up top by the compiler):

f# /// Using ParamArray as an optimized version of Array gives much better performance on small counts (up to 10x faster). let initOpt (count:int) (initializer: int-> string) = match count with | 0 -> String.Empty | 1 -> initializer 0 | 2 -> String.Concat(initializer 0, initializer 1) | 3 -> String.Concat(initializer 0, initializer 1, initializer 2) | 4 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3) | 5 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3, initializer 4) | 6 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3, initializer 4, initializer 5) | 7 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3, initializer 4, initializer 5, initializer 6) | 8 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3, initializer 4, initializer 5, initializer 6, initializer 7) | _ when count > 8 -> let res = StringBuilder count for i = 0 to count - 1 do res.Append(initializer i) |> ignore res.ToString() | _ -> failwith "error" // will become invalidArgInputMustBeNonNegative

String.init revisited (2) and final

It turned out that through copy and pasting I made mistakes on the first measurements, wrongly believing that using the array-based approach was so bad. I stabilized "random" between runs, so that all measurements can be compared, and I fixed the code where the wrong functions were called.

  • The fix now special-cases for very small amounts and unrolls the loop a little
  • Special-casing was _less fast_ than array-based above count = 4, so I dropped that
  • If the initialize function only returns very small strings of 1-3 chars, there's a performance degradation
  • If the output of initialize is 4 or 5 chars, it evens out, if it is random or larger than that, or zero chars, the array-based version wins in all cases
  • Memory pressure is, on average 60% less than before, GC pressure oftentimes even 4x better.

Let's get the numbers:

image

The code is now slightly more complex, but still not by much:

```f#
/// 2-10x faster on small counts, about 20% faster for very large count, for all cases, less memory pressure.
let initWithArray count (initializer: int -> string) =
if count < 5 then
match count with
| 0 -> String.Empty
| 1 -> initializer 0
| 2 -> String.Concat(initializer 0, initializer 1)
| 3 -> String.Concat(initializer 0, initializer 1, initializer 2)
| 4 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3)
| _ -> failwith "error" // arg error
else
let target = Array.zeroCreate count
let mutable i = 0
while i < count - count % 4 do
target.[i] <- initializer i
target.[i + 1] <- initializer(i + 1)
target.[i + 2] <- initializer(i + 2)
target.[i + 3] <- initializer(i + 3)
i <- i + 4

    while i < count do
        target.[i] <- initializer i
        i <- i + 1

    String.Concat target

```

If we zoom in on the break-even point for when the old version was faster than the new version, it is here, And as you can see, still less GC pressure:

image

Are you planning to send fixes?

@forki, absolutely, as soon as the string functions are done (one left), I'll send in a PR. I figured it made more sense to first do the exercise, timings etc, so that the reasoning behind the fixes, which aren't always obvious, are at least logged.

Please do it as separate PRs so that it's easier to get those accepted

Abel Braaksma notifications@github.com schrieb am Di., 16. Juni 2020,
20:00:

@forki https://github.com/forki, absolutely, as soon as the string
functions are done (one left), I'll send in a PR. I figured it made more
sense to first do the exercise, so that the reasoning behind the fixes,
which aren't always obvious, are at least logged.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dotnet/fsharp/issues/9390#issuecomment-644920017, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAAOANELTPMSWGDOGWRTFATRW6XLVANCNFSM4NR6H2KA
.

Please do it as separate PRs so that it's easier to get those accepted

@forki I was figuring to do one PR for the string stuff (others for array or whatever else comes my way), which is really just a few lines in a single file, but if you think it is better to do it function-by-function, I can do that, but I thought that would make it harder instead of easier...

@cartermp, @TIHan, @dsyme, any preference here so that I don't have to do the PR's twice?

From experience in this repo I can say that bigger performance PRs will
sometimes have a hard time to get in.

Abel Braaksma notifications@github.com schrieb am Di., 16. Juni 2020,
20:05:

Please do it as separate PRs so that it's easier to get those accepted

I was figuring to do one PR for the string stuff, which is really just a
few lines in a single file
https://github.com/dotnet/fsharp/blob/master/src/fsharp/FSharp.Core/string.fs,
but if you think it is better to do it function-by-function, I can do that,
but I thought that would make it harder instead of easier...

@cartermp https://github.com/cartermp, @TIHan https://github.com/TIHan,
@dsyme https://github.com/dsyme, any preference here so that I don't
have to do the PR's twice?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dotnet/fsharp/issues/9390#issuecomment-644922607, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAAOANA5G7FVK3G2I2ZPV4DRW6X7LANCNFSM4NR6H2KA
.

Yep, the smaller/more contained the better

Ok, got it, I'll make a separate pr for each function then,and I'll link it to the relevant comment.

@KevinRansom, an example that uses buffered StringBuilder, as we discussed, is the String.concat function. On sequences, it performs better than other methods. It uses System.String::Join, which internally uses the buffered SB (but, even the BCL _does not use that if the input is an array_). See the timings here: https://github.com/dotnet/fsharp/issues/9390#issuecomment-641553171

@abelbraaksma , thanks I will look at it, and try to understand why.

@abelbraaksma how far along do you think this one is in terms of getting to closure?

@cartermp, I think there are quite a few improvements already, but I've yet to check in the array functions improvements. I might get another stab at it during the holidays (won't be going anywhere anyway, it's still 2020:))

Was this page helpful?
0 / 5 - 0 ratings