I've noticed for some time now that quite a few functions that are called in many places are suboptimal in terms of performance. For instance, Array.create, Array.init, Array.concat and Array.filter can be improved upon with sometimes about 1.2-1.5x the speed, sometimes even manyfold (I got Array.replicate times source |> Array.concat 25X faster with a dedicated Array.repeat, so that was a bit of cheating).
Sometimes such improvements relate specifically to large arrays, sometimes specifically to small ones. In line of that, String.xxx functions that use StringBuilder when replaced with ToCharArray() instead improve by about 20-30%.
Before I start making a PR, which would certainly include showing the timings with BenchmarkDotNet for a variety of inputs, would such a PR be welcome? While nothing I tried is really complex, and what I tried is provably correct, as always with optimizations, they will increase the complexity of the code.
To get some idea, here's String.map vs a very simple optimization of str.ToCharArray() |> Array.map mapper |> String, which in many cases wins 40% (needs more input lengths comparisons, though, this was with 12-char string):
| Method | Job | Jit | Platform | Runtime | Mean | Error | StdDev | Min | Max | Ratio | Rank | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---------- |----------- |---------- |--------- |----------- |----------:|---------:|---------:|----------:|----------:|------:|-----:|-------:|------:|------:|----------:|
| fastMap | Job-MACJMZ | LegacyJit | X64 | .NET 4.8 | 106.91ns | 1.069ns | 1.000ns | 105.31ns | 108.28ns | 0.65 | 1 | 0.0263 | - | - | 168 B |
| stringMap | Job-MACJMZ | LegacyJit | X64 | .NET 4.8 | 164.29ns | 2.743ns | 2.291ns | 160.02ns | 169.34ns | 1.00 | 2 | 0.0315 | - | - | 201 B |
| | | | | | | | | | | | | | | | |
| fastMap | Job-CZIGBL | LegacyJit | X86 | .NET 4.8 | 118.16ns | 1.094ns | 0.970ns | 116.88ns | 120.14ns | 0.64 | 1 | 0.0219 | - | - | 120 B |
| stringMap | Job-CZIGBL | LegacyJit | X86 | .NET 4.8 | 184.93ns | 1.647ns | 1.540ns | 181.82ns | 187.86ns | 1.00 | 2 | 0.0233 | - | - | 128 B |
| | | | | | | | | | | | | | | | |
| fastMap | Job-FZSSVS | RyuJit | X64 | .NET 4.8 | 95.61ns | 0.974ns | 0.911ns | 94.33ns | 97.43ns | 0.63 | 1 | 0.0261 | - | - | 168 B |
| stringMap | Job-FZSSVS | RyuJit | X64 | .NET 4.8 | 151.22ns | 1.639ns | 1.368ns | 149.13ns | 154.09ns | 1.00 | 2 | 0.0315 | - | - | 201 B |
| | | | | | | | | | | | | | | | |
| fastMap | Job-OXSCJZ | LegacyJit | X86 | .NET 4.6.1 | 117.96ns | 1.202ns | 1.065ns | 116.12ns | 119.48ns | 0.62 | 1 | 0.0227 | - | - | 120 B |
| stringMap | Job-OXSCJZ | LegacyJit | X86 | .NET 4.6.1 | 191.32ns | 3.016ns | 2.673ns | 188.10ns | 196.17ns | 1.00 | 2 | 0.0237 | - | - | 128 B |
@abelbraaksma great observation. Would you be interested in submitting a PR to adjust these functions?
@cartermp, sure thing, I just wasn't certain how much of "performance sugar" is allowed. I mean, I would assume using Span is not possible because we still support older .NET versions, but heavily-used functions like Array.create benefits tremendously with some loop-unrolling and block-alignment, for instance.
I'll submit a PR (will probably mark it WIP as this may need some iterations) and have you comment on whether certain performance-doping is allowed or not.
@cartermp, one other question, for perf comparison, would it be enough to just run RuyJit with .NET 4.8 on 64 bit, plus one LegacyJit on x86, or should I make more system configurations (.NET Core, Standard, I don't have Mono installed...)? It should be enough to focus on the 80% users group right, and not focus too much on niche configurations?
I think .NET Core (latest) and .NET Framework (any version) would be good enough to run some benchmarks against.
Regarding performance sugar, yes Span can't be used because that would make FSharp.Core inaccessibly anywhere except for .NET Core and Xamarin. But anything else should be fine, so long as it doesn't change the public API or the end result of the functions.
I think a PR would be fine.
After playing around a little with the various ways String.map can be optimized, I'm pasting the summarized results here. I expect to be able to use the "lessons learned" here with the other functions, so that I can settle with less variants to test.
What I tried:
StringBuilder, as we know the size of the string beforehand. Array.map and loop unrolling. The latter proved not so effective as can be seen below (x2 means 2 unrolls, x4 means 4)while vs for, which should be indistinguishable mostly, and on .NET Core that's true, but somehow on .NET Framework (not shown here) while was significantly slowerfixed and String.Copy. These are the fastest by far, and use least memory, but may not be "admissable" as it makes the function unsafe.Span and String.Create. These cannot be used, as they require a dependency for .NET Framework. They are a good alternative to fixed, also little memory use and faster than the rest, except on small strings.smallMap in the overview. This is the winner on the low range, but slightly slower for larger input (though this may be measurement deviation, the code is the same as fasterMap, except for a few leading ifs).My own verdict:
fixed without loop-unrolling. It is simple enough, and it uses least memory, and is 3x to even 4x faster than the current implementation. The larger the input, the bigger the difference.fasterMap, simple and straightforward, but generally fastest without "doping", or smallMap, if we want to have top performance on strings up to 8 chars (above that, they're mostly equal).Note that any alternative I tried was faster than the current implementation for average input. And almost all of them have less GC pressure and memory use.
I've also tested with very large inputs up until 100MB, usually the perf difference shift in favor of the new algorithms, because much less memory is used. With 100MB, fixed is almost 5x times faster than current.
Benchmarks were run with BDN, using 0.01 max relative error. Originally each test was run with a variety of mapper functions, but that made little change, so I left those variations out.
fixed versionsString.map as it is now in the FSharp.Core library. It is repeated for each block. Candidate means they are simplest to adopt, Alt means also simple, but slightly slower on .NET Framework than shown here, I don't prefer them. Fixed is fixed, and Span is span.BenchmarkDotNet=v0.12.1, OS=Windows 7 SP1 (6.1.7601.0)
Intel Xeon CPU X5680 3.33GHz, 2 CPU, 12 logical and 12 physical cores
Frequency=3247119 Hz, Resolution=307.9653 ns, Timer=TSC
.NET Core SDK=3.1.300-preview-015135
[Host] : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT DEBUG
.NET Core 3.1 : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT
Job=.NET Core 3.1 Runtime=.NET Core 3.1
| Method | Categories | Value | Mean | Error | StdDev | Median | Ratio | Rank | Gen 0 | Allocated |
|------------------ |----------- |------ |-------------:|-----------:|-----------:|-------------:|------:|-----:|-------:|----------:|
| current | Candidate | 0 | 12.86ns | 0.101ns | 0.089ns | 12.89ns | 1.00 | 1 | 0.0038 | 24 B |
| fastMap | Candidate | 0 | 13.07ns | 0.078ns | 0.069ns | 13.06ns | 1.02 | 2 | 0.0038 | 24 B |
| fasterMap | Candidate | 0 | 12.75ns | 0.071ns | 0.067ns | 12.76ns | 0.99 | 1 | 0.0038 | 24 B |
| smallMap | Candidate | 0 | 16.57ns | 0.097ns | 0.091ns | 16.53ns | 1.29 | 3 | 0.0038 | 24 B |
| | | | | | | | | | | |
| current | Alt | 0 | 12.76ns | 0.173ns | 0.154ns | 12.76ns | 1.00 | 1 | 0.0038 | 24 B |
| whileMap | Alt | 0 | 12.88ns | 0.073ns | 0.065ns | 12.89ns | 1.01 | 1 | 0.0038 | 24 B |
| allocateMap | Alt | 0 | 12.86ns | 0.161ns | 0.151ns | 12.89ns | 1.01 | 1 | 0.0038 | 24 B |
| unrollMapX2 | Alt | 0 | 15.88ns | 0.213ns | 0.199ns | 15.78ns | 1.25 | 3 | 0.0038 | 24 B |
| unrollMapX4 | Alt | 0 | 15.43ns | 0.082ns | 0.076ns | 15.43ns | 1.21 | 2 | 0.0038 | 24 B |
| | | | | | | | | | | |
| current | Fixed | 0 | 12.59ns | 0.172ns | 0.161ns | 12.51ns | 1.00 | 1 | 0.0038 | 24 B |
| fixedCopy | Fixed | 0 | 28.35ns | 0.361ns | 0.338ns | 28.19ns | 2.25 | 2 | 0.0076 | 48 B |
| fixedCopyx2 | Fixed | 0 | 28.36ns | 0.212ns | 0.177ns | 28.42ns | 2.25 | 2 | 0.0076 | 48 B |
| fixedCopyx4 | Fixed | 0 | 28.51ns | 0.300ns | 0.281ns | 28.57ns | 2.26 | 2 | 0.0076 | 48 B |
| | | | | | | | | | | |
| current | Span | 0 | 12.90ns | 0.126ns | 0.118ns | 12.87ns | 1.00 | 1 | 0.0038 | 24 B |
| spanCopy | Span | 0 | 28.60ns | 0.169ns | 0.141ns | 28.65ns | 2.22 | 2 | 0.0076 | 48 B |
| spanStrCreate | Span | 0 | 35.44ns | 0.395ns | 0.350ns | 35.37ns | 2.75 | 3 | 0.0178 | 112 B |
| spanStrCreateAlt | Span | 0 | 48.54ns | 0.936ns | 0.919ns | 48.79ns | 3.76 | 4 | 0.0229 | 144 B |
| spanStrCreateUnx4 | Span | 0 | 47.61ns | 0.984ns | 1.171ns | 47.65ns | 3.70 | 4 | 0.0229 | 144 B |
| | | | | | | | | | | |
| current | Candidate | 1 | 69.09ns | 0.852ns | 0.756ns | 69.10ns | 1.00 | 4 | 0.0254 | 160 B |
| fastMap | Candidate | 1 | 52.83ns | 0.654ns | 0.612ns | 52.60ns | 0.76 | 3 | 0.0178 | 112 B |
| fasterMap | Candidate | 1 | 45.80ns | 0.482ns | 0.451ns | 45.85ns | 0.66 | 2 | 0.0127 | 80 B |
| smallMap | Candidate | 1 | 24.11ns | 0.419ns | 0.372ns | 24.11ns | 0.35 | 1 | 0.0076 | 48 B |
| | | | | | | | | | | |
| current | Alt | 1 | 67.27ns | 0.505ns | 0.473ns | 67.23ns | 1.00 | 4 | 0.0254 | 160 B |
| whileMap | Alt | 1 | 46.82ns | 0.499ns | 0.467ns | 46.75ns | 0.70 | 3 | 0.0127 | 80 B |
| allocateMap | Alt | 1 | 43.11ns | 0.635ns | 0.563ns | 42.95ns | 0.64 | 2 | 0.0127 | 80 B |
| unrollMapX2 | Alt | 1 | 23.80ns | 0.247ns | 0.231ns | 23.87ns | 0.35 | 1 | 0.0076 | 48 B |
| unrollMapX4 | Alt | 1 | 23.59ns | 0.148ns | 0.138ns | 23.64ns | 0.35 | 1 | 0.0076 | 48 B |
| | | | | | | | | | | |
| current | Fixed | 1 | 69.15ns | 1.132ns | 1.059ns | 69.30ns | 1.00 | 3 | 0.0254 | 160 B |
| fixedCopy | Fixed | 1 | 30.94ns | 0.520ns | 0.486ns | 31.12ns | 0.45 | 2 | 0.0076 | 48 B |
| fixedCopyx2 | Fixed | 1 | 30.36ns | 0.396ns | 0.371ns | 30.20ns | 0.44 | 1 | 0.0076 | 48 B |
| fixedCopyx4 | Fixed | 1 | 31.12ns | 0.271ns | 0.226ns | 31.07ns | 0.45 | 2 | 0.0076 | 48 B |
| | | | | | | | | | | |
| current | Span | 1 | 68.61ns | 0.356ns | 0.333ns | 68.62ns | 1.00 | 5 | 0.0254 | 160 B |
| spanCopy | Span | 1 | 31.36ns | 0.529ns | 0.494ns | 31.22ns | 0.46 | 1 | 0.0076 | 48 B |
| spanStrCreate | Span | 1 | 51.92ns | 1.075ns | 1.056ns | 51.75ns | 0.76 | 2 | 0.0216 | 136 B |
| spanStrCreateAlt | Span | 1 | 62.80ns | 0.841ns | 0.787ns | 63.05ns | 0.92 | 3 | 0.0267 | 168 B |
| spanStrCreateUnx4 | Span | 1 | 65.55ns | 0.716ns | 0.670ns | 65.73ns | 0.96 | 4 | 0.0267 | 168 B |
| Method | Categories | Value | Mean | Error | StdDev | Median | Ratio | Rank | Gen 0 | Allocated |
|------------------ |----------- |------ |-------------:|-----------:|-----------:|-------------:|------:|-----:|-------:|----------:|
| current | Candidate | 5 | 98.39ns | 0.599ns | 0.531ns | 98.15ns | 1.00 | 4 | 0.0280 | 176 B |
| fastMap | Candidate | 5 | 64.59ns | 1.170ns | 1.094ns | 64.16ns | 0.66 | 3 | 0.0216 | 136 B |
| fasterMap | Candidate | 5 | 54.99ns | 0.511ns | 0.478ns | 54.95ns | 0.56 | 1 | 0.0153 | 96 B |
| smallMap | Candidate | 5 | 61.75ns | 1.051ns | 0.983ns | 61.69ns | 0.63 | 2 | 0.0153 | 96 B |
| | | | | | | | | | | |
| current | Alt | 5 | 104.15ns | 2.107ns | 2.342ns | 105.23ns | 1.00 | 4 | 0.0280 | 176 B |
| whileMap | Alt | 5 | 55.53ns | 0.439ns | 0.367ns | 55.42ns | 0.53 | 2 | 0.0153 | 96 B |
| allocateMap | Alt | 5 | 52.50ns | 0.750ns | 0.701ns | 52.49ns | 0.50 | 1 | 0.0153 | 96 B |
| unrollMapX2 | Alt | 5 | 55.48ns | 0.780ns | 0.729ns | 55.24ns | 0.53 | 2 | 0.0153 | 96 B |
| unrollMapX4 | Alt | 5 | 59.18ns | 0.919ns | 0.860ns | 59.39ns | 0.57 | 3 | 0.0153 | 96 B |
| | | | | | | | | | | |
| current | Fixed | 5 | 103.54ns | 1.900ns | 1.866ns | 103.56ns | 1.00 | 3 | 0.0280 | 176 B |
| fixedCopy | Fixed | 5 | 37.75ns | 0.698ns | 0.619ns | 37.71ns | 0.36 | 1 | 0.0089 | 56 B |
| fixedCopyx2 | Fixed | 5 | 37.71ns | 0.408ns | 0.382ns | 37.75ns | 0.36 | 1 | 0.0089 | 56 B |
| fixedCopyx4 | Fixed | 5 | 39.36ns | 0.488ns | 0.407ns | 39.38ns | 0.38 | 2 | 0.0089 | 56 B |
| | | | | | | | | | | |
| current | Span | 5 | 102.48ns | 2.076ns | 2.977ns | 101.57ns | 1.00 | 4 | 0.0280 | 176 B |
| spanCopy | Span | 5 | 40.11ns | 0.787ns | 0.966ns | 39.84ns | 0.39 | 1 | 0.0089 | 56 B |
| spanStrCreate | Span | 5 | 63.09ns | 1.292ns | 2.086ns | 62.85ns | 0.62 | 2 | 0.0229 | 144 B |
| spanStrCreateAlt | Span | 5 | 74.00ns | 1.426ns | 1.334ns | 73.76ns | 0.71 | 3 | 0.0280 | 176 B |
| spanStrCreateUnx4 | Span | 5 | 74.88ns | 0.924ns | 0.864ns | 74.88ns | 0.72 | 3 | 0.0280 | 176 B |
| | | | | | | | | | | |
| 10 chars | | | | | | | | | | |
| current | Candidate | 10 | 137.76ns | 1.850ns | 1.640ns | 137.51ns | 1.00 | 4 | 0.0317 | 200 B |
| fastMap | Candidate | 10 | 81.41ns | 0.725ns | 0.678ns | 81.29ns | 0.59 | 3 | 0.0267 | 168 B |
| fasterMap | Candidate | 10 | 69.10ns | 0.469ns | 0.438ns | 69.18ns | 0.50 | 1 | 0.0191 | 120 B |
| smallMap | Candidate | 10 | 75.97ns | 0.810ns | 0.718ns | 76.21ns | 0.55 | 2 | 0.0191 | 120 B |
| | | | | | | | | | | |
| current | Alt | 10 | 135.96ns | 1.211ns | 1.011ns | 135.73ns | 1.00 | 3 | 0.0317 | 200 B |
| whileMap | Alt | 10 | 70.25ns | 1.253ns | 1.110ns | 70.04ns | 0.52 | 2 | 0.0191 | 120 B |
| allocateMap | Alt | 10 | 67.91ns | 0.863ns | 0.721ns | 67.95ns | 0.50 | 1 | 0.0191 | 120 B |
| unrollMapX2 | Alt | 10 | 70.09ns | 0.635ns | 0.594ns | 69.92ns | 0.52 | 2 | 0.0191 | 120 B |
| unrollMapX4 | Alt | 10 | 68.89ns | 0.695ns | 0.650ns | 68.92ns | 0.51 | 1 | 0.0191 | 120 B |
| | | | | | | | | | | |
| current | Fixed | 10 | 137.42ns | 2.783ns | 3.809ns | 135.49ns | 1.00 | 3 | 0.0317 | 200 B |
| fixedCopy | Fixed | 10 | 47.57ns | 0.471ns | 0.394ns | 47.66ns | 0.34 | 1 | 0.0114 | 72 B |
| fixedCopyx2 | Fixed | 10 | 48.22ns | 0.301ns | 0.282ns | 48.20ns | 0.35 | 1 | 0.0114 | 72 B |
| fixedCopyx4 | Fixed | 10 | 49.94ns | 0.730ns | 0.609ns | 49.83ns | 0.36 | 2 | 0.0114 | 72 B |
| | | | | | | | | | | |
| current | Span | 10 | 136.76ns | 1.311ns | 1.226ns | 136.86ns | 1.00 | 5 | 0.0317 | 200 B |
| spanCopy | Span | 10 | 51.29ns | 0.400ns | 0.355ns | 51.43ns | 0.37 | 1 | 0.0114 | 72 B |
| spanStrCreate | Span | 10 | 75.43ns | 0.913ns | 0.763ns | 75.27ns | 0.55 | 2 | 0.0254 | 160 B |
| spanStrCreateAlt | Span | 10 | 87.43ns | 0.599ns | 0.531ns | 87.50ns | 0.64 | 4 | 0.0305 | 192 B |
| spanStrCreateUnx4 | Span | 10 | 85.80ns | 1.017ns | 0.901ns | 85.81ns | 0.63 | 3 | 0.0305 | 192 B |
| Method | Categories | Value | Mean | Error | StdDev | Median | Ratio | Rank | Gen 0 | Allocated |
|------------------ |----------- |------ |-------------:|-----------:|-----------:|-------------:|------:|-----:|-------:|----------:|
| current | Candidate | 31 | 296.99ns | 5.675ns | 12.217ns | 294.12ns | 1.00 | 4 | 0.0443 | 280 B |
| fastMap | Candidate | 31 | 139.43ns | 1.113ns | 1.041ns | 139.34ns | 0.46 | 3 | 0.0458 | 288 B |
| fasterMap | Candidate | 31 | 122.91ns | 0.920ns | 0.816ns | 123.21ns | 0.41 | 1 | 0.0317 | 200 B |
| smallMap | Candidate | 31 | 135.52ns | 1.314ns | 1.165ns | 135.52ns | 0.45 | 2 | 0.0317 | 200 B |
| | | | | | | | | | | |
| current | Alt | 31 | 299.02ns | 5.659ns | 5.558ns | 297.41ns | 1.00 | 3 | 0.0443 | 280 B |
| whileMap | Alt | 31 | 120.16ns | 2.376ns | 3.005ns | 117.99ns | 0.41 | 1 | 0.0317 | 200 B |
| allocateMap | Alt | 31 | 129.14ns | 0.867ns | 0.811ns | 129.41ns | 0.43 | 2 | 0.0317 | 200 B |
| unrollMapX2 | Alt | 31 | 118.54ns | 1.121ns | 0.994ns | 118.43ns | 0.40 | 1 | 0.0317 | 200 B |
| unrollMapX4 | Alt | 31 | 119.67ns | 1.613ns | 1.430ns | 119.46ns | 0.40 | 1 | 0.0317 | 200 B |
| | | | | | | | | | | |
| current | Fixed | 31 | 292.63ns | 3.647ns | 3.412ns | 291.12ns | 1.00 | 4 | 0.0443 | 280 B |
| fixedCopy | Fixed | 31 | 90.22ns | 0.522ns | 0.488ns | 90.32ns | 0.31 | 1 | 0.0178 | 112 B |
| fixedCopyx2 | Fixed | 31 | 92.90ns | 0.598ns | 0.559ns | 92.94ns | 0.32 | 2 | 0.0178 | 112 B |
| fixedCopyx4 | Fixed | 31 | 100.58ns | 1.755ns | 1.642ns | 100.22ns | 0.34 | 3 | 0.0178 | 112 B |
| | | | | | | | | | | |
| current | Span | 31 | 295.64ns | 1.518ns | 1.267ns | 295.66ns | 1.00 | 5 | 0.0443 | 280 B |
| spanCopy | Span | 31 | 101.58ns | 1.860ns | 1.826ns | 101.79ns | 0.34 | 1 | 0.0178 | 112 B |
| spanStrCreate | Span | 31 | 132.14ns | 2.560ns | 2.395ns | 132.83ns | 0.45 | 2 | 0.0317 | 200 B |
| spanStrCreateAlt | Span | 31 | 146.46ns | 2.965ns | 3.530ns | 147.03ns | 0.49 | 4 | 0.0370 | 232 B |
| spanStrCreateUnx4 | Span | 31 | 140.53ns | 2.264ns | 2.007ns | 141.06ns | 0.47 | 3 | 0.0370 | 232 B |
| | | | | | | | | | | |
| 100 chars | | | | | | | | | | |
| current | Candidate | 100 | 796.34ns | 15.934ns | 14.125ns | 788.09ns | 1.00 | 3 | 0.0877 | 552 B |
| fastMap | Candidate | 100 | 347.12ns | 5.982ns | 7.986ns | 345.38ns | 0.44 | 2 | 0.1106 | 696 B |
| fasterMap | Candidate | 100 | 311.80ns | 1.806ns | 1.601ns | 312.26ns | 0.39 | 1 | 0.0749 | 472 B |
| smallMap | Candidate | 100 | 344.51ns | 2.428ns | 2.027ns | 345.05ns | 0.43 | 2 | 0.0749 | 472 B |
| | | | | | | | | | | |
| current | Alt | 100 | 802.38ns | 14.461ns | 13.527ns | 803.31ns | 1.00 | 5 | 0.0877 | 552 B |
| whileMap | Alt | 100 | 310.82ns | 3.155ns | 2.951ns | 310.92ns | 0.39 | 3 | 0.0749 | 472 B |
| allocateMap | Alt | 100 | 334.36ns | 1.520ns | 1.421ns | 334.56ns | 0.42 | 4 | 0.0749 | 472 B |
| unrollMapX2 | Alt | 100 | 283.68ns | 1.994ns | 1.865ns | 284.27ns | 0.35 | 2 | 0.0749 | 472 B |
| unrollMapX4 | Alt | 100 | 276.16ns | 3.283ns | 2.910ns | 276.29ns | 0.34 | 1 | 0.0749 | 472 B |
| | | | | | | | | | | |
| current | Fixed | 100 | 804.66ns | 5.603ns | 4.679ns | 805.33ns | 1.00 | 4 | 0.0877 | 552 B |
| fixedCopy | Fixed | 100 | 226.07ns | 1.051ns | 0.877ns | 226.04ns | 0.28 | 1 | 0.0393 | 248 B |
| fixedCopyx2 | Fixed | 100 | 234.08ns | 2.609ns | 2.313ns | 234.07ns | 0.29 | 2 | 0.0391 | 248 B |
| fixedCopyx4 | Fixed | 100 | 245.54ns | 2.454ns | 2.175ns | 245.17ns | 0.31 | 3 | 0.0391 | 248 B |
| | | | | | | | | | | |
| current | Span | 100 | 804.95ns | 8.385ns | 7.844ns | 805.59ns | 1.00 | 5 | 0.0877 | 552 B |
| spanCopy | Span | 100 | 257.32ns | 1.536ns | 1.283ns | 257.22ns | 0.32 | 1 | 0.0391 | 248 B |
| spanStrCreate | Span | 100 | 315.95ns | 1.891ns | 1.677ns | 315.88ns | 0.39 | 3 | 0.0534 | 336 B |
| spanStrCreateAlt | Span | 100 | 344.50ns | 6.737ns | 6.302ns | 342.70ns | 0.43 | 4 | 0.0587 | 368 B |
| spanStrCreateUnx4 | Span | 100 | 294.39ns | 2.441ns | 2.283ns | 293.91ns | 0.37 | 2 | 0.0587 | 368 B |
| Method | Categories | Value | Mean | Error | StdDev | Median | Ratio | Rank | Gen 0 | Gen 1 | Allocated |
| -------------------- | ------------- | -------- | --------------: | ------------: | ------------: | --------------: | -------: | ----: | ---------: | ---------: | ----------: |
| current | Candidate | 256 | 1,952.41ns | 11.159ns | 9.892ns | 1,951.66ns | 1.00 | 3 | 0.1869 | - | 1176 B |
| fastMap | Candidate | 256 | 800.65ns | 7.703ns | 6.829ns | 802.18ns | 0.41 | 2 | 0.2594 | - | 1632 B |
| fasterMap | Candidate | 256 | 739.04ns | 9.457ns | 8.847ns | 735.24ns | 0.38 | 1 | 0.1745 | - | 1096 B |
| smallMap | Candidate | 256 | 811.24ns | 3.172ns | 2.967ns | 811.42ns | 0.42 | 2 | 0.1745 | - | 1096 B |
| | | | | | | | | | | | |
| current | Alt | 256 | 1,950.68ns | 16.943ns | 15.849ns | 1,948.46ns | 1.00 | 5 | 0.1869 | - | 1176 B |
| whileMap | Alt | 256 | 730.78ns | 7.339ns | 6.506ns | 731.13ns | 0.37 | 3 | 0.1745 | - | 1096 B |
| allocateMap | Alt | 256 | 791.70ns | 7.619ns | 5.949ns | 790.93ns | 0.41 | 4 | 0.1745 | - | 1096 B |
| unrollMapX2 | Alt | 256 | 663.65ns | 3.941ns | 3.687ns | 663.96ns | 0.34 | 2 | 0.1745 | - | 1096 B |
| unrollMapX4 | Alt | 256 | 641.95ns | 6.201ns | 5.497ns | 640.22ns | 0.33 | 1 | 0.1745 | - | 1096 B |
| | | | | | | | | | | | |
| current | Fixed | 256 | 1,975.77ns | 24.444ns | 22.865ns | 1,977.10ns | 1.00 | 4 | 0.1869 | - | 1176 B |
| fixedCopy | Fixed | 256 | 551.65ns | 7.620ns | 7.127ns | 549.79ns | 0.28 | 1 | 0.0887 | - | 560 B |
| fixedCopyx2 | Fixed | 256 | 569.96ns | 5.247ns | 4.908ns | 568.99ns | 0.29 | 2 | 0.0887 | - | 560 B |
| fixedCopyx4 | Fixed | 256 | 589.82ns | 4.786ns | 3.997ns | 590.19ns | 0.30 | 3 | 0.0887 | - | 560 B |
| | | | | | | | | | | | |
| current | Span | 256 | 1,969.66ns | 19.996ns | 18.705ns | 1,970.26ns | 1.00 | 5 | 0.1869 | - | 1176 B |
| spanCopy | Span | 256 | 632.78ns | 6.331ns | 5.922ns | 634.22ns | 0.32 | 1 | 0.0887 | - | 560 B |
| spanStrCreate | Span | 256 | 732.16ns | 5.446ns | 5.094ns | 733.44ns | 0.37 | 3 | 0.1030 | - | 648 B |
| spanStrCreateAlt | Span | 256 | 772.54ns | 5.266ns | 4.668ns | 773.87ns | 0.39 | 4 | 0.1078 | - | 680 B |
| spanStrCreateUnx4 | Span | 256 | 665.30ns | 10.455ns | 9.779ns | 665.44ns | 0.34 | 2 | 0.1078 | - | 680 B |
| | | | | | | | | | | | |
| 1024 chars | | | | | | | | | | | |
| current | Candidate | 1024 | 7,629.39ns | 60.810ns | 53.907ns | 7,625.62ns | 1.00 | 3 | 0.6714 | - | 4248 B |
| fastMap | Candidate | 1024 | 3,082.48ns | 26.745ns | 25.017ns | 3,070.76ns | 0.40 | 2 | 0.9918 | 0.0038 | 6240 B |
| fasterMap | Candidate | 1024 | 2,815.90ns | 11.005ns | 8.592ns | 2,815.05ns | 0.37 | 1 | 0.6638 | - | 4168 B |
| smallMap | Candidate | 1024 | 2,804.66ns | 35.226ns | 31.227ns | 2,801.53ns | 0.37 | 1 | 0.6638 | - | 4168 B |
| | | | | | | | | | | | |
| current | Alt | 1024 | 7,653.67ns | 66.666ns | 62.360ns | 7,651.03ns | 1.00 | 5 | 0.6714 | - | 4248 B |
| whileMap | Alt | 1024 | 2,820.49ns | 16.317ns | 15.263ns | 2,820.79ns | 0.37 | 3 | 0.6638 | - | 4168 B |
| allocateMap | Alt | 1024 | 3,046.43ns | 17.272ns | 14.423ns | 3,047.76ns | 0.40 | 4 | 0.6638 | - | 4168 B |
| unrollMapX2 | Alt | 1024 | 2,690.30ns | 26.892ns | 22.456ns | 2,687.61ns | 0.35 | 2 | 0.6638 | - | 4168 B |
| unrollMapX4 | Alt | 1024 | 2,530.67ns | 14.308ns | 13.383ns | 2,531.89ns | 0.33 | 1 | 0.6638 | - | 4168 B |
| | | | | | | | | | | | |
| current | Fixed | 1024 | 7,647.49ns | 51.204ns | 47.897ns | 7,635.90ns | 1.00 | 4 | 0.6714 | - | 4248 B |
| fixedCopy | Fixed | 1024 | 2,121.33ns | 29.708ns | 27.789ns | 2,118.47ns | 0.28 | 1 | 0.3319 | - | 2096 B |
| fixedCopyx2 | Fixed | 1024 | 2,164.05ns | 11.677ns | 9.117ns | 2,164.54ns | 0.28 | 2 | 0.3319 | - | 2096 B |
| fixedCopyx4 | Fixed | 1024 | 2,263.88ns | 23.129ns | 20.504ns | 2,255.20ns | 0.30 | 3 | 0.3319 | - | 2096 B |
| | | | | | | | | | | | |
| current | Span | 1024 | 7,669.55ns | 67.241ns | 62.898ns | 7,679.77ns | 1.00 | 4 | 0.6714 | - | 4248 B |
| spanCopy | Span | 1024 | 2,397.11ns | 16.097ns | 14.269ns | 2,398.97ns | 0.31 | 1 | 0.3319 | - | 2096 B |
| spanStrCreate | Span | 1024 | 2,814.25ns | 43.848ns | 41.016ns | 2,815.72ns | 0.37 | 3 | 0.3471 | - | 2184 B |
| spanStrCreateAlt | Span | 1024 | 2,865.30ns | 22.238ns | 20.801ns | 2,866.39ns | 0.37 | 3 | 0.3510 | - | 2216 B |
| spanStrCreateUnx4 | Span | 1024 | 2,448.91ns | 35.363ns | 31.348ns | 2,452.51ns | 0.32 | 2 | 0.3510 | - | 2216 B |
| | | | | | | | | | | | |
| 8192 chars | | | | | | | | | | | |
| current | Candidate | 8192 | 61,304.72ns | 350.006ns | 327.396ns | 61,285.96ns | 1.00 | 3 | 5.1270 | 0.2441 | 32920 B |
| fastMap | Candidate | 8192 | 25,785.01ns | 305.116ns | 285.405ns | 25,831.80ns | 0.42 | 2 | 7.8125 | 0.3052 | 49248 B |
| fasterMap | Candidate | 8192 | 23,305.23ns | 121.598ns | 113.743ns | 23,323.38ns | 0.38 | 1 | 5.2185 | 0.1526 | 32840 B |
| smallMap | Candidate | 8192 | 23,291.14ns | 125.680ns | 117.561ns | 23,337.55ns | 0.38 | 1 | 5.2185 | 0.1526 | 32840 B |
| | | | | | | | | | | | |
| current | Alt | 8192 | 61,967.77ns | 536.539ns | 501.879ns | 61,816.82ns | 1.00 | 5 | 5.1270 | 0.2441 | 32920 B |
| whileMap | Alt | 8192 | 23,592.00ns | 328.376ns | 307.163ns | 23,607.09ns | 0.38 | 3 | 5.2185 | 0.1526 | 32840 B |
| allocateMap | Alt | 8192 | 24,950.07ns | 201.715ns | 178.815ns | 24,959.22ns | 0.40 | 4 | 5.2185 | 0.1526 | 32840 B |
| unrollMapX2 | Alt | 8192 | 22,003.28ns | 116.477ns | 97.264ns | 22,005.69ns | 0.36 | 2 | 5.2185 | 0.1526 | 32840 B |
| unrollMapX4 | Alt | 8192 | 21,074.46ns | 333.249ns | 295.416ns | 21,083.36ns | 0.34 | 1 | 5.2185 | 0.1526 | 32840 B |
| | | | | | | | | | | | |
| current | Fixed | 8192 | 61,401.46ns | 508.529ns | 450.797ns | 61,337.03ns | 1.00 | 4 | 5.1270 | 0.2441 | 32920 B |
| fixedCopy | Fixed | 8192 | 16,758.64ns | 118.942ns | 111.258ns | 16,766.96ns | 0.27 | 1 | 2.5940 | - | 16432 B |
| fixedCopyx2 | Fixed | 8192 | 17,569.22ns | 151.468ns | 141.683ns | 17,581.75ns | 0.29 | 2 | 2.5940 | - | 16432 B |
| fixedCopyx4 | Fixed | 8192 | 18,269.65ns | 119.090ns | 105.570ns | 18,253.54ns | 0.30 | 3 | 2.5940 | - | 16432 B |
| | | | | | | | | | | | |
| current | Span | 8192 | 62,410.59ns | 785.822ns | 735.058ns | 62,375.61ns | 1.00 | 4 | 5.1270 | 0.2441 | 32920 B |
| spanCopy | Span | 8192 | 19,228.95ns | 113.409ns | 94.702ns | 19,240.35ns | 0.31 | 1 | 2.5940 | - | 16432 B |
| spanStrCreate | Span | 8192 | 22,603.78ns | 168.023ns | 157.169ns | 22,597.73ns | 0.36 | 3 | 2.6245 | - | 16520 B |
| spanStrCreateAlt | Span | 8192 | 21,961.11ns | 321.794ns | 301.006ns | 21,942.30ns | 0.35 | 2 | 2.6245 | - | 16552 B |
| spanStrCreateUnx4 | Span | 8192 | 18,994.96ns | 241.445ns | 214.034ns | 18,928.00ns | 0.30 | 1 | 2.6245 | - | 16552 B |
The code:
```f#
module StringMap =
/// Basic String.map replacement, Array.map over CharArray
/// dotnet mem: 6x(!) string size
let fastMap mapper str =
if String.IsNullOrEmpty str then String.Empty
else
str.ToCharArray()
|> Array.map mapper
|> String
/// A faster String.map replacement, inlining the loop instead of Array.map. Uses CharArray.
/// NOTES: used as baseline for StringMapUnsafe
/// dotnet mem: 4x string size
let fasterMap mapper str =
if String.IsNullOrEmpty str then String.Empty
else
let result = str.ToCharArray()
let mutable i = 0
for c in result do
result.[i] <- mapper c
i <- i + 1
String result
/// Instead of for..in uses while. Otherwise same as fasterMap
/// dotnet mem: 4x string size
let whileMap mapper str =
if String.IsNullOrEmpty str then String.Empty
else
let result = str.ToCharArray()
let mutable i = 0
while i < result.Length do
result.[i] <- mapper result.[i]
i <- i + 1
String result
/// Like fasterMap, but uses Array.zeroCreate with string-length
/// dotnet mem: 4x string data
let allocateMap mapper str =
if String.IsNullOrEmpty str then String.Empty
else
let result = Array.zeroCreate str.Length
let mutable i = 0
for c in str do
result.[i] <- mapper c
i <- i + 1
String result
/// String.map equiv. optimized for very small maps. The break-even point is around 4 chars.
let smallMap (mapper: char -> char) (str: string) =
let len = String.length str
if len > 4 then
let result = str.ToCharArray()
let mutable i = 0
for c in result do
result.[i] <- mapper c
i <- i + 1
String result
elif len > 2 then
if len = 3
then String [|mapper str.[0]; mapper str.[1]; mapper str.[2]|]
else String [|mapper str.[0]; mapper str.[1]; mapper str.[2]; mapper str.[3]|]
elif len = 2 then
String [|mapper str.[0]; mapper str.[1]|]
elif len = 1 then
(mapper str.[0]).ToString()
else
String.Empty
/// String.map equiv. that uses safe unrolling. This is only 2-5% faster than String.fasterMap above. It suffers from non-elimination of range-checks.
let unrollMapx2 (mapper: char -> char) (str: string) =
let len = String.length str
if len > 4 then
let result = str.ToCharArray()
let mutable i = 0
while i < len - len % 2 do
result.[i] <- mapper result.[i]
result.[i + 1] <- mapper result.[i + 1]
i <- i + 2
// the remainder
if len % 2 = 1 then
result.[i] <- mapper result.[i]
//i <- i + 1
String result
elif len > 2 then
if len = 3
then String [|mapper str.[0]; mapper str.[1]; mapper str.[2]|]
else String [|mapper str.[0]; mapper str.[1]; mapper str.[2]; mapper str.[3]|]
elif len = 2 then
String [|mapper str.[0]; mapper str.[1]|]
elif len = 1 then
(mapper str.[0]).ToString()
else
String.Empty
/// String.map equiv. that uses safe unrolling. This is only 2-5% faster than String.fasterMap above.
/// It suffers from non-elimination of range-checks.
let unrollMapx4 (mapper: char -> char) (str: string) =
let len = String.length str
if len > 4 then
let result = str.ToCharArray()
let mutable i = 0
while i < len - len % 4 do
result.[i] <- mapper result.[i]
result.[i + 1] <- mapper result.[i + 1]
result.[i + 2] <- mapper result.[i + 2]
result.[i + 3] <- mapper result.[i + 3]
i <- i + 4
// the remainder
while i < len do
result.[i] <- mapper result.[i]
i <- i + 1
String result
elif len > 2 then
if len = 3
then String [|mapper str.[0]; mapper str.[1]; mapper str.[2]|]
else String [|mapper str.[0]; mapper str.[1]; mapper str.[2]; mapper str.[3]|]
elif len = 2 then
String [|mapper str.[0]; mapper str.[1]|]
elif len = 1 then
(mapper str.[0]).ToString()
else
String.Empty
module StringMapUnsafe =
/// Unrollx2. Read from source as str.[idc], use NativePtr for result, which is created with String.Copy + fixed, return modified string
let fixedCopyx2 (mapper: char -> char) (str: string) =
let len = String.length str
let result = String.Copy str
use shadow = fixed result
let mutable i = 0
while i < len - len % 2 do
NativePtr.set shadow i (mapper str.[i])
NativePtr.set shadow (i + 1) (mapper str.[i + 1])
i <- i + 2
while i < len do
NativePtr.set shadow i (mapper str.[i])
i <- i + 1
result
/// Unrollx4. Read from source as str.[idc], use NativePtr for result, which is created with String.Copy + fixed, return modified string
let fixedCopyx4 (mapper: char -> char) (str: string) =
let len = String.length str
let result = String.Copy str
use shadow = fixed result
let mutable i = 0
while i < len - len % 4 do
NativePtr.set shadow i (mapper str.[i])
NativePtr.set shadow (i + 1) (mapper str.[i + 1])
NativePtr.set shadow (i + 1) (mapper str.[i + 2])
NativePtr.set shadow (i + 1) (mapper str.[i + 3])
i <- i + 4
while i < len do
NativePtr.set shadow i (mapper str.[i])
i <- i + 1
result
/// Unrollx4. Create copy with String.Copy + fixed and use NativePtr to read and write from it, return modified string.
let fixedCopy (mapper: char -> char) (str: string) =
let len = String.length str
/// a copy to use as result (we don't want to override the input reference string!)
let result = String.Copy str
/// pointer to the result-string
use shadow = fixed result
// loop-unroll
let mutable i = 0
while i < len - len % 4 do
// NOTES:
// - read/write to same mem address appears to be somewhat faster than reading from string by index, like with str.[i]
// - while less IL instr are generated with calculating the address ourselves and then NativePtr.read/write, this appears to be faster (and safer).
NativePtr.set shadow i (mapper (NativePtr.get shadow i))
NativePtr.set shadow (i + 1) (mapper (NativePtr.get shadow (i + 1)))
NativePtr.set shadow (i + 2) (mapper (NativePtr.get shadow (i + 2)))
NativePtr.set shadow (i + 3) (mapper (NativePtr.get shadow (i + 3)))
i <- i + 4
// process the remainder
while i < len do
NativePtr.set shadow i (mapper (NativePtr.get shadow i))
i <- i + 1
result
/// Unrollx4. Create copy with String.Copy, ReadOnlySpan over it, marshal-as read/write Span<char> over same copy. Return modified underlying string.
/// Performance: ~20-30% faster than fasterMap.
let spanCopy (mapper: char -> char) (str: string) =
let len = String.length str
let resultStr = String.Copy str
let roSpan = str.AsSpan()
let writeSpan = MemoryMarshal.CreateSpan(&MemoryMarshal.GetReference roSpan, len)
// loop-unroll
let mutable i = 0
while i < len - len % 4 do
writeSpan.[i] <- mapper roSpan.[i]
writeSpan.[i + 1] <- mapper roSpan.[i + 1]
writeSpan.[i + 2] <- mapper roSpan.[i + 2]
writeSpan.[i + 3] <- mapper roSpan.[i + 3]
i <- i + 4
// process the remainder
while i < len do
writeSpan.[i] <- mapper roSpan.[i]
i <- i + 1
resultStr
/// No unroll, uses String.Create with Span.
/// Performance: 0-20% faster than fasterMap, but fast only on larger (64kb) input
let spanCreate1 (mapper: char -> char) (str: string) =
let len = String.length str
String.Create(len, str, fun buf v ->
let mutable i = 0
for c in v do
buf.[i] <- mapper c
i <- i + 1)
/// No unroll, uses String.Create with Span, no closure
/// Performance: 0-22% faster than fasterMap, the larger the intput, the bigger the gain (sometimes slower!)
let spanCreate2 (mapper: char -> char) (str: string) =
let len = String.length str
String.Create(len, (str, mapper), fun buf (str, f) ->
let mutable i = 0
for c in str do
buf.[i] <- f c
i <- i + 1)
/// Unrollx4, uses String.Create with Span. All indexed operations on spans.
/// Performance: 15-32% (!) faster than fasterMap, the larger the input, the bigger the gain
let spanCreateUnrollx4 (mapper: char -> char) (str: string) =
let len = String.length str
String.Create(len, (str, mapper), fun target (str, f) ->
let mutable i = 0
let source = str.AsSpan()
while i < len - len % 4 do
target.[i] <- f source.[i]
target.[i + 1] <- f source.[i + 1]
target.[i + 2] <- f source.[i + 2]
target.[i + 3] <- f source.[i + 3]
i <- i + 4
// process the remainder
while i < len do
target.[i] <- f source.[i]
i <- i + 1)
```
@dsyme or @cartermp, can I get a confirmation that using fixed and pointer arithmetic is off limits (or not)? If it's allowed, I gladly use it, as it gives dramatic reduction in memory usage and increases performance, but obviously, it requires more careful coding and it will automatically mark certain functions as unsafe, which I think is a "no go". Also, it may be considered fine on arrays, but not on strings, I'd totally understand that.
There are already dramatic perf improvements using other methods, but fixed trumps all. And before I get to the array functions, I like to limit my scope (consider the above an exercise in what's possible).
I know Span cannot be used, otherwise that would be a good alternative and gets close to fixed in terms of memory and speed.
Yes, spans are off-limits (unless we decide to move FSharp.Core to also bring in a dependency on System.Memory). I'd actually love to do that but I know that @dsyme is quite allergic to them.
I think we should strive for consistency in implementation here. Array and list internals tend not to use pointers to accomplish things. I'd like for us to eek out as much perf and memory as possible if we can, but I think that ultimately the best way forward is with Span, when it's appropriate to add it to FSharp.Core. It strikes the right balance of perf, readability, and safety. So the second-best implementation is probably sufficient, and a clear upgrade anyways.
unless we decide to move FSharp.Core to also bring in a dependency on System.Memory
If, for ImmutableArray, we decide to add that dependency, then I believe that includes System.Memory:

(which in turn gives us also System.Buffers and System.Numerics.Vectors, each of which open new ways of potentially improving certain key performance areas)
Array and list internals tend not to use pointers to accomplish things.
<snip/>but I think that ultimately the best way forward is with Span, when it's appropriate to add it to FSharp.Core. It strikes the right balance of perf, readability, and safety.
Agreed. Yeah, Span would give us wings in some areas :).
Ok, I'll move forward without the "extreme" optimizations than.
Next up: String.concat. This one does nothing more than calling into String.Join, but since it takes a sequence for the concatenation and doesn't special-case the empty separator string, I figured, let's see what happens if I special case for array and list (if anyone know of others that should be special-cased, please let me know. For instance, I wonder if there's a safe way for library functions to get the underlying array or list if the sequence is created with Set.ofArray or Seq.ofList, so that people won't get punished for using that instead of casting().
TLDR conclusions:
array and up to 20% gain for lists. Seq stays the same.I think these numbers can be expected considering that String.Join uses an optimized and mutable string under the hood with pre-calculated size for Concat and when an array is passed. String.Join with IEnumerable uses a more resource-hungry StringBuilder.
Below: bold is fastest. I didn't select a champion if it was within margins of error.
BenchmarkDotNet=v0.12.1, OS=Windows 7 SP1 (6.1.7601.0)
Intel Xeon CPU X5680 3.33GHz, 2 CPU, 12 logical and 12 physical cores
Frequency=3247119 Hz, Resolution=307.9653 ns, Timer=TSC
.NET Core SDK=3.1.300-preview-015135
[Host] : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT DEBUG
Job-FRSYQM : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT
MaxRelativeError=0.02 Runtime=.NET Core 3.1 MaxIterationCount=10
MaxWarmupIterationCount=5 MinIterationCount=5 MinWarmupIterationCount=2
| Method | Sep | CollSize | CollType | Mean | Error | StdDev | Ratio | Rank | Gen 0 | Gen 1 | Allocated |
| ------------- | ---------------- | ---------- | --------- | -------------: | -----------: | -----------: | -------: | ----: | ---------: | ---------: | ----------: |
| current | | Small | Array | 484.4ns | 7.61ns | 1.98ns | 1.00 | 2 | 0.0467 | - | 296 B |
| concatOpt | | Small | Array | 256.3ns | 4.80ns | 1.25ns | 0.53 | 1 | 0.0420 | - | 264 B |
| | | | | | | | | | | | |
| current | | Small | List | 554.0ns | 8.99ns | 3.21ns | 1.00 | 2 | 0.0477 | - | 304 B |
| concatOpt | | Small | List | 391.7ns | 7.23ns | 2.58ns | 0.71 | 1 | 0.0710 | - | 448 B |
| | | | | | | | | | | | |
| current | | Small | Seq | 541.7ns | 10.36ns | 7.49ns | 1.00 | 1 | 0.0467 | - | 296 B |
| concatOpt | | Small | Seq | 556.4ns | 9.02ns | 2.34ns | 1.02 | 2 | 0.0467 | - | 296 B |
| | | | | | | | | | | | |
| current | | Medium | Array | 10,715.8ns | 148.95ns | 38.68ns | 1.00 | 2 | 2.1667 | 0.0916 | 13656 B |
| concatOpt | | Medium | Array | 4,674.7ns | 92.60ns | 72.29ns | 0.43 | 1 | 0.7553 | 0.0076 | 4784 B |
| | | | | | | | | | | | |
| current | | Medium | List | 12,422.7ns | 188.90ns | 29.23ns | 1.00 | 2 | 2.1667 | 0.0916 | 13664 B |
| concatOpt | | Medium | List | 7,951.7ns | 152.82ns | 175.98ns | 0.64 | 1 | 1.2665 | 0.0153 | 8008 B |
| | | | | | | | | | | | |
| current | | Medium | Seq | 12,905.6ns | 435.52ns | 651.86ns | 1.00 | 2 | 2.1667 | 0.0916 | 13656 B |
| concatOpt | | Medium | Seq | 11,461.7ns | 212.56ns | 94.38ns | 0.88 | 1 | 2.1667 | 0.0916 | 13656 B |
| | | | | | | | | | | | |
| current | &\|& | Small | Array | 594.5ns | 11.54ns | 14.17ns | 1.00 | 2 | 0.0648 | - | 408 B |
| concatOpt | &\|& | Small | Array | 402.5ns | 7.20ns | 2.57ns | 0.69 | 1 | 0.0596 | - | 376 B |
| | | | | | | | | | | | |
| current | &\|& | Small | List | 654.4ns | 12.02ns | 8.69ns | 1.00 | 2 | 0.0658 | - | 416 B |
| concatOpt | &\|& | Small | List | 524.0ns | 9.51ns | 4.22ns | 0.80 | 1 | 0.0887 | - | 560 B |
| | | | | | | | | | | | |
| current | &\|& | Small | Seq | 624.0ns | 12.35ns | 12.13ns | 1.00 | 1 | 0.0648 | - | 408 B |
| concatOpt | &\|& | Small | Seq | 626.8ns | 11.61ns | 9.70ns | 1.01 | 1 | 0.0648 | - | 408 B |
| | | | | | | | | | | | |
| current | &\|& | Medium | Array | 13,245.0ns | 255.18ns | 133.47ns | 1.00 | 2 | 2.5482 | 0.1373 | 16048 B |
| concatOpt | &\|& | Medium | Array | 7,656.2ns | 127.40ns | 66.63ns | 0.58 | 1 | 1.1368 | 0.0229 | 7176 B |
| | | | | | | | | | | | |
| current | &\|& | Medium | List | 14,149.6ns | 220.37ns | 78.59ns | 1.00 | 2 | 2.5482 | 0.1373 | 16056 B |
| concatOpt | &\|& | Medium | List | 10,824.5ns | 183.83ns | 81.62ns | 0.77 | 1 | 1.6479 | 0.0458 | 10400 B |
| | | | | | | | | | | | |
| current | &\|& | Medium | Seq | 13,809.2ns | 258.95ns | 154.10ns | 1.00 | 1 | 2.5482 | 0.1373 | 16048 B |
| concatOpt | &\|& | Medium | Seq | 15,213.7ns | 565.78ns | 846.83ns | 1.07 | 2 | 2.5330 | 0.1221 | 16048 B |
| Method | Sep | CollType | Mean | Error | StdDev | Ratio | Rank | Gen 0 | Gen 1 | Gen 2 | Allocated |
| ------------- | ---------------- | --------- | ----------: | ----------: | ----------: | -------: | ----: | --------: | -------: | -------: | -----------: |
| current | | Array | 27.00ms | 0.351ms | 0.091ms | 1.00 | 2 | 1656.2500 | 718.7500 | 125.0000 | 18.65 MB |
| concatOpt | | Array | 10.27ms | 0.155ms | 0.069ms | 0.38 | 1 | - | - | - | 9.3 MB |
| | | | | | | | | | | | |
| current | | List | 32.12ms | 0.201ms | 0.052ms | 1.00 | 2 | 1656.2500 | 718.7500 | 125.0000 | 18.65 MB |
| concatOpt | | List | 22.57ms | 0.682ms | 0.357ms | 0.71 | 1 | - | - | - | 15.55 MB |
| | | | | | | | | | | | |
| current | | Seq | 29.01ms | 0.140ms | 0.036ms | 1.00 | 1 | 1656.2500 | 718.7500 | 125.0000 | 18.65 MB |
| concatOpt | | Seq | 30.62ms | 0.600ms | 0.314ms | 1.06 | 2 | 62.5000 | 62.5000 | 62.5000 | 31.55 MB |
| | | | | | | | | | | | |
| current | &\|& | Array | 32.07ms | 0.707ms | 0.109ms | 1.00 | 2 | 2250.0000 | 875.0000 | 187.5000 | 24.92 MB |
| concatOpt | &\|& | Array | 16.51ms | 0.259ms | 0.115ms | 0.51 | 1 | - | - | - | 12.42 MB |
| | | | | | | | | | | | |
| current | &\|& | List | 36.85ms | 0.505ms | 0.131ms | 1.00 | 2 | 2285.7143 | 928.5714 | 214.2857 | 24.92 MB |
| concatOpt | &\|& | List | 28.60ms | 0.564ms | 0.250ms | 0.78 | 1 | - | - | - | 18.67 MB |
| | | | | | | | | | | | |
| current | &\|& | Seq | 33.47ms | 0.580ms | 0.151ms | 1.00 | 1 | 2266.6667 | 933.3333 | 200.0000 | 24.92 MB |
| concatOpt | &\|& | Seq | 33.34ms | 0.453ms | 0.118ms | 1.00 | 1 | 2266.6667 | 933.3333 | 200.0000 | 24.92 MB |
The source code:
```f#
let private concatArray sep (strings: string []) =
match String.length sep with
| 0 -> String.Concat strings
| 1 -> String.Join(sep.[0], strings, 0, strings.Length)
| _ -> String.Join(sep, strings, 0, strings.Length)
let concatOpt sep (strings: seq
match strings with
| :? array
| :? list
| _ ->
// special-casing Seq for sep = String.Empty is not faster due to the involved
// overhead of Seq.toArray and the already well-optimized String.Join
String.Join(sep, strings)
```
I think using screenshots is a little bit easier to read, those tables take too much space horizontally and vertically, making it hard to read.
String.iter, unrolledThere was little that could be improved for String.iter, but unrolling showed a ~25% perf gain:

String.iteri, unrolledBasically same story:

String.mapiPerformance characteristics were, unsurprisingly, much the same as for String.map above. An easy gain of 2x performance, and reduction in mem and GC pressure.
Real advancement can be made here with fixed or with Span when the time's there. Just like String.map, unrolling had limited effect, but was more significant than for String.map. I suggest to take the unrollx2 variant (x4 and x8 had a difference of ~2%, not worth the effort). This would lead to 2.5x performance gain.

String.filterSince Array.filter is highly optimized with a bit-mask method, I decided to try a method where I simply use Array.filter on the string. This improved perf dramatically by some 1.7-2.5 times, depending on whether the filter is true on "all" characters, "some" or "none" (false for all).
The buildAndCut method zero-allocates the target array and uses a String-overload to create a new string of only the filters characters. I thought this would have more mem pressure, but result was better in both mem and speed.
```f#
let filterByArray predicate (str: string) =
if String.IsNullOrEmpty str then String.Empty
else
str.ToCharArray()
|> Array.filter predicate
|> String
let filterBuildAndCut predicate (str: string) =
let len = String.length str
if len = 0 then String.Empty
else
let target = Array.zeroCreate len
let mutable i = 0
for c in str do
if predicate c then
target.[i] <- c
i <- i + 1
String(target, 0, i)
```

Still TODO:
String.collectString.initString.replicateString.forallString.existsString.lengthNext set:
String.lengthAt first it seems not much can be done here, but at close inspection, it calls String.get_Length twice on the happy path. Changing String.IsNullOrEmpty to isNull gives a 33% perf improvement. Since this is such a small function, it'll only show when used in tight loops. Anyway, it's as simple as changing the current impl. to this:
```f#
let nullOnly (str:string) =
if isNull str then 0
else str.Length
Fun in a loop to make is measurable, adjusted with `OperationsPerInvoke`:

(note that this is not a pure measurement, as there's auxiliary code present, but the difference between the two is only the body of the `String.length` function.)
It's interesting to take a peak at the disassembly here. If we zoom in on the body of the loop, which is inlined by the JIT, this is what the original `String.length` gave us. As can be seen, this does two pointer-dereferences to calculate the length:
```assembly
; FSharp.Perf.BenchLength.current()
sub rsp,28
mov ecx,[rcx+8]
call FSharp.Perf.Data.get(Int32)
xor edx,edx
M00_L00: // loop body, i.e., String.length
test rax,rax // rax & rax, in other words: null-check, sets flags
je short M00_L01 // if zero (null) jump back to loop
cmp dword ptr [rax+8],0 // compare string-length to 0
jbe short M00_L01 // if equal, jump back to loop
mov ecx,[rax+8] // if not, get string length fallthrough to loop
M00_L01:
inc edx
cmp edx,2711
jl short M00_L00
add rsp,28
ret
; Total bytes of code 43
When we compare that to the disassembly where we only use isNull, the pointer dereference is gone completely and we have only three short instructions left in the body:
; FSharp.Perf.BenchLength.nullOnly()
sub rsp,28
mov ecx,[rcx+8]
call FSharp.Perf.Data.get(Int32)
xor edx,edx
M00_L00: // loop body, i.e., nullOnly method
test rax,rax // rax & rax, for null-check, sets flags
je short M00_L01 // if zero (null) jump back to loop
mov ecx,[rax+8] // get the string-length from the string and fallthrough to loop
M00_L01:
inc edx
cmp edx,2711
jl short M00_L00
add rsp,28
ret
; Total bytes of code 37
As you can see, we are down from five assembly instructions (incl two pointer dereferences) to only three assembly instructions.
Maybe a bit much of an analysis for such a small function, but it was fun to check what goes on under the hood. Note that the assembly can look different in other scenarios because of JIT optimizations.
It's also good to find out that the JIT does not do an extra null-check for the callvirt IL instruction, as it already appears to know that the reference cannot be null at that point. In fact, the whole call to get_Length disappears and essentially boils down to a single mov ecx, [rax+8].
String.exists && String.forallThis method takes a predicate that operates on the char. There's no direct equiv. in the .NET stdlibs. I've tried several alternatives of coding this, esp. since at first sight it seems the original is _not tail-recursive_, but this is recognized by the JIT anyway and it ~is unrolled in~ becomes an efficient loop.
Same is true for String.forall.
The only change to make here is the redundant call to get_Length, which only has an impact when searching very short strings.
tbc...
Next up:
String.replicateThere're some serious optimizations that are possible here, esp. if we would open up use of cpblk or initblk (they've been optimized, see https://github.com/dotnet/coreclr/pull/7198; we'll maybe get there when we start with array functions), but for this round I'll stick to optimizing with a simple Array.Copy, which internally calls an optimized native method already.
This gives us between 1.2x and 8x (!) performance improvements. Occasionally, the Array.Copy method can be a few percents slower, but this seems to be influenced by LOH when doing this on really large strings and counts. Since the general case is likely small strings, this greatly improves the general case.
The number of iterations and memcopies is now O(log(n)) instead of O(n), and there's no redundant memory allocated (other than for the required copy of the string at the end).
The optimization in code:
```f#
let replicateOpt count (str:string) =
if isNull str then
String.Empty
else
let len = str.Length
let target = Array.zeroCreate (len * count)
let source = str.ToCharArray()
Array.Copy(source, 0, target, 0, len)
let mutable i = len
while i * 2 < target.Length do
Array.Copy(target, 0, target, i, i)
i <- i * 2
Array.Copy(target, 0, target, i, target.Length - i)
String target
Speed:

Memory:

### `String.init`
This one proves to be tricky to optimize, as the performance distribution of both the array-based new version and the existing implementation based on `StringBuilder` is not uniform for both memory and speed. This makes comparison hard. When `Count` is small (number of iterations) then the new version is the clear winner, but at larger counts, esp. if the returned strings are small (range 1-2 chars), the original is much faster and uses less memory.
When it comes a random distribution of strings in the range 0-200 chars, the difference is neglible*:

<sup>_* EDIT: the "empty" category wasn't empty, but same as "5or16", by accident. And "random" used `Random::Next`, which caused too much noise. You can disregard those rows. Oh, and `replicateOpt` really is `initOpt`, just missed while renaming._</sup>
Unless I can think of something better (unrolling the original, had no effect either), this one should probably remain as-is, since we won't be able to predict the average result of the `initialize` function.
The code:
```f#
let initOpt count (initializer: int -> string) =
let target = Array.zeroCreate count
for i=0 to count - 1 do
target.[i] <- initializer i
String.Concat target
tbc (only String.collect remains)
String.init revisitedI couldn't let go. From the prev. measurements, it was clear that using the highly optimized String.Concat was significantly faster on small counts, so I leveraged that and now anything below count = 8 is between 12x (!!) and 1.1x faster, most over 2x faster, depending on size of count and size of input (larger strings means greater improvement).
Memory use and GC pressure also drops by 2x - 8x.

The code (from count > 4 it becomes a paramarray, also note that when count > 8 will be put up top by the compiler):
f#
/// Using ParamArray as an optimized version of Array gives much better performance on small counts (up to 10x faster).
let initOpt (count:int) (initializer: int-> string) =
match count with
| 0 -> String.Empty
| 1 -> initializer 0
| 2 -> String.Concat(initializer 0, initializer 1)
| 3 -> String.Concat(initializer 0, initializer 1, initializer 2)
| 4 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3)
| 5 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3, initializer 4)
| 6 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3, initializer 4, initializer 5)
| 7 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3, initializer 4, initializer 5, initializer 6)
| 8 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3, initializer 4, initializer 5, initializer 6, initializer 7)
| _ when count > 8 ->
let res = StringBuilder count
for i = 0 to count - 1 do
res.Append(initializer i) |> ignore
res.ToString()
| _ ->
failwith "error" // will become invalidArgInputMustBeNonNegative
String.init revisited (2) and finalIt turned out that through copy and pasting I made mistakes on the first measurements, wrongly believing that using the array-based approach was so bad. I stabilized "random" between runs, so that all measurements can be compared, and I fixed the code where the wrong functions were called.
initialize function only returns very small strings of 1-3 chars, there's a performance degradationinitialize is 4 or 5 chars, it evens out, if it is random or larger than that, or zero chars, the array-based version wins in all casesLet's get the numbers:

The code is now slightly more complex, but still not by much:
```f#
/// 2-10x faster on small counts, about 20% faster for very large count, for all cases, less memory pressure.
let initWithArray count (initializer: int -> string) =
if count < 5 then
match count with
| 0 -> String.Empty
| 1 -> initializer 0
| 2 -> String.Concat(initializer 0, initializer 1)
| 3 -> String.Concat(initializer 0, initializer 1, initializer 2)
| 4 -> String.Concat(initializer 0, initializer 1, initializer 2, initializer 3)
| _ -> failwith "error" // arg error
else
let target = Array.zeroCreate count
let mutable i = 0
while i < count - count % 4 do
target.[i] <- initializer i
target.[i + 1] <- initializer(i + 1)
target.[i + 2] <- initializer(i + 2)
target.[i + 3] <- initializer(i + 3)
i <- i + 4
while i < count do
target.[i] <- initializer i
i <- i + 1
String.Concat target
```
If we zoom in on the break-even point for when the old version was faster than the new version, it is here, And as you can see, still less GC pressure:

Are you planning to send fixes?
@forki, absolutely, as soon as the string functions are done (one left), I'll send in a PR. I figured it made more sense to first do the exercise, timings etc, so that the reasoning behind the fixes, which aren't always obvious, are at least logged.
Please do it as separate PRs so that it's easier to get those accepted
Abel Braaksma notifications@github.com schrieb am Di., 16. Juni 2020,
20:00:
@forki https://github.com/forki, absolutely, as soon as the string
functions are done (one left), I'll send in a PR. I figured it made more
sense to first do the exercise, so that the reasoning behind the fixes,
which aren't always obvious, are at least logged.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dotnet/fsharp/issues/9390#issuecomment-644920017, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAAOANELTPMSWGDOGWRTFATRW6XLVANCNFSM4NR6H2KA
.
Please do it as separate PRs so that it's easier to get those accepted
@forki I was figuring to do one PR for the string stuff (others for array or whatever else comes my way), which is really just a few lines in a single file, but if you think it is better to do it function-by-function, I can do that, but I thought that would make it harder instead of easier...
@cartermp, @TIHan, @dsyme, any preference here so that I don't have to do the PR's twice?
From experience in this repo I can say that bigger performance PRs will
sometimes have a hard time to get in.
Abel Braaksma notifications@github.com schrieb am Di., 16. Juni 2020,
20:05:
Please do it as separate PRs so that it's easier to get those accepted
I was figuring to do one PR for the string stuff, which is really just a
few lines in a single file
https://github.com/dotnet/fsharp/blob/master/src/fsharp/FSharp.Core/string.fs,
but if you think it is better to do it function-by-function, I can do that,
but I thought that would make it harder instead of easier...@cartermp https://github.com/cartermp, @TIHan https://github.com/TIHan,
@dsyme https://github.com/dsyme, any preference here so that I don't
have to do the PR's twice?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dotnet/fsharp/issues/9390#issuecomment-644922607, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAAOANA5G7FVK3G2I2ZPV4DRW6X7LANCNFSM4NR6H2KA
.
Yep, the smaller/more contained the better
Ok, got it, I'll make a separate pr for each function then,and I'll link it to the relevant comment.
@KevinRansom, an example that uses buffered StringBuilder, as we discussed, is the String.concat function. On sequences, it performs better than other methods. It uses System.String::Join, which internally uses the buffered SB (but, even the BCL _does not use that if the input is an array_). See the timings here: https://github.com/dotnet/fsharp/issues/9390#issuecomment-641553171
@abelbraaksma , thanks I will look at it, and try to understand why.
@abelbraaksma how far along do you think this one is in terms of getting to closure?
@cartermp, I think there are quite a few improvements already, but I've yet to check in the array functions improvements. I might get another stab at it during the holidays (won't be going anywhere anyway, it's still 2020:))
Most helpful comment
From experience in this repo I can say that bigger performance PRs will
sometimes have a hard time to get in.
Abel Braaksma notifications@github.com schrieb am Di., 16. Juni 2020,
20:05: