Runtime: String.StartsWith slower on Linux with some characters

Created on 29 Aug 2019  路  20Comments  路  Source: dotnet/runtime

string.StartsWith on Linux becomes 2 orders of magnitude slower when the string contains a dash (-).

On Linux:

BenchmarkDotNet=v0.11.3, OS=centos 7
Intel Xeon CPU E5-2630L v3 1.80GHz, 2 CPU, 32 logical and 16 physical cores
  [Host]     : .NET Core 3.0.0-preview8-28405-07 (CoreCLR 4.700.19.37902, CoreFX 4.700.19.40503), 64bit RyuJIT
  Job-UBBGCZ : .NET Core 3.0.0-preview8-28405-07 (CoreCLR 4.700.19.37902, CoreFX 4.700.19.40503), 64bit RyuJIT

Runtime=Core  Toolchain=netcoreapp3.0

         Method |        Mean |      Error |     StdDev |
--------------- |------------:|-----------:|-----------:|
     StartsWith |    35.79 ns |  0.1069 ns |  0.0948 ns |
 StartsWithDash | 4,411.13 ns | 35.0054 ns | 29.2311 ns |

On Windows (only for reference, the hardware is not the same):

BenchmarkDotNet=v0.11.3, OS=Windows 10.0.18362
Intel Xeon CPU E3-1271 v3 3.60GHz, 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview8-013656
  [Host]     : .NET Core 3.0.0-preview8-28405-07 (CoreCLR 4.700.19.37902, CoreFX 4.700.19.40503), 64bit RyuJIT
  DefaultJob : .NET Core 3.0.0-preview8-28405-07 (CoreCLR 4.700.19.37902, CoreFX 4.700.19.40503), 64bit RyuJIT


         Method |     Mean |     Error |    StdDev |
--------------- |---------:|----------:|----------:|
     StartsWith | 69.42 ns | 0.2523 ns | 0.2236 ns |
 StartsWithDash | 69.47 ns | 1.4200 ns | 1.6904 ns |

Benchmark code:

```C#
public class StartsWithBenchmark
{
private string _str1 = "aaaaaaaaaz";
private string _str2 = "aaaaaaaaa-";

    [Benchmark]
    public bool StartsWith()
    {
        return _str1.StartsWith("i");
    }

    [Benchmark]
    public bool StartsWithDash()
    {
        return _str2.StartsWith("i");
    }
}

```

The performance issue does not occur if using ordinal comparison.

area-System.Globalization tenet-performance

Most helpful comment

@tarekgh It鈥檚 not truly helpful to simply label this as a corner case. People expect consistent performance and if you have a method that is suddenly magnitudes slower due to certain characters being used it鈥檚 like dropping time bombs into peoples code.

All 20 comments

It seems to affect characters outside of the [9], [11-12], [32-38], [40-44], and [47-126] ranges.

@krwq that's an extraordinary multiplier. I am curious what this is about.

Yes this is interesting. @kevingosse what's you current culture set to? Looking at the code it behaves differently depending on that.

@tarekgh do you know if we (or ICU) have any special optimizations for some ranges?

I believe Linux treat '-' with some sort weights which can affect the operation with different cultures. on Windows, '-' is treated as normal ASCII and we optimize the call as ordinal at that time.

do you know if we (or ICU) have any special optimizations for some ranges?

We do have special optimization for specific ranges. Look for IsFastSort in https://github.com/dotnet/coreclr/blob/52651008a5fb72eb678467cd2eb42aac4e5334e8/src/System.Private.CoreLib/shared/System/Globalization/CompareInfo.Unix.cs#L242 . - prevents us from taking this fast path because of https://github.com/dotnet/coreclr/blob/ab2f9caaf35c96d029b96aa171ee65d04253cf7c/src/utilcode/util_nodependencies.cpp#L910

This "optimization" is used on Unix only. We used to have it on Windows as well, but removed it there because of it was not very helpful.

One way to fix this problem is to remove this "optimization" on Unix too (PR: https://github.com/dotnet/coreclr/pull/24400/) and replace it with fast path and slow path with smaller degradation.

we need to be careful here as there is some characters in ASCII range is treated differently than Windows in ICU. trying to change that will start deviate us from the ICU sorting behavior and may not be a good thing to do.

Also, in such corner cases, it is usually better to use the ordinal operations and not the linguistic operations.

Yes this is interesting. @kevingosse what's you current culture set to? Looking at the code it behaves differently depending on that.

en-US. With other cultures the performance is bad with any set of character but this is a known issue since 1.0 (https://github.com/dotnet/coreclr/issues/5612)

Also, in such corner cases, it is usually better to use the ordinal operations and not the linguistic operations.

Yes, we're currently adding StringComparer.Ordinal everywhere in our codebase as this performance drop is a blocker for us. The problem is that developers on .net framework or .net core/windows aren't really used to adding StringComparer.Ordinal, since the performance of culture-based string comparisons is pretty good on Windows (so it's mostly seen as a micro optimization). And then it breaks when we start using those bits on Linux.
I wish there was a way to default all string comparisons to ordinal. There is the invariant mode, but it also breaks explicit culture-based comparisons, which we need in some parts of our code.

Note that this issue happens only in case of using some specific characters in the ASCII range while the whole string is ASCII too. These characters are mainly control characters which not really used in main stream cases. The only characters that is interesting are the - and '. So we are talking about a corner case here and not really main stream scenarios.

You can look at the special characters here https://github.com/dotnet/coreclr/blob/ab2f9caaf35c96d029b96aa171ee65d04253cf7c/src/utilcode/util_nodependencies.cpp#L856

@tarekgh It鈥檚 not truly helpful to simply label this as a corner case. People expect consistent performance and if you have a method that is suddenly magnitudes slower due to certain characters being used it鈥檚 like dropping time bombs into peoples code.

People expect consistent performance and if you have a method that is suddenly magnitudes slower due to certain characters being used it鈥檚 like dropping time bombs into peoples code.

How this is different than inserting non ASCII character in the string even on Windows? you will get the same performance hit. Again, this is why we provide Ordinal option to let users control the behavior.

The issue here is the correctness or sort behavior against the perf. we can fix the perf by not specializing the hyphen on Linux but we should be clear this for sure can be a problem for some expected sort behavior. we can try to find a middle ground here which may work but we need to detect the cases we need to allow ordinal even when hyphen exist. just need some more investigation.

Looking more, looks on Windows we don't even try to optimize for ASCII cases and always call the underlying OS. so what @jkotas mentioned before looks reasonable to try.

On a related note, we are running dotnet/performance benchmarks on CoreCLR and Mono/netcore with LLVM JIT. I was investigating the spots where Mono performed especially poorly. One of them is the Perf_String.Contains benchmark with StringComparison.CurrentCultureIgnoreCase. It eventually hits fast-sort the optimization here.

The difference in the benchmark between Mono (which does not have this code path) and CoreCLR (which has it) was quite enormous:

| Slower | diff/base | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------- |
| System.Tests.Perf_String.StartsWith(size: 1000) | 714.91 | 37.47 | 26784.84 | |
| System.Tests.Perf_String.StartsWith(size: 100) | 311.68 | 26.41 | 8233.04 | |
| System.Tests.Perf_String.Contains(text: "This is a very nice sentence", value: " | 41.56 | 190.40 | 7913.76 | |
| System.Tests.Perf_String.Contains(text: "This is a very nice sentence", value: " | 27.03 | 276.06 | 7462.35 | |

However, the numbers are actually lying because the IsFastSort/IsAscii optimization is computed during the benchmark initialization and not accounted for in the benchmark run. To accurately measure the impact we would likely have to get rid of the cache in the object header first and see what happens. Another question is whether people run StartsWith/Contains on the same string more than once.

Looking more, looks on Windows we don't even try to optimize for ASCII cases and always call the underlying OS. so what @jkotas mentioned before looks reasonable to try.

Unless I misunderstand, @jkotas's suggestion assumes that the perf disparity comes from the IsFastSort check before fallbacking to the slow path. I think the whole problem is how incredibly slow the ICU fallback actually is.

I've done a few more benchmarks to get more data. I'm measuring the cost of return string.Concat(new string('a', length), "-").StartsWith("i"); (length being the number appended to the name of the benchmark). I'm creating a new string every time to make sure we don't fall into the corner-case mentioned by @filipnavara. The CreateString benchmark just measures the cost of the creation of the string, to make sure it's negligible.

                Method |         Mean |      Error |     StdDev |
---------------------- |-------------:|-----------:|-----------:|
        CreateString_1 |     34.68 ns |  0.0902 ns |  0.0753 ns |
      CreateString_800 |    452.03 ns |  1.4944 ns |  1.3979 ns |
          StartsWith_1 |     76.85 ns |  0.4202 ns |  0.3725 ns |
        StartsWith_800 |  1,095.46 ns |  4.7791 ns |  4.4703 ns |
 StartsWithOrdinal_800 |  1,085.14 ns |  6.9067 ns |  6.1226 ns |
      StartsWithDash_1 |  4,269.29 ns | 29.9301 ns | 27.9966 ns |
     StartsWithDash_80 |  7,548.37 ns | 57.8701 ns | 51.3003 ns |
    StartsWithDash_160 | 10,363.72 ns | 45.1756 ns | 42.2573 ns |
    StartsWithDash_800 | 34,426.77 ns | 74.0593 ns | 65.6516 ns |

We see that StartsWith_800 and StartsWithOrdinal_800 are pretty much as fast (10ns absolute difference, I'd say 20ns if we take the worst of the confidence interval). So the cost of the prior IsFastSort is negligible.

On the other hand, we see how the performance degrades with the length of the string. We have an initial 4碌s cost whenever we take the slow path, no matter the length of the string. Then we see the cost increasing linearly (3碌s for 80, 6碌s for 160, 30碌s for 800).

So we really have two issues:

  • The initial cost of 4碌s is ridiculously expensive. I see there has been a lot of optimizations on the ICU path, but clearly there's still something wrong
  • The performance shouldn't degrade linearly with the length of the string, since the first character doesn't match (so StartsWith should exit immediately). Looking at the implementation, I'd say it's because GlobalizationNative_StartsWith is implemented with the same logic as an IndexOf operation, and therefore will inspect the full string until the pattern is found. The ICU API is a mess, but I'm sure we can find a more efficient way

The performance shouldn't degrade linearly with the length of the string, since the first character doesn't match (so StartsWith should exit immediately). Looking at the implementation, I'd say it's because GlobalizationNative_StartsWith is implemented with the same logic as an IndexOf operation, and therefore will inspect the full string until the pattern is found. The ICU API is a mess, but I'm sure we can find a more efficient way

I am sure there are some edge cases I didn't consider but it should be relatively easy to implement GlobalizationNative_StartsWith in terms of ucol_strcoll call.

I've profiled following app:

class Program
{
    static void Main()
    {
        while (true)
        {
            Consume(string.Concat(new string('a', 512), "-").StartsWith("i"));
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void Consume<T>(in T _) { }
}

with PerfCollect and the most expensive methods are:

Name | Exc % | Exc | Exc Ct | Inc % | Inc
-- | -- | -- | -- | -- | --
libicui18n.so.60.2!unknown | 22.1 | 6,811 | 6,811 | 22.2 | 6,834
libicui18n.so.60.2!icu_60::CollationElementIterator::next | 14.2 | 4,383 | 4,383 | 14.3 | 4,388
libicui18n.so.60.2!icu_60::UCollationPCE::processCE | 9.0 | 2,755 | 2,755 | 9.0 | 2,757
libicui18n.so.60.2!icu_60::UCollationPCE::nextProcessed | 8.5 | 2,615 | 2,615 | 8.5 | 2,617
libicui18n.so.60.2!usearch_search_60 | 7.0 | 2,154 | 2,154 | 7.0 | 2,157
libicui18n.so.60.2!icu_60::UTF16CollationIterator::handleNextCE32 | 6.5 | 2,005 | 2,005 | 6.5 | 2,006
libicuuc.so.60.2!u_strlen_60 | 6.5 | 1,984 | 1,984 | 6.5 | 1,986
libc-2.27.so!unknown | 3.3 | 1,029 | 1,029 | 3.4 | 1,054
libicui18n.so.60.2!icu_60::CollationElementIterator::getOffset | 2.5 | 777 | 777 | 2.5 | 778
libicuuc.so.60.2!unknown | 1.9 | 596 | 596 | 1.9 | 597
libicui18n.so.60.2!icu_60::UTF16CollationIterator::getOffset | 1.7 | 513.291 | 514 | 1.7 | 514.291
libcoreclr.so!StringObject::InternalCheckHighChars | 1.3 | 399 | 399 | 1.3 | 399
libicui18n.so.60.2!usearch_openFromCollator_60 | 1.1 | 349 | 349 | 1.1 | 349
libc-2.27.so!cfree | 0.9 | 280 | 280 | 0.9 | 280
libc-2.27.so!malloc | 0.9 | 263 | 263 | 0.9 | 263
libicui18n.so.60.2!ucol_primaryOrder_60 | 0.7 | 214 | 214 | 0.7 | 214
libicui18n.so.60.2!icu_60::PCEBuffer::reset | 0.7 | 204 | 204 | 0.7 | 204
libicui18n.so.60.2!ucol_tertiaryOrder_60 | 0.6 | 196 | 196 | 0.6 | 197
libicui18n.so.60.2!ucol_secondaryOrder_60 | 0.6 | 190 | 190 | 0.6 | 190
libicuuc.so.60.2!icu_60::CharString::append | 0.5 | 165 | 165 | 0.5 | 165

So IsFastSort is not a problem, but the ICU part is.

@tarekgh is there any reason why we should not implement StartsWith in the following way:

bool StartsWith(string source, string prefix, StringComparison stringComparison)
  => CompareString(source.AsSpan(start: 0, length: prefix.Length), prefix, stringComparison);

Edit: nevermind, I've got an answer from @kevingosse in https://github.com/dotnet/coreclr/pull/26481 ;)

is there any reason why we should not implement StartsWith in the following way:

@kevingosse is right this will not work except for ordinal cases.

@kevingosse thanks for you measurements. I believe @adamsitnik PR is going to help some with the StartsWith scenario.

https://github.com/dotnet/coreclr/pull/26759 and https://github.com/dotnet/coreclr/pull/26621 combined together have fixed this problem.

Fun fact: while working on improving the performance of StartsWith on Linux we have found and fixed an 18 year old bug in ICU https://github.com/unicode-org/icu/pull/840 ;)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

omariom picture omariom  路  3Comments

nalywa picture nalywa  路  3Comments

chunseoklee picture chunseoklee  路  3Comments

jchannon picture jchannon  路  3Comments

yahorsi picture yahorsi  路  3Comments