Before 5.0, we were using ICU only on Unix systems. In 5.0 we have decided to use it on Windows by default as well.
This is something that we have done in order to have the same behavior of string-related globalization APIs on every OS.
However, this particular change has affected the performance characteristics of many frequently used methods. Some of them have regressed, some have improved.
Recently we have reported a lot of 5.0 regressions related to that. Since we have done this on purpose and we are most probably not going to revert the switch, I am opening this issue to track the list of all known regressions. When the list becomes complete, we are most probably going to update the 5.0 release docs and close the issue and label it as wont fix
.
Please feel free to edit the list.
Known changes:
cc @danmosemsft @tarekgh @billwert @DrewScoggins @GrabYourPitchforks @jkotas @safern
Tagging subscribers to this area: @tarekgh, @safern, @krwq
See info in area-owners.md if you want to be subscribed.
System.Memory.ReadOnlySpan.IndexOfString 5 regressions, 5 improvements
| Method | input | value | comparisonType | 3.1 Mean | 5.0 ICU | 5.0 NLS |
|-------------- |------------------------------------- |-------- |--------------------------- |----------:|------------:|----------:|
| IndexOfString | AAAAA5AAAA | 5 | InvariantCulture | 79.07 ns | 35.41 ns | 72.05 ns |
| IndexOfString | AAAAAAAAAAAA(...)AAAAAAAAAAAA [1000] | X | Ordinal | 39.72 ns | 38.89 ns | 38.25 ns |
| IndexOfString | AAAAAAAAAAAA(...)AAAAAAAAAAAA [100] | x | InvariantCultureIgnoreCase | 300.45 ns | 214.39 ns | 294.59 ns |
| IndexOfString | AAAAAAAAAAAA(...)AAAAAAAAAAAA [100] | x | OrdinalIgnoreCase | 245.94 ns | 802.33 ns | 240.42 ns |
| IndexOfString | ABCDE | c | InvariantCultureIgnoreCase | 71.81 ns | 30.26 ns | 63.83 ns |
| IndexOfString | Hello Worldb(...)allylong!xyz [186] | w | OrdinalIgnoreCase | 45.80 ns | 102.63 ns | 40.30 ns |
| IndexOfString | Hello Worldb(...)allylong!xyz [187] | ~ | Ordinal | 28.52 ns | 22.28 ns | 21.03 ns |
| IndexOfString | Hello Worldbb(...)bbbbbbbbbbba! [47] | y | Ordinal | 24.15 ns | 14.09 ns | 13.57 ns |
| IndexOfString | More Test's | Tests | OrdinalIgnoreCase | 56.38 ns | 108.93 ns | 50.48 ns |
| IndexOfString | StrIng | string | OrdinalIgnoreCase | 38.79 ns | 55.01 ns | 33.86 ns |
| IndexOfString | foobardzsdzs | rddzs | InvariantCulture | 101.46 ns | 44.93 ns | 94.90 ns |
| IndexOfString | string1 | string2 | InvariantCulture | 91.45 ns | 37.14 ns | 80.95 ns |
| IndexOfString | ? | ? | InvariantCulture | 108.70 ns | 4,069.11 ns | 105.36 ns |
| IndexOfString | ????????????(...)???????????? [100] | ? | Ordinal | 25.08 ns | 16.88 ns | 15.48 ns |
| IndexOfString | ????????????(...)???????????? [1000] | x | Ordinal | 39.70 ns | 39.19 ns | 37.99 ns |
Comment: InvariantCulture
and InvariantCultureIgnoreCase
have improved, while OrdinalIgnoreCase
has regressed. We could implement OrdinalIgnoreCase
path in managed code if it ever becomes a problem
Thank you @adamsitnik good idea I'm glad you created this.
I have this PR opened for addressing all ordinal operations https://github.com/dotnet/runtime/pull/40910
@adamsitnik are you going to close all other bugs complaining about ICU perf against this one?
are you going to close all other bugs complaining about ICU perf against this one?
That's my plan. I will try to do my best to do it for as many as I can.
Also, I want to be clear about the expectation here. I don't think we can fix all perf here as we are limited by calling ICU. we can look at how we can improve it but I am not expecting to get the perf to the point where we used to call NLS. So, it will be good to decide which items in this list is a blocker. The only one was the Ordinal cases which I am addressing in the attached PR. I am not aware of any other blocking scenario. We'll look more of course on other scenarios anyway but I am not sure how much we can do before 5.0 release.
System.Globalization.Tests.StringHash 2 regressions, 4 improvements
| Method | Count | Options | 3.1 Mean | 5.0 ICU Mean | 5.0 NLS Mean |
|------------ |------- |--------------------------- |----------------:|---------------:|----------------:|
| GetHashCode | 128 | (, IgnoreCase) | 1,806.07 ns | 1,806.0 ns | 1,806.10 ns |
| GetHashCode | 128 | (, None) | 1,823.38 ns | 2,284.0 ns | 1,818.16 ns |
| GetHashCode | 128 | (en-US, IgnoreCase) | 1,831.84 ns | 1,816.0 ns | 1,790.90 ns |
| GetHashCode | 128 | (en-US, None) | 1,820.91 ns | 2,296.8 ns | 1,815.97 ns |
| GetHashCode | 128 | (en-US, Ordinal) | 99.33 ns | 100.0 ns | 99.61 ns |
| GetHashCode | 128 | (en-US, OrdinalIgnoreCase) | 123.93 ns | 122.6 ns | 121.49 ns |
| GetHashCode | 131072 | (, IgnoreCase) | 2,207,798.50 ns | 1,773,953.8 ns | 2,193,202.55 ns |
| GetHashCode | 131072 | (, None) | 2,209,727.34 ns | 1,893,550.0 ns | 2,187,722.71 ns |
| GetHashCode | 131072 | (en-US, IgnoreCase) | 2,196,075.39 ns | 1,781,730.1 ns | 2,190,508.15 ns |
| GetHashCode | 131072 | (en-US, None) | 2,214,228.87 ns | 1,899,934.3 ns | 2,185,224.74 ns |
| GetHashCode | 131072 | (en-US, Ordinal) | 96,294.65 ns | 96,513.7 ns | 96,551.91 ns |
| GetHashCode | 131072 | (en-US, OrdinalIgnoreCase) | 113,393.49 ns | 112,940.4 ns | 116,436.80 ns |
Comment: there are two regressions for Count=128
but four improvements for Count=131072
System.Globalization.Tests.StringEquality: 8 regressions, 14 improvements
| Method | Count | Options | 3.1 Mean | 5.0 ICU Mean | 5.0 NLS Mean |
|--------------------------- |------ |--------------------------- |--------------:|--------------:|-------------:|
| Compare_Same | 1024 | (, IgnoreCase) | 996.299 ns | 575.414 ns | 997.39 ns |
| Compare_Same_Upper | 1024 | (, IgnoreCase) | 3,373.889 ns | 6,798.371 ns | 3,356.57 ns |
| Compare_DifferentFirstChar | 1024 | (, IgnoreCase) | 43.730 ns | 37.378 ns | 42.73 ns |
| Compare_Same | 1024 | (, None) | 1,003.408 ns | 572.528 ns | 1,003.25 ns |
| Compare_Same_Upper | 1024 | (, None) | 4,123.875 ns | 6,747.893 ns | 4,033.30 ns |
| Compare_DifferentFirstChar | 1024 | (, None) | 41.883 ns | 34.904 ns | 43.04 ns |
| Compare_Same | 1024 | (en-US, IgnoreCase) | 992.625 ns | 564.239 ns | 1,012.86 ns |
| Compare_Same_Upper | 1024 | (en-US, IgnoreCase) | 3,334.161 ns | 6,789.031 ns | 3,336.49 ns |
| Compare_DifferentFirstChar | 1024 | (en-US, IgnoreCase) | 41.821 ns | 36.251 ns | 41.92 ns |
| Compare_Same | 1024 | (en-US, IgnoreNonSpace) | 1,667.106 ns | 564.521 ns | 1,671.78 ns |
| Compare_Same_Upper | 1024 | (en-US, IgnoreNonSpace) | 9,547.973 ns | 2,103.148 ns | 9,599.69 ns |
| Compare_DifferentFirstChar | 1024 | (en-US, IgnoreNonSpace) | 66.929 ns | 35.786 ns | 59.21 ns |
| Compare_Same | 1024 | (en-US, IgnoreSymbols) | 1,676.295 ns | 569.468 ns | 1,679.39 ns |
| Compare_Same_Upper | 1024 | (en-US, IgnoreSymbols) | 9,100.891 ns | 11,605.391 ns | 9,075.81 ns |
| Compare_DifferentFirstChar | 1024 | (en-US, IgnoreSymbols) | 8,219.648 ns | 21,916.551 ns | 8,311.14 ns |
| Compare_Same | 1024 | (en-US, None) | 988.500 ns | 572.970 ns | 1,005.15 ns |
| Compare_Same_Upper | 1024 | (en-US, None) | 4,040.998 ns | 6,740.818 ns | 4,007.38 ns |
| Compare_DifferentFirstChar | 1024 | (en-US, None) | 41.931 ns | 35.349 ns | 42.87 ns |
| Compare_Same | 1024 | (en-US, Ordinal) | 82.818 ns | 67.527 ns | 68.47 ns |
| Compare_Same_Upper | 1024 | (en-US, Ordinal) | 13.907 ns | 14.796 ns | 14.82 ns |
| Compare_DifferentFirstChar | 1024 | (en-US, Ordinal) | 5.432 ns | 9.946 ns | 10.27 ns |
| Compare_Same | 1024 | (en-US, OrdinalIgnoreCase) | 815.592 ns | 822.866 ns | 834.45 ns |
| Compare_Same_Upper | 1024 | (en-US, OrdinalIgnoreCase) | 1,225.463 ns | 1,218.922 ns | 1,253.77 ns |
| Compare_DifferentFirstChar | 1024 | (en-US, OrdinalIgnoreCase) | 10.517 ns | 11.977 ns | 11.30 ns |
| Compare_Same | 1024 | (pl-PL, None) | 21,378.839 ns | 572.036 ns | 21,617.36 ns |
| Compare_Same_Upper | 1024 | (pl-PL, None) | 22,136.295 ns | 17,518.080 ns | 22,375.84 ns |
| Compare_DifferentFirstChar | 1024 | (pl-PL, None) | 61.794 ns | 37.698 ns | 62.93 ns |
Comment: ICU seems to be faster when inputs are the same or the first character is different but slows down when we are comparing lowercase and uppercase versions of the same words
System.Globalization.Tests.StringSearch: 30 regressions, 32 improvements
| Method | Options | 3.1 Mean | 5.0 ICU Mean | 5.0 NLS Mean |
|---------------------------- |---------------------------------- |--------------:|--------------:|--------------:|
| IsPrefix_FirstHalf | (, IgnoreCase, False) | 279.556 ns | 204.122 ns | 270.881 ns |
| IsPrefix_DifferentFirstChar | (, IgnoreCase, False) | 58.887 ns | 18.055 ns | 53.431 ns |
| IsSuffix_SecondHalf | (, IgnoreCase, False) | 264.358 ns | 191.353 ns | 262.566 ns |
| IsSuffix_DifferentLastChar | (, IgnoreCase, False) | 61.441 ns | 18.846 ns | 56.044 ns |
| IndexOf_Word_NotFound | (, IgnoreCase, False) | 802.657 ns | 628.962 ns | 798.926 ns |
| LastIndexOf_Word_NotFound | (, IgnoreCase, False) | 975.995 ns | 620.031 ns | 965.195 ns |
| IsPrefix_FirstHalf | (, IgnoreCase, True) | 1,012.810 ns | 3,677.834 ns | 1,019.808 ns |
| IsPrefix_DifferentFirstChar | (, IgnoreCase, True) | 72.985 ns | 853.547 ns | 72.039 ns |
| IsSuffix_SecondHalf | (, IgnoreCase, True) | 2,552.799 ns | 7,467.664 ns | 2,575.991 ns |
| IsSuffix_DifferentLastChar | (, IgnoreCase, True) | 5,376.895 ns | 1,288.707 ns | 5,329.497 ns |
| IndexOf_Word_NotFound | (, IgnoreCase, True) | 3,931.968 ns | 11,275.663 ns | 3,831.631 ns |
| LastIndexOf_Word_NotFound | (, IgnoreCase, True) | 3,142.703 ns | 17,463.707 ns | 3,042.517 ns |
| IsPrefix_FirstHalf | (, None, False) | 211.472 ns | 173.105 ns | 201.325 ns |
| IsPrefix_DifferentFirstChar | (, None, False) | 56.298 ns | 17.402 ns | 53.396 ns |
| IsSuffix_SecondHalf | (, None, False) | 182.540 ns | 174.536 ns | 179.486 ns |
| IsSuffix_DifferentLastChar | (, None, False) | 58.275 ns | 18.166 ns | 55.200 ns |
| IndexOf_Word_NotFound | (, None, False) | 699.751 ns | 500.508 ns | 678.148 ns |
| LastIndexOf_Word_NotFound | (, None, False) | 850.232 ns | 502.581 ns | 845.351 ns |
| IsPrefix_FirstHalf | (, None, True) | 1,052.320 ns | 3,636.952 ns | 1,021.712 ns |
| IsPrefix_DifferentFirstChar | (, None, True) | 73.568 ns | 848.922 ns | 72.912 ns |
| IsSuffix_SecondHalf | (, None, True) | 2,595.612 ns | 7,554.091 ns | 2,540.565 ns |
| IsSuffix_DifferentLastChar | (, None, True) | 5,306.733 ns | 1,317.628 ns | 5,348.958 ns |
| IndexOf_Word_NotFound | (, None, True) | 3,935.667 ns | 11,177.835 ns | 3,803.958 ns |
| LastIndexOf_Word_NotFound | (, None, True) | 3,083.463 ns | 16,993.198 ns | 3,029.761 ns |
| IsPrefix_FirstHalf | (en-US, IgnoreCase, False) | 270.134 ns | 205.387 ns | 273.552 ns |
| IsPrefix_DifferentFirstChar | (en-US, IgnoreCase, False) | 57.763 ns | 18.110 ns | 54.154 ns |
| IsSuffix_SecondHalf | (en-US, IgnoreCase, False) | 258.893 ns | 190.355 ns | 260.361 ns |
| IsSuffix_DifferentLastChar | (en-US, IgnoreCase, False) | 58.943 ns | 19.272 ns | 56.319 ns |
| IndexOf_Word_NotFound | (en-US, IgnoreCase, False) | 807.236 ns | 622.869 ns | 798.647 ns |
| LastIndexOf_Word_NotFound | (en-US, IgnoreCase, False) | 975.083 ns | 623.089 ns | 955.210 ns |
| IsPrefix_FirstHalf | (en-US, IgnoreCase, True) | 1,018.976 ns | 3,666.344 ns | 1,026.411 ns |
| IsPrefix_DifferentFirstChar | (en-US, IgnoreCase, True) | 76.465 ns | 887.811 ns | 72.117 ns |
| IsSuffix_SecondHalf | (en-US, IgnoreCase, True) | 2,563.798 ns | 7,529.526 ns | 2,565.718 ns |
| IsSuffix_DifferentLastChar | (en-US, IgnoreCase, True) | 5,305.949 ns | 1,304.493 ns | 5,346.831 ns |
| IndexOf_Word_NotFound | (en-US, IgnoreCase, True) | 3,892.140 ns | 11,776.383 ns | 3,926.784 ns |
| LastIndexOf_Word_NotFound | (en-US, IgnoreCase, True) | 3,090.867 ns | 17,430.155 ns | 3,072.910 ns |
| IsPrefix_FirstHalf | (en-US, IgnoreNonSpace, False) | 1,020.808 ns | 178.017 ns | 1,017.612 ns |
| IsPrefix_DifferentFirstChar | (en-US, IgnoreNonSpace, False) | 67.336 ns | 18.067 ns | 65.861 ns |
| IsSuffix_SecondHalf | (en-US, IgnoreNonSpace, False) | 2,552.693 ns | 178.425 ns | 2,544.768 ns |
| IsSuffix_DifferentLastChar | (en-US, IgnoreNonSpace, False) | 4,921.510 ns | 18.199 ns | 4,858.353 ns |
| IndexOf_Word_NotFound | (en-US, IgnoreNonSpace, False) | 3,906.457 ns | 506.498 ns | 3,792.084 ns |
| LastIndexOf_Word_NotFound | (en-US, IgnoreNonSpace, False) | 3,076.625 ns | 507.002 ns | 3,048.544 ns |
| IsPrefix_FirstHalf | (en-US, IgnoreSymbols, False) | 1,022.075 ns | 16,888.924 ns | 1,020.346 ns |
| IsPrefix_DifferentFirstChar | (en-US, IgnoreSymbols, False) | 70.421 ns | 29,267.090 ns | 66.428 ns |
| IsSuffix_SecondHalf | (en-US, IgnoreSymbols, False) | 3,437.616 ns | 19,397.341 ns | 3,411.783 ns |
| IsSuffix_DifferentLastChar | (en-US, IgnoreSymbols, False) | 5,308.903 ns | 34,515.843 ns | 5,333.271 ns |
| IndexOf_Word_NotFound | (en-US, IgnoreSymbols, False) | 3,310.485 ns | 11,414.777 ns | 3,285.185 ns |
| LastIndexOf_Word_NotFound | (en-US, IgnoreSymbols, False) | 3,468.180 ns | 16,640.286 ns | 3,393.434 ns |
| IsPrefix_FirstHalf | (en-US, None, False) | 205.249 ns | 176.915 ns | 200.000 ns |
| IsPrefix_DifferentFirstChar | (en-US, None, False) | 59.057 ns | 17.779 ns | 53.980 ns |
| IsSuffix_SecondHalf | (en-US, None, False) | 184.039 ns | 177.072 ns | 181.258 ns |
| IsSuffix_DifferentLastChar | (en-US, None, False) | 59.873 ns | 18.232 ns | 55.897 ns |
| IndexOf_Word_NotFound | (en-US, None, False) | 688.002 ns | 504.985 ns | 681.283 ns |
| LastIndexOf_Word_NotFound | (en-US, None, False) | 841.277 ns | 506.640 ns | 859.611 ns |
| IsPrefix_FirstHalf | (en-US, None, True) | 1,026.332 ns | 3,676.392 ns | 1,129.406 ns |
| IsPrefix_DifferentFirstChar | (en-US, None, True) | 72.958 ns | 860.499 ns | 73.269 ns |
| IsSuffix_SecondHalf | (en-US, None, True) | 2,555.954 ns | 7,625.414 ns | 2,556.183 ns |
| IsSuffix_DifferentLastChar | (en-US, None, True) | 5,320.978 ns | 1,349.628 ns | 5,373.210 ns |
| IndexOf_Word_NotFound | (en-US, None, True) | 3,783.346 ns | 11,472.314 ns | 3,884.931 ns |
| LastIndexOf_Word_NotFound | (en-US, None, True) | 3,080.818 ns | 17,512.610 ns | 3,077.301 ns |
| IsPrefix_FirstHalf | (en-US, Ordinal, False) | 12.480 ns | 14.469 ns | 12.584 ns |
| IsPrefix_DifferentFirstChar | (en-US, Ordinal, False) | 6.229 ns | 8.496 ns | 8.925 ns |
| IsSuffix_SecondHalf | (en-US, Ordinal, False) | 14.058 ns | 12.034 ns | 14.248 ns |
| IsSuffix_DifferentLastChar | (en-US, Ordinal, False) | 19.645 ns | 17.446 ns | 16.225 ns |
| IndexOf_Word_NotFound | (en-US, Ordinal, False) | 39.451 ns | 37.728 ns | 39.207 ns |
| LastIndexOf_Word_NotFound | (en-US, Ordinal, False) | 137.367 ns | 97.667 ns | 98.362 ns |
| IsPrefix_FirstHalf | (en-US, OrdinalIgnoreCase, False) | 73.307 ns | 82.618 ns | 83.575 ns |
| IsPrefix_DifferentFirstChar | (en-US, OrdinalIgnoreCase, False) | 9.844 ns | 9.585 ns | 9.731 ns |
| IsSuffix_SecondHalf | (en-US, OrdinalIgnoreCase, False) | 109.128 ns | 83.181 ns | 80.728 ns |
| IsSuffix_DifferentLastChar | (en-US, OrdinalIgnoreCase, False) | 203.785 ns | 149.656 ns | 141.373 ns |
| IndexOf_Word_NotFound | (en-US, OrdinalIgnoreCase, False) | 701.811 ns | 3,125.640 ns | 705.609 ns |
| LastIndexOf_Word_NotFound | (en-US, OrdinalIgnoreCase, False) | 617.823 ns | 3,161.417 ns | 621.380 ns |
| IsPrefix_FirstHalf | (pl-PL, None, False) | 2,594.625 ns | 5,704.863 ns | 2,558.898 ns |
| IsPrefix_DifferentFirstChar | (pl-PL, None, False) | 101.082 ns | 857.463 ns | 98.650 ns |
| IsSuffix_SecondHalf | (pl-PL, None, False) | 8,633.560 ns | 8,435.046 ns | 8,586.963 ns |
| IsSuffix_DifferentLastChar | (pl-PL, None, False) | 19,423.848 ns | 1,216.271 ns | 17,376.208 ns |
| IndexOf_Word_NotFound | (pl-PL, None, False) | 9,534.684 ns | 14,329.673 ns | 9,455.980 ns |
| LastIndexOf_Word_NotFound | (pl-PL, None, False) | 8,689.332 ns | 18,014.702 ns | 8,657.101 ns |
System.Globalization.Tests.Perf_DateTimeCultureInfo.Parse: 1 big regression for ja
culture, 5 improvements
| Method | culturestring | 3.1 Mean | 5.0 ICU | 5.0 NLS|
|--------------------- |-------------- |---------:|-----------:|---------:|
| ToStringHebrewIsrael | ? | 540.1 ns | 463.3 ns | 518.3 ns |
| ToString | | 254.0 ns | 248.9 ns | 245.7 ns |
| Parse | | 466.9 ns | 423.7 ns | 424.7 ns |
| ToString | da | 247.5 ns | 234.0 ns | 241.1 ns |
| Parse | da | 551.2 ns | 449.3 ns | 515.6 ns |
| ToString | fr | 251.5 ns | 246.4 ns | 244.3 ns |
| Parse | fr | 466.5 ns | 424.7 ns | 440.2 ns |
| ToString | ja | 258.1 ns | 243.5 ns | 245.4 ns |
| Parse | ja | 482.1 ns | 5,078.9 ns | 456.9 ns |
Comment: this is a known ICU issue: https://github.com/dotnet/runtime/issues/31273
are you going to close all other bugs complaining about ICU perf against this one?
@tarekgh I've gone through all System.Memory
and System.Globalization
issues with performance
tag and updated the list. It should be complete now.
@symbai I see you've thumbs-down. Could you share your concerns?
@danmosemsft Its not about adamsitnik work of collecting the regressions which I find very helpful. Its about the fact that switching to ICU has introduced MAJOR regressions (+4,500ns, ~ 4x slower) while the "improvements" are usually around ~2ns which I bet is only some noise and running them a couple of times will show there are no improvements at all.
I'm not into ICU and I dont know why this change was made at all but seeing all the performance improvements in .NET Core the last versions and now such a big regression on hot code path that is literally being used anywhere, makes me really wonder what in god's name can be that much useful about ICU that it worth THIS regression. I've thumbs-down because I disagree on this ICU change and I disagree on statements which say that "we might not fix some of these regressions". Especially without telling people that switching to .NET 5 will make their code much more slower without a benefit (there might be a benefit in ICU, but I bet it won't affect most people... unlike this performance regression).
@Symbai you have the option to switch back to use NLS if you want to.
ICU is the future direction in general for the .NET and Windows too. ICU will give the opportunity to have a consistency between OS's and OS versions. The benefit of using ICU is really worth it. ICU will give opportunity to the apps to customize the globalization behavior too.
Also note that this is already what is used on Linux and Mac, and increasingly used within Windows OS itself. So we are aligning with the industry here. If and where it is slow - we all benefit from making it faster. The .NET team have contributed bug reports and performance improvements to libicu in the past, and I expect we will do so again.
As a side note, ICU is open source and we can contribute to make hot paths faster
I am happy to see cross-platform consistency based on ICU but it would be sad to find slowing down on hot paths, specially in PowerShell whose Engine is very sensitive to string operation performance. I hope MSFT team continues to invest in the area in next milestone.
As side notice, we could have an extension package with managed implementation for applications which are performance critical and has not memory limitations (can utilize more large and fast tables). Of course, if other ways are not more effective.
@iSazonov we already fixed all ordinal[IgnoreCase] operations perf across all functionality). as pointed before, you still have the option to switch back using NLS if needed and you continue work on Windows as used to.
As side notice, we could have an extension package with managed implementation for applications which are performance critical and has not memory limitations (can utilize more large and fast tables). Of course, if other ways are not more effective.
I assume you are talking about PowerShell? of course we'll not do that in .NET as we avoid carrying any globalization data and implementing all functionalities like collations wouldn't be trivial.
by the way, Powershell already running on Linux for awhile with ICU, why you are concerned now?
we already fixed all ordinal[IgnoreCase] operations perf across all functionality)
I see. Many thanks!
you still have the option to switch back using NLS
I think it makes no sense because because more and more users work in a heterogeneous environment and with ICU there is less chance of getting different results.
by the way, Powershell already running on Linux for awhile with ICU, why you are concerned now?
Questions like why the performance/behavior is different on Windows and on Unix is very inconvenient. Questions like this sometimes appear in PowerShell repo (not related to the topic). With moving to ICU the likelihood of such questions decreases and this is great.
PowerShell is highly dependent on OrdinalIgnoreCase and any improvements here have a positive impact on it. This is my only concern.
Thanks again for your great work!
PowerShell is highly dependent on OrdinalIgnoreCase and any improvements here have a positive impact on it. This is my only concern.
Thanks for the info, it is very helpful. could you please try the latest .NET builds which include the Ordinal perf improvements and let's know if you see the differences now? let me know if you need help with that.
Just wanted to update this thread with the auto-filed results showing big wins (70%) in the System.Memory.ReadOnlySpan benchmarks that regressed.
Also seeing wins for System.Globalization.Tests.StringSearch in some of the IndexOf OrdinalIgnoreCase benchmarks. We are seeing that across x64, x86 Windows and x64 Ubuntu.
Thanks for the info, it is very helpful. could you please try the latest .NET builds which include the Ordinal perf improvements and let's know if you see the differences now? let me know if you need help with that.
A day before PowerShell MSFT team tried to move to .Net 5.0 Preview8 but without success. I guess I can do some measurements only after we move to RC1.
/cc @SteveL-MSFT @daxian-dbw for information.
Just curious @tarekgh, why do we still see significant differences between Windows (with your improvement) and Ubuntu? eg
https://github.com/DrewScoggins/performance-2/issues/1283 --- about 240-260 ns
https://github.com/DrewScoggins/performance-2/issues/1334 -- back down from 160ns to 120ns
@danmosemsft looking https://github.com/DrewScoggins/performance-2/issues/1283 I am seeing the ordinal ignore casing cases is improved. other none ordinal cases I am not expecting to improve with my change. am I reading the data correctly?
@tarekgh your change is good. I was just curious why Linux and Windows perf were still significantly different but I realized I compared the wrong rows.
@danmosemsft I was not really sure if I am looking at or interpreting the perf data correctly. that is why I was asking. Thanks for clarifying.
Would it be in the cards to contribute to ICU so that this operation is not by-design slow? I imagine that many projects relying on ICU would like a fast IndexOf
that does require caching a searcher object.
@tarekgh I made simple test for PowerShell string comparisons (in Russian locale). I see a huge regression after 5.0-Preview.7 but it is still significant better than 3.1.6.
I hope .Net and PowerShell teams will discuss this and make more reliable tests.
/cc @SteveL-MSFT @daxian-dbw
| PowerShell version | Duration, sec | .Net version
|-|-|-|
| PowerShell-7.0.3 | 13.8 | 3.1.6
| PowerShell-7.1.0-Preview.5 | 5.5 | 5.0-Preview.6
| PowerShell-7.1.0-Preview.6 | 5.5 | 5.0-Preview.7
| PowerShell-7.1.0-Preview.7 | 8.7 | 5.0-Preview.8
PowerShell test script:
$value = (Get-Date).ToString() # Russian locale
$array1 = @()
for ($i=0; $i -lt 5000; $i++) {
$array1 += $value
}
$value = (Get-Date).ToString()
$array2 = @()
for ($i=0; $i -lt 5000; $i++) {
$array2 += $value
}
Measure-Command { Compare-Object -ReferenceObject $array1 -DifferenceObject $array2 }
@iSazonov are you running on Linux or Windows? On Windows it is expected you see some regression with the Linguistic scenarios because switching to use ICU there. As mentioned before you still have the option to switch back to use NLS if needed. We have addressed all ordinal scenarios though. Also, there were a lot of work going on during these previews but at least we still significantly faster than 3.1 which is a good news.
@tarekgh It was Windows only test.
I am surprised now this is faster than 3.1 then. are you sure of your results? could you send me a couple of string cases you used in the comparison using the Russian culture?
I am surprised now this is faster than 3.1 then. are you sure of your results?
I measured _PowerShell_ not .Net - I mean perhaps there were other optimizations in PowerShell and .Net that could affect the result.
We could use PerfView to investigate in depth.
Sample string from my tests:
14 сентября 2020 г. 23:25:22
The second test line only differs in seconds.
@GSPP I don't think we have extensively studied what contributions we can make here. Regardless, ICU and NLS have different operational philosophies when it comes to this, and ICU's consumers are absolutely used to caching the search object. The canonical scenario for linguistic searching in ICU is for a UI-based application. You open a browser window or word document, enter your search term, then see all matches highlighted in the document and use Next / Previous to iterate through them.
In our case, we case about non-linguistic (ordinal) searching. If this were to be contributed back to ICU it would be in the form of a brand new API. Furthermore, the type of ordinal comparison we're using here (conversion to uppercase) is different than Unicode's own recommendations (conversion to case-fold). All of this is to say that I'm not hopeful of a specialty API like the one .NET is using making its way through.
@GrabYourPitchforks @GSPP just to let you know, I am looking at the linguistic IndexOf scenario and experimenting some changes that may help in the perf (around internally caching some ICU objects too). no promise yet as I am still in the middle of looking at that.
Also, we are following up with Windows team as they are trying to do some perf enhancement on ICU too. It is another win situation.
Last, as ICU is open source project, we have contributed some changes before which means it is possible we'll contribute more if it is really required to enhance .NET scenarios. but in general this something we may look at for 6.0 version and beyond.
@iSazonov I tried your scenario (using .NET without PS) and I am seeing the similar results as you have reported it. This is very interesting. I am going to look more on the details to understand what is going on.
I have collected the ETW benchmark data and now I am seeing the logical results. I am seeing 3.1 is faster (as expected). dotnet benchmark somehow is reporting wrong numbers or there is something causing noises there. here is the ETW data:
Name Inc % Inc Exc % Exc
system.private.corelib.il!CompareInfo.CompareString 95.7 42,349 2.0 893
+ kernel32!CompareStringExStub 92.6 40,960 0.5 202
|+ kernelbase!SortCompareString 92.1 40,754 2.2 962
|+ ntoskrnl!? 0.0 4 0.0 3
+ coreclr!? 0.6 253 0.6 253
+ system.private.corelib.il!CompareInfo.GetNativeCompareFlags 0.5 219 0.5 219
+ ?!? 0.0 14 0.0 14
+ ntoskrnl!? 0.0 10 0.0 8
Name Inc % Inc Exc % Exc
System.Private.CoreLib.il!CompareInfo.Compare 91.0 66,116 2.6 1,888
+ System.Private.CoreLib.il!CompareInfo.Compare 88.3 64,159 3.3 2,405
|+ System.Private.CoreLib.il!CompareInfo.IcuCompareString 84.8 61,633 5.4 3,941
||+ coreclr!GlobalizationNative_CompareString 74.7 54,287 5.0 3,666
|||+ icu!ucol_strcoll 66.5 48,311 4.9 3,530
|||+ ntdll!? 1.6 1,188 1.6 1,182
|||+ coreclr!GetCollatorFromSortHandle 1.5 1,105 1.5 1,081
|||+ ntoskrnl!? 0.0 16 0.0 8
|||+ nvlddmkm!? 0.0 1 0.0 1
||+ coreclr!JIT_InitPInvokeFrame 4.6 3,370 4.6 3,353
||+ ntoskrnl!? 0.0 24 0.0 9
||+ coreclr!JIT_PInvokeBegin 0.0 7 0.0 7
||+ coreclr!JIT_PInvokeEnd 0.0 4 0.0 4
And here is the results (which I believe is wrongly reported from dotnet benchmark) from exact same test.
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------- |---------:|--------:|---------:|------:|------:|------:|----------:|
| StringCompare | 141.8 ns | 3.52 ns | 10.39 ns | - | - | - | - |
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| StringCompare | 705.3 ns | 14.09 ns | 35.86 ns | - | - | - | - |
and here is the method code I used:
```C#
private static CultureInfo s_culture = CultureInfo.GetCultureInfo("ru-RU");;
private static string s_string1 = "14 сентября 2020 г. 23:25:22";
private static string s_string2 = "14 сентября 2020 г. 23:25:24";
[Benchmark] public int StringCompare() => String.Compare(s_string1, s_string2, s_culture, CompareOptions.None);
```
@adamsitnik @DrewScoggins do you have any guess why the perf numbers I am getting from dotnet benchmark tool reported that way which not matching what ETW data reported from the exact same test?
@DrewScoggins I have merged today the other optimization work which targeting string search operations (IndexOf/LastIndexOf/IsPrefix/IsSuffix/StartsWith/EndsWith). could you please watch the perf results after running my changes and update this issue? Thanks a lot.
Yes, I will keep an eye out for these to see if we are seeing any improvement.
@tarekgh I've run the benchmark that you have provided and got similar results that confirm that 5.0 using ICU is faster in this particular case.
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.1082 (1909/November2018Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-rc.1.20452.10
[Host] : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
Job-KYODGR : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
Job-PPECTW : .NET Core 5.0.0 (CoreCLR 5.0.20.45114, CoreFX 5.0.20.45114), X64 RyuJIT
| Method | Job | Runtime | Toolchain | Mean | Error | StdDev | Median | Ratio |
|-------------- |----------- |-------------- |-------------- |----------:|---------:|---------:|----------:|------:|
| StringCompare | Job-KYODGR | .NET Core 3.1 | netcoreapp3.1 | 377.30 ns | 4.741 ns | 4.435 ns | 376.14 ns | 1.00 |
| StringCompare | Job-PPECTW | .NET Core 5.0 | netcoreapp5.0 | 68.72 ns | 1.342 ns | 3.055 ns | 67.42 ns | 0.19 |
To compare apples to apples I've set the invocation count (the number of benchmark invocations per iteration) to the same number and filtered the ETW trace file to the last benchmark iteration (description of the mentioned filtering):
--invocationCount 2097152 --profiler ETW
For 3.1 a single iteration (2097152 invocations) takes 791ms:
For 5.0 a single iteration (2097152 invocations) takes 147ms:
@adamsitnik the scenario you mentioned is the string compare which is different than string search. I was expecting my recent changes affect the string search and not the string compare. maybe I am missing something here.
the scenario you mentioned
@tarekgh I have used the scenario that you have provided (but it was a long time ago - on the 14th of September)
maybe I am missing something here.
most probably in this particular scenario ICU is simply faster
I have used the scenario that you have provided (but it was a long time ago - on the 14th of September)
What I mentioned in Sept 14th was using 3.1 on Windows which is using NLS and not ICU. I am expecting 3.1 still be faster.
I am expecting 3.1 still be faster.
So I've verified that 5.0 is faster for this particular case.
After the most recent check in (https://github.com/dotnet/runtime/pull/43065) that Tarek made, we are seeing some good improvements in the lab.
Architecture | x64
-- | --
OS | Windows 10.0.18362
Changes | diff
Benchmark | Baseline | Test | Test/Base | Modality | Baseline Outlier | Baseline ETL | Comapre ETL
-- | -- | -- | -- | -- | -- | -- | --
IndexOf_Word_NotFound | 13.06 μs | 9.15 μs | 0.70 | | True
LastIndexOf_Word_NotFound | 18.87 μs | 15.01 μs | 0.80 | | True
LastIndexOf_Word_NotFound | 20.55 μs | 16.44 μs | 0.80 | | True
IsSuffix_SecondHalf | 22.05 μs | 17.57 μs | 0.80 | | False
IsPrefix_FirstHalf | 19.68 μs | 15.59 μs | 0.79 | | False
IndexOf_Word_NotFound | 12.96 μs | 9.05 μs | 0.70 | | True
IsPrefix_DifferentFirstChar | 34.04 μs | 29.81 μs | 0.88 | | True
IndexOf_Word_NotFound | 16.56 μs | 12.95 μs | 0.78 | | True
IsSuffix_DifferentLastChar | 39.49 μs | 34.97 μs | 0.89 | | True
IndexOf_Word_NotFound | 13.19 μs | 9.30 μs | 0.70 | | True
LastIndexOf_Word_NotFound | 19.46 μs | 15.26 μs | 0.78 | | True
IndexOf_Word_NotFound | 12.83 μs | 9.16 μs | 0.71 | | False
IndexOf_Word_NotFound | 13.19 μs | 9.27 μs | 0.70 | | True
LastIndexOf_Word_NotFound | 19.20 μs | 14.99 μs | 0.78 | | False
LastIndexOf_Word_NotFound | 19.41 μs | 15.28 μs | 0.79 | | False
LastIndexOf_Word_NotFound | 18.57 μs | 14.01 μs | 0.75 | | True
Historical Data in Reporting System
git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Globalization.Tests.StringSearch*'
[ 8961.400 ; 9663.790) | @@@@@@@@@@@@@@
[ 9663.790 ; 10366.179) |
[10366.179 ; 11068.568) |
[11068.568 ; 11770.957) |
[11770.957 ; 12473.347) |
[12473.347 ; 12726.514) |
[12726.514 ; 13428.903) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[13428.903 ; 14213.285) | @@@
[14612.596 ; 15349.017) | @@@@@@@@@@@@@@
[15349.017 ; 16056.794) |
[16056.794 ; 16764.571) |
[16764.571 ; 17472.348) |
[17472.348 ; 18217.629) |
[18217.629 ; 18831.044) | @
[18831.044 ; 19538.820) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[19538.820 ; 20122.311) | @@@
[16251.652 ; 16961.645) | @@@@@@@@@@@@@@
[16961.645 ; 17671.638) |
[17671.638 ; 18381.630) |
[18381.630 ; 19091.623) |
[19091.623 ; 19801.616) |
[19801.616 ; 20174.942) |
[20174.942 ; 20884.934) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[20884.934 ; 21604.903) | @@@
[17072.752 ; 17964.482) | @@@@@@@@@@@@@@
[17964.482 ; 18779.784) |
[18779.784 ; 19595.085) |
[19595.085 ; 20410.387) |
[20410.387 ; 21514.014) |
[21514.014 ; 22329.316) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[22329.316 ; 23330.116) | @@@@@@@@@@
[15229.897 ; 15975.829) | @@@@@@@@@@@@@@
[15975.829 ; 16721.761) |
[16721.761 ; 17467.693) |
[17467.693 ; 18213.625) |
[18213.625 ; 18959.557) |
[18959.557 ; 19302.824) |
[19302.824 ; 20048.756) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[20048.756 ; 20855.951) | @@@@@
[ 8797.707 ; 9499.196) | @@@@@@@@@@@@@@
[ 9499.196 ; 10200.684) |
[10200.684 ; 10902.172) |
[10902.172 ; 11603.661) |
[11603.661 ; 12305.149) |
[12305.149 ; 12684.041) |
[12684.041 ; 13385.530) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[13385.530 ; 14220.333) | @@
[29477.421 ; 30235.312) | @@@@@@@@@@@@@@
[30235.312 ; 30993.203) |
[30993.203 ; 31751.095) |
[31751.095 ; 32508.986) |
[32508.986 ; 33769.724) |
[33769.724 ; 35223.347) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[12609.085 ; 13265.894) | @@@@@@@@@@@@@
[13265.894 ; 13768.769) | @
[13768.769 ; 14425.579) |
[14425.579 ; 15082.389) |
[15082.389 ; 15739.199) |
[15739.199 ; 16287.555) |
[16287.555 ; 16944.365) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[16944.365 ; 17648.441) | @@
[34563.746 ; 35375.219) | @@@@@@@@@@@@@@
[35375.219 ; 36186.692) |
[36186.692 ; 36998.165) |
[36998.165 ; 37809.638) |
[37809.638 ; 39015.124) | @
[39015.124 ; 39826.597) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[39826.597 ; 40796.807) | @@@@@
[ 8890.906 ; 9625.004) | @@@@@@@@@@@@@@
[ 9625.004 ; 10333.195) |
[10333.195 ; 11041.386) |
[11041.386 ; 11749.577) |
[11749.577 ; 12457.769) |
[12457.769 ; 13023.837) | @
[13023.837 ; 13732.028) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[13732.028 ; 14434.832) | @@@@
[14805.514 ; 15554.325) | @@@@@@@@@@@@@@
[15554.325 ; 16303.136) |
[16303.136 ; 17051.948) |
[17051.948 ; 17800.759) |
[17800.759 ; 18662.288) |
[18662.288 ; 19285.851) | @@
[19285.851 ; 20034.662) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[20034.662 ; 20599.685) | @@@@
[ 9024.075 ; 9674.928) | @@@@@@@@@@@@@@
[ 9674.928 ; 10325.780) |
[10325.780 ; 10976.632) |
[10976.632 ; 11627.484) |
[11627.484 ; 12278.336) |
[12278.336 ; 12637.952) |
[12637.952 ; 13288.804) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[13288.804 ; 13899.140) | @@
[ 8882.220 ; 9650.252) | @@@@@@@@@@@@@@
[ 9650.252 ; 10352.069) |
[10352.069 ; 11053.887) |
[11053.887 ; 11755.704) |
[11755.704 ; 12457.521) |
[12457.521 ; 12812.048) |
[12812.048 ; 13513.865) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[13513.865 ; 14286.340) | @@@@@
[14587.344 ; 15366.223) | @@@@@@@@@@@@@@
[15366.223 ; 16127.006) |
[16127.006 ; 16887.789) |
[16887.789 ; 17648.572) |
[17648.572 ; 18409.355) |
[18409.355 ; 18787.408) |
[18787.408 ; 19548.191) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[19548.191 ; 20346.411) | @@@@@@
[14863.073 ; 15658.737) | @@@@@@@@@@@@@@
[15658.737 ; 16397.914) |
[16397.914 ; 17137.092) |
[17137.092 ; 17876.269) |
[17876.269 ; 18615.446) |
[18615.446 ; 18981.118) |
[18981.118 ; 19720.295) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[19720.295 ; 20512.543) | @@@@
[13785.427 ; 14537.175) | @@@@@@@@@@@@@@
[14537.175 ; 15288.923) |
[15288.923 ; 16040.671) |
[16040.671 ; 16792.419) |
[16792.419 ; 17544.167) |
[17544.167 ; 18045.402) |
[18045.402 ; 18964.029) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[18964.029 ; 19744.608) | @@
Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository
thanks @DrewScoggins for the updates.
With #43065 that provided some really nice improvements, we are getting even closer to contributing to ICU itself (as we run out of ideas on our side).
While reading the ICU source code I got the impression that it was not written with performance in mind (very few hacks, tricks, and comments explaining them). But I assumed that the code is not doing that because ICU is using PGO that does this for ICU.
But what if ICU is not using PGO? Perhaps this could be our first perf contribution?
I've searched ICU repo and bug tracker for "PGO" and did not find anything:
https://github.com/unicode-org/icu/search?q=PGO
https://unicode-org.atlassian.net/issues/?jql=text%20~%20%22PGO%22
Just to give you some numbers: when Chromium started using Microsoft's PGO they got +15% improvement for "new tab page load time" and "startup time": https://blog.chromium.org/2016/10/making-chrome-on-windows-faster-with-pgo.html
@tarekgh @GrabYourPitchforks what are your thoughts on this?
cc @brianrob
For who don't know what is PGO, it is Profile-guided optimization.
also cc @lewing (although I guess his focus is just size)
For information only. PowerShell 7.1 was released on .Net 5.0 and we already got performance feedback https://github.com/PowerShell/PowerShell/issues/14087
@iSazonov did you have chance to investigate more and know where is the regression is coming from?
Also, could you please measure the same scenario on .NET 6.0? we have some more optimization there it may help?
@danmosemsft we care about performance a lot too, we just had disproportionately large (small?) size needs here. @tarekgh great work on all this.
PGO of the unmanaged code looks complicated in our case because we'd need a way for our compiler/toolchain to consume the the profile.
@iSazonov did you have chance to investigate more and know where is the regression is coming from?
@tarekgh I believe it is an expected regression after moving to ICU. I shared the feedback to demonstrate what users can see in real scenarios.
Also, could you please measure the same scenario on .NET 6.0? we have some more optimization there it may help?
PowerShell repository is still on .Net 5.0. I guess PowerShell MSFT team will move to .Net 6.0 after .Net 6 Preview1 will be available to them. Then we could compare.
If users see such a noticeable regression, I would expect your latest improvements to be ported to .Net 5.0.
As for PGO, I think this could be a good step towards reaping benefits quickly with minimal cost. I assume you already have these statistics from MSFT services. But in general it is obvious which functions are the most used.
More important is how this performance is achieved. I see that there is a desire to make Runtime more compact. This is important for scenarios like WASM. But in other cases, this is not critical. If hardware intrinsics use double and triple implementations, why not do the same for strings? - a compact but slower implementation for WASM and etc. and a faster but more capacious implementation for other scenarios - it is not problem for PowerShell script execution to have additional 10 buffers by 20 Kb or even by 1 Mb.
I can speculate that no one will understand why the Binq service is 20% slower just for the sake of WASM running in the browser. Based on the same test set WASM could be get more compact runtime by means of conditional compiling to use only compact implementations.
I believe it is an expected regression after moving to ICU. I shared the feedback to demonstrate what users can see in real scenarios.
@iSazonov I am still unclear what exactly regressed in the user scenario? did you investigate PowerShell and looked where the regression is happening? what calls to .NET is regressed?
the referenced issue (https://github.com/PowerShell/PowerShell/issues/14087) mentions:
(...) on many lines of text to remove lines matching a criteria in a 85MB text file
So it could be string.IndexOf
, string.Contains
, string.Remove
, string.Replace
or even RegEx
. Let's wait for the user to provide more details
A script of mine that reliably took 26 or 27 minutes to run in 7.0 now takes 40 minutes in 7.1.
It should take less than a minute to handle 85 MB text file. We might be able to provide some recommendations as well.
I've looked into the PowerShell trace files and it looks like the regression is not related to ICU, but Regex.Replace
(cc @stephentoub). For more details please see https://github.com/PowerShell/PowerShell/issues/14087#issuecomment-729053865
We should expect a new issue to be created soon by PowerShell contributors (https://github.com/PowerShell/PowerShell/issues/14087#issuecomment-729040556)
To tell, @jefgen thankfully is in doing some optimization work from the ICU side
https://github.com/unicode-org/icu/pull/1471
https://github.com/unicode-org/icu/pull/1473
You can see the perf numbers in the ticket https://unicode-org.atlassian.net/browse/ICU-21388
@jefgen very nice improvements! did Windows Team consider using PGO (Profile Guided Optimization) to improve ICU perf?
@jefgen very nice improvements!
Thanks!
FWIW, I had a few more ideas that might help to improve the performance more, but I haven't had time to investigate them as of yet.
did Windows Team consider using PGO (Profile Guided Optimization) to improve ICU perf?
Yes the version of ICU that ships as part of the OS with Windows (icu.dll
) is indeed using PGO. However, IIUC, I think that the "training data" used for it is from scenarios for the old/legacy Edge browser. I wonder if it might be possible to somehow use training data from .NET Core instead? (I'm not sure if that is possible though, and/or how it might work).
That said, the pre-built Windows binary downloads for the public ICU don't use PGO though.
The Nuget package that our team produces for ICU (with data changes for Microsoft products) also doesn't currently use PGO either. (Both of these would/could be used for the .NET app-local ICU feature).
TBH, I'm not super familiar with PGO though, so I wonder if it might be good to sync up to discuss more sometime perhaps?.
I measured PowerShell startup scenario. Command line is pwsh -c exit
. PowerShell processes args[] and for the command line it calls _once_ string.ToLowerInvariant()
for args[0] that is "c". This takes 26 ms. I guess it is slow ICU initialization.
SpeedScope file:
PerfViewData-7.2-Startup-AI-DT.speedscope.json.zip
Update: I found another slow ICU initialization in the scenario - we mistakenly called string.StartsWith() with a default StringComparison.CurrentCulture - it calls slow icu!ucol_openElements
. I replaced with with Ordinal but
the fact remains - ICU is initialized too slow for PowerSHell startup scenario.
Most helpful comment
After the most recent check in (https://github.com/dotnet/runtime/pull/43065) that Tarek made, we are seeing some good improvements in the lab.
Run Information
Architecture | x64
-- | --
OS | Windows 10.0.18362
Changes | diff
Regressions in System.Globalization.Tests.StringSearch
Benchmark | Baseline | Test | Test/Base | Modality | Baseline Outlier | Baseline ETL | Comapre ETL
-- | -- | -- | -- | -- | -- | -- | --
IndexOf_Word_NotFound | 13.06 μs | 9.15 μs | 0.70 | | True
LastIndexOf_Word_NotFound | 18.87 μs | 15.01 μs | 0.80 | | True
LastIndexOf_Word_NotFound | 20.55 μs | 16.44 μs | 0.80 | | True
IsSuffix_SecondHalf | 22.05 μs | 17.57 μs | 0.80 | | False
IsPrefix_FirstHalf | 19.68 μs | 15.59 μs | 0.79 | | False
IndexOf_Word_NotFound | 12.96 μs | 9.05 μs | 0.70 | | True
IsPrefix_DifferentFirstChar | 34.04 μs | 29.81 μs | 0.88 | | True
IndexOf_Word_NotFound | 16.56 μs | 12.95 μs | 0.78 | | True
IsSuffix_DifferentLastChar | 39.49 μs | 34.97 μs | 0.89 | | True
IndexOf_Word_NotFound | 13.19 μs | 9.30 μs | 0.70 | | True
LastIndexOf_Word_NotFound | 19.46 μs | 15.26 μs | 0.78 | | True
IndexOf_Word_NotFound | 12.83 μs | 9.16 μs | 0.71 | | False
IndexOf_Word_NotFound | 13.19 μs | 9.27 μs | 0.70 | | True
LastIndexOf_Word_NotFound | 19.20 μs | 14.99 μs | 0.78 | | False
LastIndexOf_Word_NotFound | 19.41 μs | 15.28 μs | 0.79 | | False
LastIndexOf_Word_NotFound | 18.57 μs | 14.01 μs | 0.75 | | True
Historical Data in Reporting System
Repro
Histogram
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (en-US, None, True))
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (, None, True))
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (pl-PL, None, False))
System.Globalization.Tests.StringSearch.IsSuffix_SecondHalf(Options: (en-US, IgnoreSymbols, False))
System.Globalization.Tests.StringSearch.IsPrefix_FirstHalf(Options: (en-US, IgnoreSymbols, False))
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (en-US, IgnoreSymbols, False))
System.Globalization.Tests.StringSearch.IsPrefix_DifferentFirstChar(Options: (en-US, IgnoreSymbols, False))
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (pl-PL, None, False))
System.Globalization.Tests.StringSearch.IsSuffix_DifferentLastChar(Options: (en-US, IgnoreSymbols, False))
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (, IgnoreCase, True))
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (en-US, IgnoreCase, True))
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (, None, True))
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (en-US, IgnoreCase, True))
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (en-US, None, True))
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (, IgnoreCase, True))
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (en-US, IgnoreSymbols, False))
Docs
Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository