Runtime: [Proposal] Add string.RemoveCharacters(string)

Created on 24 Aug 2020 · 17Comments · Source: dotnet/runtime

Background and Motivation

How often do we have written something like:

string stripped = "cd cd ab cd ab cd cd".Replace("ab", string.Empty);

only to remove all string occurrences from a string. We did not used the existing string.Remove method becausestring.Remove has no overload that takes a string. In order to achieve the same we would have to create a loop calling string.IndexOf first to find all occurrences and call string.Remove on each of them. This is a lot of code.

And even when removing a single character, we probably have written it as a string, because there is no such as an empty char.

string stripped = "c d c d a c d a cd cd".Replace("a", string.Empty);

Instead of replacing a string with string.Empty (which defacto removes it so we abused the API), there should be a method which removes the string or char directly.

Proposed API

namespace System
{
    public sealed class String {

+        public string RemoveCharacters(this string, string value)
+        public string RemoveCharacters(this string, char value)
+        public string RemoveCharacters(this string, params string[] value)

    }
}

Usage Examples

``` C#
string stripped = "cd cd ab cd ab cd cd".RemoveCharacters("ab");
string stripped = "c d c d a c d a cd cd".RemoveCharacters('a'); //can slightly be more faster when providing a char instead of string

string stripped = string.RemoveCharacters("rn", " ", "-")` //instead of string.Replace("rn", string.Empty).Replace(" ", string.Empty).Replace("-", string.Empty)

## Alternative Designs

Alternatively we take the existing [string.Remove](https://docs.microsoft.com/en-us/dotnet/api/system.string.remove?view=netcore-3.1) method and only add overloads to it instead of introducing a new method name.

```diff
namespace System
{
    public sealed class String {

+        public string Remove(this string, string value)
+        public string Remove(this string, char value)
+        public string Remove(this string, params string[] value)

    }
}

Risks

None

//FYI: Edited because I've noticed there is a string.Remove but accepts only a range / integers instead of a string or char to look for and remove all occurrences. So I've renamed my method to RemoveCharacters. But string.Remove feels better and avoids adding a new method, we could just add more overloads to it. I leave the final name up to the .NET team.

api-suggestion area-System.Runtime

Source

Symbai

👍7

Most helpful comment

FWIW most of the real-world empty replacement examples I see are things like string.Replace("\n", "") and string.Replace("-", "").

GrabYourPitchforks on 24 Aug 2020

👍3

All 17 comments

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

Dotnet-GitSync-Bot on 24 Aug 2020

FWIW most of the real-world empty replacement examples I see are things like string.Replace("\n", "") and string.Replace("-", "").

GrabYourPitchforks on 24 Aug 2020

👍3

The name ReplaceCharacters would be misleading.
Such name as "ReplaceCharacters" suggests that the argument is more likely a set of characters rather than a sequence to replace. From a method with such a name i would expect it to replace each individual character, if it is found in the argument string.

george-polevoy on 24 Aug 2020

@Symbai

Instead of replacing a string with string.Empty (which defacto removes it so we abused the API), there should be a method which removes the string or char directly.

Why? Do you think such code would be easier to understand? Or that it would be faster? Or some other reason?

If it's (at least partially) about performance, can you provide benchmarks showing how much faster the new methods would be?

svick on 24 Aug 2020

The name ReplaceCharacters would be misleading.

Yes, I fully agree. Actually it should still be "Remove" but just with overloads for providing a string and a char. But I still leave this to the .NET team, as the name of the method is a minor thing to me. Just want to avoid of someone says "Ha, this already exist" while the existing is not helpful on this.

Why? Do you think such code would be easier to understand? Or that it would be faster? Or some other reason?

Both. If a developer writes something like this, he never means to replace something but to remove something. Otherwise he wouldn't use string.Empty or "". Regarding a benchmark, this will be difficult. You expect me to write a full API method. To be honest this is out of scope for me. I expect a method to be faster since it doesn't have to replace but only to remove characters from a string. If it turns out that there is a little to nothing performance improvements, I wouldn't mind, although I still believe there could be some benefit (perhaps using some Span.Slice?!). But let me ask you something instead. Why do you expect developers to use a replace method, when they usually want to remove a string? The only answer to this can be that "it just works" or "it has been like that for years". I am right? So what's your point here being against this API that just make things more clear in worst scenario?

Symbai on 25 Aug 2020

So what's your point here being against this API that just make things more clear in worst scenario?

I'm not necessarily against this API. But every new API has its costs: apart from the obvious ones (it has to be implemented, reviewed and documented), there are also others (it has to be maintained for a very long time, it makes figuring out which method you should use harder, the opportunity cost, etc.). And that means every new API has to properly justify its existence.

So, if the new API was only marginally better, "it just works" sounds like a good justification to keep only the old API to me. And it's up to you, and others who agree with you, to convince the decision makers (and BTW, I am not one of those) that the costs are worth it.

svick on 25 Aug 2020

👍1

As @svick says, API's have costs, which he listed very well. Consider someone new to .NET, who has a string and hits dot. They have to hunt through the list of API's. Maybe they see Replace(char, char) and RemoveCharacters(char). They are looking at code that does one, should they use the other? Is it faster? Maybe the code now has a mixture.

This seems to have no performance, efficiency, or discoverability benefits over an API that folks have used successfully for a long time.

danmosemsft on 25 Aug 2020

Interesting https://lemire.me/blog/2017/01/20/how-quickly-can-you-remove-spaces-from-a-string/

iSazonov on 25 Aug 2020

I have to disagree with @danmosemsft that there is negative value for new users with this API, especially if we name it simply Remove(string, string). When I was new to .NET and had to remove a substring from string for the first time, the method I thought would be the answer was string.Remove, but that didn't have the right signature, so I went and googled the solution. Granted, it was easy, but if we had string.Remove(string, string) back then, I would have just used that, no SO or docs required.

It always seemed to me that string.Replace("whatever", string.Empty) is a bit of a hack. Yes, hacks that are commonly used become idioms, but I personally do not think this is a good one.

SingleAccretion on 25 Aug 2020

👍2

Maybe they see Replace(char, char) and RemoveCharacters(char). They are looking at code that does one, should they use the other? Is it faster? Maybe the code now has a mixture.

_Honestly I think people will be more confused about things like Directory.GetFiles and DirectoryInfo.EnumerateFiles (which one is better? Which one is faster?) and all the different ways of retrieving the application executable path where even the answers on stackoverflows are differently because there are so many ways and APIs for it_.

In our case you always choose the API that makes sense. A replace method replaces, a remove method removes. If you want to remove, you choose the remove method. We do HAVE a string.Remove already. All it needs and what this proposal is about is an overload that accepts a string or a char, because its extremely often used. The reason why developers dont use the existing string.Remove is because they first would call string.IndexOf and also because they often want to replace all occurrences and not just one, which then requires to write a loop. This is why most take the replace method.

So, if the new API was only marginally better

IF, and what if not!? What if the implementation also allows passing multiple strings to remove and we can write something like this: string.Remove("\r\n", " ", "-") instead of string.Replace("\r\n").Replace(" ").Replace("-") ? It not only will be faster but also allocates less. Interestingly I've seen the last one also quite often myself and I would also write it myself this way. But only because the alternative and correct version as of now requires much more code and a loop.

Symbai on 25 Aug 2020

👍1

What if the implementation also allows passing multiple strings to remove and we can write something like this

That was only added to the proposal 6 hours ago. It wasn't part of the suggested API surface when we looked at it.

Anything else that might be missing? Do we need overloads that take params string[], etc.?

GrabYourPitchforks on 25 Aug 2020

That was only added to the proposal 6 hours ago.

The normal proposal should be still faster. It takes one parameter less. Svick for some reason doubt that and brought up performance as a reason against this proposed overload. Since I don't have written the implementation and cannot give him benchmark numbers, I gave him an extreme example where its obvious. Please note that I've put this in an "extension" note. That means it can be added but doesn't has to. Its meant to be as an "things we can do to improve it further, once the original proposed API is accepted".

Do we need overloads that take params string[]?

Yes it might be better considering these APIs also take an array as parameter:

string.IndexOfAny
string.Join
string.Concat
string.Trim

So it would line up with other existing APIs. Should I edit the first post again?

Symbai on 25 Aug 2020

So it would line up with other existing APIs. Should I edit the first post again?

Sure - feel free to edit the post as many times as you need to. :)

I can see use cases for removing all occurrences of certain characters like '-' or '\r' from a string instance. Do you think it's common to want to remove multiple characters: '-' _and_ '\n' together? What kind of code do you see benefitting from this?

GrabYourPitchforks on 25 Aug 2020

What kind of code do you see benefitting from this?

It heavily depends on the scenario but an extremely common real world scenario is when the line separator is unknown. Different OS, different line separator and .NET Core is cross platform compatible. Thus something like string.Replace("\r\n", string.Empty).Replace("\n", string.Empty) is expected. More specific scenarios are:

Stripping invalid file path characters
Stripping wildcard characters
Stripping alphabetic or numbers
Stripping invalid URL characters
etc.

These are common scenarios. But I also see something like string.Replace(string1, string.Empty).Replace(string2, string.Empty) quite often, where string1 and string2 are project depending strings. So its hard to tell you a common scenario for this, but yeah it happens.

Symbai on 25 Aug 2020

Stripping invalid file path characters

Stripping invalid URL characters

I wouldn't recommend this API for those scenarios. For scenarios such as automatic slug or filename generation, the best thing to do is to keep characters which are allowed, throwing away everything else.

Stripping wildcard characters

Stripping alphabetic or numbers

Are these all that common? For alphanumeric characters in particular, I don't think we'd want to encourage people to call the API string.RemoveChars(<here's an array of 62 chars>). Once we start getting into those scenarios we're basically talking string.RemoveWhere(Predicate<char> filter), at which point your code would look like:

string myString = GetSomeString();
myString = myString.RemoveWhere(char.IsLetter); // strawman API

GrabYourPitchforks on 26 Aug 2020

I don't think we'd want to encourage people to call the API

Yes, this was a bad idea, sorry. When we talk about smaller arrays someone would like to remove from a string AND the content of the array are common strings/characters I can currently only think about everything which has the same meaning can be written differently. Line Endings: \n \r\n, quotes: " ', thousand separators: . , etc.

I've downloaded this, WPF and Winforms repo and did a regex seach for .Replace().Replace() pattern and like I said previously, most of the time it wants to remove project depending strings, some examples:

https://github.com/dotnet/runtime/blob/def5b48f55646d59faee7a92d26e648e67a104c4/src/libraries/System.ComponentModel.Annotations/src/System/ComponentModel/DataAnnotations/FileExtensionsAttribute.cs#L34-L35
https://github.com/dotnet/runtime/blob/def5b48f55646d59faee7a92d26e648e67a104c4/src/libraries/System.Numerics.Tensors/tests/TensorTests.cs#L2166

The proposed overloads which removes a single string/character are used more often.

https://github.com/dotnet/runtime/blob/def5b48f55646d59faee7a92d26e648e67a104c4/src/libraries/Common/tests/System/Diagnostics/DebuggerAttributes.cs#L154
https://github.com/dotnet/runtime/blob/def5b48f55646d59faee7a92d26e648e67a104c4/tools-local/tasks/installer.tasks/BuildTools.Publish/CloudTestTasks/AzureHelper.cs#L137

Symbai on 26 Aug 2020

Add

public string Remove(this string, params char[] value)

kronic on 31 Aug 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Default hashcode for struct with double or float field is wrong

jkotas · 3Comments

CoreCLR fails to run when mlock is unavailable

omajid · 3Comments

System.Text.Json default DateTimeZoneHandling

btecu · 3Comments

Getting error while installing the VSTS Agent on Linux RedHat

sahithreddyk · 3Comments

Add Bson integration

GitAntoinee · 3Comments