String.Intern currently accepts a string parameter. When processing a ReadOnlySpan<char>
it would be nice to be able to intern a string without a necessary string allocation upfront. This would be especially useful for strings that would be used frequently.
For this reason I would request the following overloads added to string:
public class String
{
// Existing APIs
public static string Intern(string str);
public static string? IsInterned(string str);
// New APIs
public static string Intern(ReadOnlySpan<char> str);
public static bool IsInterned(ReadOnlySpan<char> str);
}
public static string string.Intern(ReadOnlySpan<char> str) public static string string.Intern(Span<char> str)
How are these supposed to work without allocating a string? You mean if a matching string is already in the intern pool?
@khellang
Here is an example I think could use this API. I want to parse some token that we know is very common. So when we read the data we want to pass it to string.Intern(string)
to ensure we are not creating duplicate strings.
But today we have to do this:
Token ProcessToken(Span<char> token)
{
var temp = new String(token);
return new Token {
Value = string.Intern(temp);
}
}
temp
gets thrown away as garbage 99.9% of the time.
This would be nicer:
Token ProcessToken(Span<char> token)
{
return new Token {
Value = string.Intern(token);
}
}
Note that string.Intern
uses a global singleton. It is required to intern the string 100% of the time, across all threads. It means that it needs to take locks, some of the time at least. And the current implementation never releases the string once it has been interned.
You typically get much better performance results by using a local interning table that does imperfect interning (i.e. it is ok for it to create duplicate strings in rare cases - hash collisions, multiple threads competing to create the same string, etc.)
We can still add this API, but it is unlikely to be very useful to actually make things faster.
Wouldn't we always recommend against string.Intern
anyway for reasons @jkotas states? If so, I think we should close this rather than extend its API.
@jkotas @danmosemsft now that the C# compiler will be getting module initializers, and it already has support for embedding byte arrays in PE file can we please revisit this? With module initializers there becomes a more defined place (even if you consider it as a convention) to do this work.
This would be very useful to the Bing team.
I do not see how module initializers change the equation on this.
Could you please provide details for how you would like to use this API, and what do you expect to save by doing that?
We have a ton of string data (not strings, but UTF-16 bytes in our PE files). We would use module initializers to access this data in the file and call string.Intern on them.
Ok, I have been thinking about this and I see situations where this may be useful.
A big concern of mine is that folks might see a spanified version of string.Intern
and be tempted to pass untrusted data through it in an attempt to be more allocation-efficient. For instance:
ReadOnlySpan<char> requestLine = GetRequestLine(); // = "GET /path HTTP/1.1" (as a span)
ReadOnlySpan<char> httpVerb = requestLine[..requestLine.IndexOf(' ')]; // = "GET" (as a span)
string verbAsInternedString = string.Intern(httpVerb); // = "GET" (as a string)
Basically, the sample above shows an "allocation-free" way of extracting the literal strings "GET" / "POST" / etc. from the request line, but it introduces a security flaw in the application. This allows the request payload to send arbitrary nonsensical verbs to the application, and those verbs will get added to a static singleton table which is never cleared. Eventually this will result in OOMing the web server process: a denial of service attack.
To discourage such use, I would instead recommend the following API:
public sealed class string
{
public static bool TryGetInternedString(ReadOnlySpan<char> str, [NotNullWhen(true)] out string? internedString);
}
This allows you to query whether the provided buffer exists in the intern table using a familiar try pattern. If the buffer does not exist in the intern table, it outs null. If you want to add the buffer to the intern table, you're forced to bounce through ROS<char>.ToString
first. The call site clearly performs an allocation to perform the interning operation, which reduces the odds that somebody sees this as convenient "allocation-free" API that should be called on a per-request basis or within a hot path.
@GrabYourPitchforks Good catch as always.
TryGetInternedString
The pit of failure with API like this is that it depends on the global intern table being populated with the string you care about. The population of the global intern table is not very predictable. For example, the behavior of this method would depend on how the JIT inliner works (inlining happens to populate the global intern table as side-effect). The existing IsInterned
method has this problem too, but TryGetInternedString
would increase the exposure.
This suggests that we should go back to the drawing board with this API.
inlining happens to populate the global intern table as side-effect
This isn't quite my area of expertise. But doesn't this mean that if somebody calls string.Intern("Foo")
at app start, then when a method containing ldstr "Foo"
is compiled it'll end up pointing to the previously-interned string instance?
Rather than integrating this with string.Intern, at least the scenarios I'm aware of here would be addressed if we had a Dictionary<string, ...>.TryGetValue(ReadOnlySpan<char>, ...)
extension method, ala https://github.com/dotnet/runtime/issues/27229. The sticking point I've seen there ends up being the comparer; we can do the "right thing" for comparers we know about (e.g. any built-in StringComparer), but for arbitrary comparers, we'd need to ToString the span to be able to invoke the comparer, which is far from ideal. I wonder whether we could stomach throwing from such a TryGetValue if the dictionary was using a comparer we didn't understand; in the future if we introduced an ISpanEqualityComparer or some such thing, we could make it work in more cases. But I imagine the 99% case for such an API will use one of the built-in StringComparers, probably Ordinal or OrdinalIgnoreCase.
I'm not clear why we would want to double down on String.Intern, which was sort of recommended against as long as I can remember for reasons mentioned above and one of those is the one Levi points to. What might be worth doing is offering a first class interning interface that is pluggable. This was discussed here already:
https://github.com/dotnet/runtime/issues/20073#issuecomment-276829637
The scenario where @mjsabby 's wanted to use this API was a static initialization of large configuration data. It does not have the problem with denial-of-service and there are benefits from interning the program string constants and the configuration data keys against the same pool. It may be too niche case to warrant adding the API.
Most helpful comment
Note that
string.Intern
uses a global singleton. It is required to intern the string 100% of the time, across all threads. It means that it needs to take locks, some of the time at least. And the current implementation never releases the string once it has been interned.You typically get much better performance results by using a local interning table that does imperfect interning (i.e. it is ok for it to create duplicate strings in rare cases - hash collisions, multiple threads competing to create the same string, etc.)
We can still add this API, but it is unlikely to be very useful to actually make things faster.