I came across this idea after seeing that Java has implemented it, see JEP 254 for the full details, plus this InfoQ article for some extra info.
I'm aware of the CoreFX Labs effort to implement a UTF-8 string library, but this proposal doesn't require using a different library, it changes the implementation of all strings in the CoreCLR (a bit ambitious I know!!)
Note: this optimisation work because many types of strings in web-based applications can actually be encoded as ASCII
or ISO-8859-1 (Latin-1)
, for instance URLs, HTTP Headers, etc. So even with internationalised applications there are still some possible savings.
One way of implementing this is to change the internal string layout to the following, with an extra m_type
field:
public sealed class String : IComparable, ICloneable, IConvertible, IEnumerable
, IComparable<String>, IEnumerable<char>, IEquatable<String>
{
[NonSerialized]private int m_stringLength;
[NonSerialized]private byte m_type;
[NonSerialized]private char m_firstChar;
....
Assuming that there are no padding implications for adding a field like this?
Another approach is to have an abstract String
base class and then use either a UTF16
or Latin1
implementation via virtual dispatch, but this may have performance implications.
Then the (pseudo) code for the equals method become:
public boolean Equals(string other)
{
if (this.type != other.type)
return false;
if (type == LATIN1)
return StringLatin1.Equals(this, other);
else
return StringLatinUTF16.Equals(this, other);
}
Obviously I know this is not a small undertaking, it would require changes to the String
and StringBuilder
classes, plus several others. Also the JIT or intrinsics would have to be involved to ensure that all the methods that requires an extra if (type == LATIN1)
check are as fast as possible (length, indexOf, etc).
strings become more cache-friendly, which _may_ give better performance
Breaks COM or P/Invoke interop that relies on the current string format
ASCII
or ISO-8859-1 (Latin-1)
this will have an overhead for no gain.This feature is not worth doing if it doesn't actually save space for common types of .NET applications. So that this can be measured I wrote HeapStringAnalyser that makes use of CLR MD (see Analysing .NET Memory Dumps with CLR MD for more info), @nickcraver from Stack Overflow was kind enough to run my tool and one of their memory dumps and it gave the following output:
...
Overall 30,703,367 "System.String" objects take up 4,320,235,704 bytes (4,120.10 MB)
Of this underlying byte arrays (as Unicode) take up 3,521,948,162 bytes (3,358.79 MB)
Remaining data (object headers, other fields, etc) is 798,287,542 bytes (761.31 MB), at 26 bytes per object
Actual Encoding that the "System.String" could be stored as (with corresponding data size)
3,347,868,352 bytes are ASCII
5,078,902 bytes are ISO-8859-1 (Latin-1)
169,000,908 bytes are Unicode (UTF-16)
Total: 3,521,948,162 bytes (expected: 3,521,948,162)
Compression Summary:
1,676,473,627 bytes Compressed (to ISO-8859-1 (Latin-1))
169,000,908 bytes Uncompressed (as Unicode/UTF-16)
30,703,367 bytes EXTRA to enable compression (one byte field, per "System.String" object)
Total: 1,876,177,902 bytes, compared to 3,521,948,162 before compression
In this case the String
data could be compressed from ~3,358 MB down to ~1,789 MB, a pretty impressive saving! But this test would have to be repeated on a far wider range of apps to see if the memory savings are repeated.
Update: there's a really thorough write-up by @evincarofautumn who's actually implemented this suggestion against the Mono code-base
Was updated to Ascii rather than ISO-8859-1 in RFC 7230 3.2.4. Field Parsing.
Latin-1 conflicts with Utf8 whereas Ascii is compatible (so you can cross detect between the two)
Interestingly the current string format does already have an IsAscii flag on it
Though it considers '
and -
to not be Ascii :frowning:
However changing the string to be 8-bit rather than 16-bit based on this flag would likely break fixed
statements on strings without special handling (and passing to pinvoke etc)
ASCII Strings is something that @evincarofautumn experimented with a lot for Mono: the proposal is here http://www.mono-project.com/docs/advanced/runtime/docs/ascii-strings/ (and a few more details in the mono-devel list thread about it)
I'm sure he can go into great details about the pros/cons đ
However changing the string to be 8-bit rather than 16-bit based on this flag would likely break
fixed
statements on strings without special handling (and passing to pinvoke etc)
Hmm, I didn't think about compatibility with code that uses fixed
against strings, yeah that's a problem. Looks like the mono experiment solved it by disabling fixed
on strings.
Looks like the mono experiment solved it by disabling fixed on strings.
Could always stackalloc length * 2 bytes for fixed start then copy and widen with _mm_unpacklo_epi8
& _mm_unpackhi_epi8
then compress back at fixed closed. Would be slow though...
Or allow both fixed byte* and fixed char*
Why use ASCII instead of Latin 1 for the 1 byte version? Latin1 will permit to use one byte for a lot of accented characters of Western Europe as 'è', 'ò', 'Ú'...
Could have sense to have a third type at this point for UTF32 in the case of strings containing characters that could not represented in UTF16 (Chinese characters, Klingon, Emoticons, ...) if not resorting to the kludge of use surrogate pairs (that are not really supported in .NET[1])? Probably it would not accomplish a lot of compression in this case but the implementation could be more correct.
[1] De facto .NET String are really UCS2 not UTF16.
Why use ASCII instead of Latin 1
ASCII and UTF8 are compatible; Latin 1 and UTF8 are not so require conversions between the two.
UTF8 is just as compact for the non-accented characters and is current the general interchange format, so using Latin 1 would add conversion load, and allocation to and from UTF8; whereas ASCII requires no conversion.
Also using Latin 1 in places where only ASCII is allowed (http headers etc) adds a validation step to and from. There are fast ASCII algorithms that don't apply to Latin 1, e.g case insentive compare (ignore a single bit), upper and lower casing (flip a bit) etc.
e.g. If a string is known to be ASCII (1 byte per char) it requires no validation steps and its bytes can be used directly, and conversion to UTF8 is instant, as its bytes are @already also UTF8.
The reason Mono has disabled fixed on strings is purely to help us locate the code paths that have fixed in as many places as possible and upgrade the code.
The idea is to perform a copy of the data on demand if necessary when a "fixed" that has not been upgraded is found, preserving compatibility at the expense of a string copy.
I once also tried to make the case that .net was ucs2 and not UTF16 in front of the whole ECMA committee, only to find out that I was wrong :-)
@benaadams, I live in Europe and we use Latin-1. Consider the cultural impact of using Latin-1 from my European perspective. For the US ASCII might seem like a slam dunk, but if you think about this with European eyes it's not.
Could a compiler-switch be a good middleground? Use ASCII as default, but open up for Latin1...
@HansOlavS I don't think Latin-1 would be great; I'm not thinking of it from a US perspective (I live in Europe also); more from high throughput interchange formats.
Latin-1 (ISO-8859-1) is also a superset of ASCII but requires conversion to and from String, UTF8 and Windows native. Only 6% of webpages use ISO-8859-1. It also doesn't actually cover all European lanugages as well as ISO-8859 having 16 variants for the fuller coverage.
Related you need to be using ISO-8859-7 (Latin/Greek), ISO-8859-15 (Latin-9) or ISO-8859-16 (Fantasy mix) to include the Euro symbol ⏠as its not included in Latin-1.
The Latin formats don't have the quick processing properties of ASCII e.g. case-intensive compare, uppercasing, lowercasing, and require character translation look ups; so at best become a slightly more compressed format with but with translation penalties.
It might be worth having a string8byte opaque intermediate format; that supports string items such as indexOf, split, etc, then requires translation to String, UTF8 and ASCII (for subset validation).
However, I'm not sure that provides a great advantage over using Span<byte>
or btye[]
and the Encoder
namespace?
Disclaimer: this is way out of my league... ;)
I get that ASCII aligns neatly with a lot of low-level protocols (like HTTP), but my concern was memory footprint and reduction when loading up my .NET app that contains a lot of Latin-1 compatible strings. It would be super-nice to have those stored as 8-bit chars in-memory instead of the 16-bits .NET now uses.
If you're familiar with Delphi (Anders Hejlsbergs predecessor to C#) it had 'string' (which was ANSI string, depending on your computers localization settings or whatever) and 'widestring' which was unicode (not sure of the actual encoding).
Actually, my very first reaction to C# and .NET (reading a C# book back in 2002) was that all strings were "unicode" and 16-bits. I thought that was a strange choice regarding memory consumption.
And regarding special cases like âŹ, I don't care as that is an edge-case and I'm happy to have it stored in UTF16.
ASCII has other benefits, when marshaling out to UTF8 it requires no processing. It is a direct copy.
Anything th
Anything that uses the top bit would prevent this optimization from working. The was one of the reasons for Mono's work to use ascii
I get the ASCII->UTF8 compatibility and marshaling and p/invoke and such. I guess I'm still a bit uneasy with .NET's decision to have all strings in memory as 16-bits UTF16 and miss the old days in Delphi where I could have ANSI (Latin-1) string and widestring and make a conscious decision on where to use what. But maybe this concern is totally beside the point in this thread and I misunderstood.
For Latin1 I intended the "proper" implementation done by Microsoft: https://en.wikipedia.org/wiki/Windows-1252.
The opened issue is talking to compress the strings and waste half of a bytes to be compatible with ASCII does not make sense to me. To check if a Latin1 string is a valid ASCII character you could not simple check if its byte is < 127?
A part from Java, Python has doing a similar thing to the one I was thinking:
http://legacy.python.org/dev/peps/pep-0393/
characters that are not representable is UCS2 would use UCS4 (4 Byte).
If we were interested to be compatible with UTF8 we could use UTF8 directly but it is not the most compact representation: a lot of simple characters as 'è' become 3 bytes, the more complex could become 6 bytes long!
I was pretty sure that characters between 128 and 255 were the same between Latin1 and UTF16 I am wrong?
Why don't use separate type?
@HansOlavS It would be possible to use Latin-1 as the compact encoding. Scanning would be just as fast because the second Unicode block (U+0080âU+00FF) corresponds to the upper range of Latin-1, as @fanol rightly pointed out. We could also do some clever hacks with the range U+0080âU+009F, which contains rarely used control characters; for example, encoding ⏠as U+0091 (âPrivate Use 1â).
Some ASCII-only optimisations would no longer work. Marshalling would require an extra scan per string. And since Latin-1 has incomplete coverage for many languages, it might not make much of a difference; we should measure real-world apps.
I donât like that this optimisation privileges some languages over others, but frankly itâs because the character encodings are already biased.
@zabulus All the existing APIs are using String
, so if we change the representation of String
then existing code will benefit without modification, as long as itâs not using unsafe
/fixed
. If we add a new type, only new code will be able to take advantage of the optimisation.
@zabulus
Wh(y) don't use separate type?
Sure you can use a separate type, that's what the UTF-8 string library is trying to achieve.
However this proposal was to see if something more wide-ranging and more deeply embedded in the CLR would be possible. With the side-effect that you don't have to make any code-changes to use it.
@zabulus All the existing APIs are using String, so if we change the representation of String then existing code will benefit without modification, as long as itâs not using unsafe/fixed. If we add a new type, only new code will be able to take advantage of the optimisation.
I see one more concern about such strings, except unsafe/fixed
is about marshaling such strings with P/Invoke. Specially in case using the new strings in W(ide) versions of Win32 API.
As Latin1 is a subset of UTF16 the marshalling code if should convert to a C/C++ wstring would simply duplicate any byte adding a 0 before it, probably one could made the thing faster using SSE intrinsic.
If I'm not wrong the algorith that there work for the ASCII part of an UTF8 string would work to convert a Latin1 string to UFT16 too:
https://woboq.com/blog/utf-8-processing-using-simd.html
It is obvious that in some cases this implementation will have performance cost it depends if it is more important to use less memory or to be faster...
For Latin1 I intended the "proper" implementation done by Microsoft: https://en.wikipedia.org/wiki/Windows-1252.
I was pretty sure that characters between 128 and 255 were the same between Latin1 and UTF16 I am wrong?
Characters 128 - 159 differ between the Windows-1252 variant and UTF16
To check if a Latin1 string is a valid ASCII character you could not simple check if its byte is < 127
Then additionally in the fast path algorithms you need to do a validation check, per character, which makes them less fast path... (can't use a straight int, long, vector bitwise or)
Sorry, I guess I didn't actually ask any questions when I proposed this. I guess I'm wondering:
@fanol Right, marshalling to UTF-16 would still be cheap. You could do it with ~2 instructions for each 8 bytes of input (movq
+ punpcklbw
). UTF-8 would be slightly more involved.
@mattwarren Around the time we had a discusson on mono-devel-list about my proposal, some MS people (@vancem) indicated that theyâve thought about doing this in the past. The objections were mainly that itâs a large, invasive change (we donât want to break stability & backward compatibility) and the get_OffsetToStringData
API used internally by fixed
is less than ideal.
Around the time we had a discusson on mono-devel-list about my proposal, some MS people (@vancem) indicated that theyâve thought about doing this in the past. The objections were mainly that itâs a large, invasive change (we donât want to break stability & backward compatibility) and the
get_OffsetToStringData
API used internally byfixed
is less than ideal.
@evincarofautumn that's interesting to know, thanks
Right, marshalling to UTF-16 would still be cheap. You could do it with ~2 instructions for each 8 bytes of input (movq + punpcklbw).
Though not for the Windows-1252 variant due to the conflict in characters 128 - 159
OK the Microsoft version of Latin1 while more clever in the use of the only 255 characters is not adapt because it is not a subset of UTF16 while the real Latin1 is https://en.wikipedia.org/wiki/ISO/IEC_8859-1 so it is this that should be used. Patience for 'âŹ' and other characters in the block 128 - 159 strings containing them will be not compressible.
To answer @mattwarren question on whether changing the internal representation of a string has been considered before, the short answer is YES. In fact it has been a pet desire of mine for probably over a decade now. What was clear now and has held true for quite sometime is that
Typical apps have 20% of their GC heap as strings.
Most of the 16 bit characters have 0 in their upper byte. Thus you can save 10% of typical heaps by encoding in various ways that eliminate these pointless upper bytes.
Now the first issue you have is while this saves GC heap size, it is not guaranteed to have a SPEED improvement. Lets just assume that the new representation is equally good at everything, but half the size. For things that fit in cache with the bigger size, you probably will get very little speed improvement. We know from work comparing 32 vs 64 bit applications, that the GC heap expands by 1.6X when you got to 64 bit (pointers are bigger), and that results in a 10-15% performance loss. Thus we can compute that changing the representation of strings is likely to save 10% of the GC heap and thus is likely to save about 1.5 to 2.5% throughput. This is significant, but not dramatic. Keep in mind this assumes that all operations are equal.
Now it turns out that all operations are NOT equal, and thankfully the advantage is toward the NEW layout (if done well) most of the time. We know that the most common operations on strings include
Comparision
Copy/Substring
IndexOf
Hashing
The good news is that for all of these operations, are SCANS and thus if the encodings are SHORTER they are FASTER (fewer iterations). This bodes well.
What are the disadvantages: As you might suspect compatibility is the big issue. There are two basic issues
So the real thorny issue is this issue of 'fixed'. I believe the semantics of it are solvable (first going forward you introduce a real operator, and for compat you simply recognize certain patterns that are emitted by our existing compilers and simply fail on the others.
However this only solves the correctness issue. The perf issue is still there. Most likely you solve this by introducing a string iterator, and ask people to move to it, and/or use JIT optimization to allow some patterns to simply work well (e.g. foreach over the string using integer indexes).
Because this is not perfect, we are likely to allow some compat flag where you can ask to keep things the old way (UTF16) for those who have performance issues because of it.
This probably requires a bit of an audit of where 'fixed' is being used and determining just how bad the proposed solution above is, so we can determine what the guidance will be for replacing 'fixed' in the important cases at least.
But you can see the broad outline above: Changing strings will have a 10% improvement for most GC heap sizes and is likely to yield a 1.5-2.5% throughput improvement, and maybe more (because the code was heavy in string operations that benefit from the new format), but there will definately be peopel who are impacted (those using fixed, and those with hot interop paths passing strings) The main challenge is dealing with fixed, but there is also frankly at least a few man-months of simply dealing with the places in the runtime where we took a dependency on the layout of string (in the runtime, interop, and things like stringbuilder, and all the uses of 'fixed' in corefx).
Thus it IS doable, but it is at least moderately expensive (man months), and the payoff is non-trvial but not huge.
Given that, I think the next steps would be to do a good inventory of where 'fixed' is used and design the mitigations (we will make an string iterator class?, exactly what compile magic will we have? ...). Also having microfbenchmarks for important cases of 'fixed' (before and after optimization), as well as a microbenchmark for interop would be useful.
Finally, I note that all of this just lets us CHOOSE what representation we want. We actually have ALOT of flexibility on exactly what the representation is. My recommendation is that allow that to be flexible (it does not have to be UTF8 or ascii with a type code. I am actually partial to at runtime fixed Huffman compression (which allows it to work well on Chinese). However I lets set that aside for right now. We can start with one of the suggestions above and change to something else later with basically no impact.
'
I would love, love, love it if string
were synonymous with IString<T> where T : byte or char or uint or similar
and not String
, that would allow custom implementations of IString
such as StringUtf8
, StringAscii
, StringUtf16
, etc. Of course use of p/invoke would require the signature o use the correct implementation of IString
.
I for one would love to see utf 8 everywhere.
The C# Fixed operator on strings
Um, developers should not assume that characters are the size of a char
anyways, as UTF-16 supports multiple char
characters. In fact, I'd say changing to the suggested model would help developers avoid a whole class of bugs due to very poor assumptions.
As for unsafe
operations, the underlying array should remain accessible regardless of the impl, otherwise we will be throwing an immense amount of performance optimization out with the bath water.
Given that, I think the next steps would be to do a good inventory of where 'fixed' is used and design the mitigations
Using wider types long*
on x64 and int*
on x86 and Vector<ushort>
over strings; which would also perform faster if the string was a more compact type.
Um, developers should not assume that characters are the size of a char anyways, as UTF-16 supports multiple char characters.
Normally a validate to ascii is a step with a fallback to using the Encoding namespace that accepts char*
when not ascii. Equally there is a pressure to keep ascii as byte[] for as long as possible for the performance, but then there is no interop with any function that accepts strings so the type must be widened at some point and converted to string.
An example of widening with validation is the unrolled AsciiUtilities.TryGetAsciiString is Kestrel internals.
@vancem thanks for taking the time to answer my query so comprehensively! There's a lot of trade-offs I hadn't even considered, particularly around the impact on fixed
code.
It's nice to know that someone has been thinking about this and that it may, someday, make it's way into the CLR
but there is also frankly at least a few man-months of simply dealing with the places in the runtime where we took a dependency on the layout of string (in the runtime, interop, and things like stringbuilder, and all the uses of 'fixed' in corefx).
Yeah I can image, for fun I modified a version of the CoreCLR source, so that String
had the following layout:
public sealed class String
{
private int m_stringLength;
private byte m_type;
private char m_firstChar;
....
I struggled to get a "hello world" app to run, I got stuck fixing StringBuilder
. The main problem is that when String
is broken, nothing works, you don't even get meaningful error messages because they rely on String
and/or StringBuilder
working!
A systemic change like this has to be done _very_ incrementally to retain any ability to debug the inevitable failures. Itâs a difficult and disruptive feature to implement. But, if I might be so bold, itâs our responsibility as systems programmers and programming language implementors to do exactly this kind of hard work for the sake of others. I canât name an industry in which an improvement of 10% on a highly valued metric would be considered marginal.
A systemic change like this has to be done very incrementally
Absolutely agree. A very good first step would be to design and implement alternative expansion of fixed
statement for strings that is compatible with this scheme.
Yes, the first steps are to fix the string abstraction so that it is 'complete' (that is there is only a modest amount of code that depends on its layout. That includes dealing with a scheme for dealing with 'fixed' on strings and fixing up runtime/class library code to use the new abstractions. Note that this work does NOT break anything and can be done incrementally. After that abstraction is in place, we are in a much better place to actually change the internal format.
A slight wrinkle with fixed
expansion is: if a 8 bit string; was changed to a 16 bit string during the time when it was fixed
(by writing to the high bytes) it would need reallocating as a 16 bit string when leaving the fixed statement.
My expectation of how 'fixed' works is that it unconditionally copies the string data into an array of char and points it at that. The original data is never modified. You are not allowed to modify string data in using the pointer a 'fixed' statement gives you (thankfully).
You are not allowed to modify string data in using the pointer a 'fixed' statement gives you (thankfully).
You can currently; its a pointer to the data with no copy - else fixed would be very slow (doesn't mean you should though).
Obviously trouble lies that way if the string has ever been used as its flags such as isAscii
which is lazily evalated would go out of sync; or has been passed to anything or is an embedded/interned string when it will all go to hell...
While as you say, preventing the mutation of strings using the 'fixed' is not enforced, it is illegal (large parts of the system assume that strings are read-only. including the security system. If we found anyone doing this (within or outside Microsoft), we would tell them to fix their code.
There is also a limit to what we will do for compatibility's sake. Fixing legitimate uses of relatively rare but useful feature is one thing, making broken code work is another.
We should not spend effort trying to make the mutation of string case work. The problem is bad enough without it, and we will always have the 'compat hammer' of allowing users with evil code to use the old string layout (globally).
we will always have the 'compat hammer' of allowing users with evil code to use the old string layout (globally).
Seems fair :smile: And if you are doing it for speed you will likely change you code to work with the new way. Probably would want a string with a byte*
ctor for stackalloc.
Will there be explicit conversions between types in the new string
? string.AsAscii()
? string.AsUtf8()
? string.AsUtf16()
?
Is there a way to know what the type the string is? string.IsAscii()
? string.IsUtf8()
? string.IsUtf16()
?
I don't think char
can be one or two bytes depending where it came from. Will there be a string.ToByteArray()
?
@paulomorgado I think the idea here is to change the implementation details of string
, not the public interface. I don't see why would you need any of those methods (except maybe for debugging, to ensure some specific string is compact).
And char
is always two bytes, so for example, the indexer would be implemented something like this pseudocode:
c#
public char this[int index]
{
get
{
if (type == LATIN1)
return (char)bytePointer[index];
else
return charPointer[index];
}
}
Converting string
to a byte array can already be done using Encoding.ASCII.GetBytes()
. I imagine that method would be optimized to just create a copy of the underlying byte array, when applicable.
@svick, I understood that and I think that's a good idea because it, as usual, will benefit already built applications.
However, if the change is made, future code might rely on knowing the difference.
Web APIs will almost exclusively use UTF-8 (and ASCII for some parts). On the other hand, Windows APIs will use UTF-16. What about mixed applications like Windows applications that use web APIs?
Let's take a feed reader Windows application as an example. All communications with the feed providers will use UTF-8, but all presentation will use UTF-16. Instead of relying on the runtime/framework to do the conversion, the Windows application developer might decide to handle all strings as UTF-16 (because that's what Windows uses) while the developer of the web API might choose to use UTF-8 strings (because that's what the web uses).
Is this such an odd or unusual scenario?
At least initially we would completely hide internal details of the representation from the client of string (thus it will not know if how the string is encoded). This proposal at least initially would be no NOT have multiple representations but to uniformly use UTF8 internally. Now maybe we would take that next step (of having multiple possible representations) but first thing first. First we make it so that we don't HAVE to be UTF16.
No the thread proposal was not to use UTF-8 that would 3 bytes for a lot of characters that in UFT-16 would be represented with 2 but to use ASCII or Latin1 when possible using 1 byte and to use UTF-16 (2 byte) for anything else.
If not what we achieve using UTF-8?
@vancem You mention:
My expectation of how 'fixed' works is that it unconditionally copies the string data into an array of char and points it at that.
I though the whole point of the fixed statement was for better performance, e.g. eliding range checks in hot paths. Copying the entire string to a new char array each time would seem to go against that.
I know I'm a little bit late to the party, but here are my thoughts as someone who has written code for String
:
string(char[])
constructor would get a lot slower.m_type
. This may also prevent String methods from being inlined, or if they are it will introduce more jitted code bloat.I generally agree, though, that the memory savings we can gain from UTF-8 strings are desirable. But perhaps we should take a look in the other direction:
var utf8 = u8"Hello, world!";
to get the memory benefits.
Apps that benefit significantly from using UTF-8 strings in string processing code can do so, the rest of the time they can stick to UTF-16.
@jamesqo
+1. And one more moment. Corefxlab has a low-memory solution for dynamic string construction: System.Text.Formatting.
Note that this one covers not only memory consumption for the final strings but allocations and overall performance on operations over strings too (which is much ore important for generic web site/service app).
I don't think there's a reason to keep both.
@jamesqo I'm not sure using Utf8String
would help much:
Regarding the number of conversions:
char[]
, then you have to convert whether you're using compact string
or Utf8String
.string
, then you have to convert if you're using Utf8String
, but not if you're using compact string
.As for the amount of code, a lot of that additional code would still be required even with Utf8String
, just maybe placed somewhere else.
If you have char[], then you have to convert whether you're using compact string or Utf8String.
Yes; I was saying if you have a char[]
today you want to convert to a string, or vice versa, you don't have to perform any conversions between encodings. The proposed change would make existing code that calls string(char[])
, ToCharArray
, CopyTo
, etc. much slower since we can no longer just memcpy the characters.
If new string('\0', length)
continued to return a UTF-16 structured string then that would handle most of the fixed modifications in the system, and probably elsewhere. That doesn't help with ptr increments needing to be byte
instead of ushort
on compact strings though.
Compact strings are not a clear win due to the expense of determining if you can create one or not. If dollars are being spent on string I would vote to put them all on UTF-8 rather than the pursuit of better ASCII. They solve the memory issue as well as serialization.
The proposed change would make existing code that calls
string(char[])
...
Just create a UTF-16 string for that and use current paths
I though the whole point of the fixed statement was for better performance, e.g. eliding range checks in hot paths. Copying the entire string to a new char array each time would seem to go against that.
Yes please. If C# started performing memcpy
every time I needed a pointer on a string
, I'd be forced back into "use native for strings" model; and while I do love my C, I do found strings something C handles rather poorly (compared to a memory managed language like C#).
As somebody who is forced into reading and writing UTF8 constantly for compat with Linux binaries, I use UTF8 A LOT: I really love đ the idea that C# could treat as arbitrarily encoded strings as first class citizens. I'm also a huge believer in UTF8 everywhere, if for no other reason than a frighteningly large proportion of developers believe that indexing a char[]
is akin to indexing a UTF16 encoded string
(hint they're not).
Lastly, there are several important scenarios where having a fixed (char* ptr = mystring)
is necessary, for both read and write because performance is critical. Let's just consider directory diving for a second:
Allocating a rather largish string up front (say string buffer = new string('\0', 4096)
; yes, yes the Windows shell is limited to 255 useful chars), passing the pointer up and own the stack, modifying the value as necessary, and then just using it as a buffer to pass to p/invoke is far preferable to allocating thousands of temporary string objects and just absolutely abusing the heap, garbage collector, and my send of right vs. wrong.
Is there someplace I can official register my opinion that "fixed
should never copy"?
@ig-sinicyn
I don't think there's a reason to keep both.
Well, there's the whole Windows compatibility thing. NetFx, on Windows, calls into the Win32 API a lot and having to convert to and from wchar_t
values at every call site would get prohibitively expensive. Ideally, the developer would know which strings are expected to interact with externals which require wchar_t
and have access to the necessary formats internally, thus minimizing the number of encoding operations performed.
@whoisj sorry, my previous comment was unclear definitely.
My point is, if there are good old utf16 strings _and_ there is a specialized library for high-throughput utf8 string construction, why do we need ascii strings at all?
@benaadams CopyTo
, ToCharArray
would still suffer for ASCII data. Also I don't know how many strings are created from char*
or char[]
, but I'm guessing that just using UTF-16 for all of those would significantly diminish the memory savings from this change.
why do we need ascii strings at all?
Because ascii is not encoded beyond byte value equals character. While UTF8 can, and does, correctly encoded compatible ancii characters it also encodes beyond and there are occasions when you need to know that the ascii value is simply that: the ascii value.
Also, many consoles still rely on ANSI VT100 strings for colorized output, and having an Encoding.ASCII
value which performs no value checking or assertions would allow for efficient VT100 usage.
@jamesqo I complete agree. However, string.ToCharArray
isn't very safe as there are many code points outside of the 16-bit range. You should never assume that char
is a character. Honestly, char
should likely be a 24-bit type (Unicode is a 20-bit standard after all).
To clarify some points here:
fixed
to perform a copyâbut I had to go and manually patch a bunch of code to either be encoding-aware or use a new encoding-agnostic API. Copying is the only safe option for managing the transition. This is charity, basically; we could just say âunsafe code is unsafe, end of storyâ and break peopleâs code freely, but we try not to.Strings are immutable; if youâre writing into a string, youâre invoking undefined behavior.
If you're writing to string you "do not own" then I agree, otherwise it should be considered unsafe necessary evil. I regularly have to byte* utf8str = stackalloc byte[strlen];
or similar currently.
âUTF-24â would be deathly slow due to misalignment.
Yes, yes it would. Hence the existence of UTF32, which is a significant memory waster. đ
VT100 codes are representable in regular ASCII. (Iâm not sure if you meant to imply theyâre not.)
I did not.
This is not a very expensive optimisation; you pay a small CPU cost to compress and decompress strings as needed, and you get a fairly large memory savings. And presumably it would be opt-in.
I do not disagree, as I've stated I'm all for UTF8 everywhere. However, I'm still of the belief that 16-bits in insufficient for a char
and having constant time index operators on string
is basically recommending writing errors for many developers.
Copying is the only safe option for managing the transition. This is charity, basically; we could just say âunsafe code is unsafe, end of storyâ and break peopleâs code freely, but we try not to.
Sure. Under the assumption that C# is for part-time developers or developers who care nothing about performance; but then why bother even considering UTF8 strings? Personally, I believe C# needs to get more focused on performance, or at least offer better avenues for developers to write performance critical code in C#.
Making an operator like fixed
suddenly perform a memcpy
of an arbitrarily large allocation seems like a trap and not a feature at all. We should always consider the edge cases here: the 2 GiB string and single char
string when deciding language features.
Currently strings are \0
terminated if that was same for byte[]
strings then you could do fixed byte*
or fixed char*
for both as the terminator would round its size up to the nearest char if it was odd number and be an ignored overflow (as now) if it was an even number; so both would be safe (though you would have to know type when indexing on the pointer; to know what Length
means)
I'm not such a fan of mucking around in the internals of String, and would be more inclined to support a separate type where compact strings are desirable, with easy conversion back and forth and a compatible interface (e.g., create a common IString interface).
For the internal implementation of this new type, I would suggest breaking up the string value into runs of 8-bit and 16-bit substrings and storing them in an array of structs, where each struct contains a flag (to determine the number of bytes per character) and a byte array. E.g.:
``` C#
struct CompactSubstring {
bool IsTwoByte;
byte[] Value;
}
public class CompactString : IString {
CompactSubstring[] substrings;
}
```
(Twists on the implementation could involve using a linked list of substrings, or using a String for the 2-byte substrings--lots of pros and cons, I think the above approach would be fastest to copy in memory.)
This would be nearly as compact as UTF-8 for 7-bit ASCII, and potentially _more_ compact than UTF-8 for strings with characters > 127. It would avoid issues that UTF-8 presents with common methods like Substring, while not being limited strictly to ASCII values.
Operations could be optimized for the 1-byte substrings, improving performance over String for the same values. Some operations could even be faster. For example, if you call IndexOf() for high-order character, it could skip past all 1-byte substrings during the search.
The implementation could establish a minimum number of characters before it creates a 1-byte substring, which would avoid potential higher overhead when a string flips back and forth between low and high characters.
breaking up the string value into runs of 8-bit and 16-bit substrings
And breaking with every standard convention on text encoding established, pretty much everywhere? How about interop with native stacks which are expecting null
terminated runs of encoded bits?
Personally, I'd love to see C# support 4 kinds of strings: ascii, utf-8, utf-16, and utf-32 strings for maximum compatibility with pretty much every native stack that exists. Everything else is just having fun with structures to be different.
@whoisj, as I saw it, the original question was related to how strings are _stored_ in memory (especially related to caching), not so much about how they are passed around.
Whether we use a traditional encoding (as you're suggesting) or a data structure (as I'm suggesting), translation to and from a normal UCS-2 representation would be needed for most native calls, would they not? (I don't deal with a lot of interop, so I plead ignorance on how string values are usually passed to native libraries on various platforms.)
I think the structured approach could result in faster conversions than converting traditional encodings, because the converter would be able to work with _regions_ of characters (i.e., each of a known width) rather than having to convert one character at a time (as would be required with, say, UTF-8).
What I'm suggesting would also be a new, fully-optional type that wouldn't require rewiring the dark innards of System.String.
@richardtallent
What I'm suggesting would also be a new, fully-optional type that wouldn't require rewiring the dark innards of System.String.
See https://github.com/dotnet/coreclr/issues/7083#issuecomment-245713571 for why changing string
instead of creating a new type might be preferable.
I've added this to our list of potential post 2.0 investments - this is potentially a BIG one though and it's certainly not going to get a yay/nay right here in this issue. This remains in future and on our radar.
I'm asking myself but to make more compact the String type the problem does not lie in the Char type in reality? It would be possible to encode Char in UTF-8?
Doing this we get:
How to express this new Char in the way that effectively occupies really 1 Byte, 2 Byte, 3 Byte... as the character code is? A stack allocated array could do this?
struct Char
{
Char value;
fixed Byte data[]; // this imply that fixed could used in "safe" context
// Used by compiler to create a Char literal
unsafe public Char (byte *data, int len)
{
this.data = stackalloc(len);
this.data = data;
}
}
Probably there is a motivation because no one has taken this route (Java, Pithon) and instead changed the String to not be an array of characters anymore.
On second though the field 'data' could be declared as Span<byte>
so we have safe code and the effect of it to have size 1 byte, 2 bytes, 3 bytes or 4 bytes as needed for the Char represented.
@fanoI I do not believe the CLR has a facility for two instances of the same type to be different sizes. And we definitely wouldn't want to use indirection to the heap.
Use of Span<byte>
seems to be most creative solution. đ
Perhaps we'll need to make char
32-bits like it is in many Linux projects. :persevere:
How to express this new Char in the way that effectively occupies really 1 Byte, 2 Byte, 3 Byte... as the character code is? A stack allocated array could do this?
Well, it's quite simple: you don't. If changing char
was actually a possibility then the only reasonable option would be to make it 4 bytes in size so it can actually represent a code point.
Anyway, any attempt at changing the String
encoding in such a way that chars are no longer fixed size is doomed to fail.
If I understood well how a Span works effectively lives in the stack and could have any size is this correct?
I'm unsure if a Span<byte> = { 0xf0, 0xaf, 0xa0, 0x80 }
will really occupy 4 bytes into the stack or if there is overhead.
@whoisj making Char 4 byte will be not be useful for this proposal to make String more compacts they will become more larges indeed and for interop / unsafe scenario seems clear that the useful property of .Net that a String is simply an array of Char should be maintained (fixed should never copy!).
making Char 4 byte will be not be useful for this proposal to make ...
Yes yes, of course - I was being pedantic. đ
If I understood well how a Span works effectively lives in the stack and could have any size is this correct?
Like pretty much all types Span has a fixed size. In general, the idea of a variable sized type is pretty bizarre.
seems clear that the useful property of .Net that a String is simply an array of Char should be maintained (fixed should never copy!).
And that implies that char has to have fixed size. You can't have an array of variable sized types.
Just my few thoughts:
For the question whether to use ASCII or LATIN-1 for the "byte strings": The type tag could differentiate three types (ASCII, LATIN-1 and UTF-16). Thus, LATIN-1 could still profit from the smaller storage, while we could also exploit the fact that ASCII is an UTF-8 subset. We could even distinguish a fourth kind for Strings which exceed LATIN-1, but do stay within the BMP (and thus, have no surrogates).
For comparision: Python strings are also immutable, but they have a different internal representation. The details are too much to rehash them here, it's all described nicely at the PEP 393: https://www.python.org/dev/peps/pep-0393/ - while I think it cannot be applied to .NET in a 1:1 fashion, some of the ideas could provide useful inspiration... :-)
Using an "indirect" API (where the string contents are allocated in a separate heap block) might have advantages for String Slicing (like "substring"), as we could reference the same backing store (strings are immutable, after all...)
Most helpful comment
To answer @mattwarren question on whether changing the internal representation of a string has been considered before, the short answer is YES. In fact it has been a pet desire of mine for probably over a decade now. What was clear now and has held true for quite sometime is that
Typical apps have 20% of their GC heap as strings.
Most of the 16 bit characters have 0 in their upper byte. Thus you can save 10% of typical heaps by encoding in various ways that eliminate these pointless upper bytes.
Now the first issue you have is while this saves GC heap size, it is not guaranteed to have a SPEED improvement. Lets just assume that the new representation is equally good at everything, but half the size. For things that fit in cache with the bigger size, you probably will get very little speed improvement. We know from work comparing 32 vs 64 bit applications, that the GC heap expands by 1.6X when you got to 64 bit (pointers are bigger), and that results in a 10-15% performance loss. Thus we can compute that changing the representation of strings is likely to save 10% of the GC heap and thus is likely to save about 1.5 to 2.5% throughput. This is significant, but not dramatic. Keep in mind this assumes that all operations are equal.
Now it turns out that all operations are NOT equal, and thankfully the advantage is toward the NEW layout (if done well) most of the time. We know that the most common operations on strings include
Comparision
Copy/Substring
IndexOf
Hashing
The good news is that for all of these operations, are SCANS and thus if the encodings are SHORTER they are FASTER (fewer iterations). This bodes well.
What are the disadvantages: As you might suspect compatibility is the big issue. There are two basic issues
So the real thorny issue is this issue of 'fixed'. I believe the semantics of it are solvable (first going forward you introduce a real operator, and for compat you simply recognize certain patterns that are emitted by our existing compilers and simply fail on the others.
However this only solves the correctness issue. The perf issue is still there. Most likely you solve this by introducing a string iterator, and ask people to move to it, and/or use JIT optimization to allow some patterns to simply work well (e.g. foreach over the string using integer indexes).
Because this is not perfect, we are likely to allow some compat flag where you can ask to keep things the old way (UTF16) for those who have performance issues because of it.
This probably requires a bit of an audit of where 'fixed' is being used and determining just how bad the proposed solution above is, so we can determine what the guidance will be for replacing 'fixed' in the important cases at least.
But you can see the broad outline above: Changing strings will have a 10% improvement for most GC heap sizes and is likely to yield a 1.5-2.5% throughput improvement, and maybe more (because the code was heavy in string operations that benefit from the new format), but there will definately be peopel who are impacted (those using fixed, and those with hot interop paths passing strings) The main challenge is dealing with fixed, but there is also frankly at least a few man-months of simply dealing with the places in the runtime where we took a dependency on the layout of string (in the runtime, interop, and things like stringbuilder, and all the uses of 'fixed' in corefx).
Thus it IS doable, but it is at least moderately expensive (man months), and the payoff is non-trvial but not huge.
Given that, I think the next steps would be to do a good inventory of where 'fixed' is used and design the mitigations (we will make an string iterator class?, exactly what compile magic will we have? ...). Also having microfbenchmarks for important cases of 'fixed' (before and after optimization), as well as a microbenchmark for interop would be useful.
Finally, I note that all of this just lets us CHOOSE what representation we want. We actually have ALOT of flexibility on exactly what the representation is. My recommendation is that allow that to be flexible (it does not have to be UTF8 or ascii with a type code. I am actually partial to at runtime fixed Huffman compression (which allows it to work well on Chinese). However I lets set that aside for right now. We can start with one of the suggestions above and change to something else later with basically no impact.
'