The content pipeline tool fails with the following exception when a SpriteFont contains characters from the Unicode emoji character range (surrogate characters):
Importer 'FontDescriptionImporter' had unexpected failure!
System.FormatException: String must be exactly one character long.
at System.Xml.XmlConvert.ToChar(String s)
at Microsoft.Xna.Framework.Content.Pipeline.Serialization.Intermediate.CharSerializer.Deserialize(String[] inputs, Int32& index)
The failing SpriteFont file is attached (built on Windows).
Interesting... we certainly should support this if possible. Just needs a little debugging I think.
The .NET Framework natively supports UTF-16, but the Char type is a 16-bit integer. Therefore anything that requires a multi-character sequence, such the emoji character range, is not currently supported. This includes the SpriteFont internal file format. It could be done, but it gets messy and breaks compatibility. Even using UTF-32, the emoji characters are above the legal range for a single UCS-4 character (0x0 - 0x10FFFF per ISO 10646).
FreeType supports unsigned 32-bit integer for the character code. How to get it to support it properly I'm not sure. Maybe see if it accepts character codes beyond the legal range.
Therefore anything that requires a multi-character sequence, such the emoji character range, is not currently supported.
Right... that is a bigger fix than I thought then.
Well the first thing is to maybe handle this more gracefully... reporting a reasonable error like "We do not support multi-character glyphs" or something?
@dtaylorus Where did you get those character ranges from? They are above the Unicode 8.0 character range (up to 0x10FFFF). The Unicode emoji page lists a range of character codes, ranging from simple one character codes to sequences containing up to seven or eight character codes (still within the range listed above). These are not surrogate pairs in the normal sense however, as they are individual glyphs overlaid on each other.
@KonajuGames The spritefont file contains decimal values for the coding range. The equivalent hexadecimal values are 0x1F300 - 0x1F5FF. They should be in the valid character range.
Ahh yeah. I often get that XML encoding confused. It is still above the upper limit for the 16-bit Char type used in the SpriteFont data. Strings are also natively 16-bit encoded in .NET, so supporting these character ranges could be a challenge because it means we wouldn't be able to use String or Char in any of those parts of the SpriteFont content pipeline.
The only way we could support it without breaking SpriteFont would be to add some sort of remapping to a 16bit character code.
How to remap? The user would have to know the remapping so they use those remapped characters in their strings passed to SpriteBatch.DrawString().
The user would have to know the remapping so they use those remapped characters in their strings passed to SpriteBatch.DrawString().
Yes that would be the case. I imagine a case where a game supports some fixed number of emoji and provides some menu option to select from them. Or the game automatically replaces text like :) with appropriate emojis by replacing with the special character code for each.
I would be happy with remapping to UTF-16.
I imagine being able to specify where to map the characters in the CharacterRegion element. For example, to map the range of characters from 0x1F300 - 0x1F5FF to the range beginning at the 16-bit character code 32:
<CharacterRegion>
<Start>🌀</Start>
<End>🗿</End>
<MapTo> </MapTo>
</CharacterRegion>
It would be great if that was plumbed all the way through to DrawString(), but even if it wasn't, it would at least enable me to create the font and do my own string replacement.
I imagine being able to specify where to map the characters in the CharacterRegion element.
Yeah maybe something like that could work... we would need to define a format for specifying this that doesn't break existing fonts.
This might be the first step towards some more interesting font processing features. So I think it would be a good thing for someone to work on.
Would it make sense to create a derived class from SpriteFont to handle this? Like a UnicodeSpriteFont? Just a thought.
Before you start changing things around and deriving classes.
I have been looking at this class a lot lately and its pretty tightly coupled.
So much so im not sure were to start other then to say it needs some restructuring.
Before you start changing things around and deriving classes.
There could be some improvements to SpriteFont, but don't derail this thread. Start a new issue for that discussion.
Sorry to stay on topic
If you look at the spritefont.drawInto method
The char it uses internally is only used as a dictionary key to find a glyph.
The glyph in Spritefont defines the char as the key
public struct Glyph
{
public char Character;
It could just as well use a integer as a key.
which you could add to that class.
then change drawinto to use as a key to get the glyph as well.
Charsource might have to get a change too though.
Measurestring also uses the char as a key
by the time drawInto is done it doesn't pass out anymore text at all.
it just makes the call to draw the texture.
A alternate solution is to simply have some sort of way to extend into Spritefont
Particularly so you could make your own Spritefont.drawstring()
Also a way to then call drawinternal() from within that method you made.
The issue is well before that in the content pipeline. The parsing of the SpriteFont XML file is not able to process multi-character UTF-16 sequences (0x1F300 is a two character sequence), and iterating a string at runtime cannot handle multi-character sequences either.
This problem would never existed if monogame supported TTF fonts :p
Since you use SDL, you can support SDL_ttf?
This problem would never existed if monogame supported TTF fonts :p
The .NET type Char does not support multi-character sequences. The Emoji code points use multi-character sequences. So it is not just as simple as supporting TTF at runtime. To fully support these Unicode ranges, it requires changes to everywhere a string or character is used in relation to fonts, in both the content pipeline and at runtime. That is why we suggested remapping these characters into the valid range for a single-character sequence. This minimizes the changes required now and is the simplest _first step_ to better support for the entire Unicode code point range.
In case it's helpful, you can see the changes I made to the MonoGame.Extended project to enable their BitmapFont class to draw Unicode surrogate characters to a SpriteBatch here:
Commit: Support Unicode emoji characters in BitmapFont
Probably the most interesting change was the GetUnicodeCodePoints method, which enumerates the Unicode code points of a string as Int32 values and properly handles 2-character surrogates.
The BMFont generator and the MonoGame.Extended content pipeline processor for the generated .fnt files already fully supported Unicode, so no changes were necessary there.
A nice thing that I found along the way is that changing a Char type to an Int32 is not a breaking change, since you can pass a Char value to an Int32 parameter without casting in C#.
A nice thing that I found along the way is that changing a Char type to an Int32 is not a breaking change, since you can pass a Char value to an Int32 parameter without casting in C#.
Where it is a breaking change in SpriteFont is that the XNB file format contains a single-sequence Char value for each glyph. It could be extended to support UCS-4 instead of UCS-2, but it wouldn't be compatible with SpriteFont. This is what we have to take into consideration.
but it wouldn't be compatible with SpriteFont.
Hence my suggestion to derive a class from SpriteFont. Essentially as a way to do a check for that type elsewhere and handle its special use case. A UnicodeSpriteFont class could simply add some extended properties to handle support for unicode multi-char surrogates. And perhaps an overload for DrawString in SpriteBatch could take in an instance of a UnicodeSpriteFont. Then, the format for a UnicodeSpriteFont XNB file won't break compatibility, yes?
Personally, I would not want to have to make code changes in my game or content project simply to add Unicode support, so I don't like the subclass idea.
I don't know much about the .xnb file format, but hopefully the SpriteFont XNB can be extended with a new field without breaking backward compatibility.
(My understanding is that a SpriteFont XNB generated with the old XNA content processor should continue to work in MonoGame, but that there is no requirement to have a SpriteFont XNB file generated by the new MonoGame content processor continue to work in an old XNA game.)
If that's the case, then you could do something like add a new List<Int32> field to the SpriteFont XNB if and only if the SpriteFont contains 32-bit characters.
Then when reading the XNB, you would continue to use the old List<Char> for 16-bit characters, and optionally append the new List<Int32> Unicode code points if that field exists.
The mapping approach works as well, though I'm not sure how difficult it is to make it transparent to game code. There may be less impact on existing code in the content pipeline processor if we used a <MapFrom> element instead of <MapTo>, so no changes are required in parsing the <Start> and <End> elements of CharacterRegion in the .spritefont file.
For example, to map surrogate Unicode characters to code points 32..544:
<CharacterRegion>
<Start> </Start>
<End>Ƞ</End>
<MapFrom>🌀</MapFrom>
</CharacterRegion>
I would not want to have to make code changes in my game or content project simply to add Unicode support
I agree. Ideally we could deal with this within SpriteFont with some extended information that only exists when these extended Unicode characters are used. This way any increase in content size or text rendering performance is limited to fonts that use this feature.
The mapping approach works as well,
I think remapping is a good feature independent of whatever we do here for multi-character sequences.
Ideally we could deal with this within SpriteFont with some extended information that only exists when these extended Unicode characters are used.
This makes sense to me. So then, would the <MapFrom>🌀</MapFrom> element inside the .spritefont file, simply map to a new MapFrom Int32 property on the SpriteFont class?
Am I right in thinking we can key off whether or not a value is provided to this new hypothetical MapFrom property (it could be a nullable Int32 type) of the SpriteFont class in order to handle the logic for drawing unicode glyphs?
There is a Version member in the XNB that we could use to identify an extended file format in a SpriteFont XNB. The XNB would no longer be compatible with XNA because it wouldn't recognize version 2, but that is less of an issue.
The version 2 format could use Int32 for the glyph code point instead of Char, and we should be able to have the SpriteFont class handle the differences internally.
we should be able to have the SpriteFont class handle the differences internally.
Can we do this easily without any extra performance cost? Like what is the cost of detecting these multi-character cases during parsing of the string and generating quads?
lol, tom you just proved yet another reason why there should be another separate class to handle this. It doesn't have to be a derived class of SpriteFont, but considering performance and compatibility characteristics, it makes sense to create a separate and optional "opt-in" style solution.
new UnicodeSpriteFont class (derived or copied from SpriteFont)
new .unicodespritefont XNB file
new DrawUnicodeString method of SpriteBatch
If we go this route, then any performance implications can be understood and accepted by a user of the framework. Yes, this means if you want to support Unicode fonts in your game/app then you'll have to "opt-in" to these new methods. But this is the safest way to implement this feature going forward. It also sidesteps the issue brought up by willmotil, in that we don't have to worry about twisting SpriteFont into doing something it wasn't originally intended to support.
lol, tom you just proved yet another reason why there should be another separate class to handle this.
That is why I asked the question. :) So what is the answer?
If the cost is minimal then I wouldn't want any second class at all. Even if there is a cost then where is it? Likely all we need is a special DrawUnicodeString that does the extra checks needed.
That is why I asked the question.
Oh! Sorry man. I see what you mean now. :)
Ah, yes. Well I wish I had more knowledge and experience in answering the "what" of your performance cost question. But definitely having a new DrawUnicodeString (or some other overload of DrawString) would be the way to go.
I'll let somebody else jump in who can answer the performance cost question. :)
The cost should be minimal. I can do a proof of concept this weekend. I've
got a good idea of what needs to be done now and how to handle it with
least impact.
I really wish this whole conversation was put on hold.
Personally i have a more sweeping suggestion which seriously impact the structural relationship between spritebatch and spritefont and i think that this is needed, but i wasn't quite ready to put that argument on the table yet, off the top of my head.
If your going to tackle spritefont i think you should seriously consider the following first.
Move the following
The method Spritefont.DrawInto to SpriteBatch
The Charsource struct to SpriteBatch
The internal version of MeasureString to SpriteBatch
Create a public MeasureString in Spritebatch that takes a Spritefont as a object and returns a size by return value or out
You've got some valid suggestions there, but they are completely separate to the issues in this thread and much larger in scope. As has been requested previously, please open a new issue and don't detail this thread.
but they are kinda related that's why im saying it, ill open a new thread.
I posted up a fully working draft drawinto with the matrices unrolled to straight linear math, drawtext.
Im thinking toms mapping idea is a good one. but how do we use them with regular text.
how does this work DrawString(,"some text"+ ":-)",,,);
Btw im seeing the same amount of garbage with charsource as with a regular stringbuilder.
Numeric value string representations seem to be generating them over a ton of loops no matter what i do it takes a ton of loops to see it running at 2500 to 3500 fps steadily increase then get collected.
I have a working solution for the mapping idea.
Check the latest commit here: https://github.com/stromkos/MonoGame/tree/EmojiSpriteFont
The code needs major cleanup before it could be considered for a PR.
The CharacterRegion has an added element
The following code will display the uppercase characters as lowercase.
<CharacterRegion>
<Start>A</Start>
<End>Z</End>
<DisplayAs>97</DisplayAs>
</CharacterRegion>
It replaces the char with a dictionary char, uint. using the uint as the glyph and storing it as the char in the xnb.
The output file is identical to a standard spritefont xnb, and can be loaded using the current framework.
@KonajuGames, you said that "iterating a string at runtime cannot handle multi-character sequences either." and "The .NET type Char does not support multi-character sequences.", however, System.Char does have IsSurrogate, IsSurrogatePair and ConvertToUtf32 methods, so in theory, something workable could be made out of those.
Yes a char is actually utf-16 but emoji's and more practically the current utf standard is 4 bytes ( or up to 4 codepoints ? ) it works off chained combinations and character codepoints. If i understand it right.
(_mind you im no expert at this_),
Though at the moment drawstring doesn't consider anything but a single character to look up and match to a glyph... be it utf-8 utf-16 or utf-2048 (thats of course a joke) but... that is all it understands.
You could of course even now simply add emoji's to any current spritefont by just adding some images to a texture and adding a glyph. Then parsing a string for combinations and replacing those found combos with the utf-16 characters glyph value you assigned.
I doubt that is hardly the proper way to do it, which would be... the unicode way...
(but about here is were the fog falls on me as well.)
_I simply haven't researched it enough to know how it's all supposed to or can fit together exactly._
Basically i think we need to amass a lot of encoding information at least i would.
stuff like this...
https://unicode.org/
https://msdn.microsoft.com/en-us/library/system.text.encoding.utf8(v=vs.110).aspx
// dot nets char remember is utf-16
https://msdn.microsoft.com/en-us/library/system.text.encoding.unicode(v=vs.110).aspx
https://msdn.microsoft.com/en-us/library/system.text.encoding.utf32(v=vs.110).aspx
Most browsers support the below.
The question is how do we want and what is the cost to have monogame support this seemlessly.
U+0061 LATIN SMALL LETTER A: a
Nº: 97
UTF-8: 61
UTF-16: 00 61
U+00A9 COPYRIGHT SIGN: ©
Nº: 169
UTF-8: C2 A9
UTF-16: 00 A9
U+2122 TRADE MARK SIGN: ™
Nº: 8482
UTF-8: E2 84 A2
UTF-16: 21 22
U+2603 SNOWMAN: ☃
Nº: 9731
UTF-8: E2 98 83
UTF-16: 26 03
U+260E BLACK TELEPHONE: ☎
Nº: 9742
UTF-8: E2 98 8E
UTF-16: 26 0E
U+2614 UMBRELLA WITH RAIN DROPS: ☔
Nº: 9748
UTF-8: E2 98 94
UTF-16: 26 14
U+263A WHITE SMILING FACE: ☺
Nº: 9786
UTF-8: E2 98 BA
UTF-16: 26 3A
U+2691 BLACK FLAG: ⚑
Nº: 9873
UTF-8: E2 9A 91
UTF-16: 26 91
U+269B ATOM SYMBOL: ⚛
Nº: 9883
UTF-8: E2 9A 9B
UTF-16: 26 9B
U+1F4A9 PILE OF POO: 💩
Nº: 128169
UTF-8: F0 9F 92 A9
UTF-16: D8 3D DC A9
U+1F680 ROCKET: 🚀
Nº: 128640
UTF-8: F0 9F 9A 80
UTF-16: D8 3D DE 80
Strangely enough, nobody pointed out how to calculate how many bytes is taking one Unicode char. Here is the rule for UTF-8 encoded strings:
Binary Hex Comments
0xxxxxxx 0x00..0x7F Only byte of a 1-byte character encoding
10xxxxxx 0x80..0xBF Continuation bytes (1-3 continuation bytes)
110xxxxx 0xC0..0xDF First byte of a 2-byte character encoding
1110xxxx 0xE0..0xEF First byte of a 3-byte character encoding
11110xxx 0xF0..0xF4 First byte of a 4-byte character encodingSo the quick answer is: it takes 1 to 4 bytes, depending on the first one which will indicate how many bytes it'll take up.
We would need to layout a plan of attack to takle the problems and what those problems will be.
.
What are the steps that must be taken to allow it on the rendering side and on the loading side.
What are the considerations for strings or stringbuilders when passed to Drawstring if any.
The way this works in .NET is that you convert a string of chars into Unicode code points, which are int values. For example, here's the code I wrote to do this:
````C#
private static IEnumerable
{
if (!String.IsNullOrEmpty(text))
{
for (int i = 0; i < text.Length; i += 1)
{
if (Char.IsLowSurrogate(text, i))
{
continue;
}
yield return Char.ConvertToUtf32(text, i);
}
}
}
````
The returned code points can then be mapped to Unicode glyphs, e.g., rectangular regions in a texture asset.
For example, using a BitmapFont asset generated with BMFont and loaded with MonoGame.Extended (which already properly supports Unicode as a good alternative to SpriteFont), I use:
````C#
foreach (var codePoint in GetUnicodeCodePoints(text))
{
var characterRegion = bitmapFont.GetCharacterRegion(codePoint);
// Draw the source rectangle texture specified by characterRegion
// ...
}
````
So the bottom line is that (strangely) you want to use Int32 values instead of Char values when dealing with individual characters in Unicode strings.
I just realized that with a cast you can maintain the Char type for backward compatibility instead of having to switch everything from Char to Int32:
C#
private static IEnumerable<char> GetUtf32Chars(string text)
{
if (!String.IsNullOrEmpty(text))
{
for (int i = 0; i < text.Length; i += 1)
{
if (!Char.IsLowSurrogate(text, i))
{
yield return (char)Char.ConvertToUtf32(text, i);
}
}
}
}
This enumerator now simply enumerates a string as a sequence of UTF-32 chars instead of multibyte UTF-8 chars.
I think this should make things pretty straightforward, since the only thing that needs to change is to call the GetUtf32Chars function when enumerating the characters in the string.
@dtaylorus Why cast back to char? You'll mess up the codepoints that require 4 bytes in UTF-16. Do you mean to combine the above snippet with a remapping as is proposed above?
@Jjagg Sorry, I haven't actually looked at the internals of the existing code to see what's necessary. If you can use the Int32 values, the cast would be unnecessary. But I don't think you lose anything by casting, since a Char is still 4 bytes.
I would only propose you include the cast if it makes the code change smaller, for example, if there is a bunch of existing code that assumes the char type, or if it means that you can maintain backward compatibility for a public API. Not sure if either of those are the case.
@Jjagg Sorry, it looks like I misspoke. Char is only 2 bytes internally as you say. So I think you're right that the cast is not a good option. So I guess Char really is pretty useless as a data type in a Unicode world. :)
So I guess Char really is pretty useless as a data type in a Unicode world.
Well, it's just that the CLR uses UTF-16 to represent strings, so it makes sense to have 2 byte units to represent characters (especially since most of the commonly used characters - especially Western ones - use 2 bytes in UTF-16).
Here's a suggestion to support the full Unicode range without breaking API/xnb format and without forcing users to remap characters.
IMO the best approach would be to switch the internal character representation to 32-bit integers so we don't have to force users to remap. We can do this without breaking the API by adding the necessary API to fully support 32-bit integer representations and to convert to 16-bit for the old API. We keep either keep the old API or mark it as deprecated so we can drop it in MG 4.0.
An example of a backwards-compatible change:
public struct Glyph
{
public char Character
{
get => (char) CodePoint; // should probably return '⁇' for 4 byte characters or something
set => CodePoint = (int) value;
}
public int CodePoint;
....
}
I just checked the implementation for the processor and the runtime and I think we can do a similar modification everywhere in the public API.
We can even avoid breaking the xnb format by making two additions to the format and modifying SpriteFont reading/writing accordingly:
chars for 4 byte characterschar read an additional character to get the entire characterNote that these additions are compatible with the current format because they only handle extra cases and don't make any changes for 2 byte characters. Meaning old .xnb files will get loaded just fine without special measures.
cc @dtaylorus @KonajuGames for feedback :)
Getting the unicode code points is exactly what is done in Extended for bitmap fonts.
@Jjagg I'm not sure how the Glyph structure is used, but is it reasonable to use it just like Char, so for example, you may have to enumerate two Glyph structs to decode one Unicode code point? If so, this could potentially avoid changing any public API's.
The backward compatible idea for the .xnb encoding looks good to me.
@Jjagg I took a brief look at how the Glyph structure is used and your solution to use add the int CodePoint field also looks good to me.
What do you plan to do with the public Dictionary<char, Glyph> GetGlyphs() method and the public ReadOnlyCollection<char> Characters property? Will you make a breaking change to switch from char to int?
We can add int variants (CharactersUtf32 and GetGlyphsUtf32) and cast down to char when the old versions are used, omitting the 4-byte UTF-16 characters. That's what I meant by:
We can do this without breaking the API by adding the necessary API to fully support 32-bit integer representations and to convert to 16-bit for the old API.
It won't break the API, but I think it would be best to mark the old API as deprecated and fully replace it in MG 4.0 because it's pretty crappy to have two variants like that. Still better than breaking the API in a minor version upgrade though.
Ah, so you will filter out the 4-byte characters for these properties. That's the part I was missing. Sounds good!
Couple of notes on the latest conversation:
@Jjagg Use UInt32 instead of Int32; Otherwise multibyte chars would be serialized as negative values. Depending on read method used, the lower bytes may be 2's complemented.
I see two possibilities to continue:
Create a second content type mirroring SpriteFont.
You would have to duplicate each piece of code( the serializer/deserializer, Xnb file, Font description, fontdescriptionimporter, Spritefont, character region, glyph and drawstring) and include the 32 bit codepoint value.
Perform automatic mapping of characters into UTF-16 space(8191 Chars Max), store a second mapping table either as an additional table at the end of the Xnb(Unsure if it would be ignored by parser) or as a separate file. Map it back to UTF-32 in the loading of the spritefont.
Thanks for your input @stromkos.
@Jjagg Use UInt32 instead of Int32; Otherwise multibyte chars would be serialized as negative values. Depending on read method used, the lower bytes may be 2's complemented.
If you read and write as int it doesn't really matter. And since the .NET API uses int for UTF-32 encoded characters IMO it makes sense for us to do the same.
Create a second content type mirroring SpriteFont.
This would make maintaining harder, clutters the codebase and confuses users, so it's not a good solution.
Perform automatic mapping of characters into UTF-16 space(8191 Chars Max), store a second mapping table either as an additional table at the end of the Xnb(Unsure if it would be ignored by parser) or as a separate file. Map it back to UTF-32 in the loading of the spritefont.
UTF-16 is a variable-length encoding that can represent all Unicode characters (with either 2 or 4 bytes). We don't have to change the xnb format, we just have to read characters out properly instead of assuming only 2-byte characters and store them in int. I covered this in my previous comments.
@Jjagg, I have created a working proof of concept of your ideas here: https://github.com/stromkos/MonoGame/tree/EndToEndUTF32
There are many problems with style, comments and code placement.
The only breaking change is the issue with (CharactersUtf32 and GetGlyphsUtf32),
I created Utf16 versions and made the original return CharEx. Mainly for debugging purposes(find all internal uses).
I created a new value type for UTF-32 called CharEx (I put it under the Framework.Utilities namespace, thinking it would be an extension, not a type), and supplied Readers, Writers and Serializer/Deserializer.
Very minor changes to spritefont to properly measure string lengths when they contain extended chars.
Hey @stromkos! Sorry for the delay. Could you set up a pull request so I can comment inline?
Based on the previous responses to my PR's, I am not willing to submit a non-working PR. If you have major concerns or comments, I would be glad to hear them. The version posted has many bugs, as it was intended as a proof of concept only. I have corrected some/most of those bugs. Unit testing is a blessing.
I will post a different branch, and submit it for a PR when it is, at a minimum functional level.
I would be glad to create an orphaned project with unstable updates. If you are only analyzing source and not attempting to build. (I will intentionally leave compile errors to let me know what needs to be fixed -- I realize there are better ways to indicate this, but OLD habits are hard to break :)
Current status:
Most helpful comment
The cost should be minimal. I can do a proof of concept this weekend. I've
got a good idea of what needs to be done now and how to handle it with
least impact.