At https://github.com/tinymce/tinymce/issues/1971 and https://core.trac.wordpress.org/ticket/30130 @Zodiac1978 describes an occasional problem when pasting Unicode text into TinyMCE:
I have a PDF file with German Umlauts (锟斤拷锟斤拷锟斤拷) and if I copy & paste them into the TinyMCE from WordPress I get the vowel (uoaUOA) which followed by a diaeresis (http://www.fileformat.info/info/unicode/char/0308/index.htm) instead of just one precomposed character.
This results in some problems:
- Search for words with umlauts doesn't work
- Proofreading fails
- W3C validation fails with warning "Text run is not in Unicode Normalization Form C." because precomposed characters are prefered (See: http://www.w3.org/International/docs/charmod-norm/#choice-of-normalization-form)
We're probably in a good place to fix this in our JavaScript code.
With ES6 we have a normalize function in JS:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize
And there is a polyfill for older browsers:
https://github.com/walling/unorm
Unless there's a misunderstanding of the issue here, this appears to be blocked by the upstream issue https://github.com/tinymce/tinymce/issues/1971 . If I'm wrong, can you clarify the action steps necessary for Gutenberg specifically?
From today's editor bug scrub: https://wordpress.slack.com/archives/C02QB2JS7/p1518111268000525
Oh, I remember this, 3 years ago:
We could use normalize in JavaScript, but with limited browser support.
This is not a TinyMCE problem though, it affects all input. A PHP solution might be better, but we could consider also cleaning this up for the editor on paste.
Moving back to formatting.
Not blocked by the TinyMCE issue. We can clean this up on paste in the visual editor at least, maybe also in text. I'll have a look once #5966 is merged.