We stumbled across this, when trying to json_encode some content from SilverStripe and the method suddenly returned nothing (although json_encode should be capable of encoding _any_ proper UTF-8 string).
The issue only arises when using methods that work on portions of strings, such as HTMLText::FirstSentence, HTMLText::Summary etc. When running content through some of these methods, you might end up with invalid UTF-8.
SilverStripe Framework 2.x - 4.x
String Fieldtype classes (Text, HTMLText,…), Convert class, maybe the shortcode-parser.
Edge-case, but has a high WTF-factor when it's happening.
The methods that work on substrings use methods that are not multibyte safe. In my tests, I found out that preg_replace and preg_split on whitespace (/\s+/) can split a multibyte character in half, which will result in invalid UTF-8.
The problem only seems to occur if your system uses an UTF-8 version of a locale.
// Must set an UTF-8 variant of a locale
setlocale(LC_ALL, "en_US.UTF-8");
// Generate a non-breaking space, U+00A0
$str = html_entity_decode(' ',ENT_COMPAT, 'UTF-8' );
// replace whitespace
echo preg_replace('/\s+/', ' ', $str);
This will output non-valid UTF-8, you'll see the dreaded: �
When using an UTF-8 locale, preg_replace seems to detect the byte A0 as white-space and replace it with the byte 20. So the byte sequence C2A0 (which is the non-breaking space), becomes C220, which is _not_ a valid UTF-8 character. It's very likely, that other UTF-8 characters could be affected by this problem as well (not only the non-breaking space).
All instances of preg_replace('/\s+/', …) and preg_split('/\s+/',…) should be replaced with the multibyte-safe variants, mb_ereg_replace and mb_split. Other string operations should be looked into, preferably all string replacements and regular expressions should operate in a multibyte-safe manner.
Update: instead of using the mb_ string methods, one could also use the unicode-modifier for the preg_* functions, eg. using preg_replace('/\s+/u', …) also solves the issue. Not sure what other implications the unicode-modifier has… it seems to at least require a PCRE version with unicode-support.
let's add the unicode modifier - can you PR for it and I'll get it merged and all the way up from 3.5 -> master
@dhensby PR against 3.5 branch, then?
yes please. And thanks for investigating this in such detail and taking the time to report it
@silverstripe/core-team I'd like some feedback on which approach is best...
We currently require mbstring extension and so using the mbstring functions shouldn't be a no-go.
Do we want to:
u flag to all our preg_* functions; ORpreg_* functions to mbstring equivalents?Further reading, using the u flag on preg_* methods can still be problematic when using PREG_OFFSET_CAPTURE (https://stackoverflow.com/questions/1725227/preg-match-and-utf-8-in-php/1725329#1725329)
@kinglozzer I've brought the impact down because this is quite an edge case and I feel we'd have heard about this a lot more if it were a high impact problem
Due to the design of UTF-8, this issue actually shouldn't happen. To me this is more of a bug in the underlying PCRE implementation, that a single A0 byte gets replaced to 20. Mind you, this is only happening when an UTF-8 locale is being set! Without an .UTF-8 locale, the preg_ methods behave as excepted.
Since there are several UTF-8 characters that contain the A0 byte (eg. à , Ä , Š…), this could lead to malformed UTF-8 when using preg_ methods.
The most toxic regular expression is splitting or replacing /\s+/, as this will replace all the A0 bytes. Combined regexes are less of an issue, eg. something like |<img[^>]*alt\s*=\s*"([^"]*)"[^>]*>| is not problematic since there are characters surrounding \s.
I think there are several factors why this isn't a very common problem:
Summary or FirstSentence methodsMy unit-tests that should prove the issue seem to pass… this might be specific to a certain server-configuration. Could people reading this check what the following script returns on their webserver?
<?php
$locale = setlocale(LC_ALL,'en_US.UTF-8');
$output = preg_replace('/\s+/', ' ', 'Ã ');
echo '<pre>';
printf(
"Locale %s generates %s UTF-8. Output: %s\nPHP: %s / PCRE: %s / %s",
$locale,
mb_check_encoding($output, 'UTF-8') == true ? 'VALID' : 'INVALID',
$output,
phpversion(),
PCRE_VERSION,
php_uname('s')
);
echo '</pre>';
I get the following on my local machine:
Locale en_US.UTF-8 generates INVALID UTF-8. Output: �
PHP: 5.6.30 / PCRE: 8.38 2015-11-23 / Darwin
On Linux machines I get:
Locale en_US.UTF-8 generates VALID UTF-8. Output: Ã
PHP: 5.6.30 / PCRE: 8.37 2015-04-28 / Linux
Locale en_US.UTF-8 generates VALID UTF-8. Output: Ã
PHP: 7.0.20 / PCRE: 8.37 2015-04-28 / Linux
Maybe this is a Mac OS X bug… or related to the PCRE version?
Other reports so far, from Slack:
faloude:
Locale en_US.UTF-8 generates VALID UTF-8. Output: Ã
PHP: 5.6.29 / PCRE: 8.20 2011-10-21 / Linux
Question on StackOverflow that seems to support the theory of an OS X Bug: https://stackoverflow.com/questions/18432253/unexpected-behavior-of-preg-replace-with-osx
I get this on my local Ubuntu16.04 machine:
Locale en_US.UTF-8 generates VALID UTF-8. Output:
PHP: 7.0.18-0ubuntu0.16.04.1 on: Linux
I'm not keen to move from preg to ereg, which is my beef with mb_ereg_replace. Adding /u modifier seems like a good idea.
PREG_OFFSET_CAPTURE is used in ShortcodeParser and php-peg (i.e. SSViewer).
From what I can read, the issue is that it turns byte-offsets, rather than character-offsets. As long as these are passed to substr and not mb_substr (which it is) that should be fine. So I can't see a problem in principle with preg/u.
Can we make sure we get some tests in that cover this issue?
@sminnee I've tried to add some tests, but they passed. It's either a bug that's limited to Mac OS X, or related to a certain version of PCRE. I'd love to see some output of a non Mac OS machine that runs PCRE 8.38…
The PREG_OFFSET_CAPTURE is only an issue if you happen to get an index in the middle of a multibyte sequence… utf-8 was designed in a way that this shouldn't happen, but with the bug we're dealing with here, it could happen still…
Output on CentOS 7.3:
Locale en_US.UTF-8 generates VALID UTF-8. Output: Ã
PHP: 7.1.6 / PCRE: 8.38 2015-11-23 / Linux
@dhensby Well, seems to be an OS X only issue, then… maybe also test BSD. Not sure if it's worth a lot of effort to fix this for just OS X? Of course I'd be happy if it works properly on my system, but I could also patch local installations for dev-work.
if we can patch and fix the issue with the /u flag and it solves your problem, then I see no reason against doing it, even if it's an edge case.
Locale en_US.UTF-8 generates VALID UTF-8. Output: Ã
PHP: 7.1.6 / PCRE: 8.40 2017-01-11 / FreeBSD
even tough it's looks like a osX only issue, I would appreciate if it gets fixed, since it's a real pain if you run into it and still quite a few developers use osX.
It looks like this was fixed in a6db16b2298738e1ef1329329cbef7c6b33f993e.
Most helpful comment
@silverstripe/core-team I'd like some feedback on which approach is best...
We currently require
mbstringextension and so using the mbstring functions shouldn't be a no-go.Do we want to:
uflag to all ourpreg_*functions; ORpreg_*functions to mbstring equivalents?Further reading, using the
uflag onpreg_*methods can still be problematic when usingPREG_OFFSET_CAPTURE(https://stackoverflow.com/questions/1725227/preg-match-and-utf-8-in-php/1725329#1725329)