Silverstripe-framework: BUG: String tokenizing methods return garbled content with some UTF-8 characters

Created on 5 Jul 2017 · 18Comments · Source: silverstripe/silverstripe-framework

Problem description

We stumbled across this, when trying to json_encode some content from SilverStripe and the method suddenly returned nothing (although json_encode should be capable of encoding _any_ proper UTF-8 string).

The issue only arises when using methods that work on portions of strings, such as HTMLText::FirstSentence, HTMLText::Summary etc. When running content through some of these methods, you might end up with invalid UTF-8.

Affected Versions

SilverStripe Framework 2.x - 4.x
String Fieldtype classes (Text, HTMLText,…), Convert class, maybe the shortcode-parser.

Severity

Edge-case, but has a high WTF-factor when it's happening.

Source of the issue

The methods that work on substrings use methods that are not multibyte safe. In my tests, I found out that preg_replace and preg_split on whitespace (/\s+/) can split a multibyte character in half, which will result in invalid UTF-8.

The problem only seems to occur if your system uses an UTF-8 version of a locale.

Illustration of the problem

// Must set an UTF-8 variant of a locale
setlocale(LC_ALL, "en_US.UTF-8");
// Generate a non-breaking space, U+00A0
$str = html_entity_decode('&nbsp;',ENT_COMPAT, 'UTF-8' );
// replace whitespace
echo preg_replace('/\s+/', ' ', $str);

This will output non-valid UTF-8, you'll see the dreaded: �

Explanation

When using an UTF-8 locale, preg_replace seems to detect the byte A0 as white-space and replace it with the byte 20. So the byte sequence C2A0 (which is the non-breaking space), becomes C220, which is _not_ a valid UTF-8 character. It's very likely, that other UTF-8 characters could be affected by this problem as well (not only the non-breaking space).

Possible fix

All instances of preg_replace('/\s+/', …) and preg_split('/\s+/',…) should be replaced with the multibyte-safe variants, mb_ereg_replace and mb_split. Other string operations should be looked into, preferably all string replacements and regular expressions should operate in a multibyte-safe manner.

affectv3 affectv4 efformedium feedback-requirecore-team impaclow typbug

Source

bummzack

Most helpful comment

@silverstripe/core-team I'd like some feedback on which approach is best...

We currently require mbstring extension and so using the mbstring functions shouldn't be a no-go.

Do we want to:

:heart: Add u flag to all our preg_* functions; OR
🎉 Move current preg_* functions to mbstring equivalents?

Further reading, using the u flag on preg_* methods can still be problematic when using PREG_OFFSET_CAPTURE (https://stackoverflow.com/questions/1725227/preg-match-and-utf-8-in-php/1725329#1725329)

dhensby on 5 Jul 2017

❤3

All 18 comments

Update: instead of using the mb_ string methods, one could also use the unicode-modifier for the preg_* functions, eg. using preg_replace('/\s+/u', …) also solves the issue. Not sure what other implications the unicode-modifier has… it seems to at least require a PCRE version with unicode-support.

bummzack on 5 Jul 2017

let's add the unicode modifier - can you PR for it and I'll get it merged and all the way up from 3.5 -> master

dhensby on 5 Jul 2017

@dhensby PR against 3.5 branch, then?

bummzack on 5 Jul 2017

yes please. And thanks for investigating this in such detail and taking the time to report it

dhensby on 5 Jul 2017

@silverstripe/core-team I'd like some feedback on which approach is best...

We currently require mbstring extension and so using the mbstring functions shouldn't be a no-go.

Do we want to:

:heart: Add u flag to all our preg_* functions; OR
🎉 Move current preg_* functions to mbstring equivalents?

dhensby on 5 Jul 2017

❤3

@kinglozzer I've brought the impact down because this is quite an edge case and I feel we'd have heard about this a lot more if it were a high impact problem

dhensby on 5 Jul 2017

👍1

Due to the design of UTF-8, this issue actually shouldn't happen. To me this is more of a bug in the underlying PCRE implementation, that a single A0 byte gets replaced to 20. Mind you, this is only happening when an UTF-8 locale is being set! Without an .UTF-8 locale, the preg_ methods behave as excepted.

Since there are several UTF-8 characters that contain the A0 byte (eg. à, Ġ, Š …), this could lead to malformed UTF-8 when using preg_ methods.
The most toxic regular expression is splitting or replacing /\s+/, as this will replace all the A0 bytes. Combined regexes are less of an issue, eg. something like |<img[^>]*alt\s*=\s*"([^"]*)"[^>]*>| is not problematic since there are characters surrounding \s.

I think there are several factors why this isn't a very common problem:

You'd have to specifically set your locale to be UTF-8 (the Fluent module does this by default)
The characters that cause the issue are not very commonly used
You'd have to use the Summary or FirstSentence methods
I think HTML output is more forgiving than something like encoding to JSON. In the worst case, you have the � symbol in your HTML output.

bummzack on 5 Jul 2017

My unit-tests that should prove the issue seem to pass… this might be specific to a certain server-configuration. Could people reading this check what the following script returns on their webserver?

<?php
$locale = setlocale(LC_ALL,'en_US.UTF-8');
$output = preg_replace('/\s+/', ' ', 'à');
echo '<pre>';
printf(
    "Locale %s generates %s UTF-8. Output: %s\nPHP: %s / PCRE: %s / %s",
    $locale,
    mb_check_encoding($output, 'UTF-8') == true ? 'VALID' : 'INVALID',
    $output,
    phpversion(),
    PCRE_VERSION,
    php_uname('s')
);
echo '</pre>';

I get the following on my local machine:

Locale en_US.UTF-8 generates INVALID UTF-8. Output: �
PHP: 5.6.30 / PCRE: 8.38 2015-11-23 / Darwin

bummzack on 5 Jul 2017

On Linux machines I get:

Locale en_US.UTF-8 generates VALID UTF-8. Output: à
PHP: 5.6.30 / PCRE: 8.37 2015-04-28 / Linux

Locale en_US.UTF-8 generates VALID UTF-8. Output: à
PHP: 7.0.20 / PCRE: 8.37 2015-04-28 / Linux

Maybe this is a Mac OS X bug… or related to the PCRE version?

bummzack on 5 Jul 2017

Other reports so far, from Slack:

faloude:

Locale en_US.UTF-8 generates VALID UTF-8. Output: à
PHP: 5.6.29 / PCRE: 8.20 2011-10-21 / Linux

Question on StackOverflow that seems to support the theory of an OS X Bug: https://stackoverflow.com/questions/18432253/unexpected-behavior-of-preg-replace-with-osx

bummzack on 5 Jul 2017

I get this on my local Ubuntu16.04 machine:

Locale en_US.UTF-8 generates VALID UTF-8. Output:
PHP: 7.0.18-0ubuntu0.16.04.1 on: Linux

SpiritLevel on 5 Jul 2017

I'm not keen to move from preg to ereg, which is my beef with mb_ereg_replace. Adding /u modifier seems like a good idea.

PREG_OFFSET_CAPTURE is used in ShortcodeParser and php-peg (i.e. SSViewer).

From what I can read, the issue is that it turns byte-offsets, rather than character-offsets. As long as these are passed to substr and not mb_substr (which it is) that should be fine. So I can't see a problem in principle with preg/u.

Can we make sure we get some tests in that cover this issue?

sminnee on 6 Jul 2017

@sminnee I've tried to add some tests, but they passed. It's either a bug that's limited to Mac OS X, or related to a certain version of PCRE. I'd love to see some output of a non Mac OS machine that runs PCRE 8.38…

The PREG_OFFSET_CAPTURE is only an issue if you happen to get an index in the middle of a multibyte sequence… utf-8 was designed in a way that this shouldn't happen, but with the bug we're dealing with here, it could happen still…

bummzack on 6 Jul 2017

Output on CentOS 7.3:

Locale en_US.UTF-8 generates VALID UTF-8. Output: à
PHP: 7.1.6 / PCRE: 8.38 2015-11-23 / Linux

dhensby on 6 Jul 2017

👍1

@dhensby Well, seems to be an OS X only issue, then… maybe also test BSD. Not sure if it's worth a lot of effort to fix this for just OS X? Of course I'd be happy if it works properly on my system, but I could also patch local installations for dev-work.

bummzack on 6 Jul 2017

if we can patch and fix the issue with the /u flag and it solves your problem, then I see no reason against doing it, even if it's an edge case.

dhensby on 6 Jul 2017

👍1

Locale en_US.UTF-8 generates VALID UTF-8. Output: à PHP: 7.1.6 / PCRE: 8.40 2017-01-11 / FreeBSD

even tough it's looks like a osX only issue, I would appreciate if it gets fixed, since it's a real pain if you run into it and still quite a few developers use osX.