Uglifyjs: Unicode escaping in regex is incorrectly minified.

Created on 9 Dec 2017  ·  11Comments  ·  Source: mishoo/UglifyJS

Bug report

ES5

Uglify version (uglifyjs -V)
uglify-js 3.2.1

JavaScript input

new RegExp("<object[^>]*>.*?<\/object>|<span[^>]*>.*?<\/span>|<(?:object|embed|svg|img|div|span|p|a)[^>]*>|(?:\uD83C\uDFF3)\uFE0F?\u200D?(?:\uD83C\uDF08)|(?:\uD83D\uDC41)\uFE0F?\u200D?(?:\uD83D\uDDE8)\uFE0F?|[#-9]\uFE0F?\u20E3|(?:(?:\uD83C\uDFF4)(?:\uDB40[\uDC60-\uDCFF]){1,6})|(?:\uD83C[\uDDE0-\uDDFF]){2}|(?:(?:\uD83D[\uDC68\uDC69]))\uFE0F?(?:\uD83C[\uDFFA-\uDFFF])?\u200D?(?:[\u2695\u2696\u2708]|\uD83C[\uDF3E-\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92])|(?:\uD83D[\uDC68\uDC69]|\uD83E[\uDDD0-\uDDDF])(?:\uD83C[\uDFFA-\uDFFF])?\u200D?[\u2640\u2642\u2695\u2696\u2708]?\uFE0F?|(?:(?:\u2764|\uD83D[\uDC66-\uDC69\uDC8B])[\u200D\uFE0F]{0,2}){1,3}(?:\u2764|\uD83D[\uDC66-\uDC69\uDC8B])|(?:(?:\u2764|\uD83D[\uDC66-\uDC69\uDC8B])\uFE0F?){2,4}|(?:\uD83D[\uDC68\uDC69\uDC6E\uDC71-\uDC87\uDD75\uDE45-\uDE4E]|\uD83E[\uDD26\uDD37]|\uD83C[\uDFC3-\uDFCC]|\uD83E[\uDD38-\uDD3E]|\uD83D[\uDEA3-\uDEB6]|\u26f9|\uD83D\uDC6F)\uFE0F?(?:\uD83C[\uDFFB-\uDFFF])?\u200D?[\u2640\u2642]?\uFE0F?|(?:[\u261D\u26F9\u270A-\u270D]|\uD83C[\uDF85-\uDFCC]|\uD83D[\uDC42-\uDCAA\uDD74-\uDD96\uDE45-\uDE4F\uDEA3-\uDECC]|\uD83E[\uDD18-\uDD3E])\uFE0F?(?:\uD83C[\uDFFB-\uDFFF])|(?:[\u2194-\u2199\u21a9-\u21aa]\uFE0F?|[\u0023\u002a]|[\u3030\u303d]\uFE0F?|(?:\ud83c[\udd70-\udd71]|\ud83c\udd8e|\ud83c[\udd91-\udd9a])\uFE0F?|\u24c2\uFE0F?|[\u3297\u3299]\uFE0F?|(?:\ud83c[\ude01-\ude02]|\ud83c\ude1a|\ud83c\ude2f|\ud83c[\ude32-\ude3a]|\ud83c[\ude50-\ude51])\uFE0F?|[\u203c\u2049]\uFE0F?|[\u25aa-\u25ab\u25b6\u25c0\u25fb-\u25fe]\uFE0F?|[\u00a9\u00ae]\uFE0F?|[\u2122\u2139]\uFE0F?|\ud83c\udc04\uFE0F?|[\u2b05-\u2b07\u2b1b-\u2b1c\u2b50\u2b55]\uFE0F?|[\u231a-\u231b\u2328\u23cf\u23e9-\u23f3\u23f8-\u23fa]\uFE0F?|\ud83c\udccf|[\u2934\u2935]\uFE0F?)|[\u2700-\u27bf]\uFE0F?|[\ud800-\udbff][\udc00-\udfff]\uFE0F?|[\u2600-\u26FF]\uFE0F?|[\u0030-\u0039]\uFE0F", "g");

Source of this line of code is from the emojione library (emojione.js#L160).

The uglifyjs CLI command executed or minify() options used.

uglifyjs --compress --beautify beautify=false,semicolons=false --mangle -- emojione.js

JavaScript output or error produced.

new RegExp("<object[^>]*>.*?</object>|<span[^>]*>.*?</span>|<(?:object|embed|svg|img|div|span|p|a)[^>]*>|(?:🏳)️??(?:🌈)|(?:👁)️??(?:🗨)️?|[#-9]️?⃣|(?:(?:🏴)(?:\udb40[\udc60-\udcff]){1,6})|(?:\ud83c[\udde0-\uddff]){2}|(?:(?:\ud83d[\udc68�]))️?(?:\ud83c[\udffa-\udfff])??(?:[⚕⚖✈]|\ud83c[\udf3e-\udfed]|\ud83d[\udcbb�\udd27�\ude80�])|(?:\ud83d[\udc68�]|\ud83e[\uddd0-\udddf])(?:\ud83c[\udffa-\udfff])??[♀♂⚕⚖✈]?️?|(?:(?:❤|\ud83d[\udc66-\udc69�])[‍️]{0,2}){1,3}(?:❤|\ud83d[\udc66-\udc69�])|(?:(?:❤|\ud83d[\udc66-\udc69�])️?){2,4}|(?:\ud83d[\udc68�\udc6e�-\udc87�\ude45-\ude4e]|\ud83e[\udd26�]|\ud83c[\udfc3-\udfcc]|\ud83e[\udd38-\udd3e]|\ud83d[\udea3-\udeb6]|⛹|👯)️?(?:\ud83c[\udffb-\udfff])??[♀♂]?️?|(?:[☝⛹✊-✍]|\ud83c[\udf85-\udfcc]|\ud83d[\udc42-\udcaa�-\udd96�-\ude4f�-\udecc]|\ud83e[\udd18-\udd3e])️?(?:\ud83c[\udffb-\udfff])|(?:[↔-↙↩-↪]️?|[#*]|[〰〽]️?|(?:\ud83c[\udd70-\udd71]|🆎|\ud83c[\udd91-\udd9a])️?|Ⓜ️?|[㊗㊙]️?|(?:\ud83c[\ude01-\ude02]|🈚|🈯|\ud83c[\ude32-\ude3a]|\ud83c[\ude50-\ude51])️?|[‼⁉]️?|[▪-▫▶◀◻-◾]️?|[©®]️?|[™ℹ]️?|🀄️?|[⬅-⬇⬛-⬜⭐⭕]️?|[⌚-⌛⌨⏏⏩-⏳⏸-⏺]️?|🃏|[⤴⤵]️?)|[✀-➿]️?|[\ud800-\udbff][\udc00-\udfff]️?|[☀-⛿]️?|[0-9]️","g")

The above snippet is unparsable.


Some version bi-secting shows that this last worked correctly in v3.0.25 and was first broken in v3.1.0. I took a look at the compare for thos two versions, but I just don't posses the required skillset to debug this stuff. I wish I could be more help.

bug

Most helpful comment

The fix will be in v3.2.3.

All 11 comments

That library is probably using some illegal unpaired surrogate. See: https://github.com/mishoo/UglifyJS2/issues/2242

This works:

$ bin/uglifyjs emojione.js -m -c -b beautify=false,ascii_only | node -p
/<object[^>]*>.*?<\/object>|<span[^>]*>.*?<\/span>|<(?:object|embed|svg|img|div|span|p|a)[^>]*>|(?:🏳)️?‍?(?:🌈)|(?:👁)️?‍?(?:🗨)️?|[#-9]️?⃣|(?:(?:🏴)(?:�[�-�]){1,6})|(?:�[�-�]){2}|(?:(?:�[��]))️?(?:�[�-�])?‍?(?:[⚕⚖✈]|�[�-�]|�[������])|(?:�[��]|�[�-�])(?:�[�-�])?‍?[♀♂⚕⚖✈]?️?|(?:(?:❤|�[�-��])[‍️]{0,2}){1,3}(?:❤|�[�-��])|(?:(?:❤|�[�-��])️?){2,4}|(?:�[����-���-�]|�[��]|�[�-�]|�[�-�]|�[�-�]|⛹|👯)️?(?:�[�-�])?‍?[♀♂]?️?|(?:[☝⛹✊-✍]|�[�-�]|�[�-��-��-��-�]|�[�-�])️?(?:�[�-�])|(?:[↔-↙↩-↪]️?|[#*]|[〰〽]️?|(?:�[�-�]|🆎|�[�-�])️?|Ⓜ️?|[㊗㊙]️?|(?:�[�-�]|🈚|🈯|�[�-�]|�[�-�])️?|[‼⁉]️?|[▪-▫▶◀◻-◾]️?|[©®]️?|[™ℹ]️?|🀄️?|[⬅-⬇⬛-⬜⭐⭕]️?|[⌚-⌛⌨⏏⏩-⏳⏸-⏺]️?|🃏|[⤴⤵]️?)|[✀-➿]️?|[�-�][�-�]️?|[☀-⛿]️?|[0-9]️/g

Reduced test case:

$ bin/uglifyjs -V
uglify-es 3.2.1
$ cat regex.js 
new RegExp("[\udc42-\udcaa\udd74-\udd96\ude45-\ude4f\udea3-\udecc]");



md5-d0e99a1ce3d68792ba55d6aad73b3064



$ cat regex.js | bin/uglifyjs 
new RegExp("[\udc42-\udcaa�-\udd96�-\ude4f�-\udecc]");



md5-d0e99a1ce3d68792ba55d6aad73b3064



$ cat regex.js | bin/uglifyjs -b ascii_only
new RegExp("[\udc42-\udcaa\udd74-\udd96\ude45-\ude4f\udea3-\udecc]");

$ cat regex.js | bin/uglifyjs | bin/uglifyjs -b ascii_only
new RegExp("[\udc42-\udcaa\ufffd-\udd96\ufffd-\ude4f\ufffd-\udecc]");

Unpaired surrogates must be output in ascii in the default binary output mode.

uglify-es parser recognises surrogate pairs - so the question is does this work on harmony?

If so, I guess it's just a matter of backporting part of that logic onto master.

uglify-es parser recognises surrogate pairs - so the question is does this work on harmony?

2242 has the background of master and harmony diverging - and how node streams cannot handle unpaired binary surrogates.

The reduced test case above was done with harmony.

Other ES parsers keep and use the raw string. Uglify does not. We probably have to step through the string char by char to output it properly in binary mode.

@kzc Awesome. Thanks so much for helping produce the reduced test case.

@alexlamsl Thanks for the fast response / tagging.

Even though it can probably be accommodated by uglify, I still have my doubts that the original string with unpaired surrogates - even in ascii form - is valid ECMAScript. The ES spec is silent on the use of unpaired surrogates in both strings and RegExp. It's probably a defacto browser thing.

Even node converts such a regex to a string by replacing the lone surrogates with the Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD):

$ cat regex.js | node -p
/[�-��-��-��-�]/

$ cat regex.js | node -p | xxd
0000000: 2f5b efbf bd2d efbf bdef bfbd 2def bfbd  /[...-......-...
0000010: efbf bd2d efbf bdef bfbd 2def bfbd 5d2f  ...-......-...]/
0000020: 0a                                       .

Related unicode regular expression spec:

"It is permissible, but not required, to match an isolated surrogate code point (such as u{D800}), which may occur in Unicode Strings. "
http://unicode.org/reports/tr18/#Supplementary_Characters

and related discussion:

"lone surrogates cannot be part of any valid UTF"
http://unicode.org/pipermail/unicode/2015-October/002979.html

Those affected can use the -b beautify=false,ascii_only workaround in the meantime.

The fix will be in v3.2.3.

Awesome!! Thank you guys.

I am at uglify 3.3.1 and it still screw my regexes breaking the prod builds
and If I set the ascii_only=true the next compiler doesn't see .factory method

minifiying commonmark.js when building via the most recent react-create-app yields this error:

screen shot 2018-03-30 at 19 58 18

I've installed v3.3.9 of uglify-js

Was this page helpful?
0 / 5 - 0 ratings