Bug report
ES5
Uglify version (uglifyjs -V)
uglify-js 3.2.1
JavaScript input
new RegExp("<object[^>]*>.*?<\/object>|<span[^>]*>.*?<\/span>|<(?:object|embed|svg|img|div|span|p|a)[^>]*>|(?:\uD83C\uDFF3)\uFE0F?\u200D?(?:\uD83C\uDF08)|(?:\uD83D\uDC41)\uFE0F?\u200D?(?:\uD83D\uDDE8)\uFE0F?|[#-9]\uFE0F?\u20E3|(?:(?:\uD83C\uDFF4)(?:\uDB40[\uDC60-\uDCFF]){1,6})|(?:\uD83C[\uDDE0-\uDDFF]){2}|(?:(?:\uD83D[\uDC68\uDC69]))\uFE0F?(?:\uD83C[\uDFFA-\uDFFF])?\u200D?(?:[\u2695\u2696\u2708]|\uD83C[\uDF3E-\uDFED]|\uD83D[\uDCBB\uDCBC\uDD27\uDD2C\uDE80\uDE92])|(?:\uD83D[\uDC68\uDC69]|\uD83E[\uDDD0-\uDDDF])(?:\uD83C[\uDFFA-\uDFFF])?\u200D?[\u2640\u2642\u2695\u2696\u2708]?\uFE0F?|(?:(?:\u2764|\uD83D[\uDC66-\uDC69\uDC8B])[\u200D\uFE0F]{0,2}){1,3}(?:\u2764|\uD83D[\uDC66-\uDC69\uDC8B])|(?:(?:\u2764|\uD83D[\uDC66-\uDC69\uDC8B])\uFE0F?){2,4}|(?:\uD83D[\uDC68\uDC69\uDC6E\uDC71-\uDC87\uDD75\uDE45-\uDE4E]|\uD83E[\uDD26\uDD37]|\uD83C[\uDFC3-\uDFCC]|\uD83E[\uDD38-\uDD3E]|\uD83D[\uDEA3-\uDEB6]|\u26f9|\uD83D\uDC6F)\uFE0F?(?:\uD83C[\uDFFB-\uDFFF])?\u200D?[\u2640\u2642]?\uFE0F?|(?:[\u261D\u26F9\u270A-\u270D]|\uD83C[\uDF85-\uDFCC]|\uD83D[\uDC42-\uDCAA\uDD74-\uDD96\uDE45-\uDE4F\uDEA3-\uDECC]|\uD83E[\uDD18-\uDD3E])\uFE0F?(?:\uD83C[\uDFFB-\uDFFF])|(?:[\u2194-\u2199\u21a9-\u21aa]\uFE0F?|[\u0023\u002a]|[\u3030\u303d]\uFE0F?|(?:\ud83c[\udd70-\udd71]|\ud83c\udd8e|\ud83c[\udd91-\udd9a])\uFE0F?|\u24c2\uFE0F?|[\u3297\u3299]\uFE0F?|(?:\ud83c[\ude01-\ude02]|\ud83c\ude1a|\ud83c\ude2f|\ud83c[\ude32-\ude3a]|\ud83c[\ude50-\ude51])\uFE0F?|[\u203c\u2049]\uFE0F?|[\u25aa-\u25ab\u25b6\u25c0\u25fb-\u25fe]\uFE0F?|[\u00a9\u00ae]\uFE0F?|[\u2122\u2139]\uFE0F?|\ud83c\udc04\uFE0F?|[\u2b05-\u2b07\u2b1b-\u2b1c\u2b50\u2b55]\uFE0F?|[\u231a-\u231b\u2328\u23cf\u23e9-\u23f3\u23f8-\u23fa]\uFE0F?|\ud83c\udccf|[\u2934\u2935]\uFE0F?)|[\u2700-\u27bf]\uFE0F?|[\ud800-\udbff][\udc00-\udfff]\uFE0F?|[\u2600-\u26FF]\uFE0F?|[\u0030-\u0039]\uFE0F", "g");
Source of this line of code is from the emojione library (emojione.js#L160).
The uglifyjs CLI command executed or minify() options used.
uglifyjs --compress --beautify beautify=false,semicolons=false --mangle -- emojione.js
JavaScript output or error produced.
new RegExp("<object[^>]*>.*?</object>|<span[^>]*>.*?</span>|<(?:object|embed|svg|img|div|span|p|a)[^>]*>|(?:🏳)️??(?:🌈)|(?:👁)️??(?:🗨)️?|[#-9]️?⃣|(?:(?:🏴)(?:\udb40[\udc60-\udcff]){1,6})|(?:\ud83c[\udde0-\uddff]){2}|(?:(?:\ud83d[\udc68�]))️?(?:\ud83c[\udffa-\udfff])??(?:[⚕⚖✈]|\ud83c[\udf3e-\udfed]|\ud83d[\udcbb�\udd27�\ude80�])|(?:\ud83d[\udc68�]|\ud83e[\uddd0-\udddf])(?:\ud83c[\udffa-\udfff])??[♀♂⚕⚖✈]?️?|(?:(?:❤|\ud83d[\udc66-\udc69�])[️]{0,2}){1,3}(?:❤|\ud83d[\udc66-\udc69�])|(?:(?:❤|\ud83d[\udc66-\udc69�])️?){2,4}|(?:\ud83d[\udc68�\udc6e�-\udc87�\ude45-\ude4e]|\ud83e[\udd26�]|\ud83c[\udfc3-\udfcc]|\ud83e[\udd38-\udd3e]|\ud83d[\udea3-\udeb6]|⛹|👯)️?(?:\ud83c[\udffb-\udfff])??[♀♂]?️?|(?:[☝⛹✊-✍]|\ud83c[\udf85-\udfcc]|\ud83d[\udc42-\udcaa�-\udd96�-\ude4f�-\udecc]|\ud83e[\udd18-\udd3e])️?(?:\ud83c[\udffb-\udfff])|(?:[↔-↙↩-↪]️?|[#*]|[〰〽]️?|(?:\ud83c[\udd70-\udd71]|🆎|\ud83c[\udd91-\udd9a])️?|Ⓜ️?|[㊗㊙]️?|(?:\ud83c[\ude01-\ude02]|🈚|🈯|\ud83c[\ude32-\ude3a]|\ud83c[\ude50-\ude51])️?|[‼⁉]️?|[▪-▫▶◀◻-◾]️?|[©®]️?|[™ℹ]️?|🀄️?|[⬅-⬇⬛-⬜⭐⭕]️?|[⌚-⌛⌨⏏⏩-⏳⏸-⏺]️?|🃏|[⤴⤵]️?)|[✀-➿]️?|[\ud800-\udbff][\udc00-\udfff]️?|[☀-⛿]️?|[0-9]️","g")
The above snippet is unparsable.
Some version bi-secting shows that this last worked correctly in v3.0.25 and was first broken in v3.1.0. I took a look at the compare for thos two versions, but I just don't posses the required skillset to debug this stuff. I wish I could be more help.
That library is probably using some illegal unpaired surrogate. See: https://github.com/mishoo/UglifyJS2/issues/2242
This works:
$ bin/uglifyjs emojione.js -m -c -b beautify=false,ascii_only | node -p
/<object[^>]*>.*?<\/object>|<span[^>]*>.*?<\/span>|<(?:object|embed|svg|img|div|span|p|a)[^>]*>|(?:🏳)️??(?:🌈)|(?:👁)️??(?:🗨)️?|[#-9]️?⃣|(?:(?:🏴)(?:�[�-�]){1,6})|(?:�[�-�]){2}|(?:(?:�[��]))️?(?:�[�-�])??(?:[⚕⚖✈]|�[�-�]|�[������])|(?:�[��]|�[�-�])(?:�[�-�])??[♀♂⚕⚖✈]?️?|(?:(?:❤|�[�-��])[️]{0,2}){1,3}(?:❤|�[�-��])|(?:(?:❤|�[�-��])️?){2,4}|(?:�[����-���-�]|�[��]|�[�-�]|�[�-�]|�[�-�]|⛹|👯)️?(?:�[�-�])??[♀♂]?️?|(?:[☝⛹✊-✍]|�[�-�]|�[�-��-��-��-�]|�[�-�])️?(?:�[�-�])|(?:[↔-↙↩-↪]️?|[#*]|[〰〽]️?|(?:�[�-�]|🆎|�[�-�])️?|Ⓜ️?|[㊗㊙]️?|(?:�[�-�]|🈚|🈯|�[�-�]|�[�-�])️?|[‼⁉]️?|[▪-▫▶◀◻-◾]️?|[©®]️?|[™ℹ]️?|🀄️?|[⬅-⬇⬛-⬜⭐⭕]️?|[⌚-⌛⌨⏏⏩-⏳⏸-⏺]️?|🃏|[⤴⤵]️?)|[✀-➿]️?|[�-�][�-�]️?|[☀-⛿]️?|[0-9]️/g
Reduced test case:
$ bin/uglifyjs -V
uglify-es 3.2.1
$ cat regex.js
new RegExp("[\udc42-\udcaa\udd74-\udd96\ude45-\ude4f\udea3-\udecc]");
md5-d0e99a1ce3d68792ba55d6aad73b3064
$ cat regex.js | bin/uglifyjs
new RegExp("[\udc42-\udcaa�-\udd96�-\ude4f�-\udecc]");
md5-d0e99a1ce3d68792ba55d6aad73b3064
$ cat regex.js | bin/uglifyjs -b ascii_only
new RegExp("[\udc42-\udcaa\udd74-\udd96\ude45-\ude4f\udea3-\udecc]");
$ cat regex.js | bin/uglifyjs | bin/uglifyjs -b ascii_only
new RegExp("[\udc42-\udcaa\ufffd-\udd96\ufffd-\ude4f\ufffd-\udecc]");
Unpaired surrogates must be output in ascii in the default binary output mode.
uglify-es parser recognises surrogate pairs - so the question is does this work on harmony?
If so, I guess it's just a matter of backporting part of that logic onto master.
uglify-es parser recognises surrogate pairs - so the question is does this work on harmony?
The reduced test case above was done with harmony.
Other ES parsers keep and use the raw string. Uglify does not. We probably have to step through the string char by char to output it properly in binary mode.
@kzc Awesome. Thanks so much for helping produce the reduced test case.
@alexlamsl Thanks for the fast response / tagging.
Even though it can probably be accommodated by uglify, I still have my doubts that the original string with unpaired surrogates - even in ascii form - is valid ECMAScript. The ES spec is silent on the use of unpaired surrogates in both strings and RegExp. It's probably a defacto browser thing.
Even node converts such a regex to a string by replacing the lone surrogates with the Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD):
$ cat regex.js | node -p
/[�-��-��-��-�]/
$ cat regex.js | node -p | xxd
0000000: 2f5b efbf bd2d efbf bdef bfbd 2def bfbd /[...-......-...
0000010: efbf bd2d efbf bdef bfbd 2def bfbd 5d2f ...-......-...]/
0000020: 0a .
Related unicode regular expression spec:
"It is permissible, but not required, to match an isolated surrogate code point (such as u{D800}), which may occur in Unicode Strings. "
http://unicode.org/reports/tr18/#Supplementary_Characters
and related discussion:
"lone surrogates cannot be part of any valid UTF"
http://unicode.org/pipermail/unicode/2015-October/002979.html
Those affected can use the -b beautify=false,ascii_only workaround in the meantime.
The fix will be in v3.2.3.
Awesome!! Thank you guys.
I am at uglify 3.3.1 and it still screw my regexes breaking the prod builds
and If I set the ascii_only=true the next compiler doesn't see .factory method
minifiying commonmark.js when building via the most recent react-create-app yields this error:

I've installed v3.3.9 of uglify-js
Most helpful comment
The fix will be in v3.2.3.