Uglifyjs: Issue with compressing escaped UTF8 characters

Created on 23 Nov 2012 · 25Comments · Source: mishoo/UglifyJS

I am running into an issue when attempting to compress Acorn.js or Esprima.js for browser use:

Both libraries contain regular expressions that are full of escaped UTF8 characters (Look for NonAsciiIdentifierStart in Esprima, for example).

UglifyJS reads these, resolves the escaped characters, and then outputs them to real UTF8 characters.

Loading this compressed library on WebKit then causes an exception:

Invalid regular expression: range out of order in character class

Using UglifyJS1, I could pass --ascii to the compressor, solving this issue, but leading to an increased file-size of the result (about 4kb more, since all non-ascii chars would now be escaped).

In UglifyJS2, when passing -b ascii_only=true, I do not get any compressed output anymore. I believe, since the beautifier is activated, -c is ignored.

But Ideally, I would like an option that would allow me to just preserve these strings in their current form, including all escaping that they contain.

Depending on the JS parser in use, this might be hard to implement though.

What do you suggest?

Source

lehni

Most helpful comment

While it's a good feature to have to preserve utf-8 as is in uglify, for those who arrived at this page looking for solution to the "Invalid regular expression: range out of order in character class", you can at least add charset="utf-8" in your script tag's attribute, and also make sure you have in your html mark up to get rid of the error.

michaelwsk on 14 Oct 2013

👍13 ❤5 😄3 🎉2

All 25 comments

Using -b ascii_only=true works, but -b automatically implies “beautify”. One way around this is -b ascii_only=true,beautify=false. I'll add a separate --ascii-only flag, since I imagine it will be a common need, but in the mean time you can use this hack (it'll continue to work).

mishoo on 24 Nov 2012

That's good to know! But what do you think of an option to preserve the strings as they are? Escaping all unicode chars will unnecessarily increase the size of these libraries by 4kb otherwise.

lehni on 24 Nov 2012

That's more tricky because we don't store any information about how the actual string/regexp was written. Perhaps --regexp-ascii-only would be an acceptable compromise?

mishoo on 24 Nov 2012

In this particular case, the regexps are constructed from strings, so I don't think that would work?

lehni on 24 Nov 2012

Indeed, it won't help then.. If I think about it, in Acorn, except for those long regexps there's probably no other place where Unicode chars are used literally—which means that even with an option to keep literals as they are in the original source, you won't get smaller output. So I still don't see a good reason to implement this.

Here's another suggestion. I am assuming you compress multiple files and would like the ascii_only to apply only to Acorn. I could add support for per-file compression options, specified in the source file. So you'd add to acorn.js for example:

// this is a no-op statement, UglifyJS will drop it from the output
UglifyJS: ({
    codegen: {
        ascii_only: true
    }
});

This isn't a trivial change either :( but perhaps it would be best in the long run.

mishoo on 24 Nov 2012

Would it not be possible to keep the original source in the AST and when asked for output these instead of the newly constructed source?

rvanvelzen on 26 Nov 2012

It's true that in Acorn, the size would not differ, since all unicode-chars are also escaped in the original file already.

As to the suggested change to allow configuring UglifyJS differently per file: This is not necessary for Paper.js, since we minify the library at the end, so we are just specifying "-b ascii_only=true,beautify=false" there now, and that works for us.

Feel free to close this bug, if you consider it a "wont fix".

lehni on 25 Dec 2012

Feel free to close this bug, if you consider it a "wont fix".

I don't; in fact I'd like to add an option to keep literals as in the original source, at some point, so let's keep the ticket open.

mishoo on 26 Dec 2012

This problem exists also for ES5-Shim/Sham and the workaround works. Thanks.

kriskowal on 7 May 2013

The problem exists when minifying SockJS as well: https://github.com/sockjs/sockjs-client/blob/3f7e5f4ef5673dc15d3d3db890bc82a725f0c91d/lib/utils.js#L182-L211

They solve it in a pre-minification step here: https://github.com/sockjs/sockjs-client/blob/master/bin/render.coffee#L33-L38

However when I use their non-minified source and run my own minification i of course get in trouble.
Especially IE does not like this transformation:

var foo = {
    "\ufff0": "\\ufff0",
    "\ufff1": "\\ufff1"
};

var foo={"￰":"\\ufff0","￱":"\\ufff1"};

IE sees both property names as the same, firing a warning: multiple definitions of a property not allowed in strict mode

Munter on 21 May 2013

michaelwsk on 14 Oct 2013

👍13 ❤5 😄3 🎉2

Do I understand correctly that the regexp problem only occurs when files aren't served with the right encoding information? I've just added another fat unicode regexp to one of my libraries (to recognize extending characters in CodeMirror), and it'd be great if users have the option of using the UTF8 encoding of this regexp (which will be about three time smaller than the escaped version), somehow.

UglifyJS seems to always use the original source for regexps now, but unless there are browser issues preventing this from working, I would like some way to force it to spit out the unescaped version of the regexp somehow.

marijnh on 9 Jan 2014

@marijnh the way a literal regexp goes out is via its toString() method. While UglifyJS does no (un)escaping (except when ascii_only is passed to the code generator), it seems possible that some particular implementation of RegExp::toString can modify the output. Can you send a test case?

mishoo on 10 Jan 2014

I'm testing on node (v0.11.9), the following program:

var string = "\u00e9"; var re = /\u00e9/;

gets compressed to

var string="é";var re=/\u00e9/;

i.e. the string does get unescaped, but the encoding for the regexp is preserved.

What I'm mostly curious about is what the intended behavior is, and why it is that way.

marijnh on 10 Jan 2014

Got it. Seems it was always this way for regexps. I'm trying out unescaping them, but it's not valid for all cases, for example if I unescape the following regexp: /^[ \n\t\x0C\u2028\u2029\xA0]+/ I get an "invalid regexp" error in NodeJS. I'm not sure what other cases there are... do you happen to know how to decide what characters to replace and what to remain escaped?

mishoo on 10 Jan 2014

\u2029 and \u2029 count as line separators in JavaScript, so they can not appear unescaped in regexps or strings.

marijnh on 10 Jan 2014

Pushed it. I hope there aren't other invalid cases... :)

mishoo on 10 Jan 2014

Should be okay—those two (along with r and n) are the only line separators.

marijnh on 10 Jan 2014

There's at least one case missing: '/'. UglifyJS itself uses this in

if (options.inline_script)
    ret = ret.replace(/<\x2fscript([>\/\t\n\f\r ])/gi, "<\\/script$1");

However, even when 47 is added to the code list, minifying UglifyJS itself (exported via uglifyjs --self) seems to produce invalid JS.

lautis on 18 Jan 2014

@lautis Good point, thanks. Fixed. I'll publish a new version.

mishoo on 18 Jan 2014

@marijnh turns out the list was rather large. :-) Hopefully I got them all this time.

mishoo on 21 Jan 2014

JSLint has a list of characters it sees as unsafe. It may be a good substitute.
\u0000-\u001f \u007f-\u009f \u00ad \u0600-\u0604 \u070f \u17b4 \u17b5 \u200c-\u200f \u2028-\u202f \u2060-\u206f \ufeff \ufff0-\uffff

rvanvelzen on 9 Feb 2014

Hello

I didn't really understand if the issue sould be solved, I personally have a problem with special escape chars in strings

the incriminated code:

    function escapeString(value) {
        return value.replace(/[\0\n\r\b\t\\\'\"\x1a]/g, function(s) {
            switch(s) {
                case '\0':
                    return '\\0';
                case '\n':
                    return '\\n';
                case '\r':
                    return '\\r';
                case '\b':
                    return '\\b';
                case '\t':
                    return '\\t';
                case '\x1a':
                    return '\\Z';
                default:
                    return '\\' + s;
            }
        });
    }

(this is a JS version of mysql_real_escape_string, not my own code)

problem is the \t and \x1a that are replaced by the actual tab char and SUB
and it seems to cause issues sometimes

mistic100 on 14 Aug 2014

Hi @mishoo,

the character sequence \x0B also gets converted which can create problems in certain circumstances. Would it be possible to just leave it _as is_? :)

Regards,
J