Uglifyjs: Issue with compressing escaped UTF8 characters

Created on 23 Nov 2012  ·  25Comments  ·  Source: mishoo/UglifyJS

I am running into an issue when attempting to compress Acorn.js or Esprima.js for browser use:

Both libraries contain regular expressions that are full of escaped UTF8 characters (Look for NonAsciiIdentifierStart in Esprima, for example).

UglifyJS reads these, resolves the escaped characters, and then outputs them to real UTF8 characters.

Loading this compressed library on WebKit then causes an exception:

Invalid regular expression: range out of order in character class

Using UglifyJS1, I could pass --ascii to the compressor, solving this issue, but leading to an increased file-size of the result (about 4kb more, since all non-ascii chars would now be escaped).

In UglifyJS2, when passing -b ascii_only=true, I do not get any compressed output anymore. I believe, since the beautifier is activated, -c is ignored.

But Ideally, I would like an option that would allow me to just preserve these strings in their current form, including all escaping that they contain.

Depending on the JS parser in use, this might be hard to implement though.

What do you suggest?

Most helpful comment

While it's a good feature to have to preserve utf-8 as is in uglify, for those who arrived at this page looking for solution to the "Invalid regular expression: range out of order in character class", you can at least add charset="utf-8" in your script tag's attribute, and also make sure you have in your html mark up to get rid of the error.

All 25 comments

Using -b ascii_only=true works, but -b automatically implies “beautify”. One way around this is -b ascii_only=true,beautify=false. I'll add a separate --ascii-only flag, since I imagine it will be a common need, but in the mean time you can use this hack (it'll continue to work).

That's good to know! But what do you think of an option to preserve the strings as they are? Escaping all unicode chars will unnecessarily increase the size of these libraries by 4kb otherwise.

That's more tricky because we don't store any information about how the actual string/regexp was written. Perhaps --regexp-ascii-only would be an acceptable compromise?

In this particular case, the regexps are constructed from strings, so I don't think that would work?

Indeed, it won't help then.. If I think about it, in Acorn, except for those long regexps there's probably no other place where Unicode chars are used literally—which means that even with an option to keep literals as they are in the original source, you won't get smaller output. So I still don't see a good reason to implement this.

Here's another suggestion. I am assuming you compress multiple files and would like the ascii_only to apply only to Acorn. I could add support for per-file compression options, specified in the source file. So you'd add to acorn.js for example:

// this is a no-op statement, UglifyJS will drop it from the output
UglifyJS: ({
    codegen: {
        ascii_only: true
    }
});

This isn't a trivial change either :( but perhaps it would be best in the long run.

Would it not be possible to keep the original source in the AST and when asked for output these instead of the newly constructed source?

It's true that in Acorn, the size would not differ, since all unicode-chars are also escaped in the original file already.

As to the suggested change to allow configuring UglifyJS differently per file: This is not necessary for Paper.js, since we minify the library at the end, so we are just specifying "-b ascii_only=true,beautify=false" there now, and that works for us.

Feel free to close this bug, if you consider it a "wont fix".

Feel free to close this bug, if you consider it a "wont fix".

I don't; in fact I'd like to add an option to keep literals as in the original source, at some point, so let's keep the ticket open.

This problem exists also for ES5-Shim/Sham and the workaround works. Thanks.

The problem exists when minifying SockJS as well: https://github.com/sockjs/sockjs-client/blob/3f7e5f4ef5673dc15d3d3db890bc82a725f0c91d/lib/utils.js#L182-L211

They solve it in a pre-minification step here: https://github.com/sockjs/sockjs-client/blob/master/bin/render.coffee#L33-L38

However when I use their non-minified source and run my own minification i of course get in trouble.
Especially IE does not like this transformation:

var foo = {
    "\ufff0": "\\ufff0",
    "\ufff1": "\\ufff1"
};

to

var foo={"￰":"\\ufff0","￱":"\\ufff1"};

IE sees both property names as the same, firing a warning: multiple definitions of a property not allowed in strict mode

While it's a good feature to have to preserve utf-8 as is in uglify, for those who arrived at this page looking for solution to the "Invalid regular expression: range out of order in character class", you can at least add charset="utf-8" in your script tag's attribute, and also make sure you have in your html mark up to get rid of the error.

Do I understand correctly that the regexp problem only occurs when files aren't served with the right encoding information? I've just added another fat unicode regexp to one of my libraries (to recognize extending characters in CodeMirror), and it'd be great if users have the option of using the UTF8 encoding of this regexp (which will be about three time smaller than the escaped version), somehow.

UglifyJS seems to always use the original source for regexps now, but unless there are browser issues preventing this from working, I would like some way to force it to spit out the unescaped version of the regexp somehow.

@marijnh the way a literal regexp goes out is via its toString() method. While UglifyJS does no (un)escaping (except when ascii_only is passed to the code generator), it seems possible that some particular implementation of RegExp::toString can modify the output. Can you send a test case?

I'm testing on node (v0.11.9), the following program:

var string = "\u00e9"; var re = /\u00e9/;

gets compressed to

var string="é";var re=/\u00e9/;

i.e. the string does get unescaped, but the encoding for the regexp is preserved.

What I'm mostly curious about is what the intended behavior is, and why it is that way.

Got it. Seems it was always this way for regexps. I'm trying out unescaping them, but it's not valid for all cases, for example if I unescape the following regexp: /^[ \n\t\x0C\u2028\u2029\xA0]+/ I get an "invalid regexp" error in NodeJS. I'm not sure what other cases there are... do you happen to know how to decide what characters to replace and what to remain escaped?

\u2029 and \u2029 count as line separators in JavaScript, so they can not appear unescaped in regexps or strings.

Pushed it. I hope there aren't other invalid cases... :)

Should be okay—those two (along with r and n) are the only line separators.

There's at least one case missing: '/'. UglifyJS itself uses this in

if (options.inline_script)
    ret = ret.replace(/<\x2fscript([>\/\t\n\f\r ])/gi, "<\\/script$1");

However, even when 47 is added to the code list, minifying UglifyJS itself (exported via uglifyjs --self) seems to produce invalid JS.

@lautis Good point, thanks. Fixed. I'll publish a new version.

@marijnh turns out the list was rather large. :-) Hopefully I got them all this time.

JSLint has a list of characters it sees as unsafe. It may be a good substitute.
\u0000-\u001f \u007f-\u009f \u00ad \u0600-\u0604 \u070f \u17b4 \u17b5 \u200c-\u200f \u2028-\u202f \u2060-\u206f \ufeff \ufff0-\uffff

Hello

I didn't really understand if the issue sould be solved, I personally have a problem with special escape chars in strings

the incriminated code:

    function escapeString(value) {
        return value.replace(/[\0\n\r\b\t\\\'\"\x1a]/g, function(s) {
            switch(s) {
                case '\0':
                    return '\\0';
                case '\n':
                    return '\\n';
                case '\r':
                    return '\\r';
                case '\b':
                    return '\\b';
                case '\t':
                    return '\\t';
                case '\x1a':
                    return '\\Z';
                default:
                    return '\\' + s;
            }
        });
    }

(this is a JS version of mysql_real_escape_string, not my own code)

problem is the \t and \x1a that are replaced by the actual tab char and SUB
and it seems to cause issues sometimes

Hi @mishoo,

the character sequence \x0B also gets converted which can create problems in certain circumstances. Would it be possible to just leave it _as is_? :)

Regards,
J

Please use ascii_only from --beautify if you have non-standard requirements.

Was this page helpful?
0 / 5 - 0 ratings