Ecma262: RegExp processing unicode+ignoreCase of \W is not the same as !\w when matching "S" or "K"

Created on 1 Apr 2016  路  25Comments  路  Source: tc39/ecma262

A bug has been reported against both V8 bug and JSC bug concerning whether or not the strings "K" and "S" match against the regular expression /\W/ui, i.e. not a word character with the unicode and ignoreCase flags. The submitter of at least one of those bugs describes the behavior here.

This raises a possible issue with what was intended for case insensitive Unicode regular expressions. A direct reading the relevant sections of the ECMAScript庐 2016 standard is that both /\W/ui.test("K") and /\W/ui.test("S") should return true, but that may not be what the standard intends.

The \W and related \w, are described in section 21.2.2.12 CharacterClassEscape. It states that word characters are matched with the \w CharacterClassEscape which consist of the set of {a .. z A .. Z 0 .. 9 _}. The \W CharacterClassEscape is the inverse of \w.

The creation of the character class from a character class escape is described in section 21.2.2.9 AtomEscape. In all cases, we create a character class with the inverse flag as false.

The case folding rules are defined in section 21.2.2.8.2 Runtime Semantics: Canonicalize. It states that for unicode case folding, the table in the file CaseFolding.txt from the Unicode Character Database is consulted and if there are common or simple case folding mappings for a character, that mapping is returned, otherwise the original character is returned.

The matching rules for character set matching is defined in section 21.2.2.8.1 Runtime Semantics: CharacterSetMatcher Abstract Operation. It states in step 1.c. that a character is retrieved from a valid index in the subject string. In step 1.d, that character is canonicalized. In steps 1.e.i. or 1.f.i., depending on the _invert_ flag, the result of the Canonicalize() from step 1.d. is compared against the the members of the set after each of the members is processed by Canonicalize(). In step 1.e.i., if there isn't a character in the set whose Canonicalize() result is the same as the result from step 1.d. return failure. In step 1.f.i., if there is a character in the set whose Canonicalize() result is the same as the result from step 1.d. return failure.

Putting these sections together, let's walk through these sections for the construct /\W/ui.test("S")..

The regular expression consists of the character class escape '\W', which includes all the characters not in the \w word character class. It also has the unicode and ignoreCase flags set. Matching will happen according to CharacterSetMatcher Abstract Operation, with the set A being all characters not in the \w class and the invert flag being false. At step 1.d., the 'S' character will be read and passed to Canonicalize(). The result according to CaseFolding.txt, will be 's'. Since _invert_ is false, the matching will proceed to step 1.e.i.. There exists one character in set A, \u017f (lower case long s) that will also case fold to 's'. Therefore we will not return failure, but will instead match. The same analysis of "K" holds as there exists one character \u212a (Kelvin symbol) that will canonicalize to 'k'.

The issue is whether or not this behavior was intended. Without the unicode flag, the processing of \W is the same as \w. For the unicode behavior to match the non-unicode behavior, the description for producing a character class from a character class escape needs to change for all of the upper case character class escape to produce a character class identical to the lower case equivalent with the inverse flag as true.

All 25 comments

The issue is whether or not this behavior was intended.

It was: https://bugs.ecmascript.org/show_bug.cgi?id=3145#c1

Applying the principle of least surprise, it seems to me that/\W/ui.test("S") should be false given that /\w/ui.test("S") is true. The character class escapes are arranges as lower case escapes being is X and the upper case escapes being is NOT X.

\d is digit character and \D is not digit character
\s is space character and \S is not space character
\w is word character and \W is not word character
Furthermore, !\d matches the same as \D, !\s matches the same as \S, and !\w matches the same as \W.

This holds true for all subject characters, all character class escapes and all combinations of the unicode and ignoreCase flags with the exceptions of /\W/.ui and the characters 'k', 'K', 's' and 'S'.

cc @hashseed

I don't think we should set the precedent of deviating from the unicode standard by adding exceptions to avoid surprises. These exceptions would themselves be surprising.
\w and \W should not be used in combination with /u. We should rather speed up the specification on \p and \P so that we can use \p{Letter} instead. \w is not a word character in the unicode context. It's simply a shorthand for a-zA-Z0-9_.

Alternatively, you could bring this up with the unicode standard body to have it fixed there, at the source. I don't think TC39 is the right place to debate how case folding should work.

I'm not suggesting that we deviate from the unicode standard. What I am suggesting is that we change the ECMAScript spec so that \W has the same set of characters as \w, but with the invert flag set to true. Although it won't change any other results, I would also suggest the same for \D and \S.

Having iu enable case-folding in most cases but not in others would be equally confusing/surprising IMHO.

Having iu enable case-folding in most cases but not in others would be equally confusing/surprising IMHO.

Agreed. I am not suggesting ignoring case-folding rules. I am suggesting that \W be the same thing as !\w and not something different.

Agreed. I am not suggesting ignoring case-folding rules.

Isn鈥檛 that exactly what you鈥檙e suggesting here:

I am suggesting that \W be the same thing as !\w and not something different.

One implies the other. The fact that \W is not !\w in iu mode is due to case folding / canonicalization.

@mathiasbynens, I think you're misunderstanding @msaboff's suggestion.

Currently (21.2.2.9, last algorithm), /\W/ evaluates to CharacterSetMatcher(A, false), where A is the set of characters different from A-Za-z0-9_.

What @msaboff is suggesting, is that /\W/ evaluate to CharacterSetMatcher(B, true), where B is the set of characters consisting of A-Za-z0-9_.

With this change, \W will match the characters not matched by \w (and vice versa), because the condition in step e.i of CharacterSetMatcher is the exact negation of condition in step f.i.

In particular, with this change, both "趴" (long s) and "S" will be matched by /\w/ui, and none of them will be matched by /\W/ui.

@claudepache You鈥檙e right, that鈥檚 not how I understood it. Thanks for making it clear to me.

This suggestion seems sensible (assuming there is no existing code in the wild that relies on the current behavior of /\W/iu). Thanks for bringing this up again, @msaboff!

@claudepache, thanks for explaining what I am proposing another way.

This change seems much more logical and intuitive for programmers.

Same should apply to \D and \S for symmetry, right?

@hashseed Agreed (although it won鈥檛 make a difference there).

Sounds good to me as well.

How do you interpret a class ranges that includes such a class escape?

Currently, the following are equivalent:
/\W/
/[\W]/
/[^\w]/
Consequently, these are also equivalent:
/\W/ui
/[\W]/ui
/[^\w]/ui
With the proposed change, I don't know how to interpret /[\W]/. And would these two below still be equivalent?
/[^\W]/ui
/[\w]/ui
What would /[^\W123]/ even mean? Would I need to desugar that into /\w|[123]/?
According to [0], CharSets inside the ClassRange are merged. But that assumes that CharSets are not negated.

[0] https://tc39.github.io/ecma262/#sec-nonemptyclassranges

How do you interpret a class ranges that includes such a class escape?

Yes, it is the difficulty I'm encountering when trying to write a spec for the change. A more difficult case is: /[^\d\W]/.

Currently, the following are equivalent:
/\W/
/[\W]/
/[^\w]/
Consequently, these are also equivalent:
/\W/ui
/[\W]/ui
/[^\w]/ui

Precisely not: /[\W]/ui is _not_ equivalent to /[^\w]/ui (under the current semantics): the first one matches "S", but not the second one. It is because

CharacterSetMatcher(A, false)

is _not_ equivalent to

CharacterSetMatcher(B, true), where B is the set of characters not contained in A

when the CharSet A is not stable under the Canonicalize() abstract operation.

Thanks for the correction. The point regarding class ranges stands though.

A possibility is to specify \w to evaluate to the set of characters _ch_ such that Canonicalize(_ch_) is in A-Za-z0-9_. With that semantics, \w will evaluate to a larger set than A-Za-z0-9_ when both the IgnoreCase and the Unicode flags are set to true.

Or simply have \w be a synonym for \p{Letter} for unicode regexps, as in, characters that have the general category "Letter". Both are breaking changes.

@hashseed There was a proposal to make \d and \w (and \b) more Unicode-aware, but it was rejected.

Created pull request #516 with suggested changes.

Another solution to consider is to throw a SyntaxError when \w or \W is used together with /ui flags, alerting the coder that those character classes are fundamentally incompatible with these flags, without remedy.

@claudepache I think it's a little late for that, given the shipped browser support.

This is resolved now I believe!

Was this page helpful?
0 / 5 - 0 ratings