Julia: Option to use PCRE2_UCP in regular expressions ("u regex flag)

Created on 12 May 2018 · 11Comments · Source: JuliaLang/julia

If the PCRE2_UCP option is set, the behaviour is changed so that Unicode properties are used to determine character types, as follows:

d any character that matches p{Nd} (decimal digit)
s any character that matches p{Z} or h or v
w any character that matches p{L} or p{N}, plus underscore

(This is only a subset of the changes, other parts of the documentation mention changes to POSIX character class interpretations among other things.)

There's a const UCP = UInt32(0x00020000) defined in base/pcre_h.jl, but it doesn't seem to be used for anything. It would be very useful to have a regex flag that allowed us to specify to the regex library when we wanted UCP set.

Perl seems to automatically set its /u flag (http://perldoc.perl.org/perlre.html under "Character set modifiers") when in 'unicode_strings' mode, which is the default mode in recent Perl versions. Julia could do the same (since Julia is also in 'unicode strings' mode by default) and make PCRE2_UCP the default mode, but the PCRE2 documentation says "Matching these sequences is noticeably slower when PCRE2_UCP is set" - if that's still the case in practice and there's performance impact, then UCP can be left just as a non-default flag that the user can enable by saying, for eg., match(r"\w+"u, "~~கசடதபற_2!! "). Currently this returns:

ERROR: LoadError: ArgumentError: unknown regex flag: u
Stacktrace:
 [1] Regex(::String, ::String) at .\regex.jl:43
 [2] @r_str(::LineNumberNode, ::Module, ::Any, ::Vararg{Any,N} where N) at .\regex.jl:83
in expression starting at REPL[76]:1

unicode

Source

digital-carver

Most helpful comment

I think we should just change the default and add a flag to enable the old behavior.

JeffBezanson on 17 May 2018

👍3

All 11 comments

Good catch. The current behavior doesn't sound very intuitive given that our strings are supposed to be Unicode:

julia> match(r"\w+", "café")
RegexMatch("caf")

So maybe we should make UCP the default, with a flag to disable it when performance is a concern and the caller only cares about ASCII .

nalimilan on 12 May 2018

So maybe we should make UCP the default

Though that is not backward compatible?

elextr on 13 May 2018

Yes, that's why it would have to happen before 1.0.

nalimilan on 13 May 2018

Does anybody know what other languages do?

JeffBezanson on 15 May 2018

(Just some tentative info from some searching around - keep grains of salt handy, and please point out any errors you find in this.)

Perl automatically matches as if PCRE2_UCP was set, and so implicitly does Unicode Character Property-based matching by default.

Python barely has any Unicode regex support in the default re module (Python 2 has practically no support). The regex module is commonly recommended for doing any Unicode matching in Python, and that uses UCP matching automatically, unless it's a bytestring or a string marked with the ASCII flag.

PHP, as far as I can tell, doesn't use PCRE2_UCP at all, things like \w and [:alpha:] always match only within the ASCII range. It's possible to use Unicode categories and script names with a /u suffix, which enables things like \pL and \p{Greek}. Based on those, people bake their own versions of the Unicode equivalents to \w, [:alnum:], etc. (I believe Julia's current behaviour matches PHP's, except Julia doesn't need a suffix to enable \p patterns).

.NET documentation says it automatically uses UCP based matching for strings with Unicode encoding, which seems to be the default encoding.

Ruby above version 1.9 uses the Onigmo library, which is halfway between explicit and implicit UCP support: for POSIX character classes like [:alpha:], it uses Unicode properties automatically, as if PCRE2_UCP was set. For the shorthand character classes like \w, the Unicode property mode has to be enabled with (?u):

$ ruby -e 'print(/[[:alpha:]]+/.match("$café."))'
café
$ ruby -e 'print(/\w+/.match("$café."))'
caf
$ ruby -e 'print(/(?u)\w+/.match("$café."))'
café

Java 7 and above work similar to PHP, in that UCP support has to be explicitly enabled (with Pattern.UNICODE_CHARACTER_CLASS or (?U)).

digital-carver on 15 May 2018

❤3

Additional data:

Rust has a custom implementation which behaves like PCRE with UCP.
Swift doesn't have built-in regexes apparently (though it's planned). Currently it includes NSRegularExpression, which is based on ICU, which in turns behaves as PCRE with UCP.
Go uses the re2 engine which treats things like \w and [:alpha:] as sets of ASCII characters (and provides a different syntax to matching Unicode categories).

Overall it really looks like we should set UCP by default.

nalimilan on 15 May 2018

It does seem like we should use UCP by default. Possible transition path:

0.7: add flags for turning UCP mode on and off
0.7: deprecate not using either one — use the off flag to keep old non-UCP behavior
1.0: remove warning for no-flag, flip its behavior from non-UCP to UCP

It's kind of annoying for everyone to need to add a flag to all their regexes and then delete it again, especially when they may well have wanted UCP behavior in the first place. We could potentially be smart about it and only give the warning for regular expressions whose meaning has changed.

StefanKarpinski on 16 May 2018

👍1

Agreed. We could check whether the regex contains common patterns like \w, [:alnum:], etc. IIUC the number of patterns affected by UCP is limited.

nalimilan on 16 May 2018

There is some dissenting opinion: https://discourse.julialang.org/t/regex-pcre2-and-the-pcre2-ucp-ucp-flag/10930.

StefanKarpinski on 17 May 2018

I think we should just change the default and add a flag to enable the old behavior.

JeffBezanson on 17 May 2018

👍3

I agree. If you really only want only the ASCII ones or whatever just use [a-zA-Z0-9_] or enable the ASCII flag.