Too often in code reviews (just saw @lrhn's code review) we are instructing developers to avoid creating new RegExp('...') as a class-level or local field (I've even seen it in tight-loops!). Can the RegExp class support a const constructor that is canonicalized?
assert(identical(const RegExp('a-z'), const RegExp('a-z'));
(Even if the implementation for the VM is not-so, we'd like this in dart2js badly)
/cc @rakudrama
No, sadly it can't support that.
The problem is that the RegExp constructor parses the string and throws if it isn't valid.
That's not something a const constructor can do. If instead the constructor only stores the string, and using the RegExp parses it, we have the problem that a const object can't update its state, so we have to parse it every time we use the RegExp (or use an Expando to store the extra state, which is an option, but not a very nice one for something that's not a big problem).
RegExp originally had a const constructor, but it was cheating and not really being const, which was a problem for some implementations, so we fixed the glitch.
What _would_ solve this problem is C-like local static variables. Then you could do:
String theFoo(String x) {
static re = new RegExp("flimflamflupper");
if (re.hasMatch(x)) return "Yey!";
return "Nay!";
}
and the re variable initializer would only be executed once, the first time the function is executed.
I would support that :)
It's equivalent to a lazily initialized global/static variable (well, almost, the initializer expression would be evaluated in the function's scope the one time it is evaluated).
We could at least make the cost of repeated regexps cheaper. I did a small experiment and added a one-element cache, and my small benchmark improved from 3secs to 1.5secs. (Moving the regexp out of the loop yields 1sec).
realRun(list) {
var counter = 0;
for (String str in list) {
var re = new RegExp(".*33.*");
var match = re.firstMatch(str);
if (match != null) counter++;
}
return counter;
}
run(list) {
try {
return realRun(list);
} catch(e) {
}
}
main(args) {
var list = [];
for (int i = 0; i < 10000; i++) {
list.add(i.toString());
}
var sw = new Stopwatch()..start();
for (int i = 0; i < 1000; i++) { run(list); }
print(sw.elapsedMilliseconds);
}
$ git diff
diff --git a/sdk/lib/_internal/js_runtime/lib/regexp_helper.dart b/sdk/lib/_internal/js_runtime/lib/regexp_helper.dart
index 35a42d1c..379fcb3 100644
--- a/sdk/lib/_internal/js_runtime/lib/regexp_helper.dart
+++ b/sdk/lib/_internal/js_runtime/lib/regexp_helper.dart
@@ -79,12 +79,29 @@ class JSSyntaxRegExp implements RegExp {
bool get _isMultiLine => JS("bool", "#.multiline", _nativeRegExp);
bool get _isCaseSensitive => JS("bool", "!#.ignoreCase", _nativeRegExp);
+ static String lastSource = "";
+ static String lastModifiers = "";
+ static RegExp lastRegExp = JS('', r'new RegExp("", "")');
+
static makeNative(
String source, bool multiLine, bool caseSensitive, bool global) {
- checkString(source);
String m = multiLine == true ? 'm' : '';
String i = caseSensitive == true ? '' : 'i';
String g = global ? 'g' : '';
+ String modifiers = "$m$i$g";
+
+ if (lastSource == source && lastModifiers == modifiers) {
+ return lastRegExp;
+ }
+ var result = makeCheckedNative(source, modifiers);
+ lastSource = source;
+ lastModifiers = modifiers;
+ lastRegExp = result;
+ return result;
+ }
+
+ static makeCheckedNative(String source, String modifiers) {
+ checkString(source);
// We're using the JavaScript's try catch instead of the Dart one to avoid
// dragging in Dart runtime support just because of using RegExp.
var regexp = JS('',
@@ -95,8 +112,8 @@ class JSSyntaxRegExp implements RegExp {
} catch (e) {
return e;
}
- })(#, # + # + #)''',
- source, m, i, g);
+ })(#, #)''',
+ source, modifiers);
if (JS('bool', '# instanceof RegExp', regexp)) return regexp;
// The returned value is the JavaScript exception. Turn it into a
// Dart exception.
/cc @sigmundch
@lrhn @floitschG - apologies that I don't have the full background, but have we considered letting the compiler fail with an error in cases where a const RegExp parsing would throw? It means duplicating the parsing/error checking of regexp in the compiler, but it's probably doable.
@rakudrama - could we treat new RegExp(constant) as a value we can GVN? In other words, in the example from Florian above, could we reach a point where we can do code motion and get the regexp out of the loop?
I think we should have RegExp literals. They are compact and can be compiled ahead of time. A compiler implementation can pattern match constructors and optimize them (with unpredictable reliability), but a literal gives a language level guarantee.
@lrhn static should be an expression modifier: static _e_.
It reduces cognitive overload by putting the value next to its use.
This relieves you of the hardest part: choosing a name.
It is syntax for introducing a new name and lazy initializer in the nearest static scope (class or top level).
if (str.contains(static new RegExp('*.33.*'))) ...
var re = static new RegExp('hipp*o'); // I insist on a name!
@sigmundch We could try GVN but my hopes are not high - you can only hoist a throwing operation out of a loop if it is the first observable effect.
@floitschG Interesting experiment. There is a similar implementation neutral experiment where you import "const_regexp.dart" and it shadows core.RegExp and uses an expando to do the lazy init of the delegate. There will be some difficulty in the runtime (e.g. String.replaceAll(Pattern p, String replacement) special-cases p is RegExp.)
A language feature that would help with this is a separate namepace for const and new constructors, so new const_regexp.RegExp is a factory redirecting to core.RegExp.
It's not a problem that String.replaceAll special-cases RegExp, you can just implement RegExp from dart:core. Example: https://dartpad.dartlang.org/3135557b3d3f6d000e6f9163709f6817
I'm personally opposed to baking RegExp grammar into the language grammar. It locks you into one specific RegExp dialect without really giving you anything except static checking. Adding static checking of RegExp constructor calls with String literals should be fairly easy for any source processing tool, the language doesn't need to do it.
Dart has raw strings, which makes RegExp writing much easier because you don't have to escape backslashes, so building RegExp literals into the language isn't as big an advantage as it is in, e.g., JavaScript.
The actualy problem that needs to be solved here is a way to only create one RegExp, even if the constructor is called more than once. Static expressions or static declarations could do that.
The workaround is to declare the regexp somewhere else, or use a caching factory function instead of new RegExp.
RegExp doesn't do anything with side-effects, it's just doing some precomputation... exactly the kind of thing that would be perfect for a constexpr... if C++ can do it, why not Dart? :-)
Because Dart's const is not C++'s const. The accidental syntax similarity should not be taken to mean anything at all.
The C++ const is about not having side-effects. It allows some optimizations of the runtime execution (typically moving the computation before or after other computations).
Dart's const is about being compile-time computable. We cannot write Dart code that validates the RegExp grammar without using some constructs that are not allowed in const code (e.g., loops).
Obviously we could special-case the RegExp constructor in the compiler and check anyway, but we have chosen not to do that (any more). It's a feature that the analyzer might want to consider, but it's not something we're going to put into the language.
@lrhn it's not about checking, it's about achieving RegExp caching without manually caching it. Consider code like:
static final reFoo = new RegExp("...");
foo1(s) {
reFoo.match(s);
}
foo2(s) {
const RegExp("...").match(s);
}
Which of these two is easier to read? I would argue that foo2 is easier because regular expression is brought right next to the place which uses it.
Dart const expressions are severely limited to the point that they often can't be used for anything useful in compile time - contradicting their raison d'锚tre to make things compile-time computable. We should learn from languages like D which just allow anything to run in compile time.
I don't disagree with @mraleph. The Dart const design is rather limited. It wasn't designed to make "things compile-time computable" in general, but to carve out a small subset of things that were so simple that making them compile-time computable was trivial. Within that design, we can't do the RegExp constructor because it actually checks the syntax. It _is_ about checking - because const is about more than just ensuring that something is evaluated only once.
The const in Dart has three separate effects - compile-time/single computation, global canonicalization and immutability. In different cases it's being used for just one or two of these. This case is about the "compile-time/single computation". We don't really need (program wide) canonicalization, just that we don't compute the same value more than once at this particular point. If we had a feature that did just that, like the suggested static expr expression, there wouldn't be any need to make the RegExp constructor const, just (static new RegExp(r"...")).match(s).
(As a curiosity, JavaScript originally had a clause that allowed implementations to evaluate a RegExp literal only once and reuse the value. That was dropped in later versions because JavaScript RegExps are not actually immutable).
Dart only has const to address all these things, so that's why every problem is being expressed in terms of how const can solve it if we just extend it a bit more. I'm not generally inclined to extend const (because it's a never ending story unless we _do_ go for "run anything at compile time" - and boy, and I'm going to do compile-time computed Uint8Lists then!).
I would prefer to look for alternatives that directly solve the actual problems, not ways to make more things fit into the const solution.
Another solution is to guarantee that the implementation optimizes simple uses of a new RegExp(string-literal) well enough, so that it doesn't matter that you're technically supposed to create a new object.
We cannot write Dart code that validates the RegExp grammar without using some constructs that are not allowed in const code (e.g., loops).
We can first allow loops in const code, which is something we can do if we only allow loops that have no side-effects, and then we can write the Dart code to validate the RegExp grammar.
Dart's const is about being compile-time computable.
Loops are compile-time computable.
The Dart const design is rather limited.
The whole point of issues like this one and https://github.com/dart-lang/sdk/issues/29276 (which both probably depend on https://github.com/dart-lang/sdk/issues/29277) is to change that design. :-)
I would prefer to look for alternatives that directly solve the actual problems
That would be fine too. :-)
Loops were an example. I'm pretty sure a RegExp parser will require recursive function calls too. At that point, there isn't really anything left. Also, if you think of non-termination as a "side"-effect, loops and recursion both have that, even if their bodies seem safe. So, we're back to either keeping the const subset small or not having a subset - allowing everything as a compile-time evaluated code, with the obvious problem of not being able to determine whether compilation terminates.
Well there are definitely things that shouldn't be allowed, e.g. anything that does system calls.
I agree that allowing features that result in allowing non-termination is unsatisfactory. On the other hand, so is the fact that there's tons of classes that we can't make const just because they happen to take a RegExp, or because in debug mode they try to do an assert that involves a loop, or whatnot.
Most helpful comment
I think we should have RegExp literals. They are compact and can be compiled ahead of time. A compiler implementation can pattern match constructors and optimize them (with unpredictable reliability), but a literal gives a language level guarantee.
@lrhn static should be an expression modifier: static _e_.
It reduces cognitive overload by putting the value next to its use.
This relieves you of the hardest part: choosing a name.
It is syntax for introducing a new name and lazy initializer in the nearest static scope (class or top level).
@sigmundch We could try GVN but my hopes are not high - you can only hoist a throwing operation out of a loop if it is the first observable effect.
@floitschG Interesting experiment. There is a similar implementation neutral experiment where you import "const_regexp.dart" and it shadows core.RegExp and uses an expando to do the lazy init of the delegate. There will be some difficulty in the runtime (e.g. String.replaceAll(Pattern p, String replacement) special-cases p is RegExp.)
A language feature that would help with this is a separate namepace for const and new constructors, so new const_regexp.RegExp is a factory redirecting to core.RegExp.