Packages: [Regular Expression] Unified scope names for (embedded) Regular Expressions

Created on 16 Apr 2019 · 2Comments · Source: sublimehq/Packages

Various languages have inbuilt support for Regular Expressions and some default syntax packages provide rules to apply suitable scopes within RegExp strings. However, these scope names seem to differ in various places, for example the standalone RegExp syntax and Clojure use keyword.operator.alternation.regexp for the | symbol, while JavaScript, Python and PHP use keyword.operator.or.regexp. Another example are character classes such as \d or \w, which get the scope keyword.control.character-class.regexp in the standalone RegExp syntax, constant.other.character-class.escape.backslash.regexp in JavaScript and constant.character.character-class.regexp in Python and PHP. Other languages such as Tcl and Ruby recognize RegExp strings, but do not apply specific scopes other than string.regexp, which prevents syntax highlighting of Regular Expressions in these languages.

I want to refine my color scheme for consistent RegExp highlighting, but the currently used scope names make it difficult to find common highlighting rules for all languages. My knowledge of syntax definitions is somewhat limited, but as far as I know there is the possibility to embed a syntax within another language syntax (e.g. CSS in HTML). Maybe this could be applied to parse RegExp strings and allow consistent syntax highlighting of Regular Expressions in more languages.

Regular Expression syntax:

    (?<=(T|t)he\s)(cat)$
(?#  ^^^ constant.other.assertion )
(?#       ^ keyword.operator.alternation )
(?#            ^^ keyword.control.character-class )
(?#                    ^ keyword.control.anchors )

JavaScript syntax:

var regex = /(?<=(T|t)he\s)(cat)$/;
//            ^^^ punctuation.definition.group.assertion
//                 ^ keyword.operator.or
//                      ^^ constant.other.character-class.escape.backslash
//                              ^ keyword.control.anchor

Python syntax:

regex = r'(?<=(T|t)he\s)(cat)$'
#          ^^^ constant.other.assertion
#               ^ keyword.operator.or
#                    ^^ constant.character.character-class
#                            ^ keyword.control.anchor

Clojure syntax:

#"(?<=(T|t)he\s)(cat)$"
;  ^^^ constant.other.assertion
;       ^ keyword.operator.alternation
;            ^^ keyword.control.character-class
;                    ^ keyword.control.anchors

RFC

Source

jwortmann

👍4

Most helpful comment

I definitely think number 2 is the biggest contributor, at least that is why I haven't switched any syntax definitions I have worked on to use the "generic" one (where number 1 applies). (it's not generic, it's designed with ST's Find functionality in mind - for example whether \< is an unnecessarily escaped char or a meta character depends on the engine used)
I hadn't compared performance but it doesn't surprise me as the embedded regex definitions are generally much simpler and less accurate than the main standalone one (not referring to it as "generic" any more ;))
that said, clearly there is room for improvement/unification of scopes. Maybe the embedded ones could include contexts from the standalone one if we design it in such a way that those contexts are generic enough to apply to multiple regex parser/engine implementations, so that we don't duplicate work/scopes etc

keith-hall on 19 Apr 2019

❤1 👍1

All 2 comments

Maybe this could be applied to parse RegExp strings and allow consistent syntax highlighting of Regular Expressions in more languages.

A very good point. I also wonder why an dedicated regexp syntax exists while different syntaxes use their own implementation. I can imagine two possible reasons:

a historical thing of developement
different feature levels and implementations of the underlying regexp engines of several languages, which make merging everything together impossible without causing things being highlighted in the wrong way for single syntaxes.
the dedicated regexp syntax seems quite heavy compared to some others and causes significant slowdowns in parsing, when embedded to other languages. After embedding that syntax into a new TCL implementation to overcome the string.regexp.tcl limitations the parsing time of some official TCL library sources slowed down by 20 to 30%.

That said, I agree with regexp syntaxes to be a bit inconsistent in manner of scope naming. I'd guess the scopes were applied based on existing color schemes rather then by logical structure. One reason might be - there is no clear set of rules how to name different parts of a regexp?

I'd never call ?<= a constant for instance. As the definition of a lookbehind it would need to be scoped as keyword.operator or punctuation.definition.lookbehind. Same with all the parentheses. Thy are no operators but punctuations, ... .

\d and \w and friends are constant.character.escape.