Linguist: does not work syntax highlighting when used with \b and the Russian alphabet

Created on 23 Oct 2016  ·  48Comments  ·  Source: github/linguist

Good afternoon. Create syntax highlighting for 1C:Enterprise, which supports the Russian key words.
When we use \b(Если|If)\b in github https://github-lightshow.herokuapp.com keywords are not highlighted. Do like this (?<=[^\w-а-яё\.]|^)(Если|If)(?=[^\w-а-яё\.]|$) works. Files in UTF-8.

Stale

Most helpful comment

Ah right, that makes sense then. I wasn't aware the flag imposed performance penalties (at least as far as the PCRE library is concerned).

I'd say your proposed solution is a good compromise. =) We could also only enable it for languages which use Unicode-sensitive grammars for highlighting, too (like 1C Enterprise's one does). Personally, I feel that's a more conservative approach; we could add a new option to languages.yml like unicode_pcre: true or something.

All 48 comments

Well, the effect of the \b sequence is partly dependent on the engine's Unicode settings... the Lightshow app is using a PCRE-driven implementation, as opposed to the Oniguruma-based one used by others.

Can you post an example of the code you're trying to highlight?

Hello. Most of the code in 1C Enterprise is written in russian - https://github.com/silverbulleters/vanessa-agiler/blob/master/src/cf/CommonModules/%D0%9E%D0%B1%D1%89%D0%B5%D0%B3%D0%BE%D0%9D%D0%B0%D0%B7%D0%BD%D0%B0%D1%87%D0%B5%D0%BD%D0%B8%D1%8F%D0%9A%D0%BB%D0%B8%D0%B5%D0%BD%D1%82%D0%A1%D0%B5%D1%80%D0%B2%D0%B5%D1%80/Ext/Module.bsl

Grammar was added in #2773

When we started that PR in december, we saw, that lightshow doen't work properly with \b and cyrillic, so we changed the regexps from \b to positive look behind/forward.

\b with cyrillic works in Atom, VSC and Sublime text.

Is it a bug with lightshow or Github itself can't handle it?

Ah! That proves it then. Yes, I'd say it's because of the two different engines being used. Oniguruma is Unicode-aware by default, which means it matches "word boundaries" on a more world-wise definition (instead of [A-Za-z_0-9]). PCRE, on the other hand, probably doesn't do this, although I'm not familiar with its inner workings.

I do know, however, that \b sequences are a pain in the neck... =) I like to think \b really stands for \BAD.

Assertions are definitely the best way to go, in all cases.

I have to point out, too, that Lightshow can sometimes give very misleading results. See #3130.

The app is closed source, sadly, so I have no way of knowing what's really going on in there...

The question is what engine github uses. :)

Oh! Sorry, I wasn't clear.

GitHub uses PCRE, same as Lightshow. =)

Okay... Is it easy to change one engine to another? Can we just ask the Github guys to make this change?

@arfon / @bkeepers Do you guys know what it'll take to get PCRE's PCRE_UCP flag used on the site?

More info on that here.

@alhadis got 404 on your link
UPDATE: need to change htm to html at the end

@arfon / @bkeepers Do you guys know what it'll take to get PCRE's PCRE_UCP flag used on the site?

I do not sorry. @aroben might though 😁

regular expression with \b
(?i:\b(Неопределено|Undefined|Истина|True|Ложь|False|NULL)\b)
regular expression with positive look behind/forward
(?i:(?<=[^\wа-яё\.]|^)(Неопределено|Undefined|Истина|True|Ложь|False|NULL)(?=[^\wа-яё\.]|$))

Wouldn't it be easier to narrow it down to what characters _are_ valid, instead of which ones aren't...?

For example:

(?xi)

# Following:
(?<= ^       # Start of line
   | \s      # Whitespace
   | \(      # Open bracket
   | [-+~/*] # Possible operators
   | ...     # Etc
   )

(Неопределено|Undefined|Истина|True|Ложь|False|NULL)

# Followed by:
(?= $        # End of line
  | \s       # Whitespace
  | ...      # Etc...

I'm not familiar with the language's syntax, so I can't be more specific than that, but I think you get the idea. =)

@Alhadis (?<=[^\wа-яё\.]|^) literally is a \b-before-word realisation. \b is ^\w, so we've just complete it with а-яё and adiotional . to avoid highlighting of properties with the same name as keyword.

Thing is, different engines have different notions of what constitutes a "word character". When writing portable expressions, I avoid the assumption that consuming engines will be Unicode-aware. Mostly because of scenarios like this.

In any case, there's no reason for the site's highlighting engine to _not_ be Unicode-aware, given the multicultural nature of GitHub.

@arfon / @bkeepers Do you guys know what it'll take to get PCRE's PCRE_UCP flag used on the site?

@vmg may be able to answer this

If @vmg pings @arfon for an answer, I'm outta here.

If @vmg pings @arfon for an answer, I'm outta here.

Hahaha. I can answer this: we don't currently use the PCRE_UCP flag because it's a really significant performance degradation. I acknowledge this is not an ideal answer -- I'm looking into the option to only enabling this flag on documents that we know contain extended Unicode characters, but we can't enable it by default because it really slows down syntax highlighting. 😢

I don't have an ETA but I promise we plan to look into improving highlighting for non-English, non-ASCII documents.

Ah right, that makes sense then. I wasn't aware the flag imposed performance penalties (at least as far as the PCRE library is concerned).

I'd say your proposed solution is a good compromise. =) We could also only enable it for languages which use Unicode-sensitive grammars for highlighting, too (like 1C Enterprise's one does). Personally, I feel that's a more conservative approach; we could add a new option to languages.yml like unicode_pcre: true or something.

ersonally, I feel that's a more conservative approach; we could add a new option to languages.yml like unicode_pre: true or something.

this is a good variant. @vmg @arfon is it possible to make that thing?

Personally, I feel that's a more conservative approach; we could add a new option to languages.yml like unicode_pre: true or something.

It would be great!

this is a good variant. @vmg @arfon is it possible to make that thing?

That may be the right approach but I'd rather wait until we have a solid plan for supporting something like a language flag before going ahead and adding this to the languages.yml file.

Don't screw it up. The Russians are watching.

@arfon what are our next steps?
What kind of plan do you need?

In my world the plan is:

  • add a new entry in languages.yml
  • add to the languages.yml reader at linguist-side a read step. maybe we need to create some collection or array of flags (for future uses)
  • add flag-reader at the github-highlighter side.

anything else?

@arfon what are our next steps?

Right now we wait. @vmg above pointed out that we need to do some work on the GitHub side (i.e. not here) to add support for this.

  • add to the languages.yml reader at linguist-side a read step. maybe we need to create some collection or array of flags (for future uses)
  • add flag-reader at the github-highlighter side.

Possibly. Please don't start opening pull requests implementing these changes yet as we won't be able to do anything with them :smile:

Please don't start opening pull requests implementing these changes yet as we won't be able to do anything with them

How about these in the mean time? :D

Just so we're guarded against the ol' syntax-highlighting related questions...

@vmg Can you provide a link to the issue on the Github side?

@Alhadis Hello.
Any news? There is a chance to solve this problem.

What do you mean?

@arfon write https://github.com/github/linguist/issues/3291#issuecomment-255996041
There is a solution? More than half a year has passed.

Not quite... arfon said work needs to be done on the GitHub side of things before we can do anything here. AFAIK we're no closer than before.

If you really need a solution :soon:, you're only option is to implement the workaround you initially pointed out.

/cc @github/data-engineering

Thank you. I usually expect a response from engineers at github.com

(apologies in advance, i don't intent to hijack this conversation)

In C# too, we are struggling to have C# 7.2 syntax support due to the fact that upstream has moved to Oniguruma grammer. @damieng has chalked out some ideas at https://github.com/atom/language-csharp/issues/112#issuecomment-379094384 by which we can try to make progress.

Since Oniguruma is mentioned in this thread, how feasible is it for linguist to support multiple engines? Is it a too huge effort, or totally out of the scope of this project?

Since Oniguruma is mentioned in this thread, how feasible is it for linguist to support multiple engines? Is it a too huge effort, or totally out of the scope of this project

“Totally out of the scope of this project” 😀 This isn’t a limitation of Linguist. It’s a limitation (possibly even a design decision - I don’t know as it pre-dates my time at GitHub) of the parser used on GitHub.com that parses the grammars.

“Totally out of the scope of this project” 😀 This isn’t a limitation of Linguist. It’s a limitation (possibly even a design decision - I don’t know as it pre-dates my time at GitHub) of the parser used on GitHub.com that parses the grammars.

Just to clarify, is there any chance that someone on GitHub will come and replace the good old parser with a newer one, providing support for Oniguruma and other complicated regex syntaxes?

C# language highlighting on GitHub is absolutely disgusting and it would be nice to figure out how this can be fixed in the nearest future. Other engines like BitBucket or GitLab provide a much better highlighting, sad.

Thanks in advance!

providing support for Oniguruma and other complicated regex syntaxes?

@worldbeater, PCRE supports every feature that Oniguruma does, it just uses different syntax. The perceived difference in regex support you see on GitHub is a consequence of grammar authors using Oniguruma-specific syntax instead of PCRE. This happens because Oniguruma is used by every editor which supports TextMate grammars – GitHub is a lone exception due to its use of PCRE.

If the engines were reversed, you'd be asking us to provide support for PCRE's "complicated syntax" too, because the Oniguruma engine isn't good enough.

PCRE supports every feature that Oniguruma does, it just uses different syntax

Well, now I understand, thanks! Could you please provide more info on which PCRE version GitHub is using now? Seems work on porting Oniguruma-based syntax to PCRE-based is stuck due to some incompatibilities, I don't know: https://github.com/atom/language-csharp/issues/112#issuecomment-380134954

That is all because Microsoft uses Oniguruma expressions in grammar for Visual Studio Code and Atom editors right now, but it can't be simply copied and pasted to play well with GitHub.

Thanks in advance, @Alhadis!

@worldbeater I doubt the PCRE version has anything to do with this. It's running in ASCII-only mode, that's all I know (I'm not staff). But of check the manpage for pcresyntax(3), you'll find documentation for variances between regular expression engines (starting at the section "Backreferences").

These are the likely discrepancies that that're affecting the C# grammar (indeed, most TextMate grammars which use Oniguruma extensions):

Syntax Description Engines
(?R) recurse whole pattern All
(?n) call subpattern by absolute number All
(?+n) call subpattern by relative number All
(?-n) call subpattern by relative number All
(?&name) call subpattern by name Perl, PCRE
(?P>name) call subpattern by name Python
\g<name> call subpattern by name Oniguruma
\g'name' call subpattern by name Oniguruma
\g<n> call subpattern by absolute number Oniguruma
\g'n' call subpattern by absolute number Oniguruma
\g<+n> call subpattern by relative number PCRE
\g'+n' call subpattern by relative number PCRE
\g<-n> call subpattern by relative number PCRE
\g'-n' call subpattern by relative number PCRE

Thank you for that accurate list, @Alhadis. The most common occurrence of incompatible syntax is the \g<name> that some VSCode grammars use. Obviously it's trivially fixable by simply referring to the subpattern by absolute number instead of name -- but the replacement can be tricky.

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

i think this issue should be opened. the problem still exists.

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

Any news form GitHub.com?
If this issue has no place in gh backlog it could be closed (sadly)

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

One more ping, guys. (sorry)

Need help this issue

Nothing has changed on the GitHub side of things... https://github.com/github/linguist/issues/3291#issuecomment-255750842 still applies:

... we don't currently use the PCRE_UCP flag because it's a really significant performance degradation. I acknowledge this is not an ideal answer -- I'm looking into the option to only enabling this flag on documents that we know contain extended Unicode characters, but we can't enable it by default because it really slows down syntax highlighting. 😢

I don't have an ETA but I promise we plan to look into improving highlighting for non-English, non-ASCII documents.

This isn't likely to change in the immediate future. If you really want this issue resolved sooner rather than later, the best option is to change the grammar as discussed in various points earlier in this issue.

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

This issue has been automatically closed because it has not had activity in a long time. Please feel free to reopen it or create a new issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

arfon picture arfon  ·  6Comments

TimothyGu picture TimothyGu  ·  5Comments

RafaelPAndrade picture RafaelPAndrade  ·  4Comments

FranklinYu picture FranklinYu  ·  4Comments

oliviertassinari picture oliviertassinari  ·  5Comments