linguist 🚀 - does not work syntax highlighting when used with \b and the Russian alphabet

Well, the effect of the \b sequence is partly dependent on the engine's Unicode settings... the Lightshow app is using a PCRE-driven implementation, as opposed to the Oniguruma-based one used by others.

Can you post an example of the code you're trying to highlight?

Alhadis on 23 Oct 2016

Hello. Most of the code in 1C Enterprise is written in russian - https://github.com/silverbulleters/vanessa-agiler/blob/master/src/cf/CommonModules/%D0%9E%D0%B1%D1%89%D0%B5%D0%B3%D0%BE%D0%9D%D0%B0%D0%B7%D0%BD%D0%B0%D1%87%D0%B5%D0%BD%D0%B8%D1%8F%D0%9A%D0%BB%D0%B8%D0%B5%D0%BD%D1%82%D0%A1%D0%B5%D1%80%D0%B2%D0%B5%D1%80/Ext/Module.bsl

Grammar was added in #2773

When we started that PR in december, we saw, that lightshow doen't work properly with \b and cyrillic, so we changed the regexps from \b to positive look behind/forward.

\b with cyrillic works in Atom, VSC and Sublime text.

Is it a bug with lightshow or Github itself can't handle it?

nixel2007 on 23 Oct 2016

Ah! That proves it then. Yes, I'd say it's because of the two different engines being used. Oniguruma is Unicode-aware by default, which means it matches "word boundaries" on a more world-wise definition (instead of [A-Za-z_0-9]). PCRE, on the other hand, probably doesn't do this, although I'm not familiar with its inner workings.

I do know, however, that \b sequences are a pain in the neck... =) I like to think \b really stands for \BAD.

Assertions are definitely the best way to go, in all cases.

Alhadis on 23 Oct 2016

I have to point out, too, that Lightshow can sometimes give very misleading results. See #3130.

The app is closed source, sadly, so I have no way of knowing what's really going on in there...

Alhadis on 23 Oct 2016

The question is what engine github uses. :)

nixel2007 on 23 Oct 2016

Oh! Sorry, I wasn't clear.

GitHub uses PCRE, same as Lightshow. =)

Alhadis on 23 Oct 2016

Okay... Is it easy to change one engine to another? Can we just ask the Github guys to make this change?

nixel2007 on 23 Oct 2016

👍1

@arfon / @bkeepers Do you guys know what it'll take to get PCRE's PCRE_UCP flag used on the site?

More info on that here.

Alhadis on 23 Oct 2016

@alhadis got 404 on your link
UPDATE: need to change htm to html at the end

nixel2007 on 23 Oct 2016

@arfon / @bkeepers Do you guys know what it'll take to get PCRE's PCRE_UCP flag used on the site?

I do not sorry. @aroben might though 😁

arfon on 23 Oct 2016

ava57r on 24 Oct 2016

Wouldn't it be easier to narrow it down to what characters _are_ valid, instead of which ones aren't...?

For example:

(?xi)

# Following:
(?<= ^       # Start of line
   | \s      # Whitespace
   | \(      # Open bracket
   | [-+~/*] # Possible operators
   | ...     # Etc
   )

(Неопределено|Undefined|Истина|True|Ложь|False|NULL)

# Followed by:
(?= $        # End of line
  | \s       # Whitespace
  | ...      # Etc...

I'm not familiar with the language's syntax, so I can't be more specific than that, but I think you get the idea. =)

Alhadis on 24 Oct 2016

@Alhadis (?<=[^\wа-яё\.]|^) literally is a \b-before-word realisation. \b is ^\w, so we've just complete it with а-яё and adiotional . to avoid highlighting of properties with the same name as keyword.

nixel2007 on 24 Oct 2016

Thing is, different engines have different notions of what constitutes a "word character". When writing portable expressions, I avoid the assumption that consuming engines will be Unicode-aware. Mostly because of scenarios like this.

In any case, there's no reason for the site's highlighting engine to _not_ be Unicode-aware, given the multicultural nature of GitHub.

Alhadis on 24 Oct 2016

@arfon / @bkeepers Do you guys know what it'll take to get PCRE's PCRE_UCP flag used on the site?

@vmg may be able to answer this

aroben on 24 Oct 2016

👍1

If @vmg pings @arfon for an answer, I'm outta here.

Alhadis on 24 Oct 2016

If @vmg pings @arfon for an answer, I'm outta here.

Hahaha. I can answer this: we don't currently use the PCRE_UCP flag because it's a really significant performance degradation. I acknowledge this is not an ideal answer -- I'm looking into the option to only enabling this flag on documents that we know contain extended Unicode characters, but we can't enable it by default because it really slows down syntax highlighting. 😢

I don't have an ETA but I promise we plan to look into improving highlighting for non-English, non-ASCII documents.

vmg on 24 Oct 2016

👍4

Ah right, that makes sense then. I wasn't aware the flag imposed performance penalties (at least as far as the PCRE library is concerned).

I'd say your proposed solution is a good compromise. =) We could also only enable it for languages which use Unicode-sensitive grammars for highlighting, too (like 1C Enterprise's one does). Personally, I feel that's a more conservative approach; we could add a new option to languages.yml like unicode_pcre: true or something.

Alhadis on 24 Oct 2016

👍5

ersonally, I feel that's a more conservative approach; we could add a new option to languages.yml like unicode_pre: true or something.

this is a good variant. @vmg @arfon is it possible to make that thing?

nixel2007 on 24 Oct 2016

👍1

Personally, I feel that's a more conservative approach; we could add a new option to languages.yml like unicode_pre: true or something.

It would be great!

ava57r on 24 Oct 2016

this is a good variant. @vmg @arfon is it possible to make that thing?

That may be the right approach but I'd rather wait until we have a solid plan for supporting something like a language flag before going ahead and adding this to the languages.yml file.

arfon on 24 Oct 2016

👍1

Don't screw it up. The Russians are watching.

Alhadis on 24 Oct 2016

😄3

@arfon what are our next steps?
What kind of plan do you need?

In my world the plan is:

add a new entry in languages.yml
add to the languages.yml reader at linguist-side a read step. maybe we need to create some collection or array of flags (for future uses)
add flag-reader at the github-highlighter side.

anything else?

nixel2007 on 25 Oct 2016

@arfon what are our next steps?

Right now we wait. @vmg above pointed out that we need to do some work on the GitHub side (i.e. not here) to add support for this.

add to the languages.yml reader at linguist-side a read step. maybe we need to create some collection or array of flags (for future uses)

add flag-reader at the github-highlighter side.

Possibly. Please don't start opening pull requests implementing these changes yet as we won't be able to do anything with them :smile:

arfon on 25 Oct 2016

Please don't start opening pull requests implementing these changes yet as we won't be able to do anything with them

How about these in the mean time? :D

Just so we're guarded against the ol' syntax-highlighting related questions...

Alhadis on 25 Oct 2016

@vmg Can you provide a link to the issue on the Github side?

ava57r on 3 Dec 2016

@Alhadis Hello.
Any news? There is a chance to solve this problem.

ava57r on 26 May 2017

What do you mean?

Alhadis on 26 May 2017

@arfon write https://github.com/github/linguist/issues/3291#issuecomment-255996041
There is a solution? More than half a year has passed.

ava57r on 26 May 2017

Not quite... arfon said work needs to be done on the GitHub side of things before we can do anything here. AFAIK we're no closer than before.

If you really need a solution :soon:, you're only option is to implement the workaround you initially pointed out.

/cc @github/data-engineering

lildude on 26 May 2017

Thank you. I usually expect a response from engineers at github.com

ava57r on 26 May 2017

(apologies in advance, i don't intent to hijack this conversation)

In C# too, we are struggling to have C# 7.2 syntax support due to the fact that upstream has moved to Oniguruma grammer. @damieng has chalked out some ideas at https://github.com/atom/language-csharp/issues/112#issuecomment-379094384 by which we can try to make progress.

Since Oniguruma is mentioned in this thread, how feasible is it for linguist to support multiple engines? Is it a too huge effort, or totally out of the scope of this project?

ghost on 6 Apr 2018

👍3

Since Oniguruma is mentioned in this thread, how feasible is it for linguist to support multiple engines? Is it a too huge effort, or totally out of the scope of this project

“Totally out of the scope of this project” 😀 This isn’t a limitation of Linguist. It’s a limitation (possibly even a design decision - I don’t know as it pre-dates my time at GitHub) of the parser used on GitHub.com that parses the grammars.

lildude on 7 Apr 2018

“Totally out of the scope of this project” 😀 This isn’t a limitation of Linguist. It’s a limitation (possibly even a design decision - I don’t know as it pre-dates my time at GitHub) of the parser used on GitHub.com that parses the grammars.

Just to clarify, is there any chance that someone on GitHub will come and replace the good old parser with a newer one, providing support for Oniguruma and other complicated regex syntaxes?

C# language highlighting on GitHub is absolutely disgusting and it would be nice to figure out how this can be fixed in the nearest future. Other engines like BitBucket or GitLab provide a much better highlighting, sad.

Thanks in advance!

worldbeater on 9 Apr 2018

👍1

providing support for Oniguruma and other complicated regex syntaxes?

@worldbeater, PCRE supports every feature that Oniguruma does, it just uses different syntax. The perceived difference in regex support you see on GitHub is a consequence of grammar authors using Oniguruma-specific syntax instead of PCRE. This happens because Oniguruma is used by every editor which supports TextMate grammars – GitHub is a lone exception due to its use of PCRE.

If the engines were reversed, you'd be asking us to provide support for PCRE's "complicated syntax" too, because the Oniguruma engine isn't good enough.

Alhadis on 9 Apr 2018

👍2

PCRE supports every feature that Oniguruma does, it just uses different syntax

Well, now I understand, thanks! Could you please provide more info on which PCRE version GitHub is using now? Seems work on porting Oniguruma-based syntax to PCRE-based is stuck due to some incompatibilities, I don't know: https://github.com/atom/language-csharp/issues/112#issuecomment-380134954

That is all because Microsoft uses Oniguruma expressions in grammar for Visual Studio Code and Atom editors right now, but it can't be simply copied and pasted to play well with GitHub.

Thanks in advance, @Alhadis!

worldbeater on 13 Apr 2018

@worldbeater I doubt the PCRE version has anything to do with this. It's running in ASCII-only mode, that's all I know (I'm not staff). But of check the manpage for pcresyntax(3), you'll find documentation for variances between regular expression engines (starting at the section "Backreferences").

These are the likely discrepancies that that're affecting the C# grammar (indeed, most TextMate grammars which use Oniguruma extensions):

Syntax	Description	Engines
`(?R)`	recurse whole pattern	All
`(?n)`	call subpattern by absolute number	All
`(?+n)`	call subpattern by relative number	All
`(?-n)`	call subpattern by relative number	All
`(?&name)`	call subpattern by name	Perl, PCRE
`(?P>name)`	call subpattern by name	Python
`\g<name>`	call subpattern by name	Oniguruma
`\g'name'`	call subpattern by name	Oniguruma
`\g<n>`	call subpattern by absolute number	Oniguruma
`\g'n'`	call subpattern by absolute number	Oniguruma
`\g<+n>`	call subpattern by relative number	PCRE
`\g'+n'`	call subpattern by relative number	PCRE
`\g<-n>`	call subpattern by relative number	PCRE
`\g'-n'`	call subpattern by relative number	PCRE

Alhadis on 13 Apr 2018

👍3 🎉1

Thank you for that accurate list, @Alhadis. The most common occurrence of incompatible syntax is the \g<name> that some VSCode grammars use. Obviously it's trivially fixable by simply referring to the subpattern by absolute number instead of name -- but the replacement can be tricky.

vmg on 17 Apr 2018

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

stale[bot] on 6 Nov 2018

i think this issue should be opened. the problem still exists.

nixel2007 on 6 Nov 2018

👍1

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

stale[bot] on 6 Dec 2018

Any news form GitHub.com?
If this issue has no place in gh backlog it could be closed (sadly)

nixel2007 on 6 Dec 2018

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

stale[bot] on 4 Feb 2019

One more ping, guys. (sorry)

nixel2007 on 4 Feb 2019

👍1

Need help this issue

ava57r on 5 Feb 2019

Nothing has changed on the GitHub side of things... https://github.com/github/linguist/issues/3291#issuecomment-255750842 still applies:

... we don't currently use the PCRE_UCP flag because it's a really significant performance degradation. I acknowledge this is not an ideal answer -- I'm looking into the option to only enabling this flag on documents that we know contain extended Unicode characters, but we can't enable it by default because it really slows down syntax highlighting. 😢

I don't have an ETA but I promise we plan to look into improving highlighting for non-English, non-ASCII documents.

This isn't likely to change in the immediate future. If you really want this issue resolved sooner rather than later, the best option is to change the grammar as discussed in various points earlier in this issue.

lildude on 6 Feb 2019

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

stale[bot] on 8 Mar 2019

This issue has been automatically closed because it has not had activity in a long time. Please feel free to reopen it or create a new issue.

stale[bot] on 22 Mar 2019

Linguist: does not work syntax highlighting when used with \b and the Russian alphabet

Most helpful comment

All 48 comments

Related issues