Linguist: Allow overriding with custom language name and/or implementation name.

Created on 25 Feb 2019 · 20Comments · Source: github/linguist

I have read the following issues entirely: #2627, #2360, #2598. However, this feature request addresses a different problem with a couple of entirely different solutions that also address the concerns mentioned in these issues.

Preliminary Steps

Please confirm you have...

[X] reviewed How Linguist Works,
[X] reviewed the Troubleshooting docs,
[X] considered implementing an override,
[X] verified an issue has not already been logged for your issue (linguist issues).

Problem Description

Different implementations or specifications (specifically major language versions) of the same language may sometimes result in incompatibilities such as different syntax, new/deprecated language features, functions, builtin libraries, etc.

Some projects are forced to maintain legacy code which was written in old iterations of the same language, but is still supported, or otherwise, code that targets a specific implementation that ended up being a language fork and is no longer conforming to the standardized specification.

Notable examples include Fortran (.F vs .f90 - almost two different languages), Lua (LuaJIT vs Lua 5.3 - this case involves mutually incompatible features, and entirely different interpreters for different purposes), C# (or any CLR language really, C# 6.0 in Mono vs C# 8.0 in .NET Core vs older C# versions that targeted older iterations of the .NET Framework), Python (2 vs 3), any instance of a language that was forked in some way (for any reason) is also subjected to this case.

The repository could declare which exact versions or implementations of a language are being used in what files/directories to clarify and monitor that in the language statistics, therefore the problem is cosmetic in nature.

Possible Solutions

This issue has 3 different possible solutions in terms of feature request:

Allow the repository to declare a custom language-implementation attribute in .gitattributes as an arbitrary string, and display this value in parenthesis next to the name of the language that is already detected by Linguist.

Hypothetical examples:

65.00% C++
15.00% C
10.00% Lua
10.00% Other

Would appear as:

20.00% C++
45.00% C++ (C++98)
15.00% C
10.00% Lua (LuaJIT)
10.00% Other

Another example with Fortran:

100.00% Fortran

Would appear as:

40.00% Fortran
20.00% Fortran (FORTRAN 77)
20.00% Fortran (FORTRAN 66)
20.00% Fortran (FORTRAN IV)

The same as above, but in case the name of the implementation/version was mentioned, then replace the label entirely:

20.00% C++
45.00% C++98
15.00% C
10.00% LuaJIT
10.00% Other

And:

40.00% Fortran
20.00% FORTRAN 77
20.00% FORTRAN 66
20.00% FORTRAN IV

Allow to define a custom language with name, color and optional known syntax highlighter by the Linguist. (since you can already instruct Linguist that a Java file was actually a misidentified C# file, or vice versa). This solution is the most desirable because it covers the problem mentioned above as well as allows to some other developers to associate their files with their own private languages or forks of known languages and indicate this to other users of GitHub.

*.cc linguist-language=C(name: "C with Classes", color: #abcdef)

This would group all files that end with .cc and show them as _C with Classes_ using the specified color and the already existing syntax highlighter for C that is provided by Linguist. At global search results you could either count it as plain C (because the user chose the C highlighter) or ignore it altogether.

Regarding #2360:

If the user specified that the custom language uses a syntax highlighter of an existing language, then you could possibly treat this custom language as the other language where it derives its syntax to narrow GitHub search results or global language trends, or rather ignore it at all.
If the user didn't specify a syntax highlighter at all, then ignore it completely from any search results outside of the repository if there's so much of concern to keep the search results clean from obscure or unknown user-defined/forked languages.

Regarding both #2627, #2598:

The solutions I have proposed do not require the execution of any code (such as one may require to define a whole new syntax highlighter), but only the input of custom string values from the .gitattributes file, thus they don't pose any potential security vulnerabilities or legal issues with licensing.

Stale

Source

david-von-tamar

👍2

Most helpful comment

I think this would be beneficial to language hobbyists. I was working on an experimental (immutable) Lisp dialect at some point and it was quite discouraging to think that my repositories would be labeled as "Common Lisp" instead of the fruit of my work. Similar experience with an Actor-based language, a Flow-Based language and friends expressing the same concerns with a bunch of little DSLs.

nitrix on 2 Mar 2019

👍4

All 20 comments

I understand where you're coming from, but this would quickly become problematic for lesser-known languages which evolve much quicker, and/or with less noticeable changes to syntax or semantics.

Moreover, how would this benefit users aside from (possibly) improved highlighting? Even syntax highlighting grammars can be improved to accommodate for implementation-specific discrepancies, either with creative TextMate hacks or simply revising scope-name choices (which affect the colours used to highlight code on GitHub).

We're already in the process of disambiguating what Linguist considers to be a "language group" (something which was ill-defined to start with), and introducing a new categorical tier is going to make it hard — if not impossible — where the boundaries lie between "group", "language", and "implementation". For C/Fortran, the distinction is obvious, but much less so for entries like Assembly, which cover a vast multitude of dialects, revisions, and what one might call "implementations".

Alhadis on 25 Feb 2019

I understand where you're coming from, but this would quickly become problematic for lesser-known languages which evolve much quicker, and/or with less noticeable changes to syntax or semantics.

I don't know how could this harm the popularity of young and evolving languages given that the custom names a user may set are relevant only within the local repository.

Could you provide with a brief example?

Moreover, how would this benefit users aside from (possibly) improved highlighting? Even syntax highlighting grammars can be improved to accommodate for implementation-specific discrepancies, either with creative TextMate hacks or simply revising scope-name choices (which affect the colours used to highlight code on GitHub).

I wasn't suggesting user-defined custom highlighting because it was already dismissed in the past several times as "prone to Turing-complete vulnerabilities" if I inferred that correctly (that arbitrary code may run at other machines and cause unpredictable side effects). My suggestion was far more simple than that with no potential side effects at all, because all that it does is merely grouping files under a new name & color.

We're already in the process of disambiguating what Linguist considers to be a "language group" (something which was ill-defined to start with), and introducing a new categorical tier is going to make it hard — if not impossible — where the boundaries lie between "group", "language", and "implementation". For C/Fortran, the distinction is obvious, but much less so for entries like Assembly, which cover a vast multitude of dialects, revisions, and what one might call "implementations".

I've seen that effort. I think it addresses something quite different. The main differences are:

4291 attempts to change the way languages are being classified in GitHub by default. Whereas my issue addresses a situation when a user insists to indicate that they're utilizing a specific language implementation, version, fork or dialect which the maintainers of Linguist may not even be aware of, or simply disagree with its classification as a separate language, thus leaving the user without any ability to classify their own repository as they wish.
4291 doesn't address a situation where a user decided to define their own dedicated language for their specific project needs (#4291 either assumes that the language must be widely used, or otherwise nonexistent).
4291 is agnostic to language implementations. (as you have noted) whereas this is at the center of my issue. If I create a new project in _LuaJIT_, I most definitely don't want it to be classified or associated with the standard PUC _Lua 5.3_ (they're literally incompatible, not just syntactically, but also functionally, the sole reason they're "Lua" for Linguist is because it cannot make a distinction without false positives in this case), or rather with _eLua_. Those three target absolutely different platforms and use cases, you'd also notice different programming styles in each of these due to the nature of their implementations. And yet they're all under one language in Linguist. And that's the main problem with it. Linguist can help a user to classify a repository at first look, but it should not enforce itself upon a repository when the user knows better.

david-von-tamar on 25 Feb 2019

Could you provide with a brief example?

JavaScript. New features are being added every year, with each year witnessing a new (formally defined and name) implementation of the language. Things get even messier when you consider precompilers like TypeScript (which are effective supersets of JavaScript) and JSX extensions (which blur the lines between non-standard features and user-submitted proposals.

Note how many "Presets" are listed by Babel's REPL. Those have only come into existence in the last ~5 years: this is the level of fragmentation we need to consider.

My suggestion was far more simple than that with no potential side effects at all, because all that it does is merely grouping files under a new name & color.

This is the real deal-breaker:

… under a new name & color.

User-defined languages aren't a possibility, given the mechanics of Linguist and GitHub's indexing engine, and although I don't purport to know the exact logic behind it all, I can tell you this feature would involve full-blown overhaul of GitHub's internals, affecting everything from language searches to trending repository listings.

In the end, we're benefiting only a minority of users, whilst impacting millions of others. We can't cater to everybody, and the classification system we have in place at the moment is the end-result of years of feedback and refinement. I don't think I've seen this feature suggested before, so I'm inclined to think most users wouldn't use/need it.

If two implementations of a language are decidedly different enough to be considered distinct, then they should be considered separate languages (e.g., Perl / Perl 6).

Alhadis on 25 Feb 2019

I should point out that topics are a helpful way of classifying repositories with author-defined details. For example, there are 262 repositories tagged with luajit, 424 repositories tagged with mono, 67 repos tagged with fortran90, and so forth.

This is arguably a better solution for making implementation details visible to users, and you aren't limited to defining implementation-related keywords either.

Alhadis on 25 Feb 2019

Could you provide with a brief example?

JavaScript.

Oh, it's ECMAScript! This time the name of the implementation actually won. (I'm not trying to making any point with it, I just mentioned a fact, that's all.)

This is the real deal-breaker:

… under a new name & color.

Cosmetic options are deal breakers. I see.

User-defined languages aren't a possibility, given the mechanics of Linguist and GitHub's indexing engine, and although I don't purport to know the exact logic behind it all, I can tell you this feature would involve full-blown overhaul of GitHub's internals, affecting everything from language searches to trending repository listings.

If I recall correctly, users may actually define new 'topics' for their repositories. Those topics are later used by GitHub's search engine and other internals as well.

Shouldn't the same interface apply to programming languages at some point?

The classification system we have in place at the moment is the end-result of years of feedback and refinement.

Those "years of feedback" also included desperate requests from users who wanted to classify their own repositories with own shell dialects or obscure Domain Specific Languages.

I don't think I've seen this feature suggested before, so I'm inclined to think most users wouldn't use/need it.

I've linked at least 3 issues that included almost identical feature requests by other users.

If two implementations of a language are decidedly different enough to be considered distinct, then they should be considered separate languages (e.g., Perl / Perl 6).

If you'd want me to classify Lua this way I'd end up with at least four distinct dialects (<5, 5.1, 5.3, JIT) and it keeps changing. Code from 5.1 is incompatible with Lua 5.3, for example. Same goes for JIT which is not 5.1 exactly nor 5.2, but something in between, and has its own innovations too (like _FFI Semantics_ as the author puts it).

Other problems in trying to classify Lua dialects with Linguist is their syntactic similarity, the same file extensions, but they function quite differently with many features being added & deprecated.

But if Fortran's grouping is acceptable to you, then how could I even make a case for Lua? I'm also puzzled how did Fortran end up being a group of languages while ECMAScript diverged into separate implementations?

david-von-tamar on 25 Feb 2019

Those "years of feedback" also included desperate requests from users who wanted to classify their own repositories with own shell dialects or obscure Domain Specific Languages.

I believe the missing feature is support for user-defined languages, and how a user decides to define a language is arbitrary. They might be authoring a new language, or, like you, have a wish to differentiate between dialects and major language revisions. Your suggestion specifically concerns the latter, and would be adequately addressed by the addition of user-defined language support. Which, yes, is a well-acknowledged limitation of ~~Linguist~~ GitHub in general.

I've linked at least 3 issues that included almost identical feature requests by other users.

Almost identical? That's quite a leap from the OP:

However, this feature request addresses a different problem with a couple of entirely different solutions

… how did Fortran end up being a group of languages while ECMAScript diverged into separate implementations?

Are you seriously comparing a 62-year old, pioneering language with one that evolved in barely two decades and started life as a proprietary scripting language?

Shouldn't the same interface apply to programming languages at some point?

Why? How is the topics feature inadequate?

If you'd want me to classify Lua this way I'd end up with at least four distinct dialects (<5, 5.1, 5.3, JIT) and it keeps changing.

If you're unsatisfied with the way Lua and Fortran are currently classified, then I recommend submitting a pull-request to break them into separate languages. Changing site-wide mechanics to benefit a handful of languages is neither feasible nor practical.

If I recall correctly, users may actually define new 'topics' for their repositories. Those topics are later used by GitHub's search engine and other internals as well.

That feature was added more recently, and should already be adequate for declaring things like dialects or language versions.

Alhadis on 25 Feb 2019

@david-tamar Please don't be discouraged and close this issue before anyone else than @Alhadis has had a chance to look into it and give their opinion. I often agree with @Alhadis on these issues, but here, I'm not sure to fully understand what you want, and I'd prefer to understand before I make up my mind.

given that the custom names a user may set are relevant only within the local repository.

Are you proposing that the custom language name only be taken into account inside the repository? So users couldn't search for that language on the whole GitHub.com? Is the idea only to give more detailed information in the language bar (e.g., C++98 vs. C++)?

I wasn't suggesting user-defined custom highlighting because it was already dismissed in the past several times as "prone to Turing-complete vulnerabilities" if I inferred that correctly (that arbitrary code may run at other machines and cause unpredictable side effects). My suggestion was far more simple than that with no potential side effects at all, because all that it does is merely grouping files under a new name & color

As I understand your suggestion, Linguist would still be in charge of selecting grammars and the users would just chose between them? If there's a better grammar for a language (even a dialect), we usually welcome pull requests to apply that grammar, even in cases where it requires to break a language down into its dialects to apply a different grammar to each (whilst still grouping these dialect under the same parent language). Did I misunderstand your proposal? As I understand it, I'm not sure when it would be useful (?).

pchaigno on 25 Feb 2019

@david-tamar Please don't be discouraged and close this issue before anyone else than @Alhadis has had a chance to look into it and give their opinion. I often agree with @Alhadis on these issues, but here, I'm not sure to fully understand what you want, and I'd prefer to understand before I make up my mind.

OK. I'll reopen this issue for more feedback then. I lost nearly all enthusiasm once it was made clear that GitHub's current limitations render this feature request unfeasible. On top of being dismissed as a redundant feature request by @Alhadis.

I wanted to stress out that 'topics' are not a solution because topics cannot track language statistics within the repository per file like Linguist does (resulting in an up-to-date % breakdown of the entire repository according to its actual contents).

Are you proposing that the custom language name only be taken into account inside the repository? So users couldn't search for that language on the whole GitHub.com? Is the idea only to give more detailed information in the language bar (e.g., C++98 vs. C++)?

At first I proposed that yes. Because I wanted to take the path of least resistance with the hope that it won't lead to concerns such as "it'll pollute GitHub's language statistics with duplicate or nonexistent languages".

However then I realized that people are already polluting GitHub's search results & statistics with duplicate topic tags anyway, so it might even make language tags such as "C++98", "CPP98", "cpp98" or "cxx98" as equally legitimate search terms as their equivalent topics would otherwise be.

At the moment having the names of the implementations or dialects indicated within the local repository only would suffice too. My desire is to have precise language statistics in my repository so I can let other people differentiate between source files that belong to different dialects or implementations within my own repository.

As I understand your suggestion, Linguist would still be in charge of selecting grammars and the users would just chose between them?

Yes, I think Linguist does a good job at providing grammar & syntax highlighting, but that shouldn't prevent the user from grouping files under different dialects that may utilize the same grammar (both files grouped under C++98 and C++17 could use the same generic/default C++ grammar for the most part, so it's not a major problem as far as you just want to see two separate groups in your language statistics).

If there's a better grammar for a language (even a dialect), we usually welcome pull requests to apply that grammar, even in cases where it requires to break a language down into its dialects to apply a different grammar to each (whilst still grouping these dialect under the same parent language). Did I misunderstand your proposal? As I understand it, I'm not sure when it would be useful (?).

I'm aware that I may suggest new grammar for languages that I view as distinct dialects. But dialects don't always have drastically different grammar from each other.

This may cause false-positive classifications very often since the major differences between them is not grammatical, but the way they function or being compiled (especially in Lisp dialects or Lua dialects).

Therefore grammar is not necessarily the reason one may want to separate between source files of the same language into different groups.

Features that may differ between specifications and implementations can be such as:

Changes in the standard library.
Altered semantics that are not easily differentiable in the text itself.
Implicit type casting in weakly typed languages.

For example in Lua5.3 there are integers and doubles, but in previous Lua implementations such as Lua5.1 there were only doubles, and it cannot be determined from the text itself because Lua is weakly typed and they're all .lua and have similar grammar.

Generally speaking code written for 5.1 won't work for 5.3 or vice-versa, same for the JIT dialect and other old Lua dialects, those differences are not easily understood in the text itself, until you attempt to execute the sources while targeting the wrong interpreter or implementation.

david-von-tamar on 25 Feb 2019

It's worth mentioning that the language stat-bar won't always be visible depending on the viewer's device and/or platform; e.g., no statistics are displayed on mobile, where only a "View code" button is offered.

This means information the author intends to display to users won't always be available, depending on how they're browsing your repository.

Alhadis on 25 Feb 2019

It's worth mentioning that the language stat-bar won't always be visible depending on the viewer's device and/or platform; e.g., no statistics are displayed on mobile, where only a "View code" button is offered.

This means information the author intends to display to users won't always be available, depending on how they're browsing your repository.

Well, that has more to do with the shortcomings of the mobile interface than with this issue in particular then.

Anyway, this is not entirely accurate. I've opened up GitHub on my phone right now and search results do show the name of the language used in each repository.

This information is taken from the language statistics that are calculated by Linguist.

david-von-tamar on 25 Feb 2019

nitrix on 2 Mar 2019

👍4

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

stale[bot] on 1 Apr 2019

This issue is still relevant as it affects the discoverability and growth of projects for the entire community. I am still interested in seeing repositories adopt and advertise their use of different languages, tools, implementations or even dialects, we just need the right individuals at Github to want to work on the problem.

What is proposed so far is completely in the real of the possible, Git attributes for example would work immediately out of the box. It's a political game at this point.

nitrix on 1 Apr 2019

stale[bot] on 1 May 2019

Ignoring a problem doesn't make it go away.

nitrix on 1 May 2019

stale[bot] on 31 May 2019

This issue has been automatically closed because it has not had activity in a long time. Please feel free to reopen it or create a new issue.

stale[bot] on 14 Jun 2019

Since i found my way here, I probably have to implement syntax highlighting in Atom by writing a package for it. It's okay if it's plain when rendered in Markdown...

francis94c on 2 Jan 2020

Has there been any progress on this? For tiny homebrew languages this would indeed be a really useful and motivating change, otherwise it's like I'd even turn off the language statistics entirely if I could because if they're plain wrong what's the point. And as I think everyone agrees, including every tiny language that might be discontinued soon into linguist's github-wide language list isn't sensible, but on a per-repo basis it absolutely might be. Surely it can't be that hard to allow a different colored entry on the bar based on some .github-linguist.yml in a repo which could overwrite a file extension?

Also can't this issue be marked such that the stale bot stops messing with it? It's not like it'll magically solve itself.

Edit: as for @francis94c seems like you possibly took a wrong turn, maybe try here? atom isn't really related to this issue here

etc0de on 22 Apr 2020

👍1

Has there been any progress on this?

Nope because this requires more than just changes in Linguist and thus requires buy-in and "product sponsorship" for the GitHub.com engineering side of things first.

I opened an issue in the private GitHub org repo for this back in 2018 and regularly update it with new requests.

Also can't this issue be marked such that the stale bot stops messing with it? It's not like it'll magically solve itself.

As this is dependent on changes outside of Linguist, I don't think there's any value in keeping it open here, hence I've allowed it to auto-close.

lildude on 22 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Arduino is not a language

oldmud0 · 6Comments

Github repo shows wrong language

BnSalahFahmi · 3Comments

16-bit assembly misclassified as ActionScript

RafaelPAndrade · 4Comments

Support TexInfo

FranklinYu · 4Comments

Lightshow gives illusion of Unicode-aware regexes

Alhadis · 5Comments

Linguist: Allow overriding with custom language name and/or implementation name.

Preliminary Steps

Problem Description

Possible Solutions

Most helpful comment

All 20 comments

4291 doesn't address a situation where a user decided to define their own dedicated language for their specific project needs (#4291 either assumes that the language must be widely used, or otherwise nonexistent).

Related issues