Linguist: lingust-language=xyz appears to be ignored on github if 'xyz' is not a known language. Proposal: make it appear as unknown language

Created on 5 Jun 2017  ·  27Comments  ·  Source: github/linguist

lingust-language=xyz appears to be ignored on github if xyz is not a known language. Of course at a first glance, this seems like very reasonable behavior. I'm also not saying it's a bug, but it might be something which I suggest is worth changing:

Your policy for inclusion of a language into linguist is we prefer that each new file extension be in use in hundreds of repositories before supporting them in Linguist - which is very reasonable, otherwise you'd have hundreds of obscure languages that die after a few months piled up in your language files. However, this means that for small new language projects, the statistics of what they are written in might be grossly incorrect without any indication of that in the UI.

Therefore I suggest the following:

"*.myl linguist-language=my-language" in .gitattributes for an unknown my-language should result in a stats entry of "Unknown language" for all .myl files in the repository. This gives any reader of the stats an indication that there is a language involved that github doesn't know, while avoiding the pitfalls of not having a proper language color, syntax highlighting, language id or all the other things that are not available because the language isn't actually in lib/linguist/languages.yml.

Most helpful comment

I commonly use a couple of specialty languages in my research. These languages are not known to Linguist. That's fine. I just want to be able to configure the override file in my repos so that (for my own repos only) the language bar shows the names of these specialty languages.

All 27 comments

Any chance this will ever be added?

As I mentioned in #3911 this feature being currently missing makes GitHub's language statistic grossly wrong and useless for many of my projects and those of many other upcoming language creators - and this would be a neat alternate approach to allow a fix for all those affected who care to make use of it without requiring every super arcane language to be added to linguist itself. (which as you state pretty clearly in CONTRIBUTING.md you're not gonna do)

Alternatively, you could also just let each repository define new languages with appropriate colors and grammars - but this proposal here with "Unknown language" is IMHO a nice middle ground if you don't want repos to mess too much with the language statistics and UI in unanticipated ways.

I think this a good, practical idea. @lildude What do you think? Is there anything preventing this on GitHub's side?

I think this a good, practical idea. @lildude What do you think? Is there anything preventing this on GitHub's side?

It does indeed sound like a good idea. I don't know off the top of my head if there'd be a problem on the GitHub side of things; I'd need to whip something up and then experiment to be 100% sure. I don't have the bandwidth for this at the moment, but will add it to my ToDo list.

It would be nice if the language bar used the name provided, just for display. That way it would be more friendly for users of obscure or application specific languages.

However, if that would be too much work, simply lumping it all together as "unknown language" would be a big step up from where it is now.

It would be nice if the language bar used the name provided, just for display. That way it would be more friendly for users of obscure or application specific languages.

I can say for sure that that is not likely to happen as it's a huge security risk, would require additional validation and is open to abuse:

*.md linguist-language=lildude-is-a-plonker

😆🤣

I'm still trying to find the bandwidth to see about implementing an "unknown" option.

Security risk? What exact attack are you thinking of? If it's just the amount of languages, set an upper limit. As for bad words, I don't see it either - couldn't you just put them into the project title or README already, who would gain anything from specifically putting them into the language bar? And the remedy is the same: remove the repo, nothing really changes. Just one spot more to put bad words next to 20 other easier spots.

Security risk? What exact attack are you thinking of?

The languages are added to the database and then displayed on the UI. If you can insert any string without extensive validation etc into the DB, that string can appear on the GitHub interface for any repo and be used to inject XSS or many other attack vectors.

Extensive validation? You mean a simple HTML escape which is hopefully already used anyway? (if not that'd be quite a problem on its own.) I can see a lot of reasons not to implement this, but honestly, security isn't one of them. If such a small UI change is causing you such a security / escaping safety headache, I think you have a fundamental problem elsewhere.

Edit: then again I guess I don't know the code. If your escaping really is that fragile, I'm not in a position to tell you that's wrong. Just surprised me a little, that's all

It’s more than just escaping and validation. As I originally mentioned, this is only one concern. You’re free to add whatever you like to your own repo. Allowing another person to potentially affect the appearance (the language can potentially appear on any repo that Linguist identifies as that language due to the current implementation) of another user’s repo, unchecked, is a whole different story.

Yes, we could possibly change the implementation, but that’s not a priority right now, and may never be.

Keep in mind that nowhere else in the GitHub UI do you have as much control over the UI as you do with the languages, courtesy of Linguist, and it’s this freedom we need to protect, even if it seems trivial to some to “fix” in order to get a desired outcome. It’s better to err on the side of overly cautious than lock this freedom down, which would almost certainly happen when people start abusing it.

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

It does indeed sound like a good idea. I don't know off the top of my head if there'd be a problem on the GitHub side of things; I'd need to whip something up and then experiment to be 100% sure. I don't have the bandwidth for this at the moment, but will add it to my ToDo list.

@lildude Assigning this to you.

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

Allowing another person to potentially affect the appearance (the language can potentially appear on any repo that Linguist identifies as that language due to the current implementation) of another user’s repo, unchecked, is a whole different story.

That's the behavior for the known languages that are detected by linguist, but isn't what's suggested the ability to override specifically some files with a potentially unknown language?

I cannot foresee any security risk or other impact to other repositories with that proposal. Merely discoverability when doing searches and that'd be beneficial not detrimental.

Just to re-enforce my position: why should it be "unknown"? Let it be whatever language people specify it is. They're already going through the trouble of specifying it manually. Controlling language dialects appears overzealous to me.

Looks a lot like github is discriminating against little known languages, with a bogus argument that "lildude-is-a-plonker" could be abused this way, but that doesn't prevent "lildude-is-a-plonker" emerging as a well known language and you'd have the same problem. And if you don't know how to untaint a user supplied string in a webpage, you shouldn't be allowed to maintain this project…

I commonly use a couple of specialty languages in my research. These languages are not known to Linguist. That's fine. I just want to be able to configure the override file in my repos so that (for my own repos only) the language bar shows the names of these specialty languages.

Exactly as above, I'm interested in a manual override with .gitattributes without requiring any changes to the way Linguist performs language detection.

I think it simplifies a lot the scope. Curiously, if the community were to provide such PR, can we expect it to be received positively or is it a waste of time? @pchaigno @lildude

I may even do so myself, considering I've had that same discussion with coworkers at different jobs and I foresee a lot of happy fellow developers if this came to fruition. It'd give Github that small edge over GitLab ;)

Curiously, if the community were to provide such PR, can we expect it to be received positively or is it a waste of time?

This would be nice, if it was that easy, but I'm afraid we wouldn't be able to accept it.

Outside of all things mentioned before, there is an expected functionality associated with the language bar in that users expect to be able to click the language bar, then the language and get search results for that language. This will fail to find any files as GitHub will not know about the language and thus return zero search results. This in turn will put an unfair burden on us in the Linguist community to keep explaining this (generally only three of us respond to these kinds of issues/questions) and the GitHub support personnel.

Adding support for custom names in the language bar requires a lot more than a few tweaks in Linguist as the GitHub-side changes are far from trivial and would require an internally commissioned project to implement it.

Adding support for custom names in the language bar requires a lot more than a few tweaks in Linguist as the GitHub-side changes are far from trivial and would require an internally commissioned project to implement it.

My understanding was that the Unknown Language would not be clickable in the language bar, as is already the case for the Other statistic. Then, the Unknown Language would have tm_scope: none syntax highlighting. Any unrecognized linguist-language override would be assigned the Unknown Language.

Considering this, are there any other changes on GitHub's side besides supporting a special Unknown language entry that is only shown in the language bar? The update of statistics, the mapping of files to that language, and the selection of syntax highlighting are all on Linguist's side.

I came here to ask for this exact thing. 😄 I have zero expectation of Linguist adding my toy languages to its languages.yml (that would be terrible), but it is also somewhat nasty to have zero accounting of files unknown to Linguist.

_At the moment they're not even accounted for in the "Other" percentage._

Eg: https://github.com/nikodemus/foolang reports Other 1.3%.

Currently 13.25% of that repository is in language unknown to Linguist.

If it said Other 14.55% that would seem much more reasonable and useful to me.

Being able to locally configure the language name via .gitattributes would be supernice, but just accounting for the files in the first place would be enough -- there's already plenty of mechanisms for telling Linguist to ignore files if this is undesirable in some cases.

Linking this to a version I opened later on.

https://github.com/github/linguist/issues/4166

This feature seems very useful. The statistics have confused me before when one language not in the languages.yml was recognized as 2-3 different unrelated languages.

Just noting that I changed the repository above to lie to Linguist, claiming my toy language is Smalltalk - because the portion of the unreported files started getting on my nerves. (It's not, and the syntax highlighting doesn't work right, but ... meh.)

Which spurred me to take another try on this windmill.

One option that should be simple from what I understand of how linguist is structured would be to add a few dummy-languages called eg. "Custom", "DSL", "Unknown" all highlighting as text. The repositories suffering from this issue could specify *.xyz linguist-language=Custom, and 99% of the issue would be gone.

  • It would be opt-in, so no-one who wants it gets it.
  • It would be safe, because users cannot control the name, or any of the code or rules associated with it.
  • It _should_ be simple, just one or few entries added to languages.yml.

I this sounds like it would be acceptable to Linguist maintainers, I'm happy to submit a PR providing those dummy languages.

@nikodemus Idea seems indeed to solve the issue easily enough.
I just encountered this problem when we migrated our Windev projects to github (and while it would be straightforward enough to add those to the known languages, I doubt it sees lots of usage on the platform, so probably not that useful).
It just triggers my OCD that the project gets classified as shell because there is a couple of *.env on there, and because Linguist ignores the 2000+files it doesn't recognizes, Setting a Custom tag would be sufficient IMO

@lildude Thanks for your work on this issue to date. Might you be able to provide a status update from your end?

No change. The last part of my last comment still stands:

Adding support for custom names in the language bar requires a lot more than a few tweaks in Linguist as the GitHub-side changes are far from trivial and would require an internally commissioned project to implement it.

In short, this won't be implemented until such time as GitHub's Product team commission a project for it.

No change. The last part of my last comment still stands:

Adding support for custom names in the language bar requires a lot more than a few tweaks in Linguist as the GitHub-side changes are far from trivial and would require an internally commissioned project to implement it.

In short, this won't be implemented until such time as GitHub's Product team commission a project for it.

I understand. What about the idea posted above by @nikodemus to add one custom language to Linguist called "Unknown", "Custom", "DSL", or "Other"?

That would allow users to do *.xyz linguist-language=Custom, which I think would satisfy those of us interested in this issue, while not requiring Github to support custom names in the language bar.

Linguist is responsible for language detection. If that language is explicitly specified in .gitattributes, that makes its job even easier and it can simply use that instead of guessing with heuristics.

The final breakdown that the tool produces, and the bar shown on the GitHub website are separate concerns.

If GitHub wants to filter which languages have their "official" stamp of approval, they can do that.
Just like it currently shows "Others" if it's given too many languages with small percentages in a repository.

That's a display concern.

I think the concerns are being conflated and for some reason, linguist is gate-keeping the issue from escalating to the proper team internally at GitHub. I'm still willing to help if help is needed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

oliviertassinari picture oliviertassinari  ·  5Comments

FranklinYu picture FranklinYu  ·  4Comments

TimothyGu picture TimothyGu  ·  5Comments

RafaelPAndrade picture RafaelPAndrade  ·  4Comments

philiparvidsson picture philiparvidsson  ·  4Comments