Linguist: Many *.g and *.gd files are mis-recognized as GAP, how to handle it?

Created on 15 Sep 2014  路  29Comments  路  Source: github/linguist

I just had a look at https://github.com/trending?l=gap and noticed that the top projects there are not using GAP at all, but rather are misdetected. They all seem to be using something called "G-code", see http://reprap.org/wiki/G-code -- format for RepRap 3D printers (it seems to essen

Some concrete examples (randomly picked):

The official extension for such files seem to be ".gco", but clearly many people use ".g"

This is a bit annoying. What can be done to fix this? One solution would be to add RepRap G-code to the list of "languages" detected by linguist. Note that this is not a general programming language, but rather a set of commands controlling the RepRap "3D printer". Still, it is some kind of machine code.

Should one add a language entry for this, then? What would be the name of the "language", "G-code"? Or is there another way to deal with such misdetections?

Misidentified Language

Most helpful comment

See above. Do we now have to tick this "keep alive" check box every month? What is the idea behind this "stale bot"? Are old bugs somehow fermenting into features perhaps? ;-)

All 29 comments

I don't see any other solution than the one you proposed.
Since this language seems widely spread on GitHub, it could be integrated to Linguist.
G-code seems to be the official name.

These files look like data files (type could maybe be set to data).
They all seem to be generated from some kind of graphical program.

Since #1529 and #1539 have been merged, I hope this will be resolved with the next linguist release (once it's been rolled out to GitHub, that is). We'll see... :)

@fingolfin does this look better now?

@bkeepers What exactly? Was there any recent change that might affect this? As far as I can tell, https://github.com/trending?l=gap is exactly as wrong as it was before.

Aha, but my G-code PR was merged. Alas, it seems GitHub does not recompute language stats unless a repository is in use (and even then it seems to sometimes fail -- I reported this already in the past to github). So, e.g. https://github.com/reprappro/Mendel still is listed as containing a lot of GAP code (in fact: 100% when it really contains 0%...). But if I run the latest linguist version (from git) manually on it, I get

96.94%  OpenSCAD
1.56%   Prolog
1.50%   Shell

The Prolog and Shell files are mis-detections, but otherwise this is much better. I dunno if it recognized the .g files (is there a way to make it print detected "data" files, too?), but at least no more fake GAP files.

Ah well some more https://github.com/trending?l=gap entries:

But on the "plus side", two GAP projects made it into the list of top scilab project, yay: https://github.com/trending?l=scilab lists https://github.com/gap-system/SingularInterface and https://github.com/gap-system/anupq

But those won't be fixed without #1541 or #1523 or some other specific action...

Ok, looks like we still have some work to do.

Aha, but my G-code PR was merged. Alas, it seems GitHub does not recompute language stats unless a repository is in use (and even then it seems to sometimes fail -- I reported this already in the past to github)

GitHub currently only runs analysis on changed files when you push. @arfon and I worked on some changes that we should be rolling out in the next week or so to re-analyize all files when we update linguist.

@bkeepers Awesome, you guys rock :-)

For the https://github.com/BLLIP/bllip-parser repository, several files are misdetected as GAP, even though they are G-Code, which is in theory "known" to linguist. Alas:

~/Projekte/foreign/linguist (git:master+)$ LINGUIST_DEBUG=2 bundle exec linguist bllip-parser/first-stage/DATA/EN/l.g
bllip-parser/first-stage/DATA/EN/l.g: 0 lines (0 sloc)
  type:      Text
  mime type: text/plain
     #    G-code       GAP
-   11    40.767         -
A   45         -   289.200
e   11   133.413         -
    G-code =   -633.134 +  -5.547 =   -638.681
       GAP =   -518.114 +  -4.988 =   -523.101
  language:  GAP
  blob is too large to be shown

So GAP clearly "wins" because these datafiles happens to contain the letters "A" and "e" as "words"... Ugh. G-Code files are almost completely made up out of numbers, but it seems numbers are not treated as "keywords". So the classifier fails utterly in this case. Not sure what to do about that...

@fingolfin Maybe a new heuristic rule could help...?

:flags: flagging this as stale.

First off, as I just learned, those files in BLLIP are not G-Code at all. That in turn means that the files lm.g and rm.g should be removed from the G-code sample set (let me know if you need me to file a pull request for that).

But of course even with that changes, those files in BLLIP (and any other .g files that is neither GAP nor G-Code) will be misdetected. I still think this is a design flaw in linguist (see issue #1571)...

Anyway, I'll have a look to see if I can figure out how these "heuristics" work and whether I can get them to help... (I guess they are yet another attempt to workaround that fundamental issue I mentioned... _sigh_)

So... actually... how does one add a disambiguate heuristics rule for two language which generates the results "neither of these languages match" ?

let me know if you need me to file a pull request for that

They can probably be removed in #1027.

I guess they are yet another attempt to workaround that fundamental issue I mentioned...

No, they would still be here even if we had a "no a code" option ;)

So... actually... how does one add a disambiguate heuristics rule for two language which generates the results "neither of these languages match" ?

Just return nothing in the case where none of the language match. For instance:

disambiguate "Hack", "PHP" do |data|
  if data.include?("<?hh")
    Language["Hack"]
  elsif /<?[^h]/.match(data)
    Language["PHP"]
  end
end

:+1: This is (mostly) misidentified as GAP, though it is GDScript.... https://github.com/KOBUGE-Games/ringed

Also, I suspect projects that rely on ANTLR also have *.g files identified as GAP... whereas those files are used to define the grammar of a language.

@fingolfin Are you still working on a heuristic rule?

No, I haven't looked a linguist since my last comment. There are multiple reasons for that:

  • While I know GAP, I don't know any of the other languages, so in order to develop a meaningful heuristic, I'd have to dig into them. And even then, the best I can hope for is a crude hack.
  • If instead linguist use text classifiers that allow for classifying text as "unknown", this problem would not exist, or at least would be much less severe. Working on manual disambiguation hacks for linguist seems to be an endless whack-a-mole game, for which I don't have the nerve.
  • As I explained above, several of these misdetection issues are (or at least: used to be... haven't verified that recently) already fixed by the addition of G-Code as a language. But since github does not rerun the detection for repositories that are not changed, e.g. https://github.com/reprappro/Mendel is still reported as GAP code, even thought linguist at one point correctly handled it. Clearly, no effort I'd ever spend on linguist would fix that problem.

Hi, I'm working on a heuristic rule for GDScript.
I've tested with some GDScript and GAP repos. I think it's ok with the rules.

I don't know GAP and G-code to write a rule for GAP :/

Also, I added a color to GDScript but tests are failing in the color threholds...

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

It is still relevant because the issues described here are issues. However, I have not checked whether the underlying design flaw (of not allowing negative detection) is still present in linguist.

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

See above. Do we now have to tick this "keep alive" check box every month? What is the idea behind this "stale bot"? Are old bugs somehow fermenting into features perhaps? ;-)

See above. Do we now have to tick this "keep alive" check box every month?

Until someone provides a solution, yes. Lots of peeps have said they're looking into heuristic improvements - even you at one point 馃槈 - yet the PRs aren't coming. This is something that really needs to be addressed by people that know the languages involved.

What is the idea behind this "stale bot"?

To nudge people to double-check that the issues they've raised are still valid and contribute to fixing them if they are (it's almost working on this issue too 馃槈). An issue sitting looking pretty isn't going to get resolved unless it gets attention 馃槃

See above. Do we now have to tick this "keep alive" check box every month?

Until someone provides a solution, yes. Lots of peeps have said they're looking into heuristic improvements - even you at one point 馃槈 - yet the PRs aren't coming. This is something that really needs to be addressed by people that know the languages involved.

But to do that, we'd need the ability to classify content as "spam" -- i.e., to say "this is definitely not the right language". This is something that really needs to be addressed by people that know linguist in and out.

What is the idea behind this "stale bot"?

To nudge people to double-check that the issues they've raised are still valid and contribute to fixing them if they are (it's almost working on this issue too 馃槈). An issue sitting looking pretty isn't going to get resolved unless it gets attention 馃槃

I don't see how it is working on this issue. It is annoying, sure, but I've been waiting to be able to work on this issue for years, but AFAIK (but I might be wrong, I didn't check recently) the fundamental flaws in the architecture of linguist remain, so I can't really work on it.

Anyway, I guess then we are better off closing it.

Lots of peeps have said they're looking into heuristic improvements

Wait a second, why should heuristics be involved? This should be addressable from a configuration level:

  • [ ] Maximum times @stale can flag a thread as stale after somebody's responded
  • [ ] Whether or not those with write-access can influence the above setting
  • [ ] Prevent bot from flagging issues labelled as ongoing or something

I think @lildude referred to language recognition heuristics, not the settings for the stale bot.

Anyway, to me, the stale bot is an unfriendly passive-aggressive way of saying "we are not interested in issue reports, at least for non-trivial issues, only for PRs". I'd find it more honest if the issues were simply closed with such a comment, instead of badgering people who tried to contribute by taking time to careful document and describe an issue. I am not interested in contributing to a project that works this way. No config tuning will help with that.

But to do that, we'd need the ability to classify content as "spam" -- i.e., to say "this is definitely _not_ the right language".

That is the complete opposite of the purpose of Linguist... Linguist is designed to tell you what it thinks the language is based on the information it has. In order for Linguist to get this right, it needs more data, whether that's heuristics or more samples for the classifier. Only the repo owner can tell Linguist what it is _not_, on a repo-by-repo basis, by telling it what it is using an override. Linguist, however, doesn't learn from this, and really shouldn't either as one person's preference is not necessarily correct for all, and that's why we need PRs for Linguist itself.

Wait a second, why should heuristics be involved? This should be addressable from a configuration level:

None of those actually address _this_ issue, just the stalebot behaviour 馃槈. Adding one of these labels would certainly stop stalebot.

We could certainly add the "Help wanted" label and keep this open until someone offers the necessary updates, but I fear this won't happen given it's been over four years already.

@fingolfin I hear you and understand your sentiments. I implemented stalebot to keep the development of Linguist active and "fresh" and encourage ongoing help from those knowledgable and affected. We rely heavily on the community as it's impractical for us to know each and every supported language in enough depth to resolve all the issues raised, especially as most of these are around language identification/matching, as is this issue, and often they're resolved over time thanks to other PRs. I'd love to dig in myself, but I know absolutely diddly about any of the languages involved here.

Yeah, I agree. Bots are annoying, but they give routine contributors fewer headaches from running around poking old issues with a stick to see which ones care, and which don't. This sort of thing becomes easier to manage when responsive users are ones who reply, and the ones who clearly aren't motivated enough to keep an issue open will let it grow stale. =)

Either way, thank you for your patience, and apologise for any frustration caused.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

GabLeRoux picture GabLeRoux  路  6Comments

Alhadis picture Alhadis  路  5Comments

FranklinYu picture FranklinYu  路  4Comments

etc0de picture etc0de  路  5Comments

henrywright picture henrywright  路  6Comments