I just had a look at https://github.com/trending?l=gap and noticed that the top projects there are not using GAP at all, but rather are misdetected. They all seem to be using something called "G-code", see http://reprap.org/wiki/G-code -- format for RepRap 3D printers (it seems to essen
Some concrete examples (randomly picked):
The official extension for such files seem to be ".gco", but clearly many people use ".g"
This is a bit annoying. What can be done to fix this? One solution would be to add RepRap G-code to the list of "languages" detected by linguist. Note that this is not a general programming language, but rather a set of commands controlling the RepRap "3D printer". Still, it is some kind of machine code.
Should one add a language entry for this, then? What would be the name of the "language", "G-code"? Or is there another way to deal with such misdetections?
I don't see any other solution than the one you proposed.
Since this language seems widely spread on GitHub, it could be integrated to Linguist.
G-code seems to be the official name.
These files look like data files (type could maybe be set to data).
They all seem to be generated from some kind of graphical program.
Since #1529 and #1539 have been merged, I hope this will be resolved with the next linguist release (once it's been rolled out to GitHub, that is). We'll see... :)
@fingolfin does this look better now?
@bkeepers What exactly? Was there any recent change that might affect this? As far as I can tell, https://github.com/trending?l=gap is exactly as wrong as it was before.
Aha, but my G-code PR was merged. Alas, it seems GitHub does not recompute language stats unless a repository is in use (and even then it seems to sometimes fail -- I reported this already in the past to github). So, e.g. https://github.com/reprappro/Mendel still is listed as containing a lot of GAP code (in fact: 100% when it really contains 0%...). But if I run the latest linguist version (from git) manually on it, I get
96.94% OpenSCAD
1.56% Prolog
1.50% Shell
The Prolog and Shell files are mis-detections, but otherwise this is much better. I dunno if it recognized the .g files (is there a way to make it print detected "data" files, too?), but at least no more fake GAP files.
Ah well some more https://github.com/trending?l=gap entries:
But on the "plus side", two GAP projects made it into the list of top scilab project, yay: https://github.com/trending?l=scilab lists https://github.com/gap-system/SingularInterface and https://github.com/gap-system/anupq
But those won't be fixed without #1541 or #1523 or some other specific action...
Ok, looks like we still have some work to do.
Aha, but my G-code PR was merged. Alas, it seems GitHub does not recompute language stats unless a repository is in use (and even then it seems to sometimes fail -- I reported this already in the past to github)
GitHub currently only runs analysis on changed files when you push. @arfon and I worked on some changes that we should be rolling out in the next week or so to re-analyize all files when we update linguist.
@bkeepers Awesome, you guys rock :-)
For the https://github.com/BLLIP/bllip-parser repository, several files are misdetected as GAP, even though they are G-Code, which is in theory "known" to linguist. Alas:
~/Projekte/foreign/linguist (git:master+)$ LINGUIST_DEBUG=2 bundle exec linguist bllip-parser/first-stage/DATA/EN/l.g
bllip-parser/first-stage/DATA/EN/l.g: 0 lines (0 sloc)
type: Text
mime type: text/plain
# G-code GAP
- 11 40.767 -
A 45 - 289.200
e 11 133.413 -
G-code = -633.134 + -5.547 = -638.681
GAP = -518.114 + -4.988 = -523.101
language: GAP
blob is too large to be shown
So GAP clearly "wins" because these datafiles happens to contain the letters "A" and "e" as "words"... Ugh. G-Code files are almost completely made up out of numbers, but it seems numbers are not treated as "keywords". So the classifier fails utterly in this case. Not sure what to do about that...
@fingolfin Maybe a new heuristic rule could help...?
:flags: flagging this as stale.
First off, as I just learned, those files in BLLIP are not G-Code at all. That in turn means that the files lm.g and rm.g should be removed from the G-code sample set (let me know if you need me to file a pull request for that).
But of course even with that changes, those files in BLLIP (and any other .g files that is neither GAP nor G-Code) will be misdetected. I still think this is a design flaw in linguist (see issue #1571)...
Anyway, I'll have a look to see if I can figure out how these "heuristics" work and whether I can get them to help... (I guess they are yet another attempt to workaround that fundamental issue I mentioned... _sigh_)
So... actually... how does one add a disambiguate heuristics rule for two language which generates the results "neither of these languages match" ?
let me know if you need me to file a pull request for that
They can probably be removed in #1027.
I guess they are yet another attempt to workaround that fundamental issue I mentioned...
No, they would still be here even if we had a "no a code" option ;)
So... actually... how does one add a disambiguate heuristics rule for two language which generates the results "neither of these languages match" ?
Just return nothing in the case where none of the language match. For instance:
disambiguate "Hack", "PHP" do |data|
if data.include?("<?hh")
Language["Hack"]
elsif /<?[^h]/.match(data)
Language["PHP"]
end
end
:+1: This is (mostly) misidentified as GAP, though it is GDScript.... https://github.com/KOBUGE-Games/ringed
Also, I suspect projects that rely on ANTLR also have *.g files identified as GAP... whereas those files are used to define the grammar of a language.
@fingolfin Are you still working on a heuristic rule?
No, I haven't looked a linguist since my last comment. There are multiple reasons for that:
Hi, I'm working on a heuristic rule for GDScript.
I've tested with some GDScript and GAP repos. I think it's ok with the rules.
I don't know GAP and G-code to write a rule for GAP :/
Also, I added a color to GDScript but tests are failing in the color threholds...
This affects a lot of ANTLR code: https://github.com/search?l=GAP&q=extension%3Ag+antlr&type=Code
This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.
It is still relevant because the issues described here are issues. However, I have not checked whether the underlying design flaw (of not allowing negative detection) is still present in linguist.
This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.
See above. Do we now have to tick this "keep alive" check box every month? What is the idea behind this "stale bot"? Are old bugs somehow fermenting into features perhaps? ;-)
See above. Do we now have to tick this "keep alive" check box every month?
Until someone provides a solution, yes. Lots of peeps have said they're looking into heuristic improvements - even you at one point 馃槈 - yet the PRs aren't coming. This is something that really needs to be addressed by people that know the languages involved.
What is the idea behind this "stale bot"?
To nudge people to double-check that the issues they've raised are still valid and contribute to fixing them if they are (it's almost working on this issue too 馃槈). An issue sitting looking pretty isn't going to get resolved unless it gets attention 馃槃
See above. Do we now have to tick this "keep alive" check box every month?
Until someone provides a solution, yes. Lots of peeps have said they're looking into heuristic improvements - even you at one point 馃槈 - yet the PRs aren't coming. This is something that really needs to be addressed by people that know the languages involved.
But to do that, we'd need the ability to classify content as "spam" -- i.e., to say "this is definitely not the right language". This is something that really needs to be addressed by people that know linguist in and out.
What is the idea behind this "stale bot"?
To nudge people to double-check that the issues they've raised are still valid and contribute to fixing them if they are (it's almost working on this issue too 馃槈). An issue sitting looking pretty isn't going to get resolved unless it gets attention 馃槃
I don't see how it is working on this issue. It is annoying, sure, but I've been waiting to be able to work on this issue for years, but AFAIK (but I might be wrong, I didn't check recently) the fundamental flaws in the architecture of linguist remain, so I can't really work on it.
Anyway, I guess then we are better off closing it.
Lots of peeps have said they're looking into heuristic improvements
Wait a second, why should heuristics be involved? This should be addressable from a configuration level:
I think @lildude referred to language recognition heuristics, not the settings for the stale bot.
Anyway, to me, the stale bot is an unfriendly passive-aggressive way of saying "we are not interested in issue reports, at least for non-trivial issues, only for PRs". I'd find it more honest if the issues were simply closed with such a comment, instead of badgering people who tried to contribute by taking time to careful document and describe an issue. I am not interested in contributing to a project that works this way. No config tuning will help with that.
But to do that, we'd need the ability to classify content as "spam" -- i.e., to say "this is definitely _not_ the right language".
That is the complete opposite of the purpose of Linguist... Linguist is designed to tell you what it thinks the language is based on the information it has. In order for Linguist to get this right, it needs more data, whether that's heuristics or more samples for the classifier. Only the repo owner can tell Linguist what it is _not_, on a repo-by-repo basis, by telling it what it is using an override. Linguist, however, doesn't learn from this, and really shouldn't either as one person's preference is not necessarily correct for all, and that's why we need PRs for Linguist itself.
Wait a second, why should heuristics be involved? This should be addressable from a configuration level:
None of those actually address _this_ issue, just the stalebot behaviour 馃槈. Adding one of these labels would certainly stop stalebot.
We could certainly add the "Help wanted" label and keep this open until someone offers the necessary updates, but I fear this won't happen given it's been over four years already.
@fingolfin I hear you and understand your sentiments. I implemented stalebot to keep the development of Linguist active and "fresh" and encourage ongoing help from those knowledgable and affected. We rely heavily on the community as it's impractical for us to know each and every supported language in enough depth to resolve all the issues raised, especially as most of these are around language identification/matching, as is this issue, and often they're resolved over time thanks to other PRs. I'd love to dig in myself, but I know absolutely diddly about any of the languages involved here.
Yeah, I agree. Bots are annoying, but they give routine contributors fewer headaches from running around poking old issues with a stick to see which ones care, and which don't. This sort of thing becomes easier to manage when responsive users are ones who reply, and the ones who clearly aren't motivated enough to keep an issue open will let it grow stale. =)
Either way, thank you for your patience, and apologise for any frustration caused.
Most helpful comment
See above. Do we now have to tick this "keep alive" check box every month? What is the idea behind this "stale bot"? Are old bugs somehow fermenting into features perhaps? ;-)