Linguist: Completely wrong VimL percentage (includes repo link on GitHub which reproduces this)

Created on 18 Nov 2017  Â·  5Comments  Â·  Source: github/linguist

Hi, I have a project found here which has

  • less than 150 lines of VimL

  • more than 2k lines of Python

GitHub's / linguist's concluding statistics: Vim script 41.9% Python 32.1%

I'm a clumsy person when it comes to adding up numbers, but even to me that seems grossly incorrect. Maybe this would be worth looking into? Something seems to go very wrong with those numbers, or I'm fundamentally misunderstanding how this statistics are supposed to be read..

Most helpful comment

Linguist considers anything under ^tools/ to be considered vendored code
…
Yes, the comment says for C, but it applies to all languages.

Just weighing in on this: I find this to be a mistake. I've been bitten by this more than once in the past, and I personally feel the rule should target C dependencies specifically, or not at all.

The link to Joyent's old repository is quite puzzling. If Node.js is the main reason for vendoring tools and deps, then I'd say they're the ones who should be using a manual override, not other users.

All 5 comments

First, to quote the README:

The Language stats bar displays languages percentages for the files in the repository. The percentages are calculated based on the bytes of code for each language as reported by the List Languages API.

So the number of lines isn't really relevant, but I understand your point.

This total is also only _after_ files that are considered vendored or generated have been excluded.

With this in mind, I've taken a look at your repo and only the following files count towards the stats:

Python
misc/docs-transform.py

Vim script
misc/highlighting/vim/ftdetect/tetherasm.vim
misc/highlighting/vim/ftdetect/tethercode.vim
misc/highlighting/vim/syntax/tetherasm.vim
misc/highlighting/vim/syntax/tethercode.vim

Makefile
Makefile

If we look at this in terms of (kilo) bytes of code we get:

Vim script 4.22 KB
Python 3.23 KB
Makefile 2.62 KB

... which is where the current percentage breakdown comes from.

Your next question is probably: "But the search results for python show four python files. Why aren't they all being included in the stats?"

That's where the vendored code check comes in. Linguist considers anything under ^tools/ to be considered vendored code (ie someone else's code) and excludes it from the stats thanks to: https://github.com/github/linguist/blob/7be6fb013864b59e7c9ba76e3c0062dd95cedc76/lib/linguist/vendor.yml#L21-L27

Yes, the comment says for C, but it applies to all languages.

It's also worth pointing out that the search results are completely independent of Linguist and aren't affected by Linguist, hence you see more files than are counted in the stats.

So how do we solve this? With a manual override to say everything under tools/ should _not_ be considered vendored within your repo.

This is probably more than you were expecting, but I'm in a descriptive mood this morning 😉

Linguist considers anything under ^tools/ to be considered vendored code
…
Yes, the comment says for C, but it applies to all languages.

Just weighing in on this: I find this to be a mistake. I've been bitten by this more than once in the past, and I personally feel the rule should target C dependencies specifically, or not at all.

The link to Joyent's old repository is quite puzzling. If Node.js is the main reason for vendoring tools and deps, then I'd say they're the ones who should be using a manual override, not other users.

Yup tools is part of the project, the repo has no externally written code right now. So the exclusion doesn't really make much sense, and that does indeed appear to be the reason it's off by so much.

Given linguist's current project management's unfortunate stance (my personal opinion) of not offering an easy way of including stats for relatively unknown languages or even alternatively offering any way of marking files as "Unknown" language to represent them in any meaningful way, the stats are extremely misleading/wrong beyond this VimL weirdness anyway and I don't really have much of an incentive to attempt to avoid other breakage with more linguist-specific special code. I just figured the stats were wrong beyond the unknown language part, hence this bug report.

@lildude - I agree with @Alhadis and @JonasT and think we should remove tools/from the vendored entries.

I agree. https://github.com/github/linguist/pull/3919 opened to remove it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

GabLeRoux picture GabLeRoux  Â·  6Comments

oliviertassinari picture oliviertassinari  Â·  5Comments

philiparvidsson picture philiparvidsson  Â·  4Comments

friedc picture friedc  Â·  6Comments

Alhadis picture Alhadis  Â·  5Comments