Linguist: What problem does showing the stats solve?

Created on 27 Aug 2020 · 5Comments · Source: github/linguist

This is a broad question that would be great to document. I'm a maintainer of an open-source project, the stats I see reported on the project seem decorrelated to the reality, I'm looking into providing a more accurate presentation of the reality to the developers. However, I'm lacking a critical piece of information. What's the convention? What is commonly accepted as a source of a project?

Should we take the tests into account?
Should we take the demos, showing how the code can be composed to solve bigger problems into account?
Etc.

Maybe I'm simply going after: what's problem does showing these stats solve? With the answer of the original problem the introduction of these stats was meant to solve, I think that I could work my way backward.

Thanks!

Source

oliviertassinari

All 5 comments

what's problem does showing these stats solve?

None. People like knowing what technologies were involved in the creation of a project. Myself included.

Alhadis on 2 Sep 2020

👍1

@oliviertassinari I'm looking into providing a more accurate presentation of the reality to the developers.

Depending on what's your end goal and how you model the problem, you'll probably need to work on better representations of code size (e.g. sloccount, cloc) and categorizing code (production code, tests, examples, build system, etc). If you're looking at language evolution, you'll probably want to model diffs to capture the significance of a change.

Linguist solves only a subproblem. Mainly language identification, plus some convenience features to detect vendored and generated files. You'll first need a better problem statement, and then you'll probably need a bigger system where linguist may be just a small piece.

smola on 6 Sep 2020

what's your end goal

My end goal is to deeply understand the problem GitHub language stats are solving for the developers. Why it was introduced by GitHub. With this information, I can make sure the project I'm working on is correctly configured, configured to match with the objectives of GitHub for sharing language stats.

Simply put, I'm aiming for minimizing surprises for developers browsing on the project page of GitHub.

oliviertassinari on 6 Sep 2020

😕1

My end goal is to deeply understand the problem GitHub language stats are solving for the developers. Why it was introduced by GitHub.

Woah! Good luck! 😆 You're asking a lot for some of the oldest code on GitHub.

This functionality was initially implemented in the main GitHub repo waaaay back in 2008 by @defunkt, long before pull requests, so any information about the whys not written in commit messages is lost to time.

The initial purpose of what is now Linguist was language detection and this is still reflected in the purpose detailed in the README and even the very first commit in this repo (part of extracting the code out of the main GitHub repo). The stats were added later as part of adding language popularity and ranking information to GitHub, which I can only guess was added to "drive engagement" and for informational purposes. After all, GitHub's original intent was to be a form of social network... the logo even had "social coding" in it for many years... so things like this would have been seen as "cool" nice-to-have features but in order to show a pretty graph or rankings, you needed to gather stats to produce them.

So to me, the implementation of the stats themselves is not trying to solve a problem. Nothing is broken or fixed by having the stats. To me they're merely adding to the experience by providing a quick-look representation of the code breakdown within a repository.

With this information, I can make sure the project I'm working on is correctly configured, configured to match with the objectives of GitHub for sharing language stats.

I think you're trying too hard to find a reason/purpose and are looking at this the wrong way: you appear to be trying to use the stats to match GitHub's objectives. Instead I think you should be looking at this from your own perspective: what do _you_ want to convey with the repo stats in your repo?

With the overrides, you can convey:

you think they're a true indicator of your skills so you want them to be 100% reflective of every single line of code or data committed to the repo, regardless of the opinionated decisions made by Linguist,
you don't care about the stats at all and are happy to let the magic happen without intervention,
you think they're rubbish, so disable them entirely,

... or any combination of the above. It's entirely up to you. This is why we have overrides.

So ultimately, I think it comes down to asking _yourself_ one question:

What do _you_ want the stats to reflect for _your_ repositories?

Once you know that answer, you can then decide how _you_ want to use and represent the stats.