Linguist: Incorrect line count on files

Created on 22 Jun 2019 · 9Comments · Source: github/linguist

I noticed through GitHub's file view that the Linguist report on files generates an incorrect line count. With the exception of files containing incomplete lines (no newline marker at the last line) the displayed number is always +1.

This can be easily verified locally with any file or directly at GitHub. Linguist license, e.g., has 22 lines on the content view yet is reported as having 23 lines at the top.

Preliminary Steps

Please confirm you have...

[x] reviewed How Linguist Works,
[x] reviewed the Troubleshooting docs,
[x] considered implementing an override,
[x] verified an issue has not already been logged for your issue (linguist issues).

Problem Description

This problem happens with any file that has POSIX lines, which per definition are _"A sequence of zero or more non-\ characters plus a terminating \ character. [3.206]"_. It seems to come from the way lines are split in BlobHelper:

https://github.com/github/linguist/blob/001ca526694a32a5ac4c977a79032ced0b8c0aca/lib/linguist/blob_helper.rb#L274

I imagine that -1 was given as a second argument to split to prevent truncation of empty lines at the end of the file. However this is also what keeps the last null field after the last newline character stored in the lines array — then silently considered as a full line in loc:

https://github.com/github/linguist/blob/001ca526694a32a5ac4c977a79032ced0b8c0aca/lib/linguist/blob_helper.rb#L330-L337

The solution seems to be simple enough if we added a chomp to remove the last newline (without side effects on incomplete files or files with empty lines at their end):

data.chomp.split(encoded_newlines_re, -1)

I'm not sure if this will play well with the different encodings though, so I decided to avoid a direct PR before hearing specialists. If not, a solution using regex (similar to encoded_newlines_re) and the \z anchor could be used, maybe? Are there other important facts which I'm missing here?

Source

tessarin

Most helpful comment

Reopening as Linguist isn't responsible for this on the GitHub.com side of things.

I've finally sorted out the GitHub side of things and you can see it in all it's glory now - the line count at the top now matches the content 🎉

lildude on 20 Nov 2019

🎉2

All 9 comments

@lildude Does the line count actually comes from Linguist?

pchaigno on 15 Jul 2019

I was wondering that as well. Nevertheless, it wouldn't remove the fact that Linguist itself is reporting the incorrect count.

tessarin on 15 Jul 2019

👍1

@lildude Does the line count actually comes from Linguist?

Sorry about the delay. Yes, the line count comes from Linguist.

lildude on 28 Jul 2019

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

stale[bot] on 27 Aug 2019

I'm not sure if this will play well with the different encodings though, so I decided to avoid a direct PR before hearing specialists. If not, a solution using regex (similar to encoded_newlines_re) and the \z anchor could be used, maybe? Are there other important facts which I'm missing here?

@tessarin I think you should go ahead and open a pull request with the first proposed solution.

pchaigno on 27 Aug 2019

👍1

stale[bot] on 26 Sep 2019

Reopening as Linguist isn't responsible for this on the GitHub.com side of things.

lildude on 3 Oct 2019

Reopening as Linguist isn't responsible for this on the GitHub.com side of things.

I've finally sorted out the GitHub side of things and you can see it in all it's glory now - the line count at the top now matches the content 🎉

lildude on 20 Nov 2019

🎉2

Thank you very much @lildude!

tessarin on 21 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Language not showing up in this repos

lucasrodes · 6Comments

Update Julia syntax

pfitzseb · 5Comments

Github repo shows wrong language

BnSalahFahmi · 3Comments

16-bit assembly misclassified as ActionScript

RafaelPAndrade · 4Comments

Object-C code detected while absolute zero files are present, also, no files are detected.

siscia · 6Comments