Linguist: Incorrect line count on files

Created on 22 Jun 2019  Â·  9Comments  Â·  Source: github/linguist

I noticed through GitHub's file view that the Linguist report on files generates an incorrect line count. With the exception of files containing incomplete lines (no newline marker at the last line) the displayed number is always +1.

This can be easily verified locally with any file or directly at GitHub. Linguist license, e.g., has 22 lines on the content view yet is reported as having 23 lines at the top.

Preliminary Steps

Please confirm you have...

Problem Description

This problem happens with any file that has POSIX lines, which per definition are _"A sequence of zero or more non-\ characters plus a terminating \ character. [3.206]"_. It seems to come from the way lines are split in BlobHelper:

https://github.com/github/linguist/blob/001ca526694a32a5ac4c977a79032ced0b8c0aca/lib/linguist/blob_helper.rb#L274

I imagine that -1 was given as a second argument to split to prevent truncation of empty lines at the end of the file. However this is also what keeps the last null field after the last newline character stored in the lines array — then silently considered as a full line in loc:

https://github.com/github/linguist/blob/001ca526694a32a5ac4c977a79032ced0b8c0aca/lib/linguist/blob_helper.rb#L330-L337

The solution seems to be simple enough if we added a chomp to remove the last newline (without side effects on incomplete files or files with empty lines at their end):

data.chomp.split(encoded_newlines_re, -1)

I'm not sure if this will play well with the different encodings though, so I decided to avoid a direct PR before hearing specialists. If not, a solution using regex (similar to encoded_newlines_re) and the \z anchor could be used, maybe? Are there other important facts which I'm missing here?

Most helpful comment

Reopening as Linguist isn't responsible for this on the GitHub.com side of things.

I've finally sorted out the GitHub side of things and you can see it in all it's glory now - the line count at the top now matches the content 🎉

All 9 comments

@lildude Does the line count actually comes from Linguist?

I was wondering that as well. Nevertheless, it wouldn't remove the fact that Linguist itself is reporting the incorrect count.

@lildude Does the line count actually comes from Linguist?

Sorry about the delay. Yes, the line count comes from Linguist.

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

I'm not sure if this will play well with the different encodings though, so I decided to avoid a direct PR before hearing specialists. If not, a solution using regex (similar to encoded_newlines_re) and the \z anchor could be used, maybe? Are there other important facts which I'm missing here?

@tessarin I think you should go ahead and open a pull request with the first proposed solution.

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

Reopening as Linguist isn't responsible for this on the GitHub.com side of things.

Reopening as Linguist isn't responsible for this on the GitHub.com side of things.

I've finally sorted out the GitHub side of things and you can see it in all it's glory now - the line count at the top now matches the content 🎉

Thank you very much @lildude!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lucasrodes picture lucasrodes  Â·  6Comments

pfitzseb picture pfitzseb  Â·  5Comments

BnSalahFahmi picture BnSalahFahmi  Â·  3Comments

RafaelPAndrade picture RafaelPAndrade  Â·  4Comments

siscia picture siscia  Â·  6Comments