Linguist: Classify "testdata" as vendored (Go)

Created on 13 Feb 2019  Â·  17Comments  Â·  Source: github/linguist

Preliminary Steps

Please confirm you have...

Problem Description

In Go (Golang) any folder named "testdata" is considered to be ancillary data, a.k.a. fixtures. I noticed that you already list some fixture patterns in your vendor.yml so I figured I might as well ask if it would make sense to add this folder name as a known pattern? I am working on a (private) repository that is currently listed as 98% HTML, which is simply because it performs scraping and to test that scraping, we use the raw HTML (which can be a ton for certain pages) from the pages we scrape.

We added a .gitattributes with the line testdata/* linguist-vendored but that didn't work for some reason (neither locally nor on GitHub), but in theory, it should work and is an obvious alternative solution.

URL of the affected repository:

Private

Last modified on:

?

Expected language:

I would expect the repository to be identified as mostly Go code since that's essentially what it is if you ignore the testdata.

Detected language:

Lots of HTML.

Stale

All 17 comments

We added a .gitattributes with the line testdata/* linguist-vendored but that didn't work for some reason (neither locally nor on GitHub)

It should do as can be seen in the dummy repo I created at https://github.com/lildude/expert-system to illustrate the point, however there's one caveat: GitHub will not re-analyse your repo and adjust the language stats when adding or modifying the .gitattributes file alone. You need to modify one of the "counted" files in order for the repo to be re-analysed.

I figured I might as well ask if it would make sense to add this folder name as a known pattern?

I don't see why not as Go itself ignores the directory:

$ go help packages | tail -2
Directory and file names that begin with "." or "_" are ignored
by the go tool, as are directories named "testdata".
$

Feel free to submit a PR.

Alright looks like it should work. I would assume that the local one should work no matter if any other files changes, but maybe I'm missing something else. I'll be pushing some other changes to GitHub tomorrow so will see then if the language stats updates go through. I'll make sure to submit a PR tomorrow as well :)

GitHub will not re-analyse your repo and adjust the language stats when adding or modifying the .gitattributes file alone.

@lildude - This kinda looks like a bug. Linguist has code to clear the cache if any .gitattributes file is touched, so unless some closed-sourced code supersedes this, modifying .gitattributes should be enough to update the statistics.

Note that if there is indeed that kind of closed-sourced code for performance reasons, we could improve things a bit on our side by only recomputing statistics for files affected by the new .gitattributes rules.

This kinda looks like a bug. Linguist has code to clear the cache if any .gitattributes file is touched, so unless some closed-sourced code supersedes this, modifying .gitattributes should be enough to update the statistics.

Hmm. I've just tested this again and I now realise I was thinking of the scenario where syntax highlighting isn't updated on a file influenced by a language change in the .gitattributes until that file itself is changed as the syntax highlighting is cached based on the blob.

I've played with this a bit more this morning, and I think this is a chicken & egg scenario that only happens when adding the .gitattributes via the UI. I have found that the .gitattributes file needs to exist in the repo first in order for all subsequent edits of it or any other file via the UI to trigger its content to take effect.

If I add or modify the file from the command line and push, it takes effect immediately.

So I don't think this is a bug but rather more of a chicken & egg scenario that only seems to affect adding the file via the UI. I'll discuss it with my colleagues when they come online later.

I would assume that the local one should work no matter if any other files changes, but maybe I'm missing something else.

It does, but you need to commit the file in order for it to be picked up by linguist:

$ git init foo
Initialized empty Git repository in /Users/lildude/Downloads/trash/foo/.git/
$ cd foo
$ mkdir testdata                                      
$ echo "<html></html>" > testdata/foo.html            
$ echo "package foo" > foo.go                         
$ git add .                                           
$ git commit -m 'Foo'                                 
[master (root-commit) de9cf52] Foo
 2 files changed, 2 insertions(+)
 create mode 100644 foo.go
 create mode 100644 testdata/foo.html
$ github-linguist .                                   
53.85%  HTML
46.15%  Go
$ echo "testdata/* linguist-vendored" > .gitattributes
$ github-linguist .                                   
53.85%  HTML
46.15%  Go
$ git add .gitattributes                              
$ git commit -m 'attribs'                             
[master 5673d14] attribs
 1 file changed, 1 insertion(+)
 create mode 100644 .gitattributes
$ github-linguist .                                   
100.00% Go
$

@lildude
Sorry I forgot to mention that testdata should be ignored no matter where in the repo it sits (e.g. there can be plenty those kind of folders sprinkled throughout a repo).

fgblomqvist:~$ git init foo
Initialized empty Git repository in /home/fgblomqvist/foo/.git/
fgblomqvist:~$ cd foo
fgblomqvist:~/foo$ mkdir bar
fgblomqvist:~/foo$ cd bar
fgblomqvist:~/foo/bar$ mkdir testdata
fgblomqvist:~/foo/bar$ echo "<html></html>" > testdata/foo.html
fgblomqvist:~/foo/bar$ echo "package bar" > bar.go
fgblomqvist:~/foo/bar$ cd ..
fgblomqvist:~/foo$ echo "package foo" > foo.go
fgblomqvist:~/foo$ git add .
fgblomqvist:~/foo$ git commit -m 'Foo'
[master (root-commit) a260874] Foo
 3 files changed, 3 insertions(+)
 create mode 100644 bar/bar.go
 create mode 100644 bar/testdata/foo.html
 create mode 100644 foo.go
fgblomqvist:~/foo$ github-linguist
63.16%  Go
36.84%  HTML
fgblomqvist:~/foo$ echo "testdata/* linguist-vendored" > .gitattributes
fgblomqvist:~/foo$ github-linguist 
63.16%  Go
36.84%  HTML
fgblomqvist:~/foo$ git add .gitattributes 
fgblomqvist:~/foo$ git commit -m "attribs"
[master 178861e] attribs
 1 file changed, 1 insertion(+)
 create mode 100644 .gitattributes
fgblomqvist:~/foo$ github-linguist 
63.16%  Go
36.84%  HTML

The pattern looks correct to me (e.g. it works in gitignore) so maybe it's a bug in relation to the pattern parser?

(Gonna hold off on a PR until I/we figure this one out)

Yeah, .gitattributes is a little different to .gitignore when it comes the path/pattern matching for some reason.

This should do the trick:

**/testdata/** linguist-vendored

@lildude Slightly off-topic, but is there any chance we could support both of the following forms?

~gitattributes
foo linguist-vendored
foo linguist-vendored=true
~

As well as:

~gitattributes
bar linguist-generated
bar linguist-generated=true
~

I can never remember which attributes take the =next assignment, and which don't. It's very ambiguous and an easy mistake to make.

Slightly off-topic, but is there any chance we could support both of the following forms?

We already do. Both forms work right now.

https://github.com/github/linguist/blob/e4560984058b4726010ca4b8f03ed9d0f8f464db/lib/linguist/lazy_blob.rb#L103-L106

... is where the magic happens and is used thusly...

https://github.com/github/linguist/blob/e4560984058b4726010ca4b8f03ed9d0f8f464db/lib/linguist/lazy_blob.rb#L48-L62

... and confirmed with testing...

$ find . -name foo.html
./testdata/foo.html
./duff/fud/mud/testdata/foo.html
./bar/testdata/foo.html
$ /bin/cat .gitattributes
**/testdata/** linguist-vendored=true
$ github-linguist .      
100.00% Go
$

Really? Well, I'll be damned. Had no idea. :|

We should probably update the readme so the override examples use consistent syntax. Otherwise, readers might read too far into it like I did... 😅

Yeah, .gitattributes is a little different to .gitignore when it comes the path/pattern matching for some reason.

This should do the trick:

**/testdata/** linguist-vendored

Ah that does it. The README confused me a bit since it says "Add a .gitattributes file to your project and use standard git-style path matchers for the files you want to override using..." in the first sentence here: https://github.com/github/linguist/blob/master/README.md#using-gitattributes

Might want to clarify there either what type of pattern matching is used (if it sticks to a standard), where one can find some documentation on how to write such patterns, or at the very least, clarify that it uses similar but not the same patterns as git.

@Alhadis But if we use consistent syntax, they might think only linguist-xxx=true works! ^^

Might want to clarify there either what type of pattern matching is used

@fgblomqvist To quote gitattributes(5):

The rules by which the pattern matches paths are the same as in .gitignore files (see gitignore(5)), with a few exceptions:

• negative patterns are forbidden

• patterns that match a directory do not recursively match paths inside that directory (so using the trailing-slash path/ syntax is pointless in an attributes file; use path/** instead)

The second point is what's causing this grievance.

But if we use consistent syntax, they might think only linguist-xxx=true works! ^^

We may as well drop the =true part from the examples altogether, possibly with a footnote explaining that =true is valid syntax, but redundant.

@Alhadis aah didn't realize .gitattributes was an official git file :man_facepalming:
Imo adding a link to that very page (in the section I mentioned) could be useful to prevent future misunderstanding, but totally understand if that's just too much idiot-proofing.

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

This issue has been automatically closed because it has not had activity in a long time. Please feel free to reopen it or create a new issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

BnSalahFahmi picture BnSalahFahmi  Â·  3Comments

Sanchez3 picture Sanchez3  Â·  4Comments

arfon picture arfon  Â·  6Comments

GabLeRoux picture GabLeRoux  Â·  6Comments

henrywright picture henrywright  Â·  6Comments