Linguist: YML files detected as HTML

Created on 13 Oct 2020  路  9Comments  路  Source: github/linguist

Preliminary Steps

Please confirm you have...

Problem Description


There seems to be a regression in the 7.11.x release.
I'm testing this project https://github.com/gmontard/issue-sync-redmine-github and the yml files in there make the detected language as HTML instead of Ruby.

I've tested with this script

require 'rugged'
require 'linguist'

puts Linguist::VERSION

repo = Rugged::Repository.new('/tmp/issue-sync-redmine-github/')
project = Linguist::Repository.new(repo, repo.head.target_id)
puts project.language
puts project.languages

7.11.x gives me this output

7.11.0
HTML
{"Ruby"=>9254, "HTML"=>20124}

If I remove the provisioning folder from the repo

7.11.0
Ruby
{"Ruby"=>9254}

7.10.x gives me this output

7.10.0
Ruby
{"Ruby"=>9254}

URL of the affected repository:

https://github.com/gmontard/issue-sync-redmine-github

Last modified on:

2020-10-13

Expected language:

Ruby

Detected language:

HTML

All 9 comments

You're looking at an enhancement, not a regression, and it has nothing to do with yaml 馃榿

$ github-linguist . --breakdown
68.50%  HTML
31.50%  Ruby

Ruby:
Gemfile
Gemfile.lock
Rakefile
app.rb
config.ru
config/environments.rb
config/mapping.rb
db/migrate/20141020110537_issues.rb
db/schema.rb
models/github_issue.rb
models/issue.rb
models/redmine_issue.rb

HTML:
provisioning/roles/postgresql/templates/etc_monit_conf.d_postgresql.j2
provisioning/roles/postgresql/templates/pg_hba.conf.j2
provisioning/roles/postgresql/templates/postgresql.conf.j2

$

Support for the .j2 files was added in https://github.com/github/linguist/pull/4963 which first shipped in v7.11.0.

As you can see, YAML isn't included in the stats, and nor should it be as it's considered data by default:

https://github.com/github/linguist/blob/3d39c1c1a0d34c1af890ecfd18cf9ed595c68132/lib/linguist/languages.yml#L6308-L6315

You two outputs also show no difference between the before and after stats confirming the yaml file isn't involved:

7.11.x gives me this output

7.11.0
HTML
{"Ruby"=>9254, "HTML"=>20124}

If I remove the yml file from the repo

7.11.0
HTML
{"Ruby"=>9254, "HTML"=>20124}
7.11.0
HTML
{"Ruby"=>9254, "HTML"=>20124}

Sorry! I updated the description. TL;DR HTML is not showing up

馃 are you sure? It should do. If I perform the same thing I see no change, which I'd expect as as I said, yaml files don't count:

$ git rm **/*.yml
rm 'config/database.yml'
rm 'config/newrelic.yml'
rm 'provisioning/playbook.yml'
rm 'provisioning/roles/postgresql/defaults/main.yml'
rm 'provisioning/roles/postgresql/handlers/main.yml'
rm 'provisioning/roles/postgresql/meta/main.yml'
rm 'provisioning/roles/postgresql/tasks/configure.yml'
rm 'provisioning/roles/postgresql/tasks/databases.yml'
rm 'provisioning/roles/postgresql/tasks/extensions.yml'
rm 'provisioning/roles/postgresql/tasks/extensions/contrib.yml'
rm 'provisioning/roles/postgresql/tasks/extensions/dev_headers.yml'
rm 'provisioning/roles/postgresql/tasks/extensions/postgis.yml'
rm 'provisioning/roles/postgresql/tasks/install.yml'
rm 'provisioning/roles/postgresql/tasks/main.yml'
rm 'provisioning/roles/postgresql/tasks/monit.yml'
rm 'provisioning/roles/postgresql/tasks/users.yml'
rm 'provisioning/roles/postgresql/test.yml'
$ git commit -m 'Remove all yaml files'
[master 9272e06] Remove all yaml files
 17 files changed, 1136 deletions(-)
 delete mode 100755 config/database.yml
 delete mode 100644 config/newrelic.yml
 delete mode 100644 provisioning/playbook.yml
 delete mode 100755 provisioning/roles/postgresql/defaults/main.yml
 delete mode 100755 provisioning/roles/postgresql/handlers/main.yml
 delete mode 100755 provisioning/roles/postgresql/meta/main.yml
 delete mode 100755 provisioning/roles/postgresql/tasks/configure.yml
 delete mode 100755 provisioning/roles/postgresql/tasks/databases.yml
 delete mode 100755 provisioning/roles/postgresql/tasks/extensions.yml
 delete mode 100755 provisioning/roles/postgresql/tasks/extensions/contrib.yml
 delete mode 100755 provisioning/roles/postgresql/tasks/extensions/dev_headers.yml
 delete mode 100755 provisioning/roles/postgresql/tasks/extensions/postgis.yml
 delete mode 100755 provisioning/roles/postgresql/tasks/install.yml
 delete mode 100755 provisioning/roles/postgresql/tasks/main.yml
 delete mode 100755 provisioning/roles/postgresql/tasks/monit.yml
 delete mode 100755 provisioning/roles/postgresql/tasks/users.yml
 delete mode 100755 provisioning/roles/postgresql/test.yml
$ github-linguist . --breakdown
68.50%  HTML
31.50%  Ruby

Ruby:
Gemfile
Gemfile.lock
Rakefile
app.rb
config.ru
config/environments.rb
config/mapping.rb
db/migrate/20141020110537_issues.rb
db/schema.rb
models/github_issue.rb
models/issue.rb
models/redmine_issue.rb

HTML:
provisioning/roles/postgresql/templates/etc_monit_conf.d_postgresql.j2
provisioning/roles/postgresql/templates/pg_hba.conf.j2
provisioning/roles/postgresql/templates/postgresql.conf.j2

$

The only way I can see this making a difference is if you removed the entire provisioning/roles/postgresql/ or provisioning/roles/postgresql/templates/ directories or the .j2 files within provisioning/roles/postgresql/templates/.

Did you maybe do one of those?

Support for the .j2 files was added in #4963 which first shipped in v7.11.0.

Hum... ok I see...
I removed the provisioning folder thinking that it would be because of the yml files but apparently that was because of the .j2 files. Sorry for the bad description and bug report! :(

No probs. Glad to help.

@lildude Isn't it a bug that those j2 files are detected as HTML and why the scoring of those files make the scoring so high that while I'm expecting Ruby I'm getting HTML as a first language here?

Isn't it a bug that those j2 files are detected as HTML

No. Those files are part of the "HTML+Django" language which is part of the HTML group of languages, hence you see HTML:

https://github.com/github/linguist/blob/b3664e4f242c842e356bd011090f85514d147948/lib/linguist/languages.yml#L2060-L2068

why the scoring of those files make the scoring so high that while I'm expecting Ruby I'm getting HTML as a first language here?

Because of their cumulative size. From the README:

The percentages are calculated based on the bytes of code for each language as reported by the List Languages API.

The total bytes of code of those three files far exceeds all the Ruby added up:

  19.65 KB  68.50%  HTML
   9.04 KB  31.50%  Ruby

Ruby:
   1.96 KB  models/redmine_issue.rb
   1.49 KB  models/issue.rb
   1.33 KB  Gemfile.lock
   1.18 KB  app.rb
   996.0 B  db/schema.rb
   668.0 B  config/environments.rb
   363.0 B  config/mapping.rb
   307.0 B  models/github_issue.rb
   302.0 B  Gemfile
   179.0 B  Rakefile
   176.0 B  db/migrate/20141020110537_issues.rb
   168.0 B  config.ru

HTML:
  17.95 KB  provisioning/roles/postgresql/templates/postgresql.conf.j2
   1.38 KB  provisioning/roles/postgresql/templates/pg_hba.conf.j2
   336.0 B  provisioning/roles/postgresql/templates/etc_monit_conf.d_postgresql.j2

... with provisioning/roles/postgresql/templates/postgresql.conf.j2 being the clear outlier having the most impact on the stats as it alone accounts for 17.95KB of the total 19.65KB.

Thanks for the explanation @lildude
I'm closing the issue then :)

No. Those files are part of the "HTML+Django"

If it reports a wrong language, it looks like a bug to me.
I'm facing the same problem on 2 repositories:

Maybe .j2 files should be considered as what it is: "jinja templates"

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Haroenv picture Haroenv  路  4Comments

pfitzseb picture pfitzseb  路  5Comments

henrywright picture henrywright  路  6Comments

d4nyll picture d4nyll  路  3Comments

lucasrodes picture lucasrodes  路  6Comments