Linguist: Incorrect C++ identification

Created on 26 Jun 2013  Â·  12Comments  Â·  Source: github/linguist

Linguist incorrectly identifies some files in one of my repositories, k-os, as containing 1.7% C++. This is most likely due to incorrect header file (".h") identification.

Most helpful comment

As the detection is based on a Bayesian analyzes of the file contents, there always will be (hopefully small) portion of bad detections, especially for languages sharing filename extension (.h) and similar contents (e.g. in case of C being more or less a subset of C++).

Perhaps the detection would improve if the Bayssian classifier would also assign some weight to a rules based on a broader context in the project. For example:

  • If the project contains file foo.c, then more likely foo.h is C rather then C++.
  • If the project contains file bar.cc, then more likely bar.h is C++ rather then C.
  • When there are many *.c files in a project and none *.C, *.cc, *.cpp, *.cxx, then all *.h are more likely plain C, not C++.

All 12 comments

This is happening with my repository, github.com/88Alex/lithium-os, as well.

Happening in some of my repos too, aliceos-kernel and nasu2 specifically.

Ditto for few headers in mCtrl.

As the detection is based on a Bayesian analyzes of the file contents, there always will be (hopefully small) portion of bad detections, especially for languages sharing filename extension (.h) and similar contents (e.g. in case of C being more or less a subset of C++).

Perhaps the detection would improve if the Bayssian classifier would also assign some weight to a rules based on a broader context in the project. For example:

  • If the project contains file foo.c, then more likely foo.h is C rather then C++.
  • If the project contains file bar.cc, then more likely bar.h is C++ rather then C.
  • When there are many *.c files in a project and none *.C, *.cc, *.cpp, *.cxx, then all *.h are more likely plain C, not C++.

+1.

On 8 August 2013 08:36, Martin Mitáš [email protected] wrote:

As the detection is based on a Bayesian analyzes of the file contents,
there always will be (hopefully small) portion of bad detections,
especially for languages sharing filename extension (.h) and similar
contents (e.g. in case of C being more or less a subset of C++).

Perhaps the detection would improve if the Bayssian classifier would also
assign some weight to a rules based on a broader context in the project.
For example:

  • If the project contains file foo.c, then more likely foo.h is C
    rather then C++.
  • If the project contains file bar.cc, then more likely bar.h is C++
    rather then C.
  • When there are many *.c files in a project and none *.C, *.cc, *.cpp,
    *.cxx, then all *.h are more likely plain C, not C++.

—
Reply to this email directly or view it on GitHubhttps://github.com/github/linguist/issues/554#issuecomment-22320310
.

Regards,

Alexander Kitaev

This is happening in with my repository too https://github.com/gagarin79/charon. Only C++ code (*.hh and *.cc files), But statistics showing C 85.8% and C++ 14.2%

Now I'm getting headers detected as Objective-C! D:

This issue is very annoying, despite the fact my project is entirely C It says 15% C++ and 5% Objective C.

I was having a similar issue with my pure C project. Linguist was marking some headers as C++ instead of C. It appeared to be happening to headers without any function declarations or C-style code. Here's a code snippet of a header detected as C++

#ifndef WORLDBOX_CLIENT_H
#define WORLDBOX_CLIENT_H

#include <worldbox/worldbox.h>
#include <worldbox/core.h>
#include <worldbox/graphics.h>

#endif /* WORLDBOX_CLIENT_H */

I'm unfamiliar with how Linguist works or whether this is already implemented, but an idea I've got that could resolve these types of inaccuracies would be to, in addition to the existing hack that checks for similarly named C files, scan the file contents for namespace declarations, templates, classes, etc. with the assumption that the absence thereof meaning the header is C.

And to preemptively address issues where C++ headers are being marked as C, perhaps headers included from C++ files can be assumed to be C++ headers.

This gets to a whole other issue with include paths and such, but I think it would be reasonable to assume headers are in the root directory of the repository, or in a folder named include or something similar.

Yep, thanks. We're currently working on, and testing, various heuristics that go beyond the ordinary statistical classifier, specifically to address C++ and C issues.

Thanks for the report. This is a common issue. We're tracking progress for a fix in #1626.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

TimothyGu picture TimothyGu  Â·  5Comments

GabLeRoux picture GabLeRoux  Â·  6Comments

Haroenv picture Haroenv  Â·  4Comments

headupinclouds picture headupinclouds  Â·  4Comments

lucasrodes picture lucasrodes  Â·  6Comments