Linguist: LInguist is reporting my project as a Jupyter Notebook

Created on 3 Nov 2016  路  18Comments  路  Source: github/linguist

As you can see, I have some notebooks, but mostly this is a python project.

https://github.com/ICTatRTI/researchnet

Did I do something wrong?

Most helpful comment

@lildude, @Caged I can confirm that things are not working regarding Jupyter notebooks. It's still the same issue as before: A Jupyter notebook consists of Python code that the author wrote, and of generated code that makes it an interactive environment that can be displayed in a web browser. The generated code usually makes up a lot more lines than the Python code that the author wrote.

The first problem here is that for the purpose of what linguist is trying to achieve (i.e. a breakdown of the programming languages the author used in the repo) "Jupyter notebook" should not be considered a language at all. For all intents and purposes it's just a container that holds Python code.

The second problem is that simply ingoring Jupyter notebooks from the statistics also ignores all of the actually relevant Python code inside them.

All 18 comments

Jupyter notebooks have an inflated number of lines of code, since they store a lot of metadata. So it doesn't take many notebooks to "take over" a project.

Does anybody actually write these files out by hand? Because it sounds like they're generated primarily from a webapp:

The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

And if that's the case, well, I'd say these generated files should be marked as exactly that: generated.

/cc @pchaigno /resident Python-guy

Whatever action is taken, it would be best to maintain the searchability and identifiability of notebook-only repositories.

Possibly the best course of action is to modify the lines of code reported into an "equivalent lines of code" measure which takes into account the unavoidable boilerplate. For instance, the source line consisting of the single character may turn into this monstrosity in the .ipynb file:

  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "蟺 = 3.1415926535897..."
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "蟺"
   ]
  },

That thing is about as long as the value's floating-point component itself.

All we'd need to mark these things as generated is to match against a common pattern that's consistently used in webapp-created notebooks. Usually it's something like Generated by AppName 1.1.1.1.1.1-betasemverasfuck0 or what-have-you.

You could maybe match against

 "metadata": {
   // [ stuff in here varies ]
 },
 "nbformat": 4,
 "nbformat_minor": 1

but wouldn't this make notebook-only repositories classify incorrectly?

Marking them as generated simply omits them from the language-statistics bar. We already have a number of generated-file detection routines that filter files that would otherwise unfairly skew a repository's stats. Here's the logic for detecting generated PostScript, for example. You can imagine how many projects would be incorrectly classified as PostScript if we left every .eps file unchecked.

And while that snippet you've posted might work, it should ideally be 100% unambiguous. E.g., leave no room for misidentification. The existing rules which test against single-line patterns are all very specific:

Et cetera.

The difference between postscript and Jupyter is that all Jupyter notebooks are "generated", though (either by the web app or by IPython's CLI). And unlike postscript, human effort generally needs to go into every cell of a Jupyter notebook; it's just that each cell ends up taking a lot of lines of code.

Here are some empty, newly-created notebooks with Julia and Python kernels.

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Julia 0.5.0",
   "language": "julia",
   "name": "julia-0.5"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.5.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}

and

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}

For what it's worth, I think personally that a good solution would be to estimate how many lines of a Jupyter notebook are "source" and how many are "generated". Source lines (which are all written by a human) generally look like this:

   "source": [
    "import Base: +\n",
    "\n",
    "+{T<:Number}(x::DualNumber{T}, y::DualNumber{T}) = DualNumber{T}(x.re + y.re, x.ep + y.ep)\n",
    "\n",
    "DualNumber(10.0, 17.0) + DualNumber(5.0, 9.0)"
   ]

Can linguist already handle partial file identifications?

I am having the same problem, I have about 3-4 .ipynb files out of 144 files(mainly java and scala) in my repo. If there is any option to make Linguist report based on count of files rather than size, it would be helpful.

For now, I added *.ipynb linguist-vendored to my .gitattributes file in my repository.

:wave: It looks like the original repo is no longer classified as an ipython notebook and I don't see a .gitattributes file in the repo. Can someone clarify if this is still an issue?

As @Caged mentioned, things appear to be working now on the original repo. As there hasn't been an update since 3 May, I'm closing this on the basis this has been resolved.

@lildude, @Caged I can confirm that things are not working regarding Jupyter notebooks. It's still the same issue as before: A Jupyter notebook consists of Python code that the author wrote, and of generated code that makes it an interactive environment that can be displayed in a web browser. The generated code usually makes up a lot more lines than the Python code that the author wrote.

The first problem here is that for the purpose of what linguist is trying to achieve (i.e. a breakdown of the programming languages the author used in the repo) "Jupyter notebook" should not be considered a language at all. For all intents and purposes it's just a container that holds Python code.

The second problem is that simply ingoring Jupyter notebooks from the statistics also ignores all of the actually relevant Python code inside them.

Thanks for confirming this and for the explanation @pierluigiferrari. Now I have a better understanding having looked into it, and given your two points, I don't think this is something that can easily, if ever, be addressed automatically.

The biggest limiting factor that I can see is the fact the Jupyter notebooks combine written and generated language within the same file. Linguist doesn't support partial file classification and isn't likely to ever do so as I'd imagine this would be incredibly resource intensive and probably highly unreliable when it comes to even attempting to differentiate between human and computer written code within the same file. Our current classifier is already hugely inefficient as it is.

The next limiting factor is preference. Some want the Jupyter note books recognised for what they are, others prefer them to be identified by the language they're written themselves and others still don't want the files counted at all.

I think our current implementation (implemented in https://github.com/github/linguist/pull/2746 via https://github.com/github/linguist/pull/2763) combined with manual overrides is probably the best compromise for all.

Jupyter note books are also far too prevalent on GitHub to change the default behaviour without major backlash.

Linguist doesn't support partial file classification and isn't likely to ever do so as I'd imagine this would be incredibly resource intensive and probably highly unreliable when it comes to even attempting to differentiate between human and computer written code within the same file.

... which is where an idea of mine may hold the answer. ;) I regurgitated sleep-deprived explanations which, through weighting averages assigned to specific scopes, could yield a more rational Python Notebook usage. E.g., the number of lines the programmer actually did pen of their own hand.

@lildude I understand. As you said, it seems like the best solution for Jupyter notebook users is to use manual override. Thanks for clarifying why it is the way it is and why it will likely remain this way!

@lildude I understand. As you said, it seems like the best solution for Jupyter notebook users is to use manual override. Thanks for clarifying why it is the way it is and why it will likely remain this way!

What does it mean the manual override for language statistic on GitHub? Is it the .gitatributes file?
In my opinion, it would be fair counting if for the ipynb lines will be counted only the source lines, not all metadata as well as all generated outputs...

@Borda Please see the last paragraph of how Linguist works and Linguist overrides.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

siscia picture siscia  路  6Comments

FranklinYu picture FranklinYu  路  4Comments

philiparvidsson picture philiparvidsson  路  4Comments

haskellcamargo picture haskellcamargo  路  3Comments

pfitzseb picture pfitzseb  路  5Comments