Linguist: Blob size for jupyter notebooks is misleading

Created on 27 Feb 2017 · 9Comments · Source: github/linguist

Jupyter notebooks are a hybrid form between markup and code, usually with a fairly small set of input lines and a large autogenerated set of outputs (that consist of boilerplate markup, images, etc).

Whenever I include a Jupyter notebook in a repository, it tends to dominate the code breakdown even though it might contain a very small number of lines of code. One example is here - in this case, there are only 65 lines of code (although it lists it as 144 sloc), but the file size is 30 KB due to image and autogenerated markup data.

After looking through this repository I realize for this particular library, I can move my notebook to a doc or example folder or explicitly ignore it in a dotfile, but I did want to create an issue for it because I think this is generally misleading behavior that goes against the purpose of linguist. Jupyter notebooks are tricky because sometimes it really does make sense to count parts of them towards the repository composition, but it almost never makes sense to count all of them. Linguist's treatment of them right now obscures rather than reveals useful details about the repository.

It would be very easy to write custom logic to compute their _actual_ size -- I think JSON.parse(notebook_file)['cells'].map{|c| c['source'] }.join.bytesize would work -- but I'm not sure if this would be easy to incorporate into linguist.

Source

asross

👍3

Most helpful comment

Yeah. I will just note that currently over 125,000 GitHub repositories are now listed as Jupyter notebooks, which is significantly more than many of the languages listed as "popular" in the advanced search dropdown (such as Haskell, ActionScript, Clojure, CoffeeScript, Lua, etc).

Given the simplicity of the overrides required for BlobHelper#lines and Blob#size, would it make sense to revisit this @lildude and @pchaigno? If you're concerned about performance, you could always skip the JSON.parse call for large files, and given the structure of notebook JSON it _might_ be possible to implement BlobHelper#lines using a regex, which would make it just as efficient as the current implementation...

asross on 16 Apr 2019

👍3

All 9 comments

maybe this goes without saying, but Jupyter Notebook should be counted not as a language itself, but as the actual language its code is written in.

odanoburu on 25 May 2017

One example is here - in this case, there are only 65 lines of code (although it lists it as 144 sloc), but the file size is 30 KB due to image and autogenerated markup data.

I agree this is misleading, but unfortunately, I'm pretty sure parsing all Jupyter Notebooks to improve the statistic counts would be too expensive. /cc @lildude

pchaigno on 17 Aug 2018

I'm pretty sure parsing all Jupyter Notebooks to improve the statistic counts would be too expensive.

Yes, this is likely to be overly expensive for the small amount of reward.

lildude on 17 Aug 2018

I'm closing this as a wont-fix, and we can revisit if other solutions are found. In the meantime, I'd recommend using Linguist overrides to approximate the expected result.

pchaigno on 27 Aug 2018

Thanks for the consideration! Maybe in the future I’ll check out linguist override.

asross on 27 Aug 2018

If anyone's wondering about a workaround for this issue, I have decided to exclude one massive Jupyter notebook file, and that tweaked the language statistics to more realistic fractions.