Jupyter notebooks are a hybrid form between markup and code, usually with a fairly small set of input lines and a large autogenerated set of outputs (that consist of boilerplate markup, images, etc).
Whenever I include a Jupyter notebook in a repository, it tends to dominate the code breakdown even though it might contain a very small number of lines of code. One example is here - in this case, there are only 65 lines of code (although it lists it as 144 sloc), but the file size is 30 KB due to image and autogenerated markup data.
After looking through this repository I realize for this particular library, I can move my notebook to a doc or example folder or explicitly ignore it in a dotfile, but I did want to create an issue for it because I think this is generally misleading behavior that goes against the purpose of linguist. Jupyter notebooks are tricky because sometimes it really does make sense to count parts of them towards the repository composition, but it almost never makes sense to count all of them. Linguist's treatment of them right now obscures rather than reveals useful details about the repository.
It would be very easy to write custom logic to compute their _actual_ size -- I think JSON.parse(notebook_file)['cells'].map{|c| c['source'] }.join.bytesize would work -- but I'm not sure if this would be easy to incorporate into linguist.
maybe this goes without saying, but Jupyter Notebook should be counted not as a language itself, but as the actual language its code is written in.
One example is here - in this case, there are only 65 lines of code (although it lists it as 144 sloc), but the file size is 30 KB due to image and autogenerated markup data.
I agree this is misleading, but unfortunately, I'm pretty sure parsing all Jupyter Notebooks to improve the statistic counts would be too expensive. /cc @lildude
I'm pretty sure parsing all Jupyter Notebooks to improve the statistic counts would be too expensive.
Yes, this is likely to be overly expensive for the small amount of reward.
I'm closing this as a wont-fix, and we can revisit if other solutions are found. In the meantime, I'd recommend using Linguist overrides to approximate the expected result.
Thanks for the consideration! Maybe in the future I鈥檒l check out linguist override.
If anyone's wondering about a workaround for this issue, I have decided to exclude one massive Jupyter notebook file, and that tweaked the language statistics to more realistic fractions.
Yeah. I will just note that currently over 125,000 GitHub repositories are now listed as Jupyter notebooks, which is significantly more than many of the languages listed as "popular" in the advanced search dropdown (such as Haskell, ActionScript, Clojure, CoffeeScript, Lua, etc).
Given the simplicity of the overrides required for BlobHelper#lines and Blob#size, would it make sense to revisit this @lildude and @pchaigno? If you're concerned about performance, you could always skip the JSON.parse call for large files, and given the structure of notebook JSON it _might_ be possible to implement BlobHelper#lines using a regex, which would make it just as efficient as the current implementation...
How about commiting .ipynb files without cell output?
As a bonus, git diff becomes much nicer.
Thanks!
Most helpful comment
Yeah. I will just note that currently over 125,000 GitHub repositories are now listed as Jupyter notebooks, which is significantly more than many of the languages listed as "popular" in the advanced search dropdown (such as Haskell, ActionScript, Clojure, CoffeeScript, Lua, etc).
Given the simplicity of the overrides required for
BlobHelper#linesandBlob#size, would it make sense to revisit this @lildude and @pchaigno? If you're concerned about performance, you could always skip theJSON.parsecall for large files, and given the structure of notebook JSON it _might_ be possible to implementBlobHelper#linesusing a regex, which would make it just as efficient as the current implementation...