Plots.jl: package has a large filesystem footprint

Created on 18 May 2016  Â·  66Comments  Â·  Source: JuliaPlots/Plots.jl

Plots.jl filesystem footprint is at around 110MB, mostly from its git history.

$ ~/.julia/v0.4/Plots $ du -ks .
110248  .
$ ~/.julia/v0.4/Plots $ cd .git/
$ ~/.julia/v0.4/Plots/.git $ du -ks .
108760  .

And they seem to be due to files deleted quite some time back.

size(kB)   packed(kB)  location
27420      7343        examples/meetup/wine.ipynb
11348      3266        examples/meetup/nnet.ipynb
4368       1395        examples/meetup/nnet.ipynb
2709       2482        img/gadfly/gadfly_example_2.gif
2690       2466        img/immerse/immerse_example_2.gif
2630       2400        img/gadfly/gadfly_example_2.gif
2508       697         examples/meetup/wine.ipynb
2399       2274        img/pyplot/pyplot_example_2.gif
2310       2174        img/pyplot/pyplot_example_2.gif
2296       2177        img/pyplot/pyplot_example_2.gif
2234       2117        img/pyplot/pyplot_example_2.gif
1845       1821        examples/meetup/iheartplots.gif
1797       1772        examples/meetup/iheartplots.gif
1761       659         examples/meetup/nnet.ipynb
1568       1092        examples/meetup/wine.ipynb
1483       1047        img/qwt/qwt_example_2.gif
1459       829         examples/palettes.ipynb
1431       1079        examples/meetup/wine.ipynb
1233       944         examples/meetup/wine.ipynb
1203       905         examples/meetup/wine.ipynb
...

It will be good to purge/reduce it in some way.

Most helpful comment

I'll buy both of you beer when I get the chance. And anyone else in this thread :-)

All 66 comments

I agree! Do you have any experience with this? Last time I tried to do
something like this I nearly corrupted the whole repo (never ever use git
lfs). I'd like to do this without screwing anything up.

On Wednesday, May 18, 2016, Tanmay Mohapatra [email protected]
wrote:

Plots.jl filesystem footprint is at around 110MB, mostly from its git
history.

$ ~/.julia/v0.4/Plots $ du -ks .
110248 .
$ ~/.julia/v0.4/Plots $ cd .git/
$ ~/.julia/v0.4/Plots/.git $ du -ks .
108760 .

And they seem to be due to files deleted quite some time back.

size(kB) packed(kB) location
27420 7343 examples/meetup/wine.ipynb
11348 3266 examples/meetup/nnet.ipynb
4368 1395 examples/meetup/nnet.ipynb
2709 2482 img/gadfly/gadfly_example_2.gif
2690 2466 img/immerse/immerse_example_2.gif
2630 2400 img/gadfly/gadfly_example_2.gif
2508 697 examples/meetup/wine.ipynb
2399 2274 img/pyplot/pyplot_example_2.gif
2310 2174 img/pyplot/pyplot_example_2.gif
2296 2177 img/pyplot/pyplot_example_2.gif
2234 2117 img/pyplot/pyplot_example_2.gif
1845 1821 examples/meetup/iheartplots.gif
1797 1772 examples/meetup/iheartplots.gif
1761 659 examples/meetup/nnet.ipynb
1568 1092 examples/meetup/wine.ipynb
1483 1047 img/qwt/qwt_example_2.gif
1459 829 examples/palettes.ipynb
1431 1079 examples/meetup/wine.ipynb
1233 944 examples/meetup/wine.ipynb
1203 905 examples/meetup/wine.ipynb
...

It will be good to purge/reduce it in some way.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/tbreloff/Plots.jl/issues/264

Thanks @KristofferC I think this looks like a good option. I'm going to put it off a little longer until my dev work slows down.

You realize that this will make all previous releases void since all commit SHA's will change. People will not be able to just run a Pkg.update. It definitely requires some thought how to do it best.

Yes, appears difficult to do without some sort of support in the package manager.

Yes... I assume I'll have to update METADATA with new commit info for each tag? I'm not ready to tackle that mess yet.

@viralbshah @StefanKarpinski

As mentioned above, the suggestion was to use https://rtyley.github.io/bfg-repo-cleaner/. I think this will require updates to the commit shas in METADATA... do you foresee any other issues?

I see 116MB total... 114MB is in the .git directory, and 1.2MB is from the plotly-latest.min.js file which is no longer bundled with Plots. So without the bloated git history, Plots should be about 1MB.

tom@tom-office-ubuntu:~/.julia/v0.5/Plots$ du  |sort -n |tail -n20
56  ./.git/objects/f6
72  ./src/deprecated/backends
140 ./.git/refs/tags
144 ./src/deprecated
180 ./src/backends
612 ./src
688 ./.git/logs/refs/remotes/cache/pull
688 ./.git/refs/remotes/cache/pull
704 ./.git/logs/refs/remotes/cache
704 ./.git/refs/remotes/cache
736 ./.git/refs/remotes
748 ./.git/logs/refs/remotes
800 ./.git/logs/refs
844 ./.git/logs
912 ./.git/refs
1204    ./deps
108084  ./.git/objects/pack
112648  ./.git/objects
114540  ./.git
116476  .

@StefanKarpinski says this is impossible. I guess the only real option is to wait for Pkg3. With libgit2, we can't even do shallow clones.

I don't understand why this would be impossible. @StefanKarpinski maybe you can explain a little more? If the repo had a fresh commit history and we updated METADATA appropriately, why wouldn't this work? It might require people to manually delete their local download of Plots, I suppose? (which wouldn't make it impossible, just annoying)

I think having people removing Plots.jl is ok to clean this up once, while the package is still in relatively early days and before it really explodes.

Also cc @keno @tkelman

We could just overwrite all the SHA1s for versions in METADATA, but that seems like a huge risk to me – it's basically indistinguishable from an attack on METADATA and I'm not sure how the package manager will react. We could do some testing first and see what happens?

It would likely break updating for anyone who has an existing clone. Try tesing with single branch clones first, which wouldn't rewrite the existing tags completely.

If we ever decide to do this, I tried locally and I think these are the non-METADATA steps:

cp -R ~/.julia/v0.5/Plots /tmp/Plots_backup
cd /tmp
git clone --mirror [email protected]:tbreloff/Plots.jl.git
java -jar ~/Downloads/bfg-1.12.13.jar  --strip-blobs-bigger-than 10K --protect-blobs-from master,dev,backports,sd/dev Plots.jl
cd Plots.jl
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push
# cross fingers

I have a local repo that seems to be small and in tact:

tom@tom-office-ubuntu:~/.julia/v0.5/Reinforce$ du -h /opt/Plots_purge_test2
44K /opt/Plots_purge_test2/.git/hooks
8.0K    /opt/Plots_purge_test2/.git/logs/refs/heads
12K /opt/Plots_purge_test2/.git/logs/refs
20K /opt/Plots_purge_test2/.git/logs
8.0K    /opt/Plots_purge_test2/.git/refs/heads
4.0K    /opt/Plots_purge_test2/.git/refs/tags
16K /opt/Plots_purge_test2/.git/refs
8.0K    /opt/Plots_purge_test2/.git/info
2.4M    /opt/Plots_purge_test2/.git/objects/pack
8.0K    /opt/Plots_purge_test2/.git/objects/info
2.4M    /opt/Plots_purge_test2/.git/objects
4.0K    /opt/Plots_purge_test2/.git/branches
2.6M    /opt/Plots_purge_test2/.git
28K /opt/Plots_purge_test2/test
8.0K    /opt/Plots_purge_test2/deps
180K    /opt/Plots_purge_test2/src/backends
72K /opt/Plots_purge_test2/src/deprecated/backends
144K    /opt/Plots_purge_test2/src/deprecated
612K    /opt/Plots_purge_test2/src
3.3M    /opt/Plots_purge_test2

The real question is how will Pkg interact with it. Is there any way to simulate or test that?

Pkg3 was mentioned, but I haven't heard anything about it since JuliaCon... what's the latest?

Pkg3 would only be relevant here if it's planning to move entirely away from using git repos and instead using non-repo tarball downloads of packages.

Updating from an existing install to a repo where the history has been rewritten is not likely to work smoothly. Pkg.rm doesn't actually delete the package which will make this messier to deal with for users, and there's the separate .cache bare clone to worry about.

Just for discussions sake, what would be the proper way to purge a repo from a user's system so that Pkg would download a fresh copy and start from scratch? (i.e. the most conservative way to make sure it's gone)

How about creating Plots.jl as a new package with a new name? And then eventually renaming this package eventually to the new package so that we retain the issues and PRs.

The downside is that we lose the nice Plots.jl name.

You'd need to delete Plots from all the .cache folders (there are several, sometimes but not always with symlinks shared between julia versions), all copies of Plots from .trash, and from the Pkg.dir for each version of Julia where it's been installed. A new repo would be safer.

My understanding is that this needs a new repo, if we don't want to inconvenience users - and that there is no clean way otherwise. Shouldn't we do it sooner rather than later? Perhaps UnionOfPlots.jl. :-)

I would hate to lose the issues and such, but perhaps github support can help us migrate them over.

My understanding on Pkg3 from @StefanKarpinski is that it is expected in the 0.6 release timeframe. However, code readiness and migration to Pkg3 are completely different things, and perhaps even if Pkg3 is ready optimistically by around Jan 2017, it may take another 2-3 months to work out the kinks and migrate.

The name is not changing. Users can be inconvenienced once if they need to
be.

On Sunday, October 16, 2016, Viral B. Shah [email protected] wrote:

My understanding on Pkg3 from @StefanKarpinski
https://github.com/StefanKarpinski is that it is expected in the 0.6
release timeframe. However, code readiness and migration to Pkg3 are
completely different things, and perhaps even if Pkg3 is ready
optimistically by around Jan 2017, it may take another 2-3 months to work
out the kinks and migrate.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tbreloff/Plots.jl/issues/264#issuecomment-254035206,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA492t44yy4OBpGZnWCIYXRmgbpGnxWoks5q0eWXgaJpZM4IhK8W
.

I will only say that I wish this package was not 100MB. I am ok with whatever solution you choose to go with.

I'm going to do this today, so prepare for the carnage. My plan:

  • create a backup repo at JuliaPlots/PlotsBackup.jl
  • follow the steps I listed here: https://github.com/tbreloff/Plots.jl/issues/264#issuecomment-251167584
  • prepare a METADATA PR that removes most old tags and updates the sha for any commits that I keep
  • after the PR is merged, post an announcement to remove and re-install Plots

If you break the existing tags, we'll be redirecting METADATA to point to a fork. Please don't break existing tags. Deleting them from METADATA won't be merged.

Backed up: https://github.com/JuliaPlots/PlotsBackup.jl

@tkelman would you prefer to change the url to point to the backup until the dust settles?

I believe the repo name of the backup will have to still be Plots.jl for it to work correctly.

This is not going to go well.
It would be much much less disruptive to everyone if we experiment with this as a single, new, cleaned-up branch without breaking the ability to check out commits that were on the existing branches. If Pkg can correctly handle single-branch clones, we may be able to get the desired outcome without massively breaking everything.

if we experiment with this as a single cleaned-up branch

You're going to have to go into a lot more detail about what you mean.

Don't overwrite the existing branches. Do the repo-cleaning but push the result as a new clean branch here, without overwriting the existing branches and sha's. git clone --single-branch should then be small.

I'm testing it out locally because I don't like the options of either bloating it more or renaming.

tl;dr I changed the url to point to PlotsBackup.jl and everything worked fine.

My steps:

  • cd METADATA
  • git checkout -b plotsbackup
  • update Plots/url to point to JuliaPlots/PlotsBackup.jl
  • git commit -am "plotsbackup url"
  • git remote rename origin old_origin
  • git remote rename forked origin
  • git push -u origin plotsbackup
  • change META_BRANCH to plotsbackup
  • in a fresh session, Pkg.update()
  • using MetaPkg; MetaPkg.purge("Plots") which does a Pkg.rm and then deletes all traces of Plots from the .julia directory

Then in a fresh session, this worked fine:

julia> Pkg.add("Plots")
INFO: Installing Plots v0.9.4
INFO: Building Plots
INFO: Cannot find deps/plotly-latest.min.js... downloading latest version.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1782k  100 1782k    0     0  1797k      0 --:--:-- --:--:-- --:--:-- 1796k
INFO: Package database updated

julia> using Plots; plot(rand(10))
INFO: Precompiling module Plots.
INFO: Recompiling stale cache file /home/tom/.julia/lib/v0.5/GR.ji for module GR.

and the repo is pointing to the backup repo:

tom@tom-office-ubuntu:~/.julia/v0.5$ cd Plots
tom@tom-office-ubuntu:~/.julia/v0.5/Plots$ git remote -v
origin  https://github.com/JuliaPlots/PlotsBackup.jl.git (fetch)
origin  [email protected]:JuliaPlots/PlotsBackup.jl.git (push)

And then for the final test, I purged Plots again and did a Pkg.clone("[email protected]:tbreloff/Plots.jl"), then a Pkg.update() in a fresh session. Everything worked fine. I even did Pkg.add("EEG") which is one of the few packages with a Plots dependency, and it worked as expected.

So users will get the snapshot of what PlotsBackup.jl looks like today, and you can manually clone tbreloff/Plots.jl to get the cleaned repo and future commits. If there's not a clean way to let METADATA point at Plots.jl again, then I just won't tag anything until Pkg3.

Changing the metadata url won't actually help anyone who has an existing clone (you should be able to see at https://github.com/tbreloff/Plots.jl/graphs/traffic how many clones that is), if the branches here get deleted or overwritten. Everyone would have to go through that purge exercise if you do any history rewriting in-place, otherwise any updates or checkouts won't modify the existing remotes that are set up. Usually url changes due to moving to an organization or changing username are helped by github redirects, this wouldn't be.

Everyone would have to go through that purge exercise

Yeah. Which is why I'd just prefer to rewrite the old tags. People will complain, we'll tell them to remove it and add it again, and we're good to go. I don't quite see how renaming the repo, or creating a new standalone branch, or anything else will avoid some sort of "reset" on the part of users. Better to just rip the bandaid off.

Creating a new repo will work fine and would be much simpler for everyone to deal with. Otherwise this is going to cause frustration for (at least?) hundreds of users, many of whom are far less technically savvy with git than we are.

You're also breaking the reproducibility of old code if you don't recreate all the prior version tags with equivalent code in the rewritten version. This is kind of the whole point of having version numbers in the first place.

If Pkg and libgit2 can correctly handle single-branch clones (open question) and only fetch the history of the large legacy branches if asked to checkout or downgrade to an older version, then git might just do the right thing with no reset required. You'd only need a reset if you want to decrease the size of an existing clone. New clones would be small (if single-branch clones work correctly with Pkg), existing clones would continue working.

Creating a new repo will work fine

Sure... for you.

You're also breaking the reproducibility of old code

Look. It's pre-1.0. Combine that with the fact that there's very little in the ecosystem that depends on Plots directly. And that no one really wants to check out a tag from 6 months ago. Making a new repo bring a whole host of issues, annoyances, and frustrations (which don't affect you) but that will still impact users and cause them to do some sort of manual update. Links will be wrong. StackOverflow tags will be wrong. A name change will mean massive required code updates. (Not to mention I do not want to change the name) And METADATA doesn't really allow you to reproduce anything. You'll almost certainly get a different set of packages than you did during your original Pkg.update, because version dependencies are necessarily so loose.

Applying exact version bounds to a REQUIRE file is perfectly reproducible. Feasibility may change if bounds have tightened, but you'll get the same versions.

Yup... and how many packages do that?

Registered packages, none directly. JuliaBox does. Shipping products do.

there's very little in the ecosystem that depends on Plots directly

Except users' analyses that they generated for a publication that needs revision a month later.

Is is possible to use months-old versions with any of these at this point though?

If I understand the implications of the output below, only the packages ExperimentalAnalysis and ImplicitEquations will possibly force older versions of Plots to be installed:

tom@tom-office-ubuntu:~/.julia/v0.5/METADATA$ grep Plots */versions/*/requires
ApproxFun/versions/0.1.0/requires:Plots 0.5
ApproxFun/versions/0.2.0/requires:Plots 0.7.5
ApproxFun/versions/0.2.1/requires:Plots 0.7.5
ApproxFun/versions/0.2.2/requires:Plots 0.8.1
ApproxFun/versions/0.3.0/requires:Plots 0.8.2
ApproxFun/versions/0.3.1/requires:Plots 0.8.2
ApproxFun/versions/0.3.2/requires:Plots 0.8.2
ApproxFun/versions/0.3.3/requires:Plots 0.8.2
AverageShiftedHistograms/versions/0.2.0/requires:TextPlots
AverageShiftedHistograms/versions/0.2.1/requires:TextPlots
AverageShiftedHistograms/versions/0.2.2/requires:UnicodePlots
AverageShiftedHistograms/versions/0.3.0/requires:UnicodePlots
AverageShiftedHistograms/versions/0.3.0/requires:Plots
AverageShiftedHistograms/versions/0.4.0/requires:UnicodePlots
AverageShiftedHistograms/versions/0.5.0/requires:UnicodePlots
AverageShiftedHistograms/versions/0.5.1/requires:UnicodePlots
AverageShiftedHistograms/versions/0.5.2/requires:UnicodePlots
BenchmarkProfiles/versions/0.0.1/requires:Plots
ControlSystems/versions/0.1.1/requires:Plots
ControlSystems/versions/0.1.2/requires:Plots
ControlSystems/versions/0.1.3/requires:Plots
ControlSystems/versions/0.1.4/requires:Plots v0.7.4
ControlSystems/versions/0.2.0/requires:Plots v0.7.4
DifferentialEquations/versions/0.0.1/requires:Plots
DifferentialEquations/versions/0.0.2/requires:Plots
DifferentialEquations/versions/0.0.3/requires:Plots
DifferentialEquations/versions/0.1.0/requires:Plots
DifferentialEquations/versions/0.1.1/requires:Plots
DifferentialEquations/versions/0.1.2/requires:Plots
DifferentialEquations/versions/0.1.3/requires:Plots
DifferentialEquations/versions/0.1.4/requires:Plots
DifferentialEquations/versions/0.2.0/requires:Plots
DifferentialEquations/versions/0.2.1/requires:Plots
DifferentialEquations/versions/0.3.0/requires:Plots
DifferentialEquations/versions/0.4.0/requires:Plots 0.9.2
DifferentialEquations/versions/0.4.1/requires:Plots 0.9.2
DifferentialEquations/versions/0.4.2/requires:Plots 0.9.2
EEG/versions/0.0.3/requires:Plots 0.0 0.7
EEG/versions/0.0.4/requires:Plots 0.8.0 0.9.0
EEG/versions/0.1.0/requires:Plots 0.8.0 0.9.0
EEG/versions/0.1.1/requires:Plots
EEG/versions/0.2.0/requires:Plots
ExperimentalAnalysis/versions/0.0.1/requires:Plots 0.0 0.7
ExperimentalAnalysis/versions/0.0.2/requires:Plots 0.0 0.7
ImplicitEquations/versions/0.1.0/requires:Plots 0.5.0 0.5.1
JWAS/versions/0.1.1/requires:Plots
PlotRecipes/versions/0.0.1/requires:Plots
PlotRecipes/versions/0.0.2/requires:Plots
PlotRecipes/versions/0.0.3/requires:Plots
PlotRecipes/versions/0.0.4/requires:Plots
PlotRecipes/versions/0.0.5/requires:Plots
PlotRecipes/versions/0.0.5/requires:StatPlots
PlotRecipes/versions/0.0.6/requires:Plots
PlotRecipes/versions/0.0.6/requires:StatPlots
PlotRecipes/versions/0.1.0/requires:Plots
PlotRecipes/versions/0.1.0/requires:StatPlots
Robotlib/versions/0.0.1/requires:Plots
Robotlib/versions/0.0.2/requires:Plots
StatPlots/versions/0.0.1/requires:Plots
StatPlots/versions/0.0.2/requires:Plots
StatPlots/versions/0.0.3/requires:Plots
StatPlots/versions/0.1.0/requires:Plots
StatPlots/versions/0.1.1/requires:Plots
SymPy/versions/0.2.29/requires:Plots 0.4.0
SymPy/versions/0.2.30/requires:Plots 0.4.0
SymPy/versions/0.2.31/requires:Plots
SymPy/versions/0.2.32/requires:Plots
SymPy/versions/0.2.33/requires:Plots
SymPy/versions/0.2.34/requires:Plots
SymPy/versions/0.2.35/requires:Plots
SymPy/versions/0.2.36/requires:Plots
SymPy/versions/0.2.37/requires:Plots
SymPy/versions/0.2.38/requires:Plots
SymPy/versions/0.2.39/requires:Plots
SymPy/versions/0.2.40/requires:Plots

Renaming Plots to PlotsBackup isn't the problem since GitHub handles redirection for you. The problem is then putting something in the place where Plots used to be but which is a completely unrelated git repo, which will confuse anyone's installation who has Plots installed. We went through this with the Stats/StatsBase renaming and it was rough – it seems like a bad experience to foist on Plots users. I'm not sure about the implications of changing tagged versions but I don't really like it. It seems likely to cause problems. @tbreloff, if you want to go that way, you should make a fork of METADATA and try it (and get some other people to try it as well with Plots previously installed). Otherwise, I think what @tkelman is proposing using single-branch clones is the best way to go, although I have some technological doubts there tbh.

GitHub handles redirection for you

Only if you actually do a rename under the settings. If you just push a separate copy as a brand new repository, then there aren't any redirections.

Well, the repo size doesn't really bother me. And I've spent about 10 hours too many on this issue. I'm just burnt out with Julia package management. If you guys care so much about the repo size, then let me break old tags. If you don't care enough, then I'll close the issue.

It is quite likely that Plots.jl has fewer users than the stats packages did. I'd rather make a clean break personally and get a smaller package. Or just leave it as it is - since clearly there isn't a solution that satisfies all constraints.

For my own use, I'll probably make my own clean copy of the Plots.jl repo, if the size bothers me too much going forward. Feel free to close it.

It'd be great to have this done, keeping the name and infrastructure of Plots. With the current development of Pkg3, does it look like that will support a good solution?

AFAIU Pkg3 will cause a large enough change that everything has to be "redone" and then SHA:s for old versions could be changed.

This will be fixed by Pkg3, I've added wontfix but won't close until Pkg3 is out.

Repo size is preventing me from using Plots on slower internet connections (which unfortunately I deal with a lot out in rural areas of USA). Anything that can be done to trim package sizes helps productivity for me and anyone leveraging what I'm doing.

I'm currently waiting for the Pkg.add("Plots") to finish and have no idea if I'm close or far, due to Julia's lack of progress bar.

Pkg3 will use tarballs for standard packages which tend to be less than a megabyte. So this issue will be resolved when Pkg3 is merged into Base which will happen for 0.7

Close this as something that won't be fixed in Pkg2 ? Pkg3 will be out "soon" and this will be fixed automatically. Until then there is nothing we can do, so...

Latest Plots archive is 192KB which is what will be downloaded in Pkg3. Could keep this open just to have something to close when Pkg3 lands ;)

I think Pkg3 landing will leave you plenty of issues to close, but if you need something to do that afternoon we can keep this open :)

As a teaser, this is installing Plots from scratch (in real time):

https://giphy.com/gifs/xUOwG0sbnLhrPjd9hm

Holy Moly, that's amazing. @pkofod and @KristofferC we could also close this issue ritually at some point where we're in a position to clink cold beer bottles together?

Sounds like a plan!

Just wait until the build steps are handled by BinDep2 and we spend some more time optimizing the resolver. Then this will _really_ fly.

I think it is time - and I will provide the cold beer bottles to clink together :tada: ! So, when and where? :-)

Could we do it tomorrow? :)

Do it when you tag :)

Missed the beer but I think this can be closed now anyway :tada:

I'll buy both of you beer when I get the chance. And anyone else in this thread :-)

A Plots.jl with a small filesystem footprint... This is a whole new world!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ereday picture ereday  Â·  3Comments

kersulis picture kersulis  Â·  5Comments

Cody-G picture Cody-G  Â·  4Comments

lstagner picture lstagner  Â·  5Comments

daschw picture daschw  Â·  3Comments