@echowhisky noted that conda clean with the -f or --force-pkgs-dirs flag can do significant reduction in image sizes, but we do not grasp of the consequences of doing so. I created this post to have a gather discussion regarding this and what flags we should use in this command!
I found the following documentation about the flag:
https://conda.io/projects/conda/en/latest/commands/clean.html#Removal%20Targets
In the docs they write:
Remove all writable package caches. This option is not included with the
--allflag. WARNING: This will break environments with packages installed using symlinks back to the package cache.
The concrete question I mind that we need to answer is more specifically, what flags do we pass to the clean command? The flags detail what to cleanup.
Currently we are using the -tipsy flag, but I fail to find any documentation about this flag, I think it is deprecated in favor of other flags. So, we should probably update this no matter what.
conda clean building up to datascience-notebookNone
Woops, so tipsy is simply the list of flags -t -i -p -s -y.
-t or --tarballs-i or --index-cache-p or --packages-s or ... eh that was not documented, it probably relates to --source-cache and this issue-y or --yesThen I ask myself.
-l or --lock flag that is part of the --all flag?-f or force-pkgs-dirs that is not part of the --all flag?-s or --source-cache flag that is deprecated?You beat me to it. I had planned on making this issue, and had also noticed the -s flag seems deprecated. It's likely that conda clean --all -y is what is wanted, but it's worth doing some testing between that and including -f and see how it goes.
IIRC, @jakirkham and I setup some of the initial clean commands a long time ago, probably when we were still on conda 3.x? I have few doubts that there are better options today.
Perhaps we could add some tests to the datascience-notebook where we do something similar to the test below from base-notebook, but try to do a command to verify various import statements functions.
With the -tipsy flags, because of the -s the following error shows up in the build output for the base container.
WARNING: 'conda clean --source-cache' is deprecated.
Use 'conda build purge-all' to remove source cache files.
The problem is conda build isn't installed, and I'm not sure it makes sense to install something extra to try to reduce size.
I have a branch for this issue now with the initial changes. I'll have an additional commit to test the -f option, but for now wanted to validate just the --all flag as a replacement for -tipsy. It builds to exactly the same size in my testing.
@echowhisky I reduced my 3.6 GB image that is a bloated datascience-notebook image built from scratch to 3.4 GB by going for --all -f as command line arguments instead of -tipsy.
@parente got an idea on how to test various python import statements as part of a test? If I could get a bit confident about how it makes sense to go about it I'd be happy to add such test along with a PR to change the conda clean command parameters.
Furthermore, would various import statements to various packages be good enough of a test if we broke something? I'm not confident about these things at all...
@betatim - it is my understanding from what you said, without technically understanding it, that you think it would make sense to have the -f flag for docker images, is this correct?
My docker image built with --all -f reduced in file size while remained functional for various packages with lots of dependencies.
@consideRatio If you're planning on write a pytest, you could mimic what we do in the base notebook tests to launch a docker container and run ipython -c "import whatever" or the equivalent to test what you want. If you want to take the approach in #406, you'd write a notebook and we'd add some mechanism to invoke nbconvert with the execute flag to run the notebook top to bottom and error if any cell failed.
@parente hmmm and the #406 approach would be to do something within the makefile? I wonder about how we would make the notebook accessible to the container if we go about using a notebook no matter if we use makefile stuff or pytest.
Can you recommend an approach for me? ;)
@consideRatio I would start with a single pytest setup like https://github.com/jupyter/docker-stacks/blob/master/base-notebook/test/test_container_options.py#L8 where the test body is something like the following pseudocode:
def test_pandas(container):
c = container.run(
command=['start.sh', 'ipython', '-c', '"import pandas"']
)
rv = c.wait(timeout=5)
assert rv == 0 or rv["StatusCode"] == 0
def test_conda_install(container):
"""Should install a conda library and then import it without error"""
# exercise for the reader :P
This page in the doc (https://jupyter-docker-stacks.readthedocs.io/en/latest/contributing/tests.html) describes where to put the test file so that it executes against the correct image. You'll need to create a pipenv/venv/conda env containing the contents of the requirements-dev.txt file at the root of the project in order to run the tests locally.
@betatim - it is my understanding from what you said, without technically understanding it, that you think it would make sense to have the -f flag for docker images, is this correct?
We have been doing what I think is the equivalent of -f in repo2docker for a few weeks(?) now and so far no one has complained. This means that at least for repo2docker I think we will keep deleting the package directory with the explicit rm command we have. Eventually we should switch to --all -y -f but it isn't urgent. However take into account that my last comment after poking around docker images, layers and hard-links concluded: that I think there is something I don't understand because the world is inconsistent (hardlinks work within a layer, but deleting one link reduces the image size which shouldn't happen because I think all files affected here have at least two links to them).
I would not be surprised if there is a bug in how Docker is reporting hard-links in image size. This has happened before ( https://github.com/moby/moby/issues/9283 ).
I finally learned more about hard-links: https://en.wikipedia.org/wiki/Hard_link
For comparison, in the base-notebook container run without -f the /opt/conda/ directory has the following sizes:
jovyan@6e6dbadc1fe3:/opt/conda$ du -sch *
182M bin
2.4M compiler_compat
8.0K condabin
7.3M conda-meta
4.0K envs
44K etc
13M include
312M lib
8.0K LICENSE.txt
56K man
135M pkgs
412K sbin
3.5M share
12K shell
24K ssl
12K x86_64-conda_cos6-linux-gnu
655M total
And then when I built the same container with the -f flag:
jovyan@145c591d5eea:/opt/conda$ du -sch *
182M bin
2.4M compiler_compat
8.0K condabin
7.3M conda-meta
4.0K envs
44K etc
13M include
312M lib
8.0K LICENSE.txt
56K man
796K sbin
39M share
24K shell
336K ssl
12K x86_64-conda_cos6-linux-gnu
556M total
Even though the pkgs directory is gone in the second one, some of the other directories appear to increase in size (sbin, share, ssl). They didn't really. This is due to hard-links. du is smart enough to only count a hard-linked file once when summarizing a directory. In the first example, it found the hard-linked files in pkgs first and counted their size, ignoring it as part of the summaries for the later directories. With pkgs gone in the second instance, du counts them where it finds them causing an appearance of size increase.
Not sure any of this is super important, but was a hard-linked related artifact that tripped me up at first.
In the end, using -f makes the size of the base-notebook container (on disk, not the compressed container size) about 100M less. That seems good.
The issues with the R environment failing seems to have resolved, but if it turns back up it might be fixable by using the --copy flag with conda install if it is being caused by some soft-linking issue.
I've updated my fork's commits (167c686) to be in sync with the latest updates to the master (2662627) but haven't submitted a PR yet, waiting to see if there's additional testing coming. If I can get my head around it a bit better and carve out the time, I'll try to contribute to the testing pieces as well.
@echowhisky, do you think we could raise a simplified version of this to the Docker team as an issue and see if they have any thoughts?
@echowhisky @consideRatio Given what we found in repo2docker, I think it's OK to proceed with the -f addition without waiting for tests.
@jakirkham checkout https://github.com/jupyter/repo2docker/pull/666#issuecomment-489348883 for a "minimal" example on hardlinks. Seems like docker images reports the numbers I'd expect it to report.
I think I read somewhere in the conda docs that it will try to hardlink what it can from pkgs but that not everything in that repo can be hardlinked and instead will be copied/symlinked. So it could well be that cleaning out that directory "only" saves 100MB.
@jakirkham - the info I quoted above wasn't a docker issue, just the normal reduction from tweaking the conda clean flags and standard behavior of hard-links. I haven't tested beyond the base-notebook container for size savings, but went ahead and submitted the PR.
There are very likely some additional size savings that can be squeezed out, possibly using some other techniques like cleaning out all of the compiled python (.pyc) files and some of the other pieces mentioned in the conversation you referenced, but that probably deserves its own issue thread.
If the Docker Hub tag size reports are to be believed, PR #867 caused the following changes in image size, starting in tag 4d7dd95017ed:
| | new_size_mb | old_size_mb | delta_mb |
|:---------------------|--------------:|--------------:|-----------:|
| all-spark-notebook | 2145.16 | 2145.16 | 0 |
| base-notebook | 216.035 | 259.499 | -43.4638 |
| datascience-notebook | 2047.63 | 2115.08 | -67.4586 |
| minimal-notebook | 1040.8 | 1083.59 | -42.7885 |
| pyspark-notebook | 1809.2 | 1809.2 | 0 |
| r-notebook | 1472.44 | 1530.85 | -58.4122 |
| scipy-notebook | 1384.29 | 1444.94 | -60.6465 |
| tensorflow-notebook | 1629.26 | 1558.54 | 70.7198 |
The Spark images have not changed in size because they failed to rebuild on Docker Hub. (#871 is addressing). I don't have an explanation for how the tensorflow image increased in size, but maybe there's evidence in in the build manifests (https://github.com/jupyter/docker-stacks/wiki/tensorflow-notebook-4d7dd95017ed vs https://github.com/jupyter/docker-stacks/wiki/tensorflow-notebook-2662627f26e0).
I used this notebook / binder to fetch the stats from Docker Hub.
Thanks for the summary @parente ! Closing time?
No clue about how tensorflow-notebook icnreased in size, perhaps there was some package that got itself a version bump?
Thanks for the discussion and work put in here, folks. Less is more FTW!
Most helpful comment
For comparison, in the
base-notebookcontainer run without-fthe/opt/conda/directory has the following sizes:And then when I built the same container with the
-fflag:Even though the
pkgsdirectory is gone in the second one, some of the other directories appear to increase in size (sbin,share,ssl). They didn't really. This is due to hard-links.duis smart enough to only count a hard-linked file once when summarizing a directory. In the first example, it found the hard-linked files inpkgsfirst and counted their size, ignoring it as part of the summaries for the later directories. Withpkgsgone in the second instance,ducounts them where it finds them causing an appearance of size increase.Not sure any of this is super important, but was a hard-linked related artifact that tripped me up at first.
In the end, using
-fmakes the size of the base-notebook container (on disk, not the compressed container size) about 100M less. That seems good.The issues with the R environment failing seems to have resolved, but if it turns back up it might be fixable by using the
--copyflag withconda installif it is being caused by some soft-linking issue.I've updated my fork's commits (167c686) to be in sync with the latest updates to the master (2662627) but haven't submitted a PR yet, waiting to see if there's additional testing coming. If I can get my head around it a bit better and carve out the time, I'll try to contribute to the testing pieces as well.