Custom Dockerfiles are going to be a source of neverending support problems as we make assumptions and requirements about what is present in the image. We already have requirements about JupyterHub version, and will probably develop new requirements over time, such as additional extensions, notebook server version, etc. These aren't feasible unless we are performing non-optional installation steps in the image ourselves.
Pushing people toward a clear 'run here' script that is purely in addition to our own image setup should be hugely more supportable in the long run. Following the well-established pattern of .travis.yml should allow us to cover things much more rigorously while allowing users to perform additional actions, such as build steps or downloads (post-processing on the repo, essentially).
I would consider ~every case that can only be solved by Dockerfiles to be a bug requiring higher-level support elsewhere. The only one that I can't really see a way to support is building from custom base images. I don't see how that's really supportable in the long run, though.
I feel conflicted on this one. I agree this is going to be a continuing source of pain to support dockerfiles. They hard to predict and to support.
On the other hand, many of the people who'd be the biggest "power users" use dockerfiles quite a lot. E.g., some projects that have their own interests in reproducibility have curated Dockerfiles they use for their software stack. I'm at a hack week thing right now and one of the people working on nipype (a pipelining tool stack for fMRI) ran into this problem.
If we didn't support Dockerfiles at all, then I doubt we could expect them to turn that Dockerfile into a bunch of requirements.txt and postBuild files just so that they could run Binder.
That said, I think this is a topic we should revisit every few months. I can definitely see the upside here, just don't want to unintentionally shut out some of the more "reproducibility-minded" users.
I agree that there are plenty of use cases that need Dockerfiles now, and I don't want to shut them down outright. I would just like to work from the perspective that ~every case that requires a Dockerfile is a bug in binder to be addressed via some more maintainable mechanism.
We could also consider changing our perspective on custom Dockerfiles, where we perform a bunch of installation of everything we need to work with Binder. Basically treat Dockerfiles as base images, and then we build a derivative that we ensure works with Binder, which can change over time.
What I don't think we can do realistically is support custom Dockerfiles without intervention, relying on users to build Dockerfiles that always work with our changing requirements for what's in the image.
just don't want to unintentionally shut out some of the more "reproducibility-minded" users.
That makes sense. Though I would point out that Dockerfiles are the least likely mechanism to be reproducible in binder right now over a longer time period. An upgrade to the version of jupyterhub on binder will immediately break all repos with Dockerfiles, as is about to happen. I'm working on making JUPYTERHUB_VERSION a build arg to protect against this, but all existing Dockerfiles will have to be modified for this to work.
Ironically, the legacy binder dockerfiles builder are right now much more reproducible than arbitrary Dockerfiles, because they support intervention after the user Dockerfile. I suspect we are going to have to do this for all Dockerfiles if we want them to keep Dockerfiles working for extended periods of time.
I think those are all good points. I suspect @yuvipanda has thoughts on this too. Probably worth scoping out a group proposal that formalizes some of the points you make above (e.g., exactly what kind of behavior will we take if people give dockerfiles). In the short-term I agree with you that we should really try and get across the message that dockerfiles should be an absolute last resort. If you've got thoughts on how to get this across more cleanly, feel free to make more comments on the dockerfile docs PR and I'll make changes accordingly!
Looking at the gitter channel now - one other way we could try to minimize the moving parts with Dockerfiles is to restrict them somewhat. E.g., I bet we'll run into this problem of sourcing an image from latest multiple times. What if we prohibited that people use latest in their Dockerfiles to avoid this kind of thing from happening? Similar to how jupyterhub works...
One question for this: would we still support people source docker images? If so, we'd need to figure out a build spec for this. Could be as simple as a file called SOURCE with a single line specifying the name/tag of an image.
There are lots of useful things developing in a few fields around creating base docker images for various workflow stacks, e.g.:
Quick note before I leave - while I agree we should deprecate those, I don't think we can do it anytime soon - it is, for better or worse, the standard for reproducibility and we should support it. What we can and should do is enforce better standards for it, and refuse to run Dockerfiles that fail that standard. And make it very clear that using a Dockerfile is an 'expert' case, we expect you to follow all these guidelines, and they might fail even then in the future. Better messaging and linting!
Not allowing Dockerfiles will limit adoption of Binder significantly IMO, and I don't think we can do that right now.
@choldgraf Which gitter channel are you referring?
What we can and should do is enforce better standards for it, and refuse to run Dockerfiles that fail that standard. And make it very clear that using a Dockerfile is an 'expert' case, we expect you to follow all these guidelines, and they might fail even then in the future.
This seems like the best approach. I do agree with @minrk that without some boundaries that we are creating a maintenance/support problem.
@willingc this one was in the binder gitter: https://gitter.im/binder-project/binder (this is still the active one)
re: best approach, I agree w/ validator approach...see : https://github.com/jupyterhub/binderhub/issues/102
@choldgraf Thanks. I didn't know that any discussions were happening on that gitter channel too.
yep - that's the legacy binder gitter channel...was originally us spot-checking problems with old binder, but has become new-binder-focused until we switch to the jupyterhub binder channel :-)
Probably time to clarify where discussions are happening in general related to binder, helm charts, jupyterhub, etc. Please copy me on any emails as fyi too.
yeah I think that's a good idea. Thus far for binder we haven't really directed to anywhere yet...just been using the old channels for communication. Now that we actually have the new binder channel, we could start directing people there...
Up until recently the binder technical discussions were happening on the jupyterhub channel. I had no idea that you, @yuvipanda and @minrk were discussing on the old binder channel as of the past two days.
Sorry about that @willingc. I think people are still showing up in the original binder gitter for support questions, and other discussion can bleed in there while that's going on. We should make sure we have links pointing to the new room in all the right places.
I don't think we should drop Dockerfiles right now, I just want to make sure that docs pointing users to Dockerfiles (like https://github.com/jupyterhub/binder/pull/11) emphasize that:
For this reason, a /repo/git-ref/ binder link containing a Dockerfile rarely if ever makes sense, because it can't be expected to work for very long.
If we change the requirements for what's in a container (e.g. #103), then we might be able to be more lenient about what makes a valid Dockerfile for binder.
sounds good - I'll update the Dockerfile docs PR to make sure these thoughts are included in it
As I've watched the size of the docker-stacks images grow and grow, I've often wondered if we should provide some kind of web UI that let's users pick high-level features they want (languages, kernels, conda-forge packages, etc.) and get what we'd consider a well-formed Dockerfile (and/or environment.yaml, requirements.txt, ...) for their personal build environment, or now binderhub. It feels like this would be one way to trim the docker-stacks images back to the opinionated, starter pack concept and perhaps help people create good environment definitions for use in binderhub, for whatever definition of good we have at the moment.
I recognize this isn't a panacea: whatever the tool emits will certainly need updates over time as best practices shift.
@parente I think something like this is a good idea as well...for example see the neurodocker project:
https://hub.docker.com/r/kaczmarj/neurodocker/
Basically it lets you specify things that you want in a docker image (in this case all centered around neuroimaging) and then it creates a Dockerfile for you.
Thanks for the ref @choldgraf. I haven't done much looking into what exists and I suspect there are others. Perhaps I'll do a bit of research.
One thing I was hoping to avoid with the suggestion of a web UI is the invention of yet another file format or API/CLI that needs to be spec'ed and maintained. My current thinking is that the cost of a web app (UI design, hosting, operations) is more well-known and easier to contain.
https://phpdocker.io/generator is in the vein of what I'm suggesting, minus the techie options like port.
Understanding, using, and supporting containerization and the open container standard as a portable compute model definitely seems to have legs in the data science community in both academia and industry. Supporting this standard as a first-class citizen in the binder platform seems like a reasonable goal.
It's super easy to create plenty of headaches in the existing approach anyway (e.g. try mixing some conda channels with conflicting opinions about system libraries), and tracking down incompatibilities or build failures because someone's spatial data analysis package needs a different version of the postgis driver or whatnot. If people are bringing their own container then you can reasonably punt on the maintenance issue; while letting people bring containerized solutions from their particular community which is probably best placed to answer such thorny dependency config issues anyhow
Yep I agree @cboettig - it seems that we are converging on allowing containers, but treating them as advanced use-cases where YMMV on how they interact with BinderHub (if at all). There's some extra magic we can add in if people use the "build files" approach that we won't be able to add in with the Dockerfiles approach, but I think that's fine.
@minrk are you OK with me closing this until it makes sense to open it back up for discussion in the future? Right now it sounds like removing Dockerfile support would be fairly difficult from a social standpoint, and we now have documentation that tries to make it clear how / when to use dockerfiles.
Let's go ahead and close it. We can open a new issue that limits usage or functionality at a later time.