Hi!
Our BinderHub lives in an on-prem Kubernetes cluster, which requires a proxy to access the internet. I didn't find an option to specify which proxy to use throughout all BinderHub-related pods, and the conventional HTTP[S]_PROXY, NO_PROXY or their lowercase variants don't seem to be honored. Without a proxy most builds fail either before starting a build (you cannot resolve GitHub-URLs, for example), or during it (because you cannot download any dependencies).
Right now, we solved this by uglily patching a few python files throughout the respective images. Changed were: 1) calls to tornado's httpclient in Binderhub itself, 2) the environment variables in build pod definitions and 3) the docker build arguments in repo2docker. The last one is obviously not part of BinderHub itself, but since the build pods are setup by BinderHub, it would be nice to have a way to pass a proxy configuration on to repo2docker. Perhaps one has to open another issue for the repo2docker-side of this integration.
Is there any work made being towards this, or did I perhaps miss a configuration option? I could contribute code for this, but without any promises about when it's finished.
Thanks for posting and setting up a BinderHub behind a proxy!
Is there any work made being towards this, or did I perhaps miss a configuration option?
I am not aware of any work, AFAIK you are the first person to stop by who has this need.
I could contribute code for this, but without any promises about when it's finished.
I think that is par for the course in open-source. In light of this it might be worth starting by making a list of all the places that need changing and any dependencies between them. Basically trying to make a plan where each step can be tackled by one person and merged one at a time. This way it doesn't matter if the person who started has to switch to other work.
It would also be worth exploring what options kubernetes provides to redirect/proxy traffic on a "whole cluster" scale. I've never investigated this but it would be attractive because we need to modify less code and it could cover things like non-HTTPS traffic as well (git clone [email protected]...) and we already use https://kubernetes.io/docs/concepts/services-networking/network-policies/ to restrict traffic.
(Just discussing this issue IRL with @betatim @minrk)
Here are the links to changes @bdrian needed to do, including some changes to try out kaniko:
im repo2docker-image:
https://zivgitlab.uni-muenster.de/a_broe10/jupyter.repo2docker/blob/master/patches/app.py.patch#L42
in binderhub-image:
https://zivgitlab.uni-muenster.de/a_broe10/jupyterhub.k8s-binderhub/blob/master/patches/build.py.patch#L44
https://zivgitlab.uni-muenster.de/a_broe10/jupyterhub.k8s-binderhub/blob/master/patches/tornado-httpclient.py.patch
Probably also possible with Kubernetes (metadata configuration for the pods)
Here are the links to changes @bdrian needed to do, including some changes to try out kaniko:
These changes are really not a good of example of what could be integrated into BinderHub, but rather ugly hacks for testing things out. Especially the last one has to be changed; a clean way would be to provide a proxy (if configured) when making calls to the outside world, like talking to GitHub.
It would also be worth exploring what options kubernetes provides to redirect/proxy traffic on a "whole cluster" scale. I've never investigated this but it would be attractive because we need to modify less code and it could cover things like non-HTTPS traffic as well (git clone [email protected]...) and we already use https://kubernetes.io/docs/concepts/services-networking/network-policies/ to restrict traffic.
I can see the appeal of using/modifying less code to reach the same goal, but I'm not aware of any possibilities in kubernetes itself to make this happen. Non-HTTP traffic can also be supported through a HTTP proxy by using HTTP CONNECT, which is also a common way of supporting HTTPS connections.
In light of this it might be worth starting by making a list of all the places that need changing and any dependencies between them. Basically trying to make a plan where each step can be tackled by one person and merged one at a time.
Thanks for the suggestion! I'll try to make such a list soon.
"soon"â„¢
Overall, proxy settings would be important in the following places:
http_proxy)Web calls in BinderHub are somewhat special, because the used tornado library doesn't support proxy environment variables (https://github.com/tornadoweb/tornado/issues/754).
BinderHub does not necessarily have to pass their environment variables to repo2docker's pod, alternatively one could use a Pod Preset in their own cluster, or bundle it with BinderHub's helm chart.
In repo2docker most (all?) things work automatically when setting the right environment, the used urllib takes care of the proxy settings, as does git. For the build itself, repo2docker has to pass its proxy environment to build containers through build_args.
A possible implementation for BinderHub's side of things follows.
This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:
https://discourse.jupyter.org/t/binder-behind-outbound-proxy/7428/2
Most helpful comment
(Just discussing this issue IRL with @betatim @minrk)
Here are the links to changes @bdrian needed to do, including some changes to try out kaniko:
im repo2docker-image:
https://zivgitlab.uni-muenster.de/a_broe10/jupyter.repo2docker/blob/master/patches/app.py.patch#L42
in binderhub-image:
https://zivgitlab.uni-muenster.de/a_broe10/jupyterhub.k8s-binderhub/blob/master/patches/build.py.patch#L44
https://zivgitlab.uni-muenster.de/a_broe10/jupyterhub.k8s-binderhub/blob/master/patches/tornado-httpclient.py.patch
Probably also possible with Kubernetes (metadata configuration for the pods)