This is a meta issue to discuss the project to scale JupyterHub up to potentially a way larger number of users. In particular as one given hub can only support a couple thousand users it woudl be nice to coordinate many hubs. Also because a single hub is a single point of failure. @yuvipanda came up with the following (which I'll attempt to describe), and I've look at implementing it.
Instead of having 1 hub, we want many hub each with their users.
Once a user has reach a hub, it should alway get to this hub again. The issue is with authentication.
To know which hub a user should go to, you need to authenticate. To authenticate you need to reach a hub. So what we can do is deploy a fleet of hub, with specific one only responsible only for authentication, and telling the proxy to dispatch that user to a given hub, here is a schema.
|
+-----------+
| |
| v
| +-------------------------+ Cookie Set
| | |
| | Configurable HTTP Proxy +---------+------------+---...
| | (aka CHP) | | |
| +-------------------------+ | |
| | | |
| | v v
Set Cookie | | No Cookies +---------+ +---------+
and redirect | | | | | |
| | | Hub A | | Hub B |
| | | | | |
| | +---------+ +---------+
| v
| +--------------------------+
| | |
| | Hub Dispatcher |
| | |
| | - Authenticate |
| | - Which Hub For User |
+---+ |
+--------------------------+
^
|
|
v
+--------------------------+
| |
| DataBase or User/Hub |
| |
+--------------------------+
Note, (see SVGBOB)
The hub dispatcher is kinda like a hub, except it only:
We want to minimize the change in current authenticator, so if possible no change at all.
I'm diving through the code, which I haven't touched in a while. Seemed relatively easy at first glance but I'm unsure it actually is.
1) Authenticator can set arbitrary route for login, typically OAuth have a /auth_login and /oath_callback As we don't really know which handler does what we can't blindly rewrite the response to set_cookie.
2) is there any authentication flow where the (final) hub actually
need the credential of the user that is going to connect (e.g decrypt
home dir). In which case just carrying the token in a cookie with
shared secret does not help and we need to have HubDispatcher<->Hubs communication ?
3) it looks like we can make that a pure Fake Authenticator (dynamic subclass and overwrite/extend set_login_cookie). But that feels dirty, do we want to "hack Hub" to serve only as login node, potentially exposing more services (i'm not a huge fan), or start conservatively with an application that "just" expose the authentication flow.
4) Actually don't distinguish login nodes from normal hub, and just have authentication setting a accepted but "not me" value ?
Thoughts welcome.
This is great!
Some additional notes:
I think we should use nchp instead of chp to avoid the a single point of failure like @yuvipanda said.
I don't look really into the authenticate codes but really interested in the scaling up things.
I thought of the issue before.I think we should use a way to share the redirect path. I prefer a way that each hub can direct the each user instead of one user to one hub.
@zsluedem because of the state in a given Hub corresponding to a user (it's not all in the database, so things will go awry if multiple Hubs try to manage the same user), it is important that a given user always be routed to the same Hub.
The lightest-possible implementation for me is a small, dedicated application for authentication that does:
And use a special Authenticator in the Hubs that talks to this service, rather than trying to set cookies for the Hubs on the dispatcher. There is an example of this that uses the Apache REMOTE_USER header with shibboleth, which is the sort of pattern I would probably choose:
In particular, I would probably not base the dispatcher on JupyterHub. At least, I would probably not allow them to set the cookies that the Hubs would set. To do that, you need to make sure that the dispatcher and all Hubs are talking to the same database and use the same cookie secret in order for the cookies set by the dispatcher to be transferrable. Instead, a dedicated cookie/token that is understood by the dedicated Authenticator is probably simplest. You could use an Authenticator object to do the login in the dispatcher, but I'm not sure how much that gets you, since tornado and nginx, etc. tend to have their own support for things like Google OAuth already.
is there any authentication flow where the (final) hub actually need the credential of the user that is going to connect
In theory, eventually. But not at the moment, so I would ignore this for now. If you want this, I think the dispatcher does have to use a JupyterHub Authenticator, and can then store the response for the Hub's "AskTheDispatcherAuthenticator" to retrieve later.
I made a sketch of a simple tornado oauth application that logs in with Google. It's pretty simple, and probably a good deal simpler than anything that tries to integrate more deeply with JupyterHub.
There would be a corresponding Authenticator that uses the token to identify users and trigger the regular login process.
+1 on the cookie being set by hub itself - the dispatcher should only set a meta-cookie that can be authenticated by the dispatcher proxy (which probably would be a thing by itself - not even just nchp). The hubs wouldn't know about the dispatcher directly, and the dispatcher wouldn't know about the hubs directly either. Something based off the REMOTE USER authenticator could work - I don't see it authenticating the authenticity of the User header (but maybe I missed how Shibboleth works?) so any user who can hit the hub directly can pretend to be whoever. Trivially fixable tho.
While I do agree that we can make this simpler by not making it compatible with Hub's Authenticators, I think that'll be a long term maintenence burden. It'll also make scaling from 1 hub to 2 much easier. I think being able to reuse hub authenticators should be a hard requirement...
This looks very similar to the structure we use for deploying JupyterHub on Quantopian. I'm giving a talk about this at JupyterCon in August, but here's the rough cliffnotes:
We have an application we call the "hub discovery" service, that we use to persistently map users to jupyterhub instances. When a user goes to quantopian.com/research, our frontend (which is actually a Ruby on Rails app), sends a request to the discovery service asking which hub to route the user to. If the user has already been allocated, then the discovery service just returns the uri for their hub's server (which is running a CHP, and a JupyterHub using a heavily-customized Dockerspawner). If the user hasn't been allocated, then the discovery service chooses the hub server with the smallest number of allocated users. Once the frontend has the uri for the server, it renders an iframe for that server, and from that point on it's just a regular JupyterHub connection. Our hubs are totally ephemeral (we store users' notebooks in PGContents), so we actually just use in-memory sqlite for our hub db. I looked in the early stages of the project at having a single database shared between multiple hubs, but I wasn't sure I could make it work without major changes. Having no shared state between the hubs sidesteps a large class of potential problems.
A couple implementation notes that might be of interest:
JupyterHub.initialize in our subclass to make the hub register itself with the discovery service on startup. The uri and credentials for discovery are part of our hub's configuration.@ssanderson awesome! That sounds perfect. I look forward to hearing more at JupyterCon.
That's awesome, @ssanderson!
The one big difference seems to be that your hubs are somewhat interchangeable because of pgcontents, which makes things a lot easier!
The one big difference seems to be that your hubs are somewhat interchangeable because of pgcontents, which makes things a lot easier!
Yup. I think the two important differences are this and the fact that we already had a separate frontend service to act as the top-level proxy for routing users to the right hub. This is nice because it means we don't have to have any client-side code for hub-routing; our rails app just renders an iframe with the right hub's URI embedded in the larger page.
So I gave a try as a custom App that accept any authenticator it ends up duplicating almost 1/2 the code of JupyterHubApp, so I'm unconvinced it is the right way. It does have a custom proxy-authenticator and no-op-spawner that could work though.
@ssanderson it seems like very practical way to deploy!! Looking forward to the con now
@minrk
because of the state in a given Hub corresponding to a user (it's not all in the database, so things will go awry if multiple Hubs try to manage the same user), it is important that a given user always be routed to the same Hub.
Is there a way to store all the states in a database so that each of the hub can access?
I found out we can use some service discoveries like consul or zookeeper as @ssanderson said above and store the state which each hub hold so that a given user can be routed to the different Hub.
Wow, that's a lot of stuff you had to keep, @Carreau! Let's see what we can decouple out of JupyterHub.
I want to note on timelines - I'm perfectly happy for us to run HubDispatcher experimentally and with fast changing requirements for the short term. I'd want us to get JupyterHub 0.8 out out the door asap first, since it has a lot of really good changes and it's been a while since our last release.
@Carreau would you be at BIDS tomorrow?
Wow, that's a lot of stuff you had to keep
Yes, there might be way o remove some, but there is a lot of coupling.
It works, but the dispatcher still need to spawn something and poll for it. I need to subclass User for that (thinking of making that an option)
@Carreau would you be at BIDS tomorrow?
No, I'm in SF. I'm working with Paul on the Jupyter Talk. I'm thinking earlier discussions are right and as a first pass we should do a completely different authenticator and work _toward_ decoupling of 0.9 or 0.10 then have an easy migration path forward.
Yup, that makes sense, @Carreau!
I'll now figure out what kinda authenticator we'll need for our use case, and figure out a small separate service we can write for that.
Do we have this on roadmap for a release version? Our notebook platform architecture is using JupyterHub currently but there have been continuous questions about hub being SPOF. I understand that a hub restart does not effect logged in users but just the fact of having a single server deployment is making it difficult for people to accept.
I'll be more then happy to contribute if there is a story around this.
Hi @ckbhatt. I'll probably start working on this in a week or two. We ant to deploy this sometime in october.
I've spent some time over the last few weeks working on some part of this. https://github.com/berkeley-dsep-infra/data8xhub is the beginnings of the infrastructure.
I've now gotten to an architecture where the user's home directories can actually be shared easily across hubs in a scaleable way. So I can now load balance the hubs, sharding only the home directory locations. Hooray!
So how to load balance the hub? Me and @minrk brainstormed this earlier today, and here is a summary:
With this, we should be able to dynamically scale number of hubs up and down.
Adding a new hub:
Removing a hub:
Does this accurately capture what we talked about, @minrk?
I believe so. Some further details on (3) for the shared auth state: cookie authentication is handled by storing a UUID for each user as user.cookie_id. This is the value that is encrypted and stored in a cookie when a user logs in, and the value used to lookup the user. When a user makes a request with a cookie, the value is decrypted (to cookie_id) and a user looks up the user in the database by cookie_id. If cookie_id were deterministic based on the username (e.g. HMAC(shared secret + username)), then the cookie_id would be the same on all Hubs for a given user.
Consequences of this:
The one piece we missed in the shared cookie scenario is that the cookie_id lookup to work, the corresponding User row must already exist in the database. So instead of just overriding how cookie_id is set, I think you'll also need to override _get_user_cookie to work before the user exists, and perhaps set_login_cookie to write a cookie that has both name and digest, so that your get_user_cookie can check if the name matches the digest without having to do a lookup in the database. That's a bit more complicated and more overriding than we hoped.
@minrk if the user is already running on the target hub, the row should exist there right? and in the 'source' hub (where authentication has just finished) the user would exist too (at least temporarily - we can perhaps delete it before doing the redirect).
I agree this is starting to feel somewhat fragile from the complexity.
if the user is already running on the target hub, the row should exist there right?
Ah, yes. I forgot that the user is only redirected across hubs when they are already running and thus guaranteed to exist on another hub. Then that should be fine and most of my comment is irrelevant.
Most helpful comment
This looks very similar to the structure we use for deploying JupyterHub on Quantopian. I'm giving a talk about this at JupyterCon in August, but here's the rough cliffnotes:
We have an application we call the "hub discovery" service, that we use to persistently map users to jupyterhub instances. When a user goes to quantopian.com/research, our frontend (which is actually a Ruby on Rails app), sends a request to the discovery service asking which hub to route the user to. If the user has already been allocated, then the discovery service just returns the uri for their hub's server (which is running a CHP, and a JupyterHub using a heavily-customized Dockerspawner). If the user hasn't been allocated, then the discovery service chooses the hub server with the smallest number of allocated users. Once the frontend has the uri for the server, it renders an iframe for that server, and from that point on it's just a regular JupyterHub connection. Our hubs are totally ephemeral (we store users' notebooks in PGContents), so we actually just use in-memory sqlite for our hub db. I looked in the early stages of the project at having a single database shared between multiple hubs, but I wasn't sure I could make it work without major changes. Having no shared state between the hubs sidesteps a large class of potential problems.
A couple implementation notes that might be of interest:
JupyterHub.initializein our subclass to make the hub register itself with the discovery service on startup. The uri and credentials for discovery are part of our hub's configuration.