Hello all, this is somewhat related to #304
I am trying to build a custom orchestration/spawner/launcher system for deploying a distributed cluster. Ideally this system would manage session security, to prevent arbitrary clients from connecting to a scheduler. (We're interested in this from the perspective of a multi-user host.)
Because this is all based on Tornado, it would be very convenient to be able to just use ssl_options and communicate SSL Client Certificates before launching anything in distributed. However this doesn't work for the Client object, as this keyword isn't exposed. It looks like ssl_options would need to be propagated to the Client somehow from this line up. (And maybe for Scheduler and Worker as well, I haven't read the those implementations very closely yet)
Based on reading issues/blogs/etc it looks like managing security isn't an immediate (or even necessarily future) goal of distributed/dask, but would it be acceptable to expose a little more of Tornado's framework to permit programmatic management of SSL for more custom deployments? Ideally distributed could just pass some kwargs around and not consider the problem further.
If you're amenable to the idea, I would be happy to work towards a PR (given a little API guidance).
Edit: I see that there is an existing PR #537 that sort of addresses this, but it looks like it may have stalled out a little, and it doesn't seem to have a way to pass the ssl_options into TCPClient.connect in the same way that would be natural for the TCPServer objects.
Definitely amenable to a PR helping dask.distributed to pass through Tornado keywords :)
So I've had more time to look over how dask.distributed is structured, and I think it's probably worth addressing a couple API questions before moving forward.
Both Scheduler and Worker appear to act as both servers and clients in specific situations. There's no real problem with that, but it does impact the meaning of ssl_options for the higher-level interface of dask.distributed.
For brevity I'll use the Tornado base classes from here on:
TCPServer - (Scheduler, Worker)TCPClient - (distributed.core:connect used by Client, Scheduler, and Worker)Note: while the SSL/TLS client/server relationship doesn't need to reflect who initiated the socket connection, Tornado assumes that the initiator is always the client, so the above classes also serve to represent the SSL/TLS notion of client and server.
The "problem" is that the ssl.SSLContext (ssl_options) object needs to be configured on both sides of the socket for things to work. This means that passing ssl_options via **kwargs down to the TCPServer (as is currently possible) won't work because TCPClient doesn't realize it needs to handshake. With that in mind there are a couple options:
The same ssl_options are passed the TCPServer and TCPClient (a la #537), this would work, however it means you have mutual authentication. It also means that whatever certificate you use needs to be signed in the context of a server and a client (from the TLS standpoint).
This would work completely fine for me as I only need message authentication within the network (effectively using the same ad-hoc SSL cert to generate a shared secret for message digests), but I don't want to try and push something upstream that only fits my purposes.
If we wanted to support using SSL/TLS in a more traditional way (e.g. some unauthenticated client system authenticates an instituational scheduler^^^) we would need to differentiate between a TCPClient and TCPServer context. That way the TCPClient could request authentication of the TCPServer, but the TCPServer wouldn't authenticate TCPClients.
As an example, this would be something like client_ssl_option and server_ssl_options. This seems like a good middle ground approach, as mutual authentication is still possible as is the more typical server-only authentication.
The above still isn't the most flexible option, as the above approach is insensitive to the type of TCPServer that it is connecting to as a TCPClient. This means that if you needed to support unauthenticated Workers, but authenticated Schedulers, you would have to disambiguate ssl_options for when you are connecting to another Worker (no ssl at all (or maybe a different configuration)) and when you are connecting to a Scheduler (ssl needed with authentication). I can only imagine this being interesting if you were trying to build something like a \
Given the above, it seems like there are a couple reasonable API options:
1) Give up and do nothing, solving it the "right" way isn't worth the effort.
2) Just enable passing ssl_options down to distributed.core:connect and take the approach of mutual authentication or none at all.
3) Add client_ssl_options and server_ssl_options (or some variant thereof) to the constructors of Client, Scheduler, and Worker. (My current understanding is the Client wouldn't actually need server_ssl_options)
4) Create a method to set the SSL options instead of using the constructor. e.g. set_ssl_options(ssl_mode: {'server', 'client'}, ssl_options). I think this may be a really good approach as it doesn't change the constructor API and it enables you to in the future decide that this was a terrible idea and deprecate the method altogether. Or if it proves very useful, extend the method with another keyword enabling differentiating ssl_options based on who you are talking to (Worker -> Worker vs Worker -> Scheduler).
5) Config file and skip the constructor. This can be extended as needed, but it makes dynamic management weird.
At the end of the day, I don't want to make dask.distributed handle the SSL/TLS itself, but a naive plumbing of kwargs isn't going to really be useful either. So it seems like a middle ground approach which understands only enough to pass different ssl_options down to the appropriate Tornado objects would be ideal.
I hope the above is a fair-shakedown of the situation given my still limited understanding of the codebase. And once again, I am more than happy to do the legwork on this, but I don't want to start something if you guys aren't on-board.
^^^ An aside: Conversely, it also seems reasonable that an institution would desire to authenticate client systems to prevent public access, however this would require something like a scheduler plugin to recognize when a Client issues a command and then some institutional logic would check the TCPClient certificate and accept/reject the processing of the message from there. (This doesn't sound possible at this time and the plugin would need an opportunity to observe the stream object to get the certificate information, at that point, it seems wiser to just have "first class" ssl support)
Also, sorry for the really long write-up. I don't want to discourage this idea by making it sound hard, but I also don't want to push dask.distributed into a corner by implementing option 2 and calling it a day without further exploration.
cc @hussainsultan @pitrou
I'm still not sure what kind of model would make most sense. The general question is what security people are expecting, at a higher level. For example, should both ends of a connection always be authenticated?
@pitrou For your given hypothetical example, even if it was decided that mutual authentication is required, it still doesn't necessarily provide a guide on how to implement.
My example of option 2 (which is a very easy way to enforce mutual auth) requires the cert to be signed in two contexts, which is a little "weird", most of the time you would have separate certs for the client and server parts of the TLS handshake. This means someone else will inevitably request the differentiation of the ssl_options so that they can use separate certs for client and server mode (which as a side-effect happens to enable more security policies than just mutual auth)
It strikes me that SSL/TLS can be used to enforce almost any arbitrary security policy assuming you have granular enough control over the ssl_options. Is it necessary to even have a high-level notion of security, or can it be left strictly to initial setup and SSL/TLS?
If we wanted to plan our security independently of the notions of "client" and "server", we could do that (for example with a starttls-like mechanism inside the protocol). It would be more work, but still doable.
I agree that as a first approximation, we can use the same cert for every communication of a given role (either Client, Scheduler or Worker).
I like the idea of some kind of starttls message, and going in that direction would give some really obvious places for hooks to do more granular authentication, but that sounds like a very big design choice.
I can get started on a first-approximation (single-cert, pass ssl_options from each constructor) as this would at least make the existing undocumented ssl_options work for some security policies.
If we are ok with going that route, does it make more sense to make it an explicit keyword parameter (necessitating an order in the signature), or should it just be something that is pop-ed from **kwargs (and could even remain undocumented for now)? I think I would have to add **kwargs to Client in that case, but the other two constructors can stay the same.
I can think of two options:
SSLContext parameter (and only that)Either way, I am busy refactoring the I/O layer, so it's not a good idea to start work on this immediately.
Given http://distributed.readthedocs.io/en/latest/tls.html, this can be closed, I think.
Most helpful comment
Definitely amenable to a PR helping dask.distributed to pass through Tornado keywords :)