Distributed: Config setting to disallow `run_on_scheduler`

Created on 6 Mar 2019 · 9Comments · Source: dask/distributed

We run an adaptive deployment wherein the dask Scheduler runs in an environment with elevated permissions compared with the workers. Consequently, we'd rather not have an easily accessible method for accidentally executing code sent from a worker on the scheduler.

I know that this functionality was introduced mainly for debugging purposes in https://github.com/dask/distributed/pull/808, so was hoping we could introduce a config setting that turned this feature off (probably by just implementing a simple if / else in https://github.com/dask/distributed/blob/master/distributed/scheduler.py#L3069).

I'm happy to make the PR but wanted to open this issue for discussion before doing so.

Source

cicdw

All 9 comments

Yes, that seems reasonable to me. To be honest this feature request is long anticipated :)

You'll also want to look everywhere that pickle.loads is called, which includes the run function in worker.py::run as well as scheduler.py::Scheduler.feed (this is currently used by the Progress bars).

One rather comprehensive way to solve this problem would be to make the change in pickle.loads, which lives in distributed/protocol/

mrocklin on 6 Mar 2019

👍1

Ah ok nice!

Re: the comprehensive solution, it seems I'd have to update pickle.loads to only block if its being run on the scheduler? Is there an obvious way I could infer that?

cicdw on 6 Mar 2019

My thought was that you would block pickle.loads in config generally, and then only set that config on the process that's running the scheduler.

mrocklin on 6 Mar 2019

It would also be totally fine to handle this on the Scheduler itself. In that case you would probably remove some of the less safe routes in the constructor.

mrocklin on 6 Mar 2019

More generally, we might add a blocked_handlers keyword to core.py::Server. We would check that collection when we receive a message before we grab the handler. This would be pretty configurable, and would let us do things on the workers if we needed to.

As an example, I can imagine someone wanting to remove the shutdown route as well on the scheduler so that clients weren't allowed to shut down the cluster. This is similar to your running functions concern, but wouldn't be handled by a code fix that targetted just the run and feed functions. It would be nice to solve this problem and whatever future problem arises at the same time.

mrocklin on 6 Mar 2019

Interesting; so that could be as simple as providing a list of strings to the blocked_handlers kwarg, and then pop those keys from self.handlers in the Server init, correct?

And then layering a config setting on top that automatically adds ['terminate', 'feed', 'run_function'] to the blocked_handlers for the Scheduler.

cicdw on 7 Mar 2019

Interesting; so that could be as simple as providing a list of strings to the blocked_handlers kwarg, and then pop those keys from self.handlers in the Server init, correct?

Could pop, or could check the blocked_handlers on each message and raise an informative error.

if op in self.blocked_handlers:
    raise Exception("The op OP has been explicitly disallowed in this TYPENAME, possibly due to security concerns")
else:
    func = self.handlers[op]
    ...

mrocklin on 7 Mar 2019

👍1

And then layering a config setting on top that automatically adds ['terminate', 'feed', 'run_function'] to the blocked_handlers for the Scheduler.

Before we bake in a config setting I'm inclined to see if we might just have the blocked handlers as the config, and then have docs that encourage people to use a particular set in the scheduler. I'm trying to avoid having many small special cased options and instead have a few relatively general purpose ones until we see very common patterns emerge.

mrocklin on 7 Mar 2019

👍1

Dumb question: I implemented a check + raising a ValueError here: https://github.com/dask/distributed/blob/master/distributed/core.py#L333

And attempted to modify this test to catch the ValueError. The _behavior_ of the test changed (the comm was closed), but the ValueError wasn't raised anywhere that I could see.

I'll admit I was just pattern matching for the tests; a nudge in a more nuanced direction for writing better tests would be appreciated.

edit: Nvm, I think I have a satisfactory solution.

cicdw on 7 Mar 2019

Was this page helpful?

0 / 5 - 0 ratings