Presto: Presto rolling restart

Created on 10 Dec 2018 · 9Comments · Source: prestodb/presto

Hi,

We have a presto cluster and usually there is no window where 0 queries are running. So deploys of new plugin versions usually disrupts work for few minutes for users. I have 2 questions here:

Is it possible some how mark coordinator as not accepting any new queries - just finish ongoing?
Is it possible to mark worker as "decommissioned" so it will finish all current tasks and won't accept any new one - so can be safely restarted?

Maybe there are some workarounds. One idea that I have in mind:

When there is need to deploy new version of plugin(s) set up parallel cluster with newer version and then route all new queries from users to new one, wait for previous version to finish and then shut cluster with previous down.

But this requires a lot of effort to implement. Just wondering if there is something available out of the box?

Thank you

Source

OleksiiDuzhyi

👍6

Most helpful comment

Is it possible some how mark coordinator as not accepting any new queries - just finish ongoing?

There's no explicit way of doing this. You could do this with a proxy in front of the Presto clusters that's aware of when clusters are marked as inactive. There's a proxy implementation in https://github.com/prestodb/presto/tree/master/presto-proxy, but it doesn't yet support that mode of operation.

Is it possible to mark worker as "decommissioned" so it will finish all current tasks and won't accept any new one - so can be safely restarted?

Yes, you can submit a PUT to /v1/info/state with "SHUTTING_DOWN" in the body. That will cause the worker to transition to a state where it will continue to process existing work but be unavailable for new work.

martint on 10 Dec 2018

👍6

All 9 comments

We have work planned related to this. @ggreg can describe it more.

rschlussel on 10 Dec 2018

Is it possible some how mark coordinator as not accepting any new queries - just finish ongoing?

Is it possible to mark worker as "decommissioned" so it will finish all current tasks and won't accept any new one - so can be safely restarted?

martint on 10 Dec 2018

👍6

One thing to note, is Presto will not communicate with nodes running a different version of Presto, so you can not use a rolling restart to upgrade Presto.

dain on 10 Dec 2018

One thing to note, is Presto will not communicate with nodes running a different version of Presto, so you can not use a rolling restart to upgrade Presto.

Hm.. This should enable rolling upgrade, no? as updated nodes wont communicate with older coordinator, but will connect to new one once it becomes master coordinator.

sopel39 on 11 Dec 2018

One thing to note, is Presto will not communicate with nodes running a different version of Presto, so you can not use a rolling restart to upgrade Presto.

Hm.. This should enable rolling upgrade, no? as updated nodes wont communicate with older coordinator, but will connect to new one once it becomes master coordinator.

It depends on how you define "rolling upgrade". I guess this is a form of rolling, but to me this mode is more like migrating existing nodes to a new cluster.

dain on 15 Dec 2018

Regarding (1), at Facebook, some significant deployments have multiple clusters. We currently rely on an external process, the gateway, that redirects queries to the coordinator that will actually execute the query. The gateway only dispatches the query by replying with an HTTP redirect and the URL of another endpoint more suitable to handle the request. It can be a coordinator or another process that returns a redirect.

The gateway uses an internal discovery service (something similar to Consul) to maintain a map of some attributes of a query, such as the schema, to coordinators. To drain a cluster, we mark its coordinator as draining in the discovery service to the gateway ignore it and no longer redirect any queries to it. Then another process polls the list of queued and running queries until no blocking queries is still running. Blocking is intended as must complete to consider the coordinator as drained. The coordinator can still execute queries and we still send ping queries that are simple Presto SQL statement to monitor the health of the cluster.

The gateway is a simple HTTP server that mainly finds which coordinator should handle the query and then returns a redirect to the Presto client. There is no logic specific to Presto which makes it easy to write in any language.

In your case, you could deploy a similar gateway process that:

Maintain a list of active coordinators
When you deploy a new version, spawn a new cluster that will register to the gateway (through some service that could be as simple as DNS, etcd, Consul, ...)
The gateway then dispatches queries by sending redirects with respect to some policy. You can for example progressively redirect more queries to the new cluster that acts as a canary either by mere percentage of queries or looking at some attributes of the query (schema, source, client tags, ...)

ggreg on 16 Dec 2018

👍1

I implemented the gateway service through haproxy. It doesn't check the query string in the request. All clients connect to the haproxy layer to run their queries and the proxy server will route the request to the coordinator. There are two coordinators with their own discovery service. These discovery services are externalized (we are not using the built-in discovery service of the coordinator).

Workers are configured to go through the proxy layer to determine the discovery server they connect to. Following are the steps I used in the past to upgrade the cluster without impacting any queries.

Upgrade the standby coordinator to the version and restart it.
Pick few workers and send the PUT request to /v1/info/state to shut them down (graceful shutdown)
Upgrade the presto version on those workers and force these workers to join the standby coordinator's discovery service.
Shutdown the discovery server associated with the active/primary coordinator.
This will force haproxy to determine that the primary coordinator is no longer healthy. HAProxy will mark the standby coordinator as healthy as its coordinator and discovery service are up and running.
Fresh Queries will immediately start flowing to the standby coordinator.
Use graceful shutdown the remaining workers in batches and upgrade the software on them. When these workers come online, they will join the stand by coordinator as haproxy will route them there.
All the active queries in the current coordinator will continue to run and complete without failures. Most of the workers will swing to the stand by coordinator as haproxy will round worker's announcement request to the stand by discovery server. For 15+ seconds, the workers will see commands from the original coordinator and the workers will see two coordinators. But queries won't fail.
Once the active queries on primary coordinator are done, shutdown the presto process and upgrade the software there. Restart that coordinator and the associated discovery service.
If needed, follow steps from #4 onwards to reverse the traffic from the current active coordinator to this upgraded coordinator.

We use scripts to perform all these activities and so it is not that complicated when we upgrade presto versions.

sajjoseph on 27 Dec 2018

👍4

We have built and open-sourced presto-gateway - https://github.com/lyft/presto-gateway. Please take a look.

It can route adhoc and scheduled queries to separate set of clusters
Default routing is RoundRobin.
It has APIs to deactivate / activate backend presto clusters as needed.
It stores query history locally in an in-memory circular buffer and provides links to querydetails page to underlying presto backend.

So if you have more than one presto clusters, you can put these behind presto-gateway and deactivate one cluster, perform upgrade etc and once done, you can activate it back again.

puneetjaiswal on 13 Feb 2019

👍5

Nice write up on different approaches, thank you all for sharing the details.

@puneetjaiswal I am using presto-gateway and appreciate your team sharing it with the community.

I had a question regarding graceful shutdown. I hope I am not hijacking the thread and hope the question is related to the broader theme of rolling restarts. How long does it take in your environment for the worker to gracefully shutdown during load time? I know it depends on the load and particular tasks/splits they are working on, but I am just looking for an approximate number based on experience(I think @sajjoseph hinted they saw some where around 15+ seconds). My main intention is to compare it with the 2 minutes termination grace time for AWS spot instances.