Presto: High Availability

Created on 6 Mar 2019  ·  9Comments  ·  Source: prestosql/presto

Goals

At the highest level, our goal is to create a highly available Presto setups. More specifically, we would like to accomplish the following:

  • Reduce impact of crashed coordinator
  • Simplify deployment architecture (e.g., fast failover, load balancers, etc.)
  • Reduce failures due to upgrades

Work Items

  • ~Dispatcher split~

    • ~Move queue management into separate internal service (#95)~

    • ~Secure submission query API~

    • ~Split queued and executing client API~

  • Dispatcher-only or coordinator-only server

    • Add only start necessary services depending on server role

  • Multiple dispatchers

    • Proxy for dumb load balancer

    • Durable resource groups

  • Multiple coordinators in a cluster

    • Add configuration to run shared discovery across coordinators

    • Add coordination for OOM killer and reserved memory pool

  • Multiple clusters

    • Cluster/coordinator discovery (each cluster has independent discovery)

    • Cluster selection system for queries

Deployment Architectures

There are multiple ways to combine the above work items into different setups to achieve different goals. The following sections describes the most common setups:

Multiple coordinators with Shared Queue

In this setup there is a single dispatcher containing the shared queue, and multiple coordinators in the cluster. If the dispatcher crashes, all queued queries fail, but executing queries will continue. If a coordinator crashes, all queries managed by that coordinator fail, but other queued queries and queries managed by other coordinators will continue.

This setup requires the following:

  • Dispatcher split
  • Dispatcher-only or coordinator-only server
  • Multiple coordinators in a cluster
  • (Optional) Multiple dispatchers

Highly Available Coordinators

In this setup there are multiple coordinators and each contains a dispatcher. Queued queries are durable, but executing queries can fail.

This setup requires the following:

  • Dispatcher split
  • Multiple dispatchers
  • Multiple coordinators in a cluster

Multiple Clusters with Shared Queue

In this setup there is a dispatcher tier that is managing a shared queue for multiple clusters. In the simplest form, there is a single dispatcher for all clusters and a single coordinator for each cluster. As above, if either fails, it only fails the queries being managed by that instance. This project has multiple followup projects to make the dispatcher HA, and to allow multiple coordinators in a cluster.

This setup requires the following:

  • Dispatcher split
  • Dispatcher-only or coordinator-only server
  • Multiple clusters
  • (Optional) Multiple dispatchers
  • (Optional) Multiple coordinators
enhancement roadmap

Most helpful comment

Is this feature already planned?

All 9 comments

We hope this patch will also:

  • Simplify the collaboration with resource orchestration platforms (K8S / Mesos)
  • Allow 0 downtime rollout upgrade
  • Allow load balancing between coordinators (for high concurrency queries where coordinator becomes the bottleneck)

Path 1: Multiple coordinators with Shared Queue

This approach requires "Multiple coordinators in a cluster" implementation which I think would be a difficult task. The cooperative scheduling of workers between coordinators could be complicated. I'd rather prefer taking Path 3 first and then move toward the outcome of Path 1.

Path 2: Highly Available Coordinators

I don't like this approach. The main issue is that the client has to decide which coordinator to connect or we have to add a load balancer in front of the coordinators.
Also, this doesn't allow Dispatcher-only mode and we must always have two+ coordinators online.

Path 3: Multiple Clusters with Shared Queue

The multiple clusters model allow 0 downtime rollout upgrade become easy. The most important advantage of this path I think is it split the big problem (highly available Presto) into many smaller problems. After the dispatcher module being implemented, different people can work on different followup projects including load balancing algorithm, multiple dispatchers, multiple coordinators in a cluster, etc.

I'll vote for Path 3 for its extensibility and step-by-step sprint development approach.

@oneonestar:

  • Simplify the collaboration with resource orchestration platforms (K8S / Mesos)

This isn't one of my goals, but if this helps great.

  • Allow 0 downtime rollout upgrade

All three of the designs will allow for this. For multi cluster it is straight forward; shut down one cluster, upgrade it and bring it back online. For multi coordinator, it is similar. The two key things to know is Presto coordinators and workers must have the same version, and there is a config option for a minimum number of workers. So you can upgrade a single coordinator, and the coordinator will become visible to the dispatchers, but it won't accept queries until it has enough workers it can use. Then you simply upgrade workers one at a time. When the number of upgraded workers cross the threshold, the coordinator starts excepting work. The full solution is a bit more complex.

  • Allow load balancing between coordinators (for high concurrency queries where coordinator becomes the bottleneck)

Only solutions with "Multiple coordinators" would do this.

Path 1: Multiple coordinators with Shared Queue
This approach requires "Multiple coordinators in a cluster" implementation which I think would be a difficult task. The cooperative scheduling of workers between coordinators could be complicated. I'd rather prefer taking Path 3 first and then move toward the outcome of Path 1.

From the first version of Presto, the system has always done "cooperative scheduling of workers between coordinators", so this isn't too difficult. Actually, there was a period at FB where we made all workers coordinators by accident, and everything just worked.

The complex part is really around decisions that must be globally made. Specifically, which query to promote to the reserved pool, or which query to kill in low memory, must be coordinated to prevent multiple coordinators making conflicting decisions at the same time.

Path 2: Highly Available Coordinators
I don't like this approach. The main issue is that the client has to decide which coordinator to connect or we have to add a load balancer in front of the coordinators.
Also, this doesn't allow Dispatcher-only mode and we must always have two+ coordinators online.

In this design, the client contacts a dispatcher service which happens to be running in the same process as the coordinators. This is "Multiple dispatchers" project mentioned above, and it requires "Proxy for dumb load balancer". Anything that has HA for the front end will require some kind of dumb load balancer, but even DNS should work.

In this design, you could scale down all the way to a single coordinator (containing a dispatcher). The dispatcher service would accept and queue queries, but would not start them until enough workers showed up. If the dispatcher failed, another one would need to be started before the clients timed out, but assuming it did, it would rebuild the queues and the clients continue.

Path 3: Multiple Clusters with Shared Queue
The multiple clusters model allow 0 downtime rollout upgrade become easy. The most important advantage of this path I think is it split the big problem (highly available Presto) into many smaller problems. After the dispatcher module being implemented, different people can work on different followup projects including load balancing algorithm, multiple dispatchers, multiple coordinators in a cluster, etc.

I think those points are true of all of the above designs.


My main concern with approach 3 is I think it might only be good for really large users that have multiple clusters and are ok with and additional tier of dispatcher machines. Additionally, I don't think is actually makes Presto more highly available, which is what people have been asking for. Anyway, I am happy to work on whatever the community wants done first.

Something that could be considered for the first iteration of HA for the coordinators is to implement a system similar to what Hashicorp's Vault uses. Where there is only ever one Coordinator or Master node active. Leadership election is used to handle which coordinator is that leader and it handles all of the coordinator processes and the healthcheck endpoint for the leader returns 200 so that a VIP knows that it should be active.

The other nodes return a different HTTP Code (Most likely it makes sense in this case to be 503 Service Unavailable) so that those nodes would be unavailable in any VIPs they reside behind. Otherwise they could just trigger a 307 temporary redirect to the leader coordinator.

That way the overall code for how a coordinator works is the same. It just gets wrapped up in a leadership election to determine whoe should actually service the request.

Is this feature already planned? We expect to be able to apply presto to ETL, which requires a stable cluster and task queue

@wpf375516041 Facebook uses Presto extensively for ETL without this feature. This is accomplished by running queries using an external scheduler (similar to Apache Airflow) that handles retry on failure. If the coordinator restarts (or is moved to a different machine), the queue is lost, which is bad, but the queries will still run because the scheduler retries them.

This feature helps, but you still want external retry anyway: queries fail for various reasons, the machine submitting the query crashes, etc.

@electrum
Thank you very much for your reply. How can I handle it better for long-running tasks in the event of a queue loss?
Is it possible to provide a shared task queue?
I think this may bring a very big change, without having to repeat the parts of the plan that have already been executed?

The “shared queue” is only for queuing queries before they start executing. Once they start executing, they run on a single coordinator, exactly as they do today.

For long running queries, it is important to have reliable hardware. Failures should be rare. Use enough machines and concurrency so that the wall time is no more than a few hours. If you have queries longer than this, try to break them up into shorter queries.

Is this feature already planned?

Has there been any more progress made on high availability?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

anismiles picture anismiles  ·  3Comments

kokes picture kokes  ·  3Comments

findepi picture findepi  ·  4Comments

JamesRTaylor picture JamesRTaylor  ·  5Comments

byungnam picture byungnam  ·  4Comments