It would be great to implement a clean shutdown mode, that handles some signal and waits for all the child processes to complete before really shutting down.
This is closely related to upgrades in situations where Atlantis Server code is cleanly separated from data, for example running atlantis sever in a Docker container with atlantis user home mounted as a Docker volume.
_Tracking child processes can be nicely done by using cgroups, but I would discourage this implementation because it does not have a close analogs in the Docker/k8s world. AFAIK anyways._
We run Atlantis in AWS Fargate and for upgrading Atlantis or pushing out configuration changes we first block ingress to the fargate task on it's AWS ELB. Then after some time we assume all Atlantis Terraform processes have completed and we recreate the fargate task with new configuration or container image tag using local terraform.
But we don't reliably know when all the terraform tasks are completed. I think this could be improved by adding an api endpoint to get the count of current terraform processes running.
Then our upgrade process would be:
The work to know how many TF processes are running is the same to properly pass a context through to everything and then keep the Atlantis process running until all the TF processes are stopped so I'm not sure we need an API endpoint.
Yea that would be fine as long as terraform applying fargate task changes can support waiting for the clean shutdown to happen - we use https://github.com/terraform-aws-modules/terraform-aws-atlantis
I could see this being a problem if a clean shutdown is waiting an hour for a long terraform process to finish
Hmm, it looks like there's a 2m max (https://forums.aws.amazon.com/thread.jspa?messageID=907417) so that wouldn't necessarily work. Maybe an API endpoint like /drain or something would be necessary.
Hi,
I'm working on it (implementing a drain).
I need it since we are deploying Atlantis with Atlantis in a K8s cluster, so we need RollingUgrades + a clean pod termination.
As you suggested, I'm implementing a drain endpoint, with a POST to start the drain and a GET to check its completion.
I will probably implement an operation like "atlantis shutdown" which will call this endpoint locally and wait for completion of the drain, so that it can be used in the preStop hook on K8s. I will probably have a working prototype before the end of the week.
We will battlefield test it asap on our cluster.
This means that I will propose as well a chart update for rollingUgrades + preStop hook in lifecycle.
Most helpful comment
Hi,
I'm working on it (implementing a drain).
I need it since we are deploying Atlantis with Atlantis in a K8s cluster, so we need RollingUgrades + a clean pod termination.
As you suggested, I'm implementing a drain endpoint, with a POST to start the drain and a GET to check its completion.
I will probably implement an operation like "atlantis shutdown" which will call this endpoint locally and wait for completion of the drain, so that it can be used in the preStop hook on K8s. I will probably have a working prototype before the end of the week.
We will battlefield test it asap on our cluster.
This means that I will propose as well a chart update for rollingUgrades + preStop hook in lifecycle.