From what I observed, if the ecs fargate service(tasks) fail to start it will re-try and I never saw it actually timeout.
Is there anything we can do from CDK to timeout a deployment?
It will timeout eventually, but it will take a while (an hour or so).
That's a good question if we can set a timeout. I actually don't know the answer to that.
cc @SoManyHs ?
I ran into this today as well. My fix was to kill CDK which was hung and go into the aws console to "Cancel update stack". From there the stack was rolled back to it's original state.
Seems like in these cases you'd want to fail fast(er) with a timeout override and do the above steps automatically to get back to a known state.
That is a good point. You can use the console to interrupt the deployment. You don't even have to kill CDK to do it either, CDK will show the rollback starting and exit with an error appropriately at the end if you do.
Another issue I found related to this was: after cancel the deploy(and delete the stack in cloudformation) it didn't remove the log groups(created by the stack with specified static names) which will fail in the next cdk deploy.
I had to manually remove those log groups.
That is actually on purpose, to always retain logs, they might be important.
There is a parameter to control that, but I suppose we can improve the default on whether a static name was specified or not
Another issue I found related to this was: after cancel the deploy(and delete the stack in cloudformation) it didn't remove the log groups(created by the stack with specified static names) which will fail in the next cdk deploy.
I had to manually remove those log groups.
Yes, I ran into this as well. IMO, the defaults should be such that destroy -> deploy should be idempotent.
That is only true if you haven't used your stack in the mean time, I'd think.
If you've accepted money from someone you're now required by law to keep logs on that. Also because of security reasons, we don't want to make it too easy to destroy those logs. Or customer data, or whatever state you've accumulated in the mean time.
In any case, if it's bothering you, you can always pass retainLogGroup: false
I don't think there's any way to control the per-resource timeout in CloudFormation. We can control the stack-wide timeout instead though. Should be a toolkit feature.
@rix0rrr Note that stack-wide timeouts seem to be limited to creation rather than update :(
Looks like there is already creation timeout of nested stack. Is there any way to specify timeout of updating nested stack?
We can control the stack-wide timeout instead though. Should be a toolkit feature.
It will be an awesome feature 鈽濓笍 鉂わ笍 , and hopefully it will help avoid having to set up a complex manual check for ECS deployment crush loop like this https://aws.amazon.com/blogs/compute/automating-rollback-of-failed-amazon-ecs-deployments/
On the other hand, I can confirm that currently, ECS EC2 service deployment takes 3 hours to decide that it failed, when the error comes from application layer within docker container.
3/6 | 6:33:37 AM | UPDATE_IN_PROGRESS | AWS::ECS::Service | xxx
...
5/6 | 9:35:07 AM | UPDATE_FAILED | AWS::ECS::Service | xxx Service arn:aws:ecs:xxx:xxx:service/xxx did not stabilize.
The current workaround is updating the service's desired count to 0.
What if the CDK CLI trapped a WINCH process signal to trigger the sending of a cloudformation:CancelUpdateStack? If we did this, then the user could run CDK like this: timeout -sWINCH 15m cdk deploy ...