Argo: Workflow steps fail with a 'pod deleted' message.

Created on 19 Nov 2020 · 35Comments · Source: argoproj/argo

Summary

Maybe related to #3381?

Some of the workflow steps end up in Error state, with pod deleted. I am not sure which of the following data points are relevant, but listing all observations:

the workflow uses PodGC: strategy: OnPodSuccess.
we are seeing this for ~5% of workflow steps.
affected steps are a part of a withItems loop
the workflow is not large - ~170 to 300 concurrent nodes
this is observed since deploying v2.12.0rc2 yesterday, including v2.12.0rc2 executor image. We were previously on v2.11.6 and briefly on v2.11.7, and have not seen this.
k8s events confirm the pods ran to completion
cluster scaling has been ruled out as the cause - this is observed on multiple k8s nodes, all of which are still running
we have not tried the same workflow without PodGC yet.

Diagnostics

What Kubernetes provider are you using?

docker

What version of Argo Workflows are you running?

v2.12.0rc2 for all components

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

bug regression

Source

ebr

👍2

All 35 comments

I'm betting this is caused by #4024, maybe when it interacts with #4025.

alexec on 19 Nov 2020

Can you please try running your controller with INFORMER_WRITE_BACK=false?

alexec on 19 Nov 2020

👍1

Can confirm that we are seeing zero occurrences of this with INFORMER_WRITE_BACK=false, on a 10x larger workflow. We'll leave it like this for now. Are there any performance/reliability issues to look out for with this flag?

ebr on 19 Nov 2020

INFORMER_WRITE_BACK=false only points to a solution - it is not the solution itself and it comes with a performance cost

alexec on 19 Nov 2020

If I create an engineering build - can you please test it?

alexec on 19 Nov 2020

Certainly! will it also include what we discussed re: #4560 ?

ebr on 19 Nov 2020

Please can you try argoproj/workflow-controller:grace and check to see if allowing a short grace-period before marking node as error is logged?

alexec on 19 Nov 2020

@alexec should I also remove the INFORMER_WRITE_BACK envvar?

ebr on 19 Nov 2020

Yes. Remove INFORMER_WRITE_BACK.

alexec on 19 Nov 2020

👍1

Still seeing ~3% of pods being deleted on :grace, and no log messages with grace-period. I'd like to revert to 2.12.0rc2 and set INFORMER_WRITE_BACK=false for now - is that safe aside from the performance cost?

ebr on 19 Nov 2020

@ebr - no - that is an unsupported development flag.

This is a blocking issue for v2.12 and does need to be fixed.

alexec on 19 Nov 2020

I've just updated the build with a longer grace period and more logging. Could you please try it out?

alexec on 19 Nov 2020

no longer seeing this on :grace! but no log messages either.

ebr on 19 Nov 2020

can you tell me if you see recentlyStarted in your logs?

alexec on 19 Nov 2020

No recentlyStarted.... I'm going to try reproducing this again on 2.12rc2

ebr on 20 Nov 2020

That's really odd. You should absolutely see it. What is your log level? info? Is your imagePullPolicy set to Always?

alexec on 20 Nov 2020

Log level debug as we set in our Zoom troubleshooting and I haven't changed it; imagePullPolicy: Always.

ebr on 20 Nov 2020

there are definitely dozens of pods being started and completed, and none of the given strings appear in the controller logs.... I double checked that I'm still on :grace.

ebr on 20 Nov 2020

Do you see Pod for node deleted? You should not.

alexec on 20 Nov 2020

No, I'm not seeing that. but I'm no longer seeing deleted pods either

ebr on 20 Nov 2020

Ok, I did not docker push the very latest version. I have now pushed it if yo have time to test and check for recentlyStarted in your logs.

alexec on 20 Nov 2020

Why am I asking again? Seeing recentlyStarted: true will give me confidence that fix is correct. Not something else changed.

alexec on 20 Nov 2020

Sounds good, I will try right now before signing off for the day. thanks for all your attention to this!

ebr on 20 Nov 2020

I wasn't able to reproduce this at all (still on :grace). And see none of the above log messages...
Tomorrow I'm going to try reproducing this again on 2.12.0rc2
A silly thought perhaps, but could this have been caused by another workflow? The only thing that changed throughout the day (aside from my testing) was that we deleted some previously failed workflows.

ebr on 20 Nov 2020

👍1

Update: we did see quite a lot of deleted pods through the night - hard to say how many exactly because we were running over 12,000 pods, but it's definitely in the thousands. this is still on :grace. So I guess it's not as immediately reproducible, but definitely still happening.
I can't say whether there were any log messages with recentlyStarted: true, as the controller is pumping out such massive log volume that they've already been rotated out

ebr on 20 Nov 2020

Not fixed - but less of a problem?

alexec on 23 Nov 2020

Ah - the :grace version does not have any fixes from :mot (which is now merged to :latest). If I create a new engineering build of :grace that includes that - could you test the together?

alexec on 23 Nov 2020

Done. argoproj/workflow-controller:grace has both fixes.

alexec on 23 Nov 2020

hey @alexec - thank you - I was offline for a couple of days, but will test it as soon as I can. It is unlikely to happen today.

Not fixed - but less of a problem?

This is correct. Definitely not as reproducible anymore. For what it's worth, my team has been running jobs on the :mot build since Friday (4 days), and many very large workflows have completed perfectly, while some smaller ones experienced pod deletion. There doesn't seem to be any systematic pattern to it - it's very intermittent. I'll try the new :grace build when it's safe to do so; we're running some really critical workloads right now so I can't promise to do it soon, unfortunately.

ebr on 24 Nov 2020

👍1

Some more deleted pods today; i switched to :grace. Seeing this:

significant increase in RAM usage of the workflow controller
still some deleted pods, very few and very intermittent
still not seeing any log messages like recentlyStarted: true or grace, but also due to the frequent controller restarts and infrequent pod deletions, I can't be sure.

the worklfows seem to be running successuflly and without changes though, i'm debating whether to revert to :mot or just leave it alone. increased RAM usage is not an issue for us.

ebr on 25 Nov 2020

recentlyStarted was logged at debug level. I've increased this to info level and published a new :grace image.
I've added RECENTLY_STARTED_POD_DURATION=10s you could try setting this to higher values, e.g. RECENTLY_STARTED_POD_DURATION=30s to see if that improves the issue.
:mot is merged to master and the fix is on :latest.
I would not expect to see increased RAM usage. Are you able to test :latest so we can be sure that the issue is with grace not something else?

It is Thanksgiving at the moment - I'll look in more detail on Monday.

alexec on 25 Nov 2020

no worries and no rush - happy Thanksgiving! I actually did change to :latest shortly after posting the last reply, as I wasn't comfortable with the spurious restarts. And yes, on :latest the restarts stopped, and the RAM usage went back down and flattened. I noticed that on :grace the RAM usage increased proportionately to the number of workflow pods in the cluster - which is not the case on :latest.

I've been at debug loglevel the entire time, so should've seen the recentlyStarted messages, but perhaps the issue is too intermittent. We don't have log aggregation in place, and tailing the logs doesn't seem to be sufficient, there's just too much volume relative to the frequency of these occurrences. Anyway, I'm going to avoid changing builds at least today and tomorrow, so this may have to wait until Monday for me as well.

Am I the only user reporting this? Perhaps it would be helpful for someone else to also try reproducing the issue, just to be sure it's not a red herring caused by something related to our EKS cluster set up (I don't see how it could be, but stranger things have happened).

ebr on 25 Nov 2020

@ebr I think I've heard of one or two users mention this (but not reporting this by creating a ticket), so I assume they are not seriously impacted.

Let me know how you get on with :grace with RECENTLY_STARTED_POD_DURATION=30s.

alexec on 30 Nov 2020

I believe this has been likely fixed/mitigated on master.

alexec on 17 Dec 2020

We can confirm the issue gone in our runs on a build off the commit https://github.com/argoproj/argo/commit/625e3ce265e17df9315231e82e9a346aba400b14 in master.