Argo: Workflow steps fail with a 'pod deleted' message.

Created on 19 Nov 2020  路  35Comments  路  Source: argoproj/argo

Summary

Maybe related to #3381?

Some of the workflow steps end up in Error state, with pod deleted. I am not sure which of the following data points are relevant, but listing all observations:

  • the workflow uses PodGC: strategy: OnPodSuccess.
  • we are seeing this for ~5% of workflow steps.
  • affected steps are a part of a withItems loop
  • the workflow is not large - ~170 to 300 concurrent nodes
  • this is observed since deploying v2.12.0rc2 yesterday, including v2.12.0rc2 executor image. We were previously on v2.11.6 and briefly on v2.11.7, and have not seen this.
  • k8s events confirm the pods ran to completion
  • cluster scaling has been ruled out as the cause - this is observed on multiple k8s nodes, all of which are still running
  • we have not tried the same workflow without PodGC yet.

Diagnostics

What Kubernetes provider are you using?

docker

What version of Argo Workflows are you running?

v2.12.0rc2 for all components



Message from the maintainers:

Impacted by this bug? Give it a 馃憤. We prioritise the issues with the most 馃憤.

bug regression

All 35 comments

I'm betting this is caused by #4024, maybe when it interacts with #4025.

Can you please try running your controller with INFORMER_WRITE_BACK=false?

Can confirm that we are seeing zero occurrences of this with INFORMER_WRITE_BACK=false, on a 10x larger workflow. We'll leave it like this for now. Are there any performance/reliability issues to look out for with this flag?

INFORMER_WRITE_BACK=false only points to a solution - it is not the solution itself and it comes with a performance cost

If I create an engineering build - can you please test it?

Certainly! will it also include what we discussed re: #4560 ?

Please can you try argoproj/workflow-controller:grace and check to see if allowing a short grace-period before marking node as error is logged?

@alexec should I also remove the INFORMER_WRITE_BACK envvar?

Yes. Remove INFORMER_WRITE_BACK.

Still seeing ~3% of pods being deleted on :grace, and no log messages with grace-period. I'd like to revert to 2.12.0rc2 and set INFORMER_WRITE_BACK=false for now - is that safe aside from the performance cost?

@ebr - no - that is an unsupported development flag.

This is a blocking issue for v2.12 and does need to be fixed.

I've just updated the build with a longer grace period and more logging. Could you please try it out?

no longer seeing this on :grace! but no log messages either.

can you tell me if you see recentlyStarted in your logs?

No recentlyStarted.... I'm going to try reproducing this again on 2.12rc2

That's really odd. You should absolutely see it. What is your log level? info? Is your imagePullPolicy set to Always?

Log level debug as we set in our Zoom troubleshooting and I haven't changed it; imagePullPolicy: Always.

there are definitely dozens of pods being started and completed, and none of the given strings appear in the controller logs.... I double checked that I'm still on :grace.

Do you see Pod for node deleted? You should not.

No, I'm not seeing that. but I'm no longer seeing deleted pods either

Ok, I did not docker push the very latest version. I have now pushed it if yo have time to test and check for recentlyStarted in your logs.

Why am I asking again? Seeing recentlyStarted: true will give me confidence that fix is correct. Not something else changed.

Sounds good, I will try right now before signing off for the day. thanks for all your attention to this!

I wasn't able to reproduce this at all (still on :grace). And see none of the above log messages...
Tomorrow I'm going to try reproducing this again on 2.12.0rc2
A silly thought perhaps, but could this have been caused by another workflow? The only thing that changed throughout the day (aside from my testing) was that we deleted some previously failed workflows.

Update: we did see quite a lot of deleted pods through the night - hard to say how many exactly because we were running over 12,000 pods, but it's definitely in the thousands. this is still on :grace. So I guess it's not as immediately reproducible, but definitely still happening.
I can't say whether there were any log messages with recentlyStarted: true, as the controller is pumping out such massive log volume that they've already been rotated out

Not fixed - but less of a problem?

Ah - the :grace version does not have any fixes from :mot (which is now merged to :latest). If I create a new engineering build of :grace that includes that - could you test the together?

Done. argoproj/workflow-controller:grace has both fixes.

hey @alexec - thank you - I was offline for a couple of days, but will test it as soon as I can. It is unlikely to happen today.

Not fixed - but less of a problem?

This is correct. Definitely not as reproducible anymore. For what it's worth, my team has been running jobs on the :mot build since Friday (4 days), and many very large workflows have completed perfectly, while some smaller ones experienced pod deletion. There doesn't seem to be any systematic pattern to it - it's very intermittent. I'll try the new :grace build when it's safe to do so; we're running some really critical workloads right now so I can't promise to do it soon, unfortunately.

Some more deleted pods today; i switched to :grace. Seeing this:

image

  • significant increase in RAM usage of the workflow controller
  • still some deleted pods, very few and very intermittent
  • still not seeing any log messages like recentlyStarted: true or grace, but also due to the frequent controller restarts and infrequent pod deletions, I can't be sure.

the worklfows seem to be running successuflly and without changes though, i'm debating whether to revert to :mot or just leave it alone. increased RAM usage is not an issue for us.

  • recentlyStarted was logged at debug level. I've increased this to info level and published a new :grace image.
  • I've added RECENTLY_STARTED_POD_DURATION=10s you could try setting this to higher values, e.g. RECENTLY_STARTED_POD_DURATION=30s to see if that improves the issue.
  • :mot is merged to master and the fix is on :latest.
  • I would not expect to see increased RAM usage. Are you able to test :latest so we can be sure that the issue is with grace not something else?

It is Thanksgiving at the moment - I'll look in more detail on Monday.

no worries and no rush - happy Thanksgiving! I actually did change to :latest shortly after posting the last reply, as I wasn't comfortable with the spurious restarts. And yes, on :latest the restarts stopped, and the RAM usage went back down and flattened. I noticed that on :grace the RAM usage increased proportionately to the number of workflow pods in the cluster - which is not the case on :latest.

I've been at debug loglevel the entire time, so should've seen the recentlyStarted messages, but perhaps the issue is too intermittent. We don't have log aggregation in place, and tailing the logs doesn't seem to be sufficient, there's just too much volume relative to the frequency of these occurrences. Anyway, I'm going to avoid changing builds at least today and tomorrow, so this may have to wait until Monday for me as well.

Am I the only user reporting this? Perhaps it would be helpful for someone else to also try reproducing the issue, just to be sure it's not a red herring caused by something related to our EKS cluster set up (I don't see how it could be, but stranger things have happened).

@ebr I think I've heard of one or two users mention this (but not reporting this by creating a ticket), so I assume they are not seriously impacted.

Let me know how you get on with :grace with RECENTLY_STARTED_POD_DURATION=30s.

I believe this has been likely fixed/mitigated on master.

We can confirm the issue gone in our runs on a build off the commit https://github.com/argoproj/argo/commit/625e3ce265e17df9315231e82e9a346aba400b14 in master.

Was this page helpful?
0 / 5 - 0 ratings