Argo: Wait container stuck in multi-root workflows in 2.5-rc8 to 2.5.0

Created on 19 Feb 2020 · 26Comments · Source: argoproj/argo

Checklist:

[x] I've included the version.
[ ] I've included reproduction steps.
[ ] I've included the workflow YAML.
[x] I've included the logs.

What happened:
In multi-root workflows in a single pod, the wait container gets stuck. It is not possible to terminate the workflow. The only way to stop it is to delete it.

The logs of the stuck container

$ kubectl logs workflow-zdkm8-552438975 -n argo -c wait
time="2020-02-19T06:06:48Z" level=info msg="Creating a docker executor"
time="2020-02-19T06:06:48Z" level=info msg="Executor (version: v0.0.0+unknown, build_date: 1970-01-01T00:00:00Z) initialized (pod: <redacted> )}}}"
time="2020-02-19T06:06:48Z" level=info msg="Waiting on main container"
time="2020-02-19T06:06:49Z" level=info msg="main container started with container ID: 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6"
time="2020-02-19T06:06:49Z" level=info msg="Starting annotations monitor"
time="2020-02-19T06:06:49Z" level=info msg="Execution control set from API: {2020-02-19 10:06:47 +0000 UTC false}"
time="2020-02-19T06:06:49Z" level=info msg="docker wait 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6"
time="2020-02-19T06:06:49Z" level=info msg="Starting deadline monitor"
time="2020-02-19T06:06:53Z" level=info msg="Main container completed"
time="2020-02-19T06:06:53Z" level=info msg="Saving logs"
time="2020-02-19T06:06:53Z" level=info msg="[docker logs 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6]"
time="2020-02-19T06:06:53Z" level=info msg="Annotations monitor stopped"
time="2020-02-19T06:06:53Z" level=info msg="Deadline monitor stopped"

The processes running on the container are

root@pod-m8-552438975:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.4 145816 34196 ?        Ssl  06:06   0:00 argoexec wait
root        22  0.0  0.3  45672 28684 ?        Sl   06:06   0:00 docker logs 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6
root        41  0.0  0.0  18136  3188 pts/0    Ss   06:11   0:00 /bin/bash
root        54  0.0  0.0  36636  2844 pts/0    R+   06:12   0:00 ps aux

If I kill the docker logs process with id 22, it exits successfully.

root@pod-zdkm8-552438975:/# kill 22
root@pod-zdkm8-552438975:/# command terminated with exit code 137

Environment:

Argo version: 2.5-rc8 to 2.5

Kubernetes version :

clientVersion:
  buildDate: "2019-09-18T14:36:53Z"
  compiler: gc
  gitCommit: 2bd9643cee5b3b3a5ecbd3af49d09018f0773c77
  gitTreeState: clean
  gitVersion: v1.16.0
  goVersion: go1.12.9
  major: "1"
  minor: "16"
  platform: linux/amd64
serverVersion:
  buildDate: "2019-11-13T11:11:50Z"
  compiler: gc
  gitCommit: 7015f71e75f670eb9e7ebd4b5749639d42e20079
  gitTreeState: clean
  gitVersion: v1.15.6
  goVersion: go1.12.12
  major: "1"
  minor: "15"
  platform: linux/

Maybe it was introduced in this commit https://github.com/argoproj/argo/pull/2136

bug regression

Source

Arkanayan

Most helpful comment

Right, I think I'm getting closer. MultiReader reads all inputs sequentially, so I think once we exceed the buffer amount on stderr, docker logs blocks to write, and doesn't exit, so we never finish reading stdin. (stupidly, I didn't check the multireader docs, it has been a while since I used it).
Need to try and recreate but should have a fix real soon.

tcolgate on 5 Mar 2020

👍3

All 26 comments

@tcolgate FYI

alexec on 19 Feb 2020

@Arkanayan do you have an example of this problem? E.g. in the examples folder?

alexec on 19 Feb 2020

Sorry @alexec, I am not able to reproduce this issue with the examples.

Arkanayan on 24 Feb 2020

If we can repro an issue, we cannot fix it. Please let me know if you're able to provide YAML that repos the issue.

Do you believe the issues appeared in v2.5.0-rc8? I.e. it was not present in v2.5.0-rc7. If that is the case, we can reach out to the committers of that release and see what they think.

alexec on 24 Feb 2020

I've just encountered this on v2.5.1, I'll try to provide a minimal example.

In our case, it seems to be potentially related to: https://github.com/argoproj/argo/issues/1493

EDIT: Still investigating. It may be related to our use of tqdm (https://github.com/tqdm/tqdm), which we did not turn off. I believe it spits out quite a lot of logs because it writes to stderr, and there has been a recent change to the handling of stderr.

EDIT 2: Disabling tqdm seems to fix it for us. I now believe the problem is when you write lots of things to stderr.

MrSaints on 24 Feb 2020

I'm having this as well.
wait container logs:

time="2020-02-25T01:02:00Z" level=info msg="Waiting on main container"
time="2020-02-25T01:02:05Z" level=info msg="main container started with container ID: e0c64b7eff38ddfdc5ac705044c685b3ee8f260ef610f2cd8b83a4190ad20394"
time="2020-02-25T01:02:05Z" level=info msg="Starting annotations monitor"
time="2020-02-25T01:02:06Z" level=info msg="docker wait e0c64b7eff38ddfdc5ac705044c685b3ee8f260ef610f2cd8b83a4190ad20394"
time="2020-02-25T01:02:06Z" level=info msg="Starting deadline monitor"
time="2020-02-25T01:07:00Z" level=info msg="Alloc=3200 TotalAlloc=9568 Sys=68674 NumGC=5 Goroutines=11"
time="2020-02-25T01:10:51Z" level=info msg="Main container completed"
time="2020-02-25T01:10:51Z" level=info msg="Saving logs"
time="2020-02-25T01:10:51Z" level=info msg="Annotations monitor stopped"
time="2020-02-25T01:10:51Z" level=info msg="[docker logs e0c64b7eff38ddfdc5ac705044c685b3ee8f260ef610f2cd8b83a4190ad20394]"
time="2020-02-25T01:10:52Z" level=info msg="Deadline monitor stopped"
time="2020-02-25T01:12:00Z" level=info msg="Alloc=3309 TotalAlloc=9685 Sys=70080 NumGC=7 Goroutines=6"
time="2020-02-25T01:17:00Z" level=info msg="Alloc=3217 TotalAlloc=9686 Sys=70080 NumGC=10 Goroutines=6"

Running argo 2.6.0-rc2

then the last line just repeats...

EDIT: Disabling the archiveLogs seems to "fix it" for some workflows.
Agree with @MrSaints , writing too much information seems to interfere with the wait container since the workflows that remained Running were the ones that wrote a lot of logs.

rmgpinto on 25 Feb 2020

👍2

It might be worth manually running the docker logs command on the host, check if it returns and how much data it returns. I don't /think/ the NopCloser use could cause this. We could wrap stderr and stdout in a custom Closer that closes both, to be absolutely sure.

tcolgate on 25 Feb 2020

We are seeing a similar issue as mentioned by @rmgpinto (we're running v2.5.0).

Some of our workflows dont terminate due to a sidecar stuck in Running state. In our case, the sidecar is tensorboard. We've been running these workflows for many months without issues, so this behavior is new.

We have just set archiveLogs: false at the workflow-controller and workflow-template level - I'll report back if this mitigates the issue.

Here's the wait container's logs, although they're not much more interesting the logs @rmgpinto shared:

time="2020-02-28T06:03:23Z" level=info msg="Main container completed"
time="2020-02-28T06:03:23Z" level=info msg="Saving logs"
time="2020-02-28T06:03:23Z" level=info msg="Annotations monitor stopped"
time="2020-02-28T06:03:23Z" level=info msg="[docker logs 6a2ef3074dc0a844737f8ba0da1bedb81d556a51acb9afc65d5a590d67649b03]"
time="2020-02-28T06:03:24Z" level=info msg="Deadline monitor stopped"
time="2020-02-28T06:07:59Z" level=info msg="Alloc=2398 TotalAlloc=5488 Sys=70592 NumGC=49 Goroutines=4"
time="2020-02-28T06:12:59Z" level=info msg="Alloc=2398 TotalAlloc=5492 Sys=70592 NumGC=51 Goroutines=4"
time="2020-02-28T06:17:59Z" level=info msg="Alloc=2398 TotalAlloc=5497 Sys=70592 NumGC=54 Goroutines=4"
...

jpugliesi on 28 Feb 2020

I could be worthwhile to share our docker version: 18.09.9-ce.

rmgpinto on 1 Mar 2020

It might be worth manually running the docker logs command on the host, check if it returns and how much data it returns. I don't /think/ the NopCloser use could cause this. We could wrap stderr and stdout in a custom Closer that closes both, to be absolutely sure.

Running docker logs in wait container is working as intended, instantly returning the logs and exiting.

Arkanayan on 2 Mar 2020

Can someone test this?

On Tue, 3 Mar 2020, 18:37 Alex Collins, notifications@github.com wrote:

Closed #2261 https://github.com/argoproj/argo/issues/2261 via #2345
https://github.com/argoproj/argo/pull/2345.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/argoproj/argo/issues/2261?email_source=notifications&email_token=AAASR42BANOFKQUBCIQBTYDRFVE5PA5CNFSM4KXTK4D2YY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOXBQVKAA#event-3093386496,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAASR42HZOC4TRZLCDW5TG3RFVE5PANCNFSM4KXTK4DQ
.

tcolgate on 3 Mar 2020

I can test with my workflows.
I'm using the docker images, can you make a 2.6.1-rc?
Thanks

rmgpinto on 3 Mar 2020

Can this bug fix be merged into v2.5 and v2.6 because it blocks us from using the latest stable versions?

tigerwings on 4 Mar 2020

We probably. won't update v2.5, as it is now end of life. It should be back-ported to v2.6.1.

alexec on 4 Mar 2020

Containers are still stuck in 2.6.1.

rmgpinto on 5 Mar 2020

Can you do a ps -ef in the running executor container and send us the
result? Does it show docker logs still running? or any zombie processes?

On Thu, 5 Mar 2020 at 08:45, Ricardo Gândara Pinto notifications@github.com
wrote:

Containers are still stuck in 2.6.1.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/argoproj/argo/issues/2261?email_source=notifications&email_token=AAASR43BVWNZ6Q5BSP2ACULRF5RCVA5CNFSM4KXTK4D2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN4IQJI#issuecomment-595101733,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAASR453OJ7FLVUDAHPOPFTRF5RCVANCNFSM4KXTK4DQ
.

Tristan Colgate-McFarlane

"You can get all your daily vitamins from 52 pints of guiness, and a
glass of milk"

tcolgate on 5 Mar 2020

$ kubectl -n argo exec -it test-workflow-nrf7h-741316927 -c wait -- ps -ef                          
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 09:46 ?        00:00:00 argoexec wait
root        23     1  0 09:55 ?        00:00:00 docker logs 6d57f49eaa0af8ebf36d
root        47     0  0 09:57 pts/0    00:00:00 ps -ef

if I execute kubectl -n argo exec -it test-workflow-nrf7h-741316927 -c wait -- docker logs 6d57f49eaa0af8ebf36d it shows the logs and exits.

rmgpinto on 5 Mar 2020

tcolgate on 5 Mar 2020

👍3

tested this with the following workflow:
```apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: wftestthing.3
namespace: argo
spec:
entrypoint: build
templates:

container:
args:
- "for i in $(seq 100000); do date > /dev/stderr; done "
  
  command:
- sh
- -c
  
  image: alpine:latest
  
  name: build
  
```

I didn't check to see if that hangs on 2.6.1, but the old code hung on the equivalent with some local testing.

tcolgate on 5 Mar 2020

tested this with the following workflow:
kind: Workflow
metadata:
  name:  wftestthing.3
  namespace: argo
spec:
  entrypoint: build
  templates:
  - container:
      args:
      - "for i in $(seq 100000); do date > /dev/stderr;  done "
      command:
      - sh
      - -c
      image: alpine:latest
    name: build
I didn't check to see if that hangs on 2.6.1, but the old code hung on the equivalent with some local testing.

I tested this workflow with 2.6.1. The issue still exists.

$ kubectl exec -it issue2261 -c wait -- ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 15:44 ?        00:00:00 argoexec wait
root        21     1  0 15:45 ?        00:00:00 docker logs 3f25da9882f9ffbba5b4
root        33     0  0 15:47 pts/0    00:00:00 ps -ef

tigerwings on 5 Mar 2020

@tigerwings yes, I hoped that fails with 2.6.1, but I've tested against the #2368, and it works with that patch applied. (this ticket needs reopening)

tcolgate on 5 Mar 2020

We upgraded yesterday from 2.4.x to 2.6.0 and faced the same issue for some of our workflows. wait container got stuck and process output showed docker logs running...
Looking forward to argo v.2.6.2.