Checklist:
What happened:
In multi-root workflows in a single pod, the wait container gets stuck. It is not possible to terminate the workflow. The only way to stop it is to delete it.
The logs of the stuck container
$ kubectl logs workflow-zdkm8-552438975 -n argo -c wait
time="2020-02-19T06:06:48Z" level=info msg="Creating a docker executor"
time="2020-02-19T06:06:48Z" level=info msg="Executor (version: v0.0.0+unknown, build_date: 1970-01-01T00:00:00Z) initialized (pod: <redacted> )}}}"
time="2020-02-19T06:06:48Z" level=info msg="Waiting on main container"
time="2020-02-19T06:06:49Z" level=info msg="main container started with container ID: 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6"
time="2020-02-19T06:06:49Z" level=info msg="Starting annotations monitor"
time="2020-02-19T06:06:49Z" level=info msg="Execution control set from API: {2020-02-19 10:06:47 +0000 UTC false}"
time="2020-02-19T06:06:49Z" level=info msg="docker wait 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6"
time="2020-02-19T06:06:49Z" level=info msg="Starting deadline monitor"
time="2020-02-19T06:06:53Z" level=info msg="Main container completed"
time="2020-02-19T06:06:53Z" level=info msg="Saving logs"
time="2020-02-19T06:06:53Z" level=info msg="[docker logs 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6]"
time="2020-02-19T06:06:53Z" level=info msg="Annotations monitor stopped"
time="2020-02-19T06:06:53Z" level=info msg="Deadline monitor stopped"
The processes running on the container are
root@pod-m8-552438975:/# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.4 145816 34196 ? Ssl 06:06 0:00 argoexec wait
root 22 0.0 0.3 45672 28684 ? Sl 06:06 0:00 docker logs 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6
root 41 0.0 0.0 18136 3188 pts/0 Ss 06:11 0:00 /bin/bash
root 54 0.0 0.0 36636 2844 pts/0 R+ 06:12 0:00 ps aux
If I kill the docker logs process with id 22, it exits successfully.
root@pod-zdkm8-552438975:/# kill 22
root@pod-zdkm8-552438975:/# command terminated with exit code 137
Environment:
2.5-rc8 to 2.5clientVersion:
buildDate: "2019-09-18T14:36:53Z"
compiler: gc
gitCommit: 2bd9643cee5b3b3a5ecbd3af49d09018f0773c77
gitTreeState: clean
gitVersion: v1.16.0
goVersion: go1.12.9
major: "1"
minor: "16"
platform: linux/amd64
serverVersion:
buildDate: "2019-11-13T11:11:50Z"
compiler: gc
gitCommit: 7015f71e75f670eb9e7ebd4b5749639d42e20079
gitTreeState: clean
gitVersion: v1.15.6
goVersion: go1.12.12
major: "1"
minor: "15"
platform: linux/
Maybe it was introduced in this commit https://github.com/argoproj/argo/pull/2136
@tcolgate FYI
@Arkanayan do you have an example of this problem? E.g. in the examples folder?
Sorry @alexec, I am not able to reproduce this issue with the examples.
If we can repro an issue, we cannot fix it. Please let me know if you're able to provide YAML that repos the issue.
Do you believe the issues appeared in v2.5.0-rc8? I.e. it was not present in v2.5.0-rc7. If that is the case, we can reach out to the committers of that release and see what they think.
I've just encountered this on v2.5.1, I'll try to provide a minimal example.
In our case, it seems to be potentially related to: https://github.com/argoproj/argo/issues/1493
EDIT: Still investigating. It may be related to our use of tqdm (https://github.com/tqdm/tqdm), which we did not turn off. I believe it spits out quite a lot of logs because it writes to stderr, and there has been a recent change to the handling of stderr.
EDIT 2: Disabling tqdm seems to fix it for us. I now believe the problem is when you write lots of things to stderr.
I'm having this as well.
wait container logs:
time="2020-02-25T01:02:00Z" level=info msg="Waiting on main container"
time="2020-02-25T01:02:05Z" level=info msg="main container started with container ID: e0c64b7eff38ddfdc5ac705044c685b3ee8f260ef610f2cd8b83a4190ad20394"
time="2020-02-25T01:02:05Z" level=info msg="Starting annotations monitor"
time="2020-02-25T01:02:06Z" level=info msg="docker wait e0c64b7eff38ddfdc5ac705044c685b3ee8f260ef610f2cd8b83a4190ad20394"
time="2020-02-25T01:02:06Z" level=info msg="Starting deadline monitor"
time="2020-02-25T01:07:00Z" level=info msg="Alloc=3200 TotalAlloc=9568 Sys=68674 NumGC=5 Goroutines=11"
time="2020-02-25T01:10:51Z" level=info msg="Main container completed"
time="2020-02-25T01:10:51Z" level=info msg="Saving logs"
time="2020-02-25T01:10:51Z" level=info msg="Annotations monitor stopped"
time="2020-02-25T01:10:51Z" level=info msg="[docker logs e0c64b7eff38ddfdc5ac705044c685b3ee8f260ef610f2cd8b83a4190ad20394]"
time="2020-02-25T01:10:52Z" level=info msg="Deadline monitor stopped"
time="2020-02-25T01:12:00Z" level=info msg="Alloc=3309 TotalAlloc=9685 Sys=70080 NumGC=7 Goroutines=6"
time="2020-02-25T01:17:00Z" level=info msg="Alloc=3217 TotalAlloc=9686 Sys=70080 NumGC=10 Goroutines=6"
Running argo 2.6.0-rc2
then the last line just repeats...
EDIT: Disabling the archiveLogs seems to "fix it" for some workflows.
Agree with @MrSaints , writing too much information seems to interfere with the wait container since the workflows that remained Running were the ones that wrote a lot of logs.
It might be worth manually running the docker logs command on the host, check if it returns and how much data it returns. I don't /think/ the NopCloser use could cause this. We could wrap stderr and stdout in a custom Closer that closes both, to be absolutely sure.
We are seeing a similar issue as mentioned by @rmgpinto (we're running v2.5.0).
Some of our workflows dont terminate due to a sidecar stuck in Running state. In our case, the sidecar is tensorboard. We've been running these workflows for many months without issues, so this behavior is new.
We have just set archiveLogs: false at the workflow-controller and workflow-template level - I'll report back if this mitigates the issue.
Here's the wait container's logs, although they're not much more interesting the logs @rmgpinto shared:
time="2020-02-28T06:03:23Z" level=info msg="Main container completed"
time="2020-02-28T06:03:23Z" level=info msg="Saving logs"
time="2020-02-28T06:03:23Z" level=info msg="Annotations monitor stopped"
time="2020-02-28T06:03:23Z" level=info msg="[docker logs 6a2ef3074dc0a844737f8ba0da1bedb81d556a51acb9afc65d5a590d67649b03]"
time="2020-02-28T06:03:24Z" level=info msg="Deadline monitor stopped"
time="2020-02-28T06:07:59Z" level=info msg="Alloc=2398 TotalAlloc=5488 Sys=70592 NumGC=49 Goroutines=4"
time="2020-02-28T06:12:59Z" level=info msg="Alloc=2398 TotalAlloc=5492 Sys=70592 NumGC=51 Goroutines=4"
time="2020-02-28T06:17:59Z" level=info msg="Alloc=2398 TotalAlloc=5497 Sys=70592 NumGC=54 Goroutines=4"
...
I could be worthwhile to share our docker version: 18.09.9-ce.
It might be worth manually running the docker logs command on the host, check if it returns and how much data it returns. I don't /think/ the NopCloser use could cause this. We could wrap stderr and stdout in a custom Closer that closes both, to be absolutely sure.
Running docker logs in wait container is working as intended, instantly returning the logs and exiting.
Can someone test this?
On Tue, 3 Mar 2020, 18:37 Alex Collins, notifications@github.com wrote:
Closed #2261 https://github.com/argoproj/argo/issues/2261 via #2345
https://github.com/argoproj/argo/pull/2345.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/argoproj/argo/issues/2261?email_source=notifications&email_token=AAASR42BANOFKQUBCIQBTYDRFVE5PA5CNFSM4KXTK4D2YY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOXBQVKAA#event-3093386496,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAASR42HZOC4TRZLCDW5TG3RFVE5PANCNFSM4KXTK4DQ
.
I can test with my workflows.
I'm using the docker images, can you make a 2.6.1-rc?
Thanks
Can this bug fix be merged into v2.5 and v2.6 because it blocks us from using the latest stable versions?
We probably. won't update v2.5, as it is now end of life. It should be back-ported to v2.6.1.
Containers are still stuck in 2.6.1.
Can you do a ps -ef in the running executor container and send us the
result? Does it show docker logs still running? or any zombie processes?
On Thu, 5 Mar 2020 at 08:45, Ricardo Gândara Pinto notifications@github.com
wrote:
Containers are still stuck in 2.6.1.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/argoproj/argo/issues/2261?email_source=notifications&email_token=AAASR43BVWNZ6Q5BSP2ACULRF5RCVA5CNFSM4KXTK4D2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN4IQJI#issuecomment-595101733,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAASR453OJ7FLVUDAHPOPFTRF5RCVANCNFSM4KXTK4DQ
.
--
"You can get all your daily vitamins from 52 pints of guiness, and a
glass of milk"
$ kubectl -n argo exec -it test-workflow-nrf7h-741316927 -c wait -- ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 09:46 ? 00:00:00 argoexec wait
root 23 1 0 09:55 ? 00:00:00 docker logs 6d57f49eaa0af8ebf36d
root 47 0 0 09:57 pts/0 00:00:00 ps -ef
if I execute kubectl -n argo exec -it test-workflow-nrf7h-741316927 -c wait -- docker logs 6d57f49eaa0af8ebf36d it shows the logs and exits.
Right, I think I'm getting closer. MultiReader reads all inputs sequentially, so I think once we exceed the buffer amount on stderr, docker logs blocks to write, and doesn't exit, so we never finish reading stdin. (stupidly, I didn't check the multireader docs, it has been a while since I used it).
Need to try and recreate but should have a fix real soon.
tested this with the following workflow:
```apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: wftestthing.3
namespace: argo
spec:
entrypoint: build
templates:
tested this with the following workflow:
kind: Workflow metadata: name: wftestthing.3 namespace: argo spec: entrypoint: build templates: - container: args: - "for i in $(seq 100000); do date > /dev/stderr; done " command: - sh - -c image: alpine:latest name: buildI didn't check to see if that hangs on 2.6.1, but the old code hung on the equivalent with some local testing.
I tested this workflow with 2.6.1. The issue still exists.
$ kubectl exec -it issue2261 -c wait -- ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 15:44 ? 00:00:00 argoexec wait
root 21 1 0 15:45 ? 00:00:00 docker logs 3f25da9882f9ffbba5b4
root 33 0 0 15:47 pts/0 00:00:00 ps -ef
@tigerwings yes, I hoped that fails with 2.6.1, but I've tested against the #2368, and it works with that patch applied. (this ticket needs reopening)
We upgraded yesterday from 2.4.x to 2.6.0 and faced the same issue for some of our workflows. wait container got stuck and process output showed docker logs running...
Looking forward to argo v.2.6.2.
I can confirm that v2.6.2 fixed this issue for us, thanks!
I can confirm as well.
Thanks @tcolgate !
Well, I also did break it in the first place :)
Only people that do nothing won't break anything :)
Most helpful comment
Right, I think I'm getting closer. MultiReader reads all inputs sequentially, so I think once we exceed the buffer amount on stderr, docker logs blocks to write, and doesn't exit, so we never finish reading stdin. (stupidly, I didn't check the multireader docs, it has been a while since I used it).
Need to try and recreate but should have a fix real soon.