Argo: Argo workflow runs forever when container fails to start

Created on 9 Mar 2018  ·  3Comments  ·  Source: argoproj/argo

Is this a BUG REPORT or FEATURE REQUEST?: Feature?

What happened:

When I was trying to create a container that failed, this was shown in the argo UI logs:

{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"container \"main\" in pod \"fooworker-98g74-364291675\" is waiting to start: ContainerCreating","reason":"BadRequest","code":400}

The argo workflow seemed to run forever:

⇒  argo list
NAME                        STATUS    AGE    DURATION
fooworker-98g74   Running   12m    12m

There was an event that described the error:

⇒  kubectl describe pod fooworker-98g74-364291675
..snip..
Events:
  Type     Reason                 Age               From               Message
  ----     ------                 ----              ----               -------
..snip..
  Warning  FailedMount            56s (x8 over 2m)  kubelet, minikube  MountVolume.SetUp failed for volume "secretstore" : secrets "foosecret" not found

What you expected to happen:

The workflow to fail with an error/provide more useful information. Possibly for the failing event to be shown in the argo UI logs.

How to reproduce it (as minimally and precisely as possible):

Create an argo workflow that will fail (eg. secret missing)

Anything else we need to know?: N/A

Environment:

  • Argo version:
$ argo version
argo: v2.0.0
  BuildDate: 2018-02-06T21:38:42Z
  GitCommit: 0978b9c61cb7435d31ef8d252b80e03708a70adc
  GitTreeState: clean
  GitTag: v2.0.0
  GoVersion: go1.9.1
  Compiler: gc
  Platform: darwin/amd64
  • Kubernetes version :
$ kubectl version -o yaml
clientVersion:
  buildDate: 2018-02-09T21:51:06Z
  compiler: gc
  gitCommit: d2835416544f298c919e2ead3be3d0864b52323b
  gitTreeState: clean
  gitVersion: v1.9.3
  goVersion: go1.9.4
  major: "1"
  minor: "9"
  platform: darwin/amd64
serverVersion:
  buildDate: 2018-01-26T19:04:38Z
  compiler: gc
  gitCommit: 925c127ec6b946659ad0fd596fa959be43f0cc05
  gitTreeState: clean
  gitVersion: v1.9.0
  goVersion: go1.9.1
  major: ""
  minor: ""
  platform: linux/amd64

Other debugging information (if applicable):

  • workflow result:
$ argo get fooworker-98g74
Name:             fooworker-98g74
Namespace:        default
ServiceAccount:   default
Status:           Running
Created:          Fri Mar 09 16:28:06 +1100 (15 minutes ago)
Started:          Fri Mar 09 16:28:06 +1100 (15 minutes ago)
Duration:         15 minutes 48 seconds
Parameters:
..redacted..

STEP                               PODNAME                               DURATION  MESSAGE
 ● fooworker-98g74
..snip: all other steps worked as expected..
 └---● util-final                  fooworker-98g74-364291675   10m
  • executor logs:
$ kubectl logs <failedpodname> -c init
Error from server (BadRequest): container init is not valid for pod fooworker-98g74-364291675

$ kubectl logs <failedpodname> -c wait
Error from server (BadRequest): container "wait" in pod "fooworker-98g74-364291675" is waiting to start: ContainerCreating
duplicate

Most helpful comment

Going to duplicate this to issue #525. If we incorporate Pending as a phase, it would point the user in the right direction that pod is still Pending instead of Running.

All 3 comments

Going to duplicate this to issue #525. If we incorporate Pending as a phase, it would point the user in the right direction that pod is still Pending instead of Running.

Pending reflects a different phase than a failure though. Perhaps Argo should surface the different k8s status values as is (ImagePullBackOff, Pending, Failed, Running... etc)?

Pending reflects a different phase than a failure though. Perhaps Argo should surface the different k8s status values as is (ImagePullBackOff, Pending, Failed, Running... etc)?

@yebrahim yes, this is what was done. See example output:

$ argo get image-pull-fail-tb57d
Name:                image-pull-fail-tb57d
Namespace:           default
ServiceAccount:      default
Status:              Running
Created:             Tue Aug 21 17:25:07 -0700 (5 seconds ago)
Started:             Tue Aug 21 17:25:07 -0700 (5 seconds ago)
Duration:            5 seconds

STEP                      PODNAME                DURATION  MESSAGE
 ◷ image-pull-fail-tb57d  image-pull-fail-tb57d  5s        ImagePullBackOff: Back-off pulling image "alpine:doesntexist"
Was this page helpful?
0 / 5 - 0 ratings