Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT
What happened:
Seeing this error msg every so often and completely random. It does not happen specifically to a task. I get this for a task and the next run my task works fine.
The task runs fine and I can see the output but If the next task depends on this task it won't go to the next task.
What you expected to happen:
I did not used to see this and start to see that at some point of time when I installed kubeflowpipline and ran a task. However I remove/redeploy argo again but still see the error every so often.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
$ argo version
argo: v2.2.1
BuildDate: 2018-10-11T16:25:59Z
GitCommit: 3b52b26190163d1f72f3aef1a39f9f291378dafb
GitTreeState: clean
GitTag: v2.2.1
GoVersion: go1.10.3
Compiler: gc
Platform: darwin/amd64
$ kubectl version -o yaml
clientVersion:
buildDate: 2018-07-10T10:13:58Z
compiler: gc
gitCommit: 91e7b4fd31fcd3d5f436da26c980becec37ceefe
gitTreeState: clean
gitVersion: v1.11.0
goVersion: go1.10.3
major: "1"
minor: "11"
platform: darwin/amd64
serverVersion:
buildDate: 2018-12-06T23:13:14Z
compiler: gc
gitCommit: 6bad6d9c768dc0864dab48a11653aa53b5a47043
gitTreeState: clean
gitVersion: v1.11.5-eks-6bad6d
goVersion: go1.10.3
major: "1"
minor: 11+
platform: linux/amd64
Other debugging information (if applicable):
$ argo get <workflowname>
argo get argo-gpu-s3-copy-4qzp8
Name: argo-gpu-s3-copy-4qzp8
Namespace: development
ServiceAccount: argo
Status: Error
Created: Sat Feb 02 19:10:05 +0000 (3 minutes ago)
Started: Sat Feb 02 19:10:05 +0000 (3 minutes ago)
Finished: Sat Feb 02 19:10:21 +0000 (3 minutes ago)
Duration: 16 seconds
Parameters:
s3-path: Shared_data/OULU/small_frames_npy
local-path: test2
bucker-name: onfido-mlplatform-in
node-selector: m4.xlarge
STEP PODNAME DURATION MESSAGE
âš argo-gpu-s3-copy-4qzp8
â””-âš list-chunk argo-gpu-s3-copy-4qzp8-3040831338 16s failed to save outputs: interface conversion: error is *exec.Error, not *exec.ExitError
$ kubectl logs <failedpodname> -c init
$ kubectl logs <failedpodname> -c wait
$ kubectl logs -n kube-system $(kubectl get pods -l app=workflow-controller -n kube-system -o name)
workflow-controller log:
time="2019-02-02T19:25:02Z" level=info msg="Processing workflow" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:02Z" level=info msg="Updated phase -> Running" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:02Z" level=info msg="Steps node argo-gpu-s3-copy-58kdd (argo-gpu-s3-copy-58kdd) initialized Running" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:02Z" level=info msg="StepGroup node argo-gpu-s3-copy-58kdd[0] (argo-gpu-s3-copy-58kdd-2204826621) initialized Running" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:02Z" level=info msg="Created pod: argo-gpu-s3-copy-58kdd[0].check (argo-gpu-s3-copy-58kdd-1253578299)" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:02Z" level=info msg="Pod node argo-gpu-s3-copy-58kdd[0].check (argo-gpu-s3-copy-58kdd-1253578299) initialized Pending" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:02Z" level=info msg="Workflow step group node argo-gpu-s3-copy-58kdd[0] (argo-gpu-s3-copy-58kdd-2204826621) not yet completed" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:02Z" level=info msg="Workflow update successful" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:03Z" level=info msg="Processing workflow" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:03Z" level=info msg="Checking for deleted pods" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:04Z" level=info msg="Updating node argo-gpu-s3-copy-58kdd[0].check (argo-gpu-s3-copy-58kdd-1253578299) message: PodInitializing"
time="2019-02-02T19:25:04Z" level=info msg="Workflow step group node argo-gpu-s3-copy-58kdd[0] (argo-gpu-s3-copy-58kdd-2204826621) not yet completed" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:04Z" level=info msg="Workflow update successful" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:05Z" level=info msg="Processing workflow" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:05Z" level=info msg="Checking for deleted pods" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:05Z" level=info msg="Workflow step group node argo-gpu-s3-copy-58kdd[0] (argo-gpu-s3-copy-58kdd-2204826621) not yet completed" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:06Z" level=info msg="Processing workflow" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:06Z" level=info msg="Checking for deleted pods" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:06Z" level=info msg="Workflow step group node argo-gpu-s3-copy-58kdd[0] (argo-gpu-s3-copy-58kdd-2204826621) not yet completed" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:09Z" level=info msg="Processing workflow" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:09Z" level=info msg="Updating node argo-gpu-s3-copy-58kdd[0].check (argo-gpu-s3-copy-58kdd-1253578299) status Pending -> Running"
time="2019-02-02T19:25:09Z" level=info msg="Workflow step group node argo-gpu-s3-copy-58kdd[0] (argo-gpu-s3-copy-58kdd-2204826621) not yet completed" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:09Z" level=warning msg="Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \"argo-gpu-s3-copy-58kdd\": the object has been modified; please apply your changes to the latest version and try again" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:09Z" level=info msg="Re-appying updates on latest version and retrying update" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:09Z" level=info msg="Update retry attempt 1 successful" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:09Z" level=info msg="Workflow update successful" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="Processing workflow" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="Updating node argo-gpu-s3-copy-58kdd[0].check (argo-gpu-s3-copy-58kdd-1253578299) status Running -> Error"
time="2019-02-02T19:25:10Z" level=info msg="Updating node argo-gpu-s3-copy-58kdd[0].check (argo-gpu-s3-copy-58kdd-1253578299) message: failed to save outputs: interface conversion: error is *exec.Error, not *exec.ExitError"
time="2019-02-02T19:25:10Z" level=info msg="Step group node argo-gpu-s3-copy-58kdd[0] (argo-gpu-s3-copy-58kdd-2204826621) deemed failed: child 'argo-gpu-s3-copy-58kdd-1253578299' failed" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="node argo-gpu-s3-copy-58kdd[0] (argo-gpu-s3-copy-58kdd-2204826621) phase Running -> Failed" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="node argo-gpu-s3-copy-58kdd[0] (argo-gpu-s3-copy-58kdd-2204826621) message: child 'argo-gpu-s3-copy-58kdd-1253578299' failed" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="node argo-gpu-s3-copy-58kdd[0] (argo-gpu-s3-copy-58kdd-2204826621) finished: 2019-02-02 19:25:10.370256123 +0000 UTC" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="step group argo-gpu-s3-copy-58kdd[0] (argo-gpu-s3-copy-58kdd-2204826621) was unsuccessful: child 'argo-gpu-s3-copy-58kdd-1253578299' failed" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="Outbound nodes of argo-gpu-s3-copy-58kdd-1253578299 is [argo-gpu-s3-copy-58kdd-1253578299]" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="Outbound nodes of argo-gpu-s3-copy-58kdd is [argo-gpu-s3-copy-58kdd-1253578299]" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="node argo-gpu-s3-copy-58kdd (argo-gpu-s3-copy-58kdd) phase Running -> Failed" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="node argo-gpu-s3-copy-58kdd (argo-gpu-s3-copy-58kdd) message: child 'argo-gpu-s3-copy-58kdd-1253578299' failed" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="node argo-gpu-s3-copy-58kdd (argo-gpu-s3-copy-58kdd) finished: 2019-02-02 19:25:10.3703671 +0000 UTC" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="Checking deamoned children of argo-gpu-s3-copy-58kdd" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="Updated phase Running -> Failed" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="Updated message -> child 'argo-gpu-s3-copy-58kdd-1253578299' failed" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="Marking workflow completed" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=warning msg="Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \"argo-gpu-s3-copy-58kdd\": the object has been modified; please apply your changes to the latest version and try again" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="Re-appying updates on latest version and retrying update" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="Update retry attempt 1 successful" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:10Z" level=info msg="Workflow update successful" namespace=development workflow=argo-gpu-s3-copy-58kdd
time="2019-02-02T19:25:11Z" level=info msg="Labeled pod development/argo-gpu-s3-copy-58kdd-1253578299 completed"
I just build from master ( plus a unrelated tweak ) and can confirm this behavior - except I get it every run
@wadeholler thanks. it fails 80% of times for me. I think we upgraded the k8s cluster version and starts to see this.
Any solution for that?
@wadeholler there had been a bug on master branch, I fixed it with the PR#1213. Try to delete argoexec:latest from your cluster and build it again using the new dockerfile (if argoproj/argoexec:latest is not updated or if you made any modifications).
That helped the stated problem above but now submodule support is broken:
failed to load artifacts: fatal: No url found for submodule path 'obsfuscated' in .gitmodules
my previous reply was for repos that had a submodule reference but no .gitmodules file. the new argoexec updates that force a submodule update caused this issue. unrelated to the above. All is well now. cheers
I'm pretty sure I had fixed a exec.Error, not *exec.ExitError panic conversion as part of the PNS work. Will close this as fixed in v2.3 but please re-open if it is seen again.