Is this a BUG REPORT or FEATURE REQUEST?: FEATURE REQUEST
What happened: Argo attempted to store the docker logs to s3 after killing the istio sidecar. This means that it was impossible to connect to s3 since all traffic is iptable redirected to the (now killed) istio proxy.
What you expected to happen: The "wait" container leaves the sidecars (or perhaps designated ones) alive until artifact (logs, etc.) storage is complete.
How to reproduce it (as minimally and precisely as possible):
1/ Configure a cluster with istio
2/ Configure a workflow with a template that has an istio sidecar. We did this manually rather than injecting since argo doesn't recognize the injected container as a sidecar that needs killing.
3/ Specify s3 configuration and archiveLogs in the argo configuration
4/ Run the workflow
@dicarlo2 Were you ever able to figure out a way to get an istio-injected workflow working with artifact support?
Maybe configuring istio to avoid intercepting s3 traffic with traffic.sidecar.istio.io/excludeOutboundIPRanges?
As you suggested, delaying killing sidecars in waitContainer() works. Also had to workaround https://github.com/istio/istio/issues/9454 but didn't require changes in argo or istio code.
Unsure if this kind of change is something argo maintainers want as a PR, or if there are downsides to doing this.
func waitContainer() error {
wfExecutor := initExecutor()
defer wfExecutor.HandleError()
defer stats.LogStats()
stats.StartStatsTicker(5 * time.Minute)
// Wait for main container to complete
err := wfExecutor.Wait()
if err != nil {
wfExecutor.AddError(err)
// do not return here so we can still try to kill sidecars & save outputs
}
// Saving logs
logArt, err := wfExecutor.SaveLogs()
if err != nil {
wfExecutor.AddError(err)
// do not return here so we can still try to kill sidecars
}
// Saving output parameters
err = wfExecutor.SaveParameters()
if err != nil {
wfExecutor.AddError(err)
// do not return here so we can still try to kill sidecars
}
// Saving output artifacts
err = wfExecutor.SaveArtifacts()
if err != nil {
wfExecutor.AddError(err)
// do not return here so we can still try to kill sidecars
}
// Killing sidecar containers
err = wfExecutor.KillSidecars()
if err != nil {
wfExecutor.AddError(err)
return err
}
// Capture output script result
err = wfExecutor.CaptureScriptResult()
if err != nil {
wfExecutor.AddError(err)
return err
}
err = wfExecutor.AnnotateOutputs(logArt)
if err != nil {
wfExecutor.AddError(err)
return err
}
return nil
}
Please go-ahead to create PR with changes. we will review and merge. Thanks for contributing
@sarabala1979 Is there any way to get an Argo Workflow to recognize an automatically injected istio sidecar?
@nfickas is there an issue you are running into with istio sidecars, or what do you mean by "recognize"? We have been using 2.4.2 with istio sidecars automatically injected without any issues.
For us this issue was resolved here: https://github.com/argoproj/argo/pull/1645
@ddseapy our workflows start up with the automatic sidecars but then the sidecars are stuck in a running state after the workflow completes.
I am seeing some odd behavior on the wait container, looks like it Errored out. Failed to establish pod watch: Get https://10.100.200.1:443/api/v1/namespaces/default/pods?fieldSelector=metadata.name%3Dhello-world-82bm7&watch=true: dial tcp 10.100.200.1:443: connect: connection refused
That's the istio issue i linked above. The underlying issue is: https://github.com/istio/istio/issues/11130
Really it's because k8s sidecar support PR (https://github.com/kubernetes/enhancements/issues/753) hasn't gotten through, so there is no ordering which causes a race condition.
You can exclude the k8s api server from the IP range that envoy intercepts, or do things like change the docker command. It could also be fixed in argo by retrying the request on failure with code changes in argo.
I would rather not making any changes to argo code, out of the other two options which one would be easier and do you have any examples of either of those being done?
Just curious did you run into this issue when getting argo working with injected sidecars?
The easiest thing to do is to just annotate your Argo worfklows so that the auto-injector doesn't add a sidecar to them. Better support is coming in k8s (maybe in 1.8) but it's just not there yet.
Jobs in k8s are the same way... there isn't really anything special about Argo there.
Edit: This would be spec.templates[].metadata.annotations: sidecar.istio.io/inject: "false"
if you don't need sidecar, that is indeed easiest. We need to control traffic for our applications, so that is not an option for us. For the workflows we run it was sufficient to exclude the k8s master IP range via traffic.sidecar.istio.io/excludeOutboundIPRanges
We need sidecars as well, I will try that out. Thanks!
That's a good point, too. But just to be clear, I do believe that is going to make the pod stay up when the workflow is finished - correct? You may need a separate system to clean those up, I believe. I'm going off memory and could be mistaken.
To state the obvious, requests to k8s api server will then bypass istio. This was not a security concern for our use-cases, but it could be for someone else.
@jaygorrell that's why I put up the pr. we have been using 2.4.2 for quite a while now without issues, and the pod goes to completed state.
For visibility - Prasanna Kumar Krishnamurthy in argo workflow slack suggested patching the pod spec (https://github.com/argoproj/argo/blob/master/examples/pod-spec-patch.yaml) with something like
"containers":[{"name":"wait", "command": ["/bin/bash", "-c"], "args": ["sleep 20; argoexec wait"]}]
@ddseapy So the above suggestion will just make it clean up the istio-proxy container after 20 seconds. I think the real issue is that the wait and main containers are launching before the proxy starts up.
My understanding is the suggestion should delay the wait container running by editing it's command. I would think that it would delay cleanup (as you said), but also prevent the startup race condition.
If the main container has the same race condition (depends on your main container), you would need to wait there too.
I haven't tried the approach myself yet though, so I may be misunderstanding things.
hm yeah I am not seeing that happen, I am still seeing the main and wait containers startup before istio-proxy. For reference this is what I did: podSpecPatch: '{"containers": [{"name":"wait", "command": ["/bin/bash", "-c"], "args": ["sleep 10; argoexec wait"]}, {"name":"main", "command": ["/bin/bash", "-c"], "args": ["sleep 10; argoexec main"]}]}'
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I'd still want this open. This is valuable for kubeflow pipelines multi tenant support, because api server needs to be protected by istio. Having istio sidecars in workflows allow in cluster api calls from workflows.
The startup problem should be resolved in Istio 1.7.0 as a workaround:
https://istio.io/latest/news/releases/1.7.x/announcing-1.7/#production-operability-improvements
https://github.com/istio/istio/pull/24737
If you are using the IstioOperator:
spec:
components:
proxy:
holdApplicationUntilProxyStarts: true
Any one found any work around for Linkerd2 ?