Argo: Workflows with istio (or other service mesh)

Created on 25 Mar 2019 · 25Comments · Source: argoproj/argo

Is this a BUG REPORT or FEATURE REQUEST?: FEATURE REQUEST

What happened: Argo attempted to store the docker logs to s3 after killing the istio sidecar. This means that it was impossible to connect to s3 since all traffic is iptable redirected to the (now killed) istio proxy.

What you expected to happen: The "wait" container leaves the sidecars (or perhaps designated ones) alive until artifact (logs, etc.) storage is complete.

How to reproduce it (as minimally and precisely as possible):

1/ Configure a cluster with istio
2/ Configure a workflow with a template that has an istio sidecar. We did this manually rather than injecting since argo doesn't recognize the injected container as a sidecar that needs killing.
3/ Specify s3 configuration and archiveLogs in the argo configuration
4/ Run the workflow

enhancement help wanted

Source

dicarlo2

👍6

All 25 comments

@dicarlo2 Were you ever able to figure out a way to get an istio-injected workflow working with artifact support?

ddseapy on 1 Oct 2019

Maybe configuring istio to avoid intercepting s3 traffic with traffic.sidecar.istio.io/excludeOutboundIPRanges?

ddseapy on 1 Oct 2019

As you suggested, delaying killing sidecars in waitContainer() works. Also had to workaround https://github.com/istio/istio/issues/9454 but didn't require changes in argo or istio code.

Unsure if this kind of change is something argo maintainers want as a PR, or if there are downsides to doing this.

func waitContainer() error {
    wfExecutor := initExecutor()
    defer wfExecutor.HandleError()
    defer stats.LogStats()
    stats.StartStatsTicker(5 * time.Minute)

    // Wait for main container to complete
    err := wfExecutor.Wait()
    if err != nil {
        wfExecutor.AddError(err)
        // do not return here so we can still try to kill sidecars & save outputs
    }
    // Saving logs
    logArt, err := wfExecutor.SaveLogs()
    if err != nil {
        wfExecutor.AddError(err)
        // do not return here so we can still try to kill sidecars
    }
    // Saving output parameters
    err = wfExecutor.SaveParameters()
    if err != nil {
        wfExecutor.AddError(err)
        // do not return here so we can still try to kill sidecars
    }
    // Saving output artifacts
    err = wfExecutor.SaveArtifacts()
    if err != nil {
        wfExecutor.AddError(err)
        // do not return here so we can still try to kill sidecars
    }
    // Killing sidecar containers
    err = wfExecutor.KillSidecars()
    if err != nil {
        wfExecutor.AddError(err)
        return err
    }
    // Capture output script result
    err = wfExecutor.CaptureScriptResult()
    if err != nil {
        wfExecutor.AddError(err)
        return err
    }
    err = wfExecutor.AnnotateOutputs(logArt)
    if err != nil {
        wfExecutor.AddError(err)
        return err
    }
    return nil
}

ddseapy on 3 Oct 2019

Please go-ahead to create PR with changes. we will review and merge. Thanks for contributing

sarabala1979 on 3 Oct 2019

@sarabala1979 Is there any way to get an Argo Workflow to recognize an automatically injected istio sidecar?

nfickas on 8 Jan 2020

@nfickas is there an issue you are running into with istio sidecars, or what do you mean by "recognize"? We have been using 2.4.2 with istio sidecars automatically injected without any issues.

ddseapy on 8 Jan 2020

For us this issue was resolved here: https://github.com/argoproj/argo/pull/1645

ddseapy on 8 Jan 2020

@ddseapy our workflows start up with the automatic sidecars but then the sidecars are stuck in a running state after the workflow completes.

nfickas on 9 Jan 2020

I am seeing some odd behavior on the wait container, looks like it Errored out. Failed to establish pod watch: Get https://10.100.200.1:443/api/v1/namespaces/default/pods?fieldSelector=metadata.name%3Dhello-world-82bm7&watch=true: dial tcp 10.100.200.1:443: connect: connection refused

nfickas on 9 Jan 2020

That's the istio issue i linked above. The underlying issue is: https://github.com/istio/istio/issues/11130

Really it's because k8s sidecar support PR (https://github.com/kubernetes/enhancements/issues/753) hasn't gotten through, so there is no ordering which causes a race condition.

You can exclude the k8s api server from the IP range that envoy intercepts, or do things like change the docker command. It could also be fixed in argo by retrying the request on failure with code changes in argo.

ddseapy on 9 Jan 2020

I would rather not making any changes to argo code, out of the other two options which one would be easier and do you have any examples of either of those being done?

Just curious did you run into this issue when getting argo working with injected sidecars?

nfickas on 9 Jan 2020

The easiest thing to do is to just annotate your Argo worfklows so that the auto-injector doesn't add a sidecar to them. Better support is coming in k8s (maybe in 1.8) but it's just not there yet.

Jobs in k8s are the same way... there isn't really anything special about Argo there.

Edit: This would be spec.templates[].metadata.annotations: sidecar.istio.io/inject: "false"

jaygorrell on 9 Jan 2020

if you don't need sidecar, that is indeed easiest. We need to control traffic for our applications, so that is not an option for us. For the workflows we run it was sufficient to exclude the k8s master IP range via traffic.sidecar.istio.io/excludeOutboundIPRanges

ddseapy on 9 Jan 2020

We need sidecars as well, I will try that out. Thanks!

nfickas on 9 Jan 2020

That's a good point, too. But just to be clear, I do believe that is going to make the pod stay up when the workflow is finished - correct? You may need a separate system to clean those up, I believe. I'm going off memory and could be mistaken.

jaygorrell on 9 Jan 2020

To state the obvious, requests to k8s api server will then bypass istio. This was not a security concern for our use-cases, but it could be for someone else.

ddseapy on 9 Jan 2020

@jaygorrell that's why I put up the pr. we have been using 2.4.2 for quite a while now without issues, and the pod goes to completed state.

ddseapy on 9 Jan 2020

👍1

For visibility - Prasanna Kumar Krishnamurthy in argo workflow slack suggested patching the pod spec (https://github.com/argoproj/argo/blob/master/examples/pod-spec-patch.yaml) with something like

"containers":[{"name":"wait", "command": ["/bin/bash", "-c"], "args": ["sleep 20; argoexec wait"]}]

ddseapy on 16 Feb 2020

👍1

@ddseapy So the above suggestion will just make it clean up the istio-proxy container after 20 seconds. I think the real issue is that the wait and main containers are launching before the proxy starts up.

nfickas on 28 Feb 2020

👍1

My understanding is the suggestion should delay the wait container running by editing it's command. I would think that it would delay cleanup (as you said), but also prevent the startup race condition.

If the main container has the same race condition (depends on your main container), you would need to wait there too.

I haven't tried the approach myself yet though, so I may be misunderstanding things.

ddseapy on 28 Feb 2020

👍1

hm yeah I am not seeing that happen, I am still seeing the main and wait containers startup before istio-proxy. For reference this is what I did: podSpecPatch: '{"containers": [{"name":"wait", "command": ["/bin/bash", "-c"], "args": ["sleep 10; argoexec wait"]}, {"name":"main", "command": ["/bin/bash", "-c"], "args": ["sleep 10; argoexec main"]}]}'

nfickas on 28 Feb 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 2 Jul 2020

I'd still want this open. This is valuable for kubeflow pipelines multi tenant support, because api server needs to be protected by istio. Having istio sidecars in workflows allow in cluster api calls from workflows.

Bobgy on 2 Jul 2020

The startup problem should be resolved in Istio 1.7.0 as a workaround:
https://istio.io/latest/news/releases/1.7.x/announcing-1.7/#production-operability-improvements
https://github.com/istio/istio/pull/24737

If you are using the IstioOperator:

spec:
  components:
    proxy:
      holdApplicationUntilProxyStarts: true

shadiramadan on 25 Aug 2020

👍1

Any one found any work around for Linkerd2 ?

BlackRider97 on 28 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Argo workflow runs forever when container fails to start

0xdevalias · 3Comments

Argo CLI: String parameters submitted in parameter-file have extraneous double quotes added

delwaterman · 3Comments

nested template outputs produce runtime error: invalid memory address or nil pointer dereference

CermakM · 3Comments

Hello Wold example stuck at ContainerCreating with microk8s

srados · 3Comments

Possible memory leak?

hden · 3Comments