Pipelines: Upgrade Argo to v2.11+

Created on 27 Sep 2020  路  33Comments  路  Source: kubeflow/pipelines

arebackend help wanted

Most helpful comment

Sorry, the issue drifted away from my focus previously. I managed to switch over to the official argo version 2.11.6 server side for my deployment, and everything has been running smoothly. It seems that server side is more straightforward.

@NikeNano Have you started on this? If not, I can create a PR tonight to summarize what I did and you can look into it and include extra depedencies and licenese as such.

All 33 comments

@Ark-kun This is currently blocking my current work. Though I can work around it somehow, I would like to get this addressed quickly. Do you have a timeline for this to be fixed or if I can help with the upgrades?

Will the upgrade need upgrading argo client? If not, I think you can try upgrading the argo installation in your own cluster and see if it fixes the problem.

Upgrading the client is a lot harder, there's some go module dependencies issue to fix. There's an ongoing PR working on this, you may help there: https://github.com/kubeflow/pipelines/pull/4498.

Will the upgrade need upgrading argo client? If not, I think you can try upgrading the argo installation in your own cluster and see if it fixes the problem.

It seems like the KFP images only include additional licenses on top of the original Argo images.
Are there any other changes that I need to be aware of?

If there are only additional lincenses, I can switch the image to the official argo images to see if it works.

Yes, it's just additional licenses. You can just switch to official argo images.

@Bobgy I tried to update the Argo image to v2.11.1, but now I get this error from the workflow-controller repeatedly, and new pipeline runs seem to get into an infinite loop with unknown status. Any ideas?

time="2020-10-02T16:27:22Z" level=info msg="config map" name=workflow-controller-configmap
time="2020-10-02T16:27:22Z" level=info msg="Configuration:\nartifactRepository:\n  archiveLogs: true\n  s3:\n    accessKeySecret:\n      key: accesskey\n      name: mlpipeline-minio-artifact\n    bucket: parala-kfp-artifacts\n    endpoint: minio-service.kubeflow:9000\n    insecure: true\n    keyPrefix: artifacts\n    secretKeySecret:\n      key: secretkey\n      name: mlpipeline-minio-artifact\nexecutorImage: gcr.io/ml-pipeline/argoexec:v2.7.5-license-compliance\nmetricsConfig: {}\nnamespace: kubeflow\nnodeEvents: {}\npodSpecLogStrategy: {}\nsso:\n  clientId:\n    key: \"\"\n  clientSecret:\n    key: \"\"\n  issuer: \"\"\n  redirectUrl: \"\"\ntelemetryConfig: {}\n"
time="2020-10-02T16:27:22Z" level=info msg="Persistence configuration disabled"
time="2020-10-02T16:27:22Z" level=info msg="Starting Workflow Controller" version=v2.11.1
time="2020-10-02T16:27:22Z" level=info msg="Workers: workflow: 32, pod: 32"
time="2020-10-02T16:27:22Z" level=info msg="Performing periodic GC every 5m0s"
time="2020-10-02T16:27:22Z" level=info msg="Persistence disabled - so archived workflow GC disabled - you must restart the controller if you enable this"
time="2020-10-02T16:27:22Z" level=info msg="Starting workflow TTL controller (resync 20m0s)"
time="2020-10-02T16:27:22Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics"
time="2020-10-02T16:27:22Z" level=info msg="Starting CronWorkflow controller"
E1002 16:27:22.063415       1 reflector.go:153] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to list *unstructured.Unstructured: workflowtemplates.argoproj.io is forbidden: User "system:serviceaccount:kubeflow:argo" cannot list resource "workflowtemplates" in API group "argoproj.io" at the cluster scope
time="2020-10-02T16:27:22Z" level=info msg="Started workflow TTL worker"
E1002 16:27:23.068520       1 reflector.go:153] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to list *unstructured.Unstructured: workflowtemplates.argoproj.io is forbidden: User "system:serviceaccount:kubeflow:argo" cannot list resource "workflowtemplates" in API group "argoproj.io" at the cluster scope
E1002 16:27:24.073525       1 reflector.go:153] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to list *unstructured.Unstructured: workflowtemplates.argoproj.io is forbidden: User "system:serviceaccount:kubeflow:argo" cannot list resource "workflowtemplates" in API group "argoproj.io" at the cluster scope
E1002 16:27:25.078845       1 reflector.go:153] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to list *unstructured.Unstructured: workflowtemplates.argoproj.io is forbidden: User "system:serviceaccount:kubeflow:argo" cannot list resource "workflowtemplates" in API group "argoproj.io" at the cluster scope

@Bobgy I tried to update the Argo image to v2.11.1, but now I get this error from the workflow-controller repeatedly, and new pipeline runs seem to get into an infinite loop with unknown status. Any ideas?

time="2020-10-02T16:27:22Z" level=info msg="config map" name=workflow-controller-configmap
time="2020-10-02T16:27:22Z" level=info msg="Configuration:\nartifactRepository:\n  archiveLogs: true\n  s3:\n    accessKeySecret:\n      key: accesskey\n      name: mlpipeline-minio-artifact\n    bucket: parala-kfp-artifacts\n    endpoint: minio-service.kubeflow:9000\n    insecure: true\n    keyPrefix: artifacts\n    secretKeySecret:\n      key: secretkey\n      name: mlpipeline-minio-artifact\nexecutorImage: gcr.io/ml-pipeline/argoexec:v2.7.5-license-compliance\nmetricsConfig: {}\nnamespace: kubeflow\nnodeEvents: {}\npodSpecLogStrategy: {}\nsso:\n  clientId:\n    key: \"\"\n  clientSecret:\n    key: \"\"\n  issuer: \"\"\n  redirectUrl: \"\"\ntelemetryConfig: {}\n"
time="2020-10-02T16:27:22Z" level=info msg="Persistence configuration disabled"
time="2020-10-02T16:27:22Z" level=info msg="Starting Workflow Controller" version=v2.11.1
time="2020-10-02T16:27:22Z" level=info msg="Workers: workflow: 32, pod: 32"
time="2020-10-02T16:27:22Z" level=info msg="Performing periodic GC every 5m0s"
time="2020-10-02T16:27:22Z" level=info msg="Persistence disabled - so archived workflow GC disabled - you must restart the controller if you enable this"
time="2020-10-02T16:27:22Z" level=info msg="Starting workflow TTL controller (resync 20m0s)"
time="2020-10-02T16:27:22Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics"
time="2020-10-02T16:27:22Z" level=info msg="Starting CronWorkflow controller"
E1002 16:27:22.063415       1 reflector.go:153] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to list *unstructured.Unstructured: workflowtemplates.argoproj.io is forbidden: User "system:serviceaccount:kubeflow:argo" cannot list resource "workflowtemplates" in API group "argoproj.io" at the cluster scope
time="2020-10-02T16:27:22Z" level=info msg="Started workflow TTL worker"
E1002 16:27:23.068520       1 reflector.go:153] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to list *unstructured.Unstructured: workflowtemplates.argoproj.io is forbidden: User "system:serviceaccount:kubeflow:argo" cannot list resource "workflowtemplates" in API group "argoproj.io" at the cluster scope
E1002 16:27:24.073525       1 reflector.go:153] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to list *unstructured.Unstructured: workflowtemplates.argoproj.io is forbidden: User "system:serviceaccount:kubeflow:argo" cannot list resource "workflowtemplates" in API group "argoproj.io" at the cluster scope
E1002 16:27:25.078845       1 reflector.go:153] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to list *unstructured.Unstructured: workflowtemplates.argoproj.io is forbidden: User "system:serviceaccount:kubeflow:argo" cannot list resource "workflowtemplates" in API group "argoproj.io" at the cluster scope

Solved: adding --namespaced to the workflow-controller args

@Bobgy There are some changes needed to be made to use the latest version of Argo workflow-controller, and I can create a PR for that. I wonder if it needs to be updated after the cli client is merged?

@xinbinhuang You can try if upgrading argo itself solves your problem, if it passes all of our e2e tests, we can get it merged before the cli client.

If we like to update the Argo version to 2.11.X I can look in to it, I guess it will be kind of similar to the update to 2.7 @Bobgy?

/assign

@NikeNano thank you for offering help! That'll be great!

What's even better is using the chance to document how to upgrade argo, so others can learn from you next time.

What's even better is using the chance to document how to upgrade argo, so others can learn from you next time.

Sounds like a good idea, will include it!

FYI, when upgrading to 2.11.6, you should be aware that Google requires all images to contain necessary license information in the docker image.
That's why we built gcr.io/ml-pipeline/argoexec:v2.7.5-license-compliance from https://github.com/kubeflow/pipelines/tree/master/third_party/argo. We might be missing some documentation there, so feel free to ask me when you start about that part.

I think we can split into two PRs, one upgrading the image and one upgrading the go package.

EDIT: I built https://github.com/kubeflow/testing/tree/master/py/kubeflow/testing/go-license-tools to automatically collect go dependency licenses from GitHub.

I think we can split into two PRs, one upgrading the image and one upgrading the go package.

EDIT: I built https://github.com/kubeflow/testing/tree/master/py/kubeflow/testing/go-license-tools to automatically collect go dependency licenses from GitHub.

Cool, I will look in to it when I managed to fix the dependencies correctly.

Sorry, the issue drifted away from my focus previously. I managed to switch over to the official argo version 2.11.6 server side for my deployment, and everything has been running smoothly. It seems that server side is more straightforward.

@NikeNano Have you started on this? If not, I can create a PR tonight to summarize what I did and you can look into it and include extra depedencies and licenese as such.

I have done some initial work but make a PR with your solution @xinbinhuang, and I can help out :)

FYI related work on argo to update the dependencies : https://github.com/argoproj/argo/pull/4426

@Bobgy @NikeNano @xinbinhuang any ETA on this? We have a need to move to higher k8s client

@animeshsingh is the latest argo enough for you, or do we need to wait for @NikeNano 's PR in argo?

/assign @capri-xiyue
Let's work on this then

This was just merged to argo, https://github.com/argoproj/argo/pull/4810#event-4186169881, will start to look at it as well to see if we could push this.

This will be part of argo v3
https://github.com/argoproj/argo/pull/4810#issuecomment-757010305

@NikeNano Is there any ETA on this? And what are the remaining items of upgrading Argo to v2.11+? Do we have a task list for upgrading Argo to v2.11+?

@NikeNano Is there any ETA on this? And what are the remaining items of upgrading Argo to v2.11+? Do we have a task list for upgrading Argo to v2.11+?

If we want to go for version3 we have to wait for the release until we can update as far as I see. Which should be in the end of January hopefully, https://github.com/argoproj/argo/issues/4425#issuecomment-748644024. I guess this might not be necessary, but last time I looked at it there where some dependencies issues that I could't solve with out the need for upgrading argo, maybe you could figure it out @capri-xiyue? When I did the update to 2.7 their where a lot of issues with collision between dependencies. I suggest we wait for the release before we try to do the update.

argoproj/argo#4425 (comment)

I think it will be fine to wait until the end of Jan for argo version3 if it makes the updating dependency easier.
@Bobgy https://kubernetes.io/blog/2020/12/02/dont-panic-kubernetes-and-docker/ Docker support won't be removed until late 2021. I think it should be safe to update argo after the version 3 gets released.
@NikeNano How do we usually update depdencies like argo in kfp? Is there any docs to update dependecies like argo? For example, do we have any script of updating dependencies in KFP like https://github.com/knative/eventing/blob/master/hack/update-deps.sh? I'm just wondering why there will be dependencies issues when upgrading argo. I thought go module should be able to resolve the dependencies automatically.

Thanks, makes sense to me waiting for argo v3 if that' end of Jan.

@NikeNano How do we usually update depdencies like argo in kfp? Is there any docs to update dependecies like argo? For example, do we have any script of updating dependencies in KFP like https://github.com/knative/eventing/blob/master/hack/update-deps.sh? I'm just wondering why there will be dependencies issues when upgrading argo. I thought go module should be able to resolve the dependencies automatically.

We do not usually update dependencies, so each time we update, they have already been pretty old and many things could be breaking after some dependencies are updated.

It's worth discussing the upgrade strategy in a project health issue.

... I'm just wondering why there will be dependencies issues when upgrading argo. I thought go module should be able to resolve the dependencies automatically.

@capri-xiyue I am pretty new to go in general and have always found go dependency management to be a bit unclear. Especially how it sometimes tries to automatically resolve issues.... see https://github.com/golang/go/wiki/Modules#can-i-control-when-gomod-gets-updated-and-when-the-go-tools-use-the-network-to-satisfy-dependencies. I think it would be great if we document this, which I remember you also asked for as part of the upgrade of argo @Bobgy. But lets make a seperate issue and continue the discussion on.

@NikeNano @Bobgy I created a issue for the discussion of upgrade strategy https://github.com/kubeflow/pipelines/issues/4999

Asked about upstream updates in https://github.com/argoproj/argo/discussions/4953

EDIT: got reply from argo maintainer, the suggestion is to upgrade to v2.12 now, v3 will be backward compatible, but it'll still take a while, the first RC hasn't been released yet (but will soon).

I will give it a new try to update to 2.12.

Let us know if you want help!

Resolved by #5232.

Thank you for everyone who helped with this issue!
@NikeNano @alexec @xinbinhuang

Was this page helpful?
0 / 5 - 0 ratings