Pipelines: UI Inaccessible via Proxy URL

Created on 21 Feb 2020 · 20Comments · Source: kubeflow/pipelines

What steps did you take:

attempted to access the KFP UI via the proxy URL as given by: kubectl describe configmap inverse-proxy-config -n kubeflow | grep googleusercontent.com

What happened:

Received a Google 504 error in the browser
The proxy-agent Deployment is constantly outputting the following log:

Failed to read pending requests: "Failed to list pending requests: 401, \"\\n<!DOCTYPE html>\...(html for the 504 page)

What did you expect to happen:

Be able to access the UI via the proxy URL

Environment:

How did you deploy Kubeflow Pipelines (KFP)?

Standalone Deployment to GKE

KFP version: 0.2.4 (also occurred with 0.2.2)

KFP SDK version: 0.2.4

Anything else you would like to add:

I'm able to access the UI via the port forwarding method and able to perform all normal actions
Sort of related, but is there any documentation on getting IAP setup with KFP standalone? I think if I wanted to do it now I would have to make my own Kustomization that does this?

/kind bug
/area deployment/standalone

aredeploymenstandalone kinbug needs investigation prioritp1 statutriaged

Source

parthmishra

Most helpful comment

I deleted the data in the inverse-proxy-config ConfigMap and restarted the proxy-agent pod to see if I could get the proxy agent to register as a new backend rather than reuse the existing one and it worked, except now I'm at a different proxy URL (since a new backend id) was generated. Perhaps this is an issue with authenticating an existing backend? If so it might be upstream issue with
the inverting-proxy image.

I'm probably not going to troubleshoot too much further since I'd rather spend time figuring out how to get IAP up and running.

parthmishra on 25 Feb 2020

👍2

All 20 comments

We don't provide instructions for IAP, but if you know how to do it, definitely, that's a choice. You can either fork kustomize manifests, or write your own overlay that overrides existing one: https://www.kubeflow.org/docs/pipelines/installation/standalone-deployment/#best-practices-maintaining-custom-manifests.

A quick try: can you try deleting the proxy-agent pod to restart it? I've seen a transient issue that can be solved by restart, but I'm not sure if that works for you.

Bobgy on 25 Feb 2020

I've killed the proxy-agent pod a few times and the issue is still occurring

parthmishra on 25 Feb 2020

I'm probably not going to troubleshoot too much further since I'd rather spend time figuring out how to get IAP up and running.

parthmishra on 25 Feb 2020

👍2

I'm glad you found a solution. Were there anything special with your deployment? Did you do it in an empty cluster? or did you upgrade ...?

Bobgy on 27 Feb 2020

My initial installation was on an empty cluster with version 0.2.0 and then I've been upgrading the installation using Kustomize overlays as documented. The only other customizations I have been using are the GCP Cloud SQL and GCS patches.

Something interesting that I didn't point earlier is that I have the two KFP clusters in separate environments (dev, prod GCP projects) with the exact same configurations. They both started experiencing this same issue at the exact same time.

parthmishra on 27 Feb 2020

Thanks!
I think I read about an outage about inverse proxy at a certain time, but I don't know exact details now. That looks related. I will get back to you with some investigation.

Bobgy on 27 Feb 2020

I don't think the issue I saw was related, it was at a different date.

@IronPan Do you have any idea about this issue?

Bobgy on 28 Feb 2020

I can confirm this issue, it happened to me without warning after a few weeks of normal operations.

I'm running version 0.2.5 on a custom cluster. I attempted similar resolution steps to @parthmishra :
1- Deleting the pod didn't resolve the issue
2- Deleting the config data resolved it, but it meant that the endpoint changed, which means I have to update all my application deployments

omarzouk on 21 Apr 2020

I will try submit a fix in <3days

rmgogogo on 24 Apr 2020

"2- Deleting the config data resolved it, but it meant that the endpoint changed, which means I have to update all my application deployments"

Any more info on "update all application deployment"?
Is it possible you fetch the URL dynamically via the configmap? That URL is designed to be dynamic.

I guess the issue you hit is that the "proxy-agent" pod is moved to another VM while the system uses VM_ID for the proxy. My fix is that if "proxy-agent" pod is moved to another VM, the URL will get refreshed.

rmgogogo on 27 Apr 2020

@rmgogogo Yes, you are right. I actually did that, so my deployment code now fetches the URL from the configmap.

However, I think that it is not nice that the deployment code needs to be aware of the underlying configmaps of the Kubeflow deployment. It feels a bit too hacky. What if you guys change how the proxy stores its configmap in a newer version? This means upgrading would be a big pain.

Btw, I noticed that there is some kind of state associated with the URL. After deleting the configmap data and getting a new URL, I tried putting back the old URL into the configmap manually and restarting the pod. The new pod fell right into the infinite error loop behavior. So whenever the "bad" URL is present in the configmap, the pod falls to error loop.

Thanks for your effort in creating a fix!

omarzouk on 27 Apr 2020

The fix #3663 triggered the case that "the KFP hostname can be changed after the proxy-agent got moved to a new VM".

The new hostname should be used for kfp.Client(host=newHostName). It would make user codes be more complex to get a new hostname each time.

A root fix is that we should fix proxy-agent/proxy-server to make sure one K8s cluster only can provide one hostname, even the proxy-agent got moved to a new VM (& possible new VM in new region in multi-region/mutli-master cluster).

rmgogogo on 26 May 2020

/reopen
since #3845 reverted the fix here

Bobgy on 27 May 2020

@Bobgy: Reopened this issue.

In response to this:

/reopen
since #3845 reverted the fix here

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 27 May 2020

Another google3 fix should be applied for proxy-server.
PR3845 is on submitting.

rmgogogo on 27 May 2020

background info for external:
The proxy-agent and proxy-server work together as an InverseProxy service so that KFP API/UI can be exposed via hostname like xxx.googleusercontent.com.

The previous fix #3663 fixed this issue via allowing change to use a new hostname whenever the proxy-agent got moved to a new VM (GKE node). It can solve the issue. However cons is that the hostname got changed.

Considering changing hostname may force user to change codes "kfp.Client(host=xxx)", it's not user friendly as we don't know when GKE will move the proxy-agent to a new node, so we decide to revert the previous fix (#3663). Revert PR is in #3845.

So hostname is always fixed after an installation in a GKE cluster. However, we still have the issue that if proxy-agent got moved to a new node, it's possible the proxy can't work (the original problem of this ticket). It's not always be reproduced. Appreciate if you hit the issue and can provide/paste some proxy-agent logs here.

A possible fix will be in proxy-server side (not in Github, b/157252786).

FYI: we are treating this issue with high priority.

rmgogogo on 27 May 2020

detailed logs welcomed. it's possible after the node got moved to a new VM & session got expired, the current codebase didn't well handle the session/token refresh (just a guess now)

rmgogogo on 27 May 2020

client error msg is from here,

The proxy-agent can't get pending requests from proxy-server but get 401.

rmgogogo on 27 May 2020

@parthmishra found a fix inside Google internal repo for multi-zone/multi-master case.
Quick question, does your case only happens in a multi-master cluster? Or your cluster is single-master?

The fix I found is only for multi-master case. Not sure whether the fix solve your problem.

rmgogogo on 27 May 2020

The Google internal repo fix (for proxy-server) is submitted. It can fix the multi-master case which shows same error log.

ETA of going to production is ~1w. Here close this ticket.

The #3845 will be in next release (Yuan is oncall for the release, it may be named as 1.0.0-rc.0 and later be 1.0.0.0)

rmgogogo on 28 May 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[frontend] Add pagination to Artifact list

zijianjoy · 3Comments

controller-manager stuck in CrashLoopBackOff

rcleere · 3Comments

Metrics don't show with latest kfp version

Svendegroote91 · 3Comments

Pending with "Unschedulable: pod has unbound immediate PersistentVolumeClaims"

kim-sardine · 5Comments

As a user, I want to be alerted by e-mail in case of a pipeline or component failure!

julioyildo · 4Comments