Pipelines: UI Inaccessible via Proxy URL

Created on 21 Feb 2020  路  20Comments  路  Source: kubeflow/pipelines

What steps did you take:

  • attempted to access the KFP UI via the proxy URL as given by: kubectl describe configmap inverse-proxy-config -n kubeflow | grep googleusercontent.com

What happened:

  • Received a Google 504 error in the browser
  • The proxy-agent Deployment is constantly outputting the following log:
Failed to read pending requests: "Failed to list pending requests: 401, \"\\n<!DOCTYPE html>\...(html for the 504 page)

What did you expect to happen:

  • Be able to access the UI via the proxy URL

Environment:

How did you deploy Kubeflow Pipelines (KFP)?

  • Standalone Deployment to GKE

KFP version: 0.2.4 (also occurred with 0.2.2)

KFP SDK version: 0.2.4

Anything else you would like to add:

  • I'm able to access the UI via the port forwarding method and able to perform all normal actions
  • Sort of related, but is there any documentation on getting IAP setup with KFP standalone? I think if I wanted to do it now I would have to make my own Kustomization that does this?

/kind bug
/area deployment/standalone

aredeploymenstandalone kinbug needs investigation prioritp1 statutriaged

Most helpful comment

I deleted the data in the inverse-proxy-config ConfigMap and restarted the proxy-agent pod to see if I could get the proxy agent to register as a new backend rather than reuse the existing one and it worked, except now I'm at a different proxy URL (since a new backend id) was generated. Perhaps this is an issue with authenticating an existing backend? If so it might be upstream issue with
the inverting-proxy image.

I'm probably not going to troubleshoot too much further since I'd rather spend time figuring out how to get IAP up and running.

All 20 comments

We don't provide instructions for IAP, but if you know how to do it, definitely, that's a choice. You can either fork kustomize manifests, or write your own overlay that overrides existing one: https://www.kubeflow.org/docs/pipelines/installation/standalone-deployment/#best-practices-maintaining-custom-manifests.

A quick try: can you try deleting the proxy-agent pod to restart it? I've seen a transient issue that can be solved by restart, but I'm not sure if that works for you.

I've killed the proxy-agent pod a few times and the issue is still occurring

I deleted the data in the inverse-proxy-config ConfigMap and restarted the proxy-agent pod to see if I could get the proxy agent to register as a new backend rather than reuse the existing one and it worked, except now I'm at a different proxy URL (since a new backend id) was generated. Perhaps this is an issue with authenticating an existing backend? If so it might be upstream issue with
the inverting-proxy image.

I'm probably not going to troubleshoot too much further since I'd rather spend time figuring out how to get IAP up and running.

I'm glad you found a solution. Were there anything special with your deployment? Did you do it in an empty cluster? or did you upgrade ...?

My initial installation was on an empty cluster with version 0.2.0 and then I've been upgrading the installation using Kustomize overlays as documented. The only other customizations I have been using are the GCP Cloud SQL and GCS patches.

Something interesting that I didn't point earlier is that I have the two KFP clusters in separate environments (dev, prod GCP projects) with the exact same configurations. They both started experiencing this same issue at the exact same time.

Thanks!
I think I read about an outage about inverse proxy at a certain time, but I don't know exact details now. That looks related. I will get back to you with some investigation.

I don't think the issue I saw was related, it was at a different date.

@IronPan Do you have any idea about this issue?

I can confirm this issue, it happened to me without warning after a few weeks of normal operations.

I'm running version 0.2.5 on a custom cluster. I attempted similar resolution steps to @parthmishra :
1- Deleting the pod didn't resolve the issue
2- Deleting the config data resolved it, but it meant that the endpoint changed, which means I have to update all my application deployments

I will try submit a fix in <3days

"2- Deleting the config data resolved it, but it meant that the endpoint changed, which means I have to update all my application deployments"

Any more info on "update all application deployment"?
Is it possible you fetch the URL dynamically via the configmap? That URL is designed to be dynamic.

I guess the issue you hit is that the "proxy-agent" pod is moved to another VM while the system uses VM_ID for the proxy. My fix is that if "proxy-agent" pod is moved to another VM, the URL will get refreshed.

@rmgogogo Yes, you are right. I actually did that, so my deployment code now fetches the URL from the configmap.

However, I think that it is not nice that the deployment code needs to be aware of the underlying configmaps of the Kubeflow deployment. It feels a bit too hacky. What if you guys change how the proxy stores its configmap in a newer version? This means upgrading would be a big pain.

Btw, I noticed that there is some kind of state associated with the URL. After deleting the configmap data and getting a new URL, I tried putting back the old URL into the configmap manually and restarting the pod. The new pod fell right into the infinite error loop behavior. So whenever the "bad" URL is present in the configmap, the pod falls to error loop.

Thanks for your effort in creating a fix!

The fix #3663 triggered the case that "the KFP hostname can be changed after the proxy-agent got moved to a new VM".

The new hostname should be used for kfp.Client(host=newHostName). It would make user codes be more complex to get a new hostname each time.

A root fix is that we should fix proxy-agent/proxy-server to make sure one K8s cluster only can provide one hostname, even the proxy-agent got moved to a new VM (& possible new VM in new region in multi-region/mutli-master cluster).

/reopen
since #3845 reverted the fix here

@Bobgy: Reopened this issue.

In response to this:

/reopen
since #3845 reverted the fix here

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Another google3 fix should be applied for proxy-server.
PR3845 is on submitting.

background info for external:
The proxy-agent and proxy-server work together as an InverseProxy service so that KFP API/UI can be exposed via hostname like xxx.googleusercontent.com.

The previous fix #3663 fixed this issue via allowing change to use a new hostname whenever the proxy-agent got moved to a new VM (GKE node). It can solve the issue. However cons is that the hostname got changed.

Considering changing hostname may force user to change codes "kfp.Client(host=xxx)", it's not user friendly as we don't know when GKE will move the proxy-agent to a new node, so we decide to revert the previous fix (#3663). Revert PR is in #3845.

So hostname is always fixed after an installation in a GKE cluster. However, we still have the issue that if proxy-agent got moved to a new node, it's possible the proxy can't work (the original problem of this ticket). It's not always be reproduced. Appreciate if you hit the issue and can provide/paste some proxy-agent logs here.

A possible fix will be in proxy-server side (not in Github, b/157252786).

FYI: we are treating this issue with high priority.

detailed logs welcomed. it's possible after the node got moved to a new VM & session got expired, the current codebase didn't well handle the session/token refresh (just a guess now)

client error msg is from here,

The proxy-agent can't get pending requests from proxy-server but get 401.

@parthmishra found a fix inside Google internal repo for multi-zone/multi-master case.
Quick question, does your case only happens in a multi-master cluster? Or your cluster is single-master?

The fix I found is only for multi-master case. Not sure whether the fix solve your problem.

The Google internal repo fix (for proxy-server) is submitted. It can fix the multi-master case which shows same error log.

ETA of going to production is ~1w. Here close this ticket.

The #3845 will be in next release (Yuan is oncall for the release, it may be named as 1.0.0-rc.0 and later be 1.0.0.0)

Was this page helpful?
0 / 5 - 0 ratings