Pipelines: Warning message is confusing when pod logs cannot be retrieved

Created on 8 May 2020  路  17Comments  路  Source: kubeflow/pipelines

What steps did you take:

After the pod finishes successfully and is later reclaimed, the following warning is displayed. The message often confuses first-time users, and suggests them to check out the troubleshooting guide.

Warning: failed to retrieve pod logs. Possible reasons include cluster autoscaling or pod preemption

image

What happened:

In fact, the logs can be viewed in Stackdriver Kubernetes Monitoring.

What did you expect to happen:

Remove the warning message.

/kind bug
/area frontend

arefrontend good first issue help wanted kinbug prioritp1 statutriaged

All 17 comments

Thanks for the suggestion!
Sounds reasonable to me.

We can only show troubleshooting guide when there is an error, but not a warning.

happy to fix this
/assign @jonasdebeukelaer

oh wait is this already done? i.e. just removing 'troubleshooting guide'?

@jonasdebeukelaer Thanks for offering help!
This still needs to be done.

Some helpful information for contribution:

  1. frontend contribution guide: https://github.com/kubeflow/pipelines/tree/master/frontend
  2. Banner component (that shows the troubleshooting link): https://github.com/kubeflow/pipelines/blob/master/frontend/src/components/Banner.tsx
  3. Run Details Page's log viewer tab's banner: https://github.com/kubeflow/pipelines/blob/e52481a164e8b7d9a1352c592f51f47c46e4a576/frontend/src/pages/RunDetails.tsx#L472

My suggested UX would be to hide the troubleshooting link when the given banner is a warning (it still shows it when the banner is error), but you can take a look and decide if that feels reasonable to you. It already supports hiding the link on ad-hoc usage: https://github.com/kubeflow/pipelines/blob/master/frontend/src/components/Banner.tsx#L72, so we can also dynamically configure it on usages.

@Bobgy @jonasdebeukelaer I wonder if it is okay to remove the message completely, or at least lower the level to informational (without the exclamation mark and "Warning" prefix).

/reopen
I just tested on 1.0.0 and the problem still exists.

Now it shows fail to retrieve pod logs with error message:

Error response: Could not get main container logs: Error: Unable to retrieve workflow status: [object Object].

We didn't accommodate for the case when workflow was also missing.

@Bobgy: Reopened this issue.

In response to this:

/reopen
I just tested on 1.0.0 and the problem still exists.

Now it shows fail to retrieve pod logs with error message:

Error response: Could not get main container logs: Error: Unable to retrieve workflow status: [object Object].

We didn't accommodate for the case when workflow was also missing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jonasdebeukelaer do you want to revisit this?
Or I can follow up too

hey @Bobgy should be a quick one to fix so happy to do it. In what situations can a workflow be missing?

When user configures a TTL to GC workflows. (We have a default TTL of 1day)

In fact, workflow status should be persisted into run details DB rows. So UI shouldn't need to fetch the workflow

hmm makes sense 馃憤

@Bobgy The logs are already persisted to the storage the same way as other artifacts. AFAIK, @eterna2 added support to show these logs in the UX when the pod is not available, but this option is turned off by default. Maybe you can enable this option?

@Ark-kun this bug: https://github.com/kubeflow/pipelines/issues/3711#issuecomment-666877126 must be fixed before logs can be reused from archive.

@Ark-kun How do you enable the option?

@Ark-kun @Bobgy Is there any update to this? How do you enable log persistence?

No update yet, we need someone from community to fix this problem.

For us, we are on GCP, so GCP stackdriver auto persists all Kubernetes pod logs.

@haydnkeung I managed to enable logs persistence.

Check the configmap workflow-controller-configmap and see if archiveLogs: true is set. For me it wasn't, even though I'm on KF 1.1, and I had to set it in the config-map.yaml found in your manifest dir under argo/base.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Toeplitz picture Toeplitz  路  4Comments

xinbinhuang picture xinbinhuang  路  3Comments

talhairfanbentley picture talhairfanbentley  路  5Comments

Svendegroote91 picture Svendegroote91  路  3Comments

maggiemhanna picture maggiemhanna  路  5Comments