Pipeline: Fail to fetch logs for old PipelineRuns

Created on 20 Oct 2020 · 5Comments · Source: tektoncd/pipeline

Expected Behavior

Tekton pipelines should persistent the logs of old pipeline runs in the configured bucket.

Actual Behavior

The logs for old pipeline is not available. When I try tp see the logs I get unable to fetch logs with the following message

{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"global-pr-checks-7wxpv-set-final-status-4nr8m-pod-jzv8z\" not found","reason":"NotFound","details":{"name":"global-pr-checks-7wxpv-set-final-status-4nr8m-pod-jzv8z","kind":"pods"},"code":404}

Steps to Reproduce the Problem

Deploy tekton and configure s3 bucket in us-east-1
Create a pipeline run
Delete the pods, the logs will be missing

Additional Info

Kubernetes version:

Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-065dce", GitCommit:"065dcecfcd2a91bd68a17ee0b5e895088430bd05", GitTreeState:"clean", BuildDate:"2020-07-16T01:44:47Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}

Tekton Pipeline version:

Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller o=jsonpath='{.items[0].metadata.labels.version}'

v0.16.3

kinbug

Source

hprateek43

Most helpful comment

@vdemeester I totally understand the feature-creep that could be encountered by trying to solve this problem, but the main issue is that the core tkn CLI errors if you try to fetch logs for pods that were removed, which feels broken.

I get that logging is quite a nebulous thing and there are already so many different options for this (Loki, ELK, Datadog etc.).... but in order for an operator to debug a pipeline who's containers are now missing from the system, they have to manually go and scrape their logging system and try and piece together and order the logs, which is honestly more effort than it's worth.

@hprateek43 you mentioned that there is an option to ship logs to object storage... where can I find that option? Are the logs shipped in a nicely readable format (e.g. like what you get from tkn pipelinerun logs?

I completely agree :wink: There is definitely room for improvement in here, and, long-term, tools like tkn should work with getting logs from runs that do not have related pods anymore. Not sure yet how we would do this, but this is something that, I think, we need to support yes :+1:

vdemeester on 21 Oct 2020

👍2

All 5 comments

I believe this is by-design and not a bug, but I do wish that it _wasn't_.

Nodes are replaced all the time and Pods will come and go freely over the lifetime of a cluster... it'd be nice if there was still a way to grab the logs for previous PipelineRuns. I'm sure there's a possible issue here with the 1Mb limit etcd places on each K8s API object, so perhaps that's why it's the way it is at the moment.

Question for devs: Is this something that's being investigated?

daviddyball on 20 Oct 2020

Logs are not stored in etcd AFAIK, this can be a issue if tekton does not ship the logs to the configured S3 bucket/GCS bucket. Just want to confirm this behaviour. @bobcatfish Can you share your inputs ?

hprateek43 on 20 Oct 2020

@daviddyball indeed, this is by design. Aggregating logs for Tekton workflows is similar to aggerating logs for any workload running in kubernetes.
The main question is, should Tekton provide a component to aggregate logs somewhere, or should we just rely on existing ones in the kubernetes ecosystem (loki, …) and document it / give some advice on them ? This is something that is "investigated" but not as part of the core component of tekton (aka tektoncd/pipeline).

vdemeester on 20 Oct 2020

👍2

@vdemeester I totally understand the feature-creep that could be encountered by trying to solve this problem, but the main issue is that the core tkn CLI errors if you try to fetch logs for pods that were removed, which feels broken.

I get that logging is quite a nebulous thing and there are already so many different options for this (Loki, ELK, Datadog etc.).... but in order for an operator to debug a pipeline who's containers are now missing from the system, they have to manually go and scrape their logging system and try and piece together and order the logs, which is honestly more effort than it's worth.

@hprateek43 you mentioned that there is an option to ship logs to object storage... where can I find that option? Are the logs shipped in a nicely readable format (e.g. like what you get from tkn pipelinerun logs?

daviddyball on 21 Oct 2020

@vdemeester I totally understand the feature-creep that could be encountered by trying to solve this problem, but the main issue is that the core tkn CLI errors if you try to fetch logs for pods that were removed, which feels broken.

I get that logging is quite a nebulous thing and there are already so many different options for this (Loki, ELK, Datadog etc.).... but in order for an operator to debug a pipeline who's containers are now missing from the system, they have to manually go and scrape their logging system and try and piece together and order the logs, which is honestly more effort than it's worth.

@hprateek43 you mentioned that there is an option to ship logs to object storage... where can I find that option? Are the logs shipped in a nicely readable format (e.g. like what you get from tkn pipelinerun logs?