Jx: The jenkins-x-gc-activities pods start to fail and bring down Jenkins

Created on 17 Sep 2018  ·  21Comments  ·  Source: jenkins-x/jx

Summary

Jenkins stops responding to requests web requests (EG: jx console page fails to load). Upon inspection of the pods, you end up with hundreds of failed jenkins-x-gc-activities pods:

❯ kubectl get pods --all-namespaces
NAMESPACE      NAME                                                   READY     STATUS              RESTARTS   AGE
jxmt           jenkins-x-gc-activities-1537207200-229pb               0/1       Error               0          3m
jxmt           jenkins-x-gc-activities-1537207200-2484n               0/1       Error               0          39m
jxmt           jenkins-x-gc-activities-1537207200-24fs6               0/1       Error               0          9m
jxmt           jenkins-x-gc-activities-1537207200-262j9               0/1       Error               0          31m
jxmt           jenkins-x-gc-activities-1537207200-268d6               0/1       Error               0          12m

And logs on one of those pods gets you:

❯ kubectl logs --tail=100 jenkins-x-gc-activities-1537207200-zwb8m -n jxmt
error: deployments.apps "prow-build" is forbidden: User "system:serviceaccount:jxmt:jenkins-x-gc-activities" cannot get deployments.apps in the namespace "jxmt"

When I installed JX, I did not enable prow. The install was basically jx create cluster eks and hooking it up to Bitbucket Cloud. I built the cluster this morning and Jenkins is unresponsive. This has actually happened to 2-3 of my clusters.

Steps to reproduce the behavior

  1. Create a cluster in EKS with default environments.
  2. Create a quickstart (golang-http for example). Pipeline passes/fails (doesn't matter).
  3. Let it run for a few hours and the failing pods start to build up.

All of my repositories are private if that matters at all (--git-private flag).

Jx version

The output of jx version is:

NAME               VERSION
jx                 1.3.275
jenkins x platform 0.0.2447
kubernetes cluster v1.10.3
kubectl            v1.11.3
helm client        v2.10.0+g9ad53aa
helm server        v2.10.0+g9ad53aa
git                git version 2.19.0

Kubernetes cluster

EKS and created with jx create cluster eks but pointing to Bitbucket Cloud.

Operating system / Environment

N/A

Expected behavior

The jenkins-x-gc-activities pods to pass.

Actual behavior

The jenkins-x-gc-activities pods fail and end up bringing down Jenkins site.

Most helpful comment

fwiw: in the meantime, kubectl delete pod -l app=gc-activities will clear up all the pods

All 21 comments

Tried to run jx gc activities in hopes it would clear out the dead pods or something and got:

❯ jx gc activities
error: deployments.apps "prow-build" not found

So I ran this command (replace $NS with the namespace of JX, usually just jx):

kubectl create clusterrolebinding jenkins-jx-role-binding-2 --clusterrole=cluster-admin --user=system:serviceaccount:$NS:jenkins-x-gc-activities --user=system:serviceaccount:$NS:jenkins-x-gc-previews

And now the error on the pods show as:

❯ kubectl logs --tail=100 -n jx jenkins-x-gc-activities-1537304400-zvx7r
error: deployments.apps "prow-build" not found

Seeing the not found error on a vanilla minikube install. Looks like the problem is in IsProwEnabled around: https://github.com/jenkins-x/jx/blob/677f6c89c150edec35deea492eb38de78ecb541f/pkg/kube/namespaces.go#L97

When there is no prow deployment it returns false but it also returns the error indicating that the deployment was not found. The calling code in GCActivitiesOptions.Run just sees the error and bails out. Either IsProwEnabled needs to ignore any errors or it should be doing a List and searching for the deployment.

fwiw: in the meantime, kubectl delete pod -l app=gc-activities will clear up all the pods

@davidcurrie Yes, I found that this morning - going to verify and then send a PR. Seems as that line caused both issues in this ticket - gc-activities pods needs permission to get deployments and the error needs to be ignored as you said.

Created https://github.com/jenkins-x/jx/pull/1723 - this only fixes the 2nd problem, original problem still exists with the permissions.

Created https://github.com/jenkins-x/jenkins-x-platform/pull/3562 to try to fix original permission problem. It's a super wild guess... hah

Yeah, seeing this too on a plain Minikube install I started today (with jx create cluster minikube). Oddly enough I don't seem to see this error with every install, nor do I see it when running it outside of Minikube.

David's temporary work around did the trick for me though, it ended up deleting over 500 pods and "unfroze" Jenkins. Although within minutes there's another 20 or so back up.

We desperately need #1705 to be fixed. It brings EKS cluster down due to the k8s 1.10.0 bug which causes failed jobs to loop infinitely until cluster runs out of resources. My dirty workaround for this one is to disable jenkins-x-gc-activities cronjob.

https://github.com/jenkins-x/jx/pull/1723 is now merged and is in jx 1.3.291 onwards. To pick that up you need to bump CHART_VERSION in your ~/.jx/cloud-environments/Makefile. 0.0.2510 is the latest. Make sure when you create the cluster you don't select the option to recreate the cloud environment otherwise you'll lose this change.

@davidcurrie @hekonsek @polothy this has now been released thanks for the work on this!

Kudos for @polothy for fixing this!

I'm still getting this problem with a newly created cluster, with the latest jx version:

jx                 1.3.315
jenkins x platform 0.0.2544
kubernetes cluster v1.10.3
kubectl            v1.12.0-rc.1
helm client        v2.10.0+g9ad53aa
helm server        v2.10.0+g9ad53aa
git                git version 2.7.4

Any ideas on what could be wrong with my set up? Or how to update jx to get the release that includes this fix?

Many thanks!

I thought the chart would be regenerated and dumped into https://github.com/jenkins-x/jenkins-x-platform but I'm not seeing the change there. I asked here https://github.com/jenkins-x/jx/pull/1735#issuecomment-424036775 if something else needs to be done.

I figured this project generated the values.yaml file in jenkins-x-platform, but maybe not?

Blast - that's what I did originally but thought that file was auto-generated from this repo (https://github.com/jenkins-x/jenkins-x-platform/pull/3562). Very confusing.

No, you have to add those values as I did. :) My PR will do the job, I've just tested it against my cluster.

Do you know how the charts are used inside of this project? (EG, this is what I changed: https://github.com/jenkins-x/jx/pull/1735/files)

I think that your chart in project will be respected as well, but you should specify api group as apps, not empty string, because apps.deployments permission is missing.

BTW I think that platform should be the right place to make this kind of changes.

We started with the chart in jenkins-x-platform but are moving to using the charts in the jx repo. So ATM there's some duplication annoyingly. Will try and sort out soon but for now I think this PR should do it https://github.com/jenkins-x/jenkins-x-platform/pull/3675 . I've just merged it so will run the end to end tests and release cloud-environments provided all is green.

Was this page helpful?
0 / 5 - 0 ratings