Jx: Environment wiped out during concurrent builds

Created on 22 Nov 2019 · 12Comments · Source: jenkins-x/jx

Summary

This is a problem that kept me busy several days. When several environment build pipeline happen roughly at the same time (because several application master branches has been updated at the same time), the environment can be totally wiped out. It includes the deployments, the services, the ingresses, ... everything managed by the charts.

After adding outputs to jx (to see output of kubect apply and kubectl delete), the cause is clear:

the first build runs kubect apply and sets everything to version to 61
the second build runs kubect apply and sets everything to versions to 62
the first build runs kubectl delete and deletes everything which has not the version 61, meaning at this point literally everything.
the first build runs kubectl delete but there's nothing left anyway.

Both staging and production can be impacted. In production can be more careful to not promote applications too fast.

Solutions

This is a synchronization problem.

The condition jenkins.io/chart-release=jx,jenkins.io/version!=78 seems quite violent, jenkins.io/chart-release=jx,jenkins.io/version<78 could already be much safer.

I think that if the jx step helm apply is cut in two step (jx step helm apply and jx step helm clean), and that before each of the step, a step checks that the pipeline is the latest version, most of the problems could be avoided.

But unlikely order of events could still be a problem.

Steps to reproduce the behavior

Just run many pipelines

jx start pipeline my-repo/my-project-1/master
jx start pipeline my-repo/my-project-2/master
jx start pipeline my-repo/my-project-3/master
jx start pipeline my-repo/my-project-4/master
jx start pipeline my-repo/my-project-5/master

Wait and pray to see it

watch kubectl get all -n jx-staging

Expected behavior

Pipelines should happen one by one, and everybody should be happy.

Actual behavior

The environment gets totally wiped out, and everybody is sad.

Jx version

The current head of jx (d2b7a7115fed1c537491ad556fd558a27b621163), customized to display kubectl apply and kubectl output

The output of jx version is:

NAME               VERSION
jx                 2.0.998-dev+46a9e4e6e
Kubernetes cluster v1.14.8-gke.12
kubectl            v1.14.7
helm client        Client: v2.14.3+g0e7f3b6
git                2.17.1
Operating System   Ubuntu 18.04.3 LTS

Jenkins type

[x] Serverless Jenkins X Pipelines (Tekton + Prow)
[ ] Classic Jenkins

Kubernetes cluster

Standard jx create cluster gke jx boot process.

Operating system / Environment

Ubuntu 18.04.3 LTS

Here is what I could see while running 5 builds (form #77 to #81). I've lost #77 logs, but there was nothing specific to notice. #78 clearly wiped out everything (apparently with the kind help of #80)

xxx/environment-xxx-staging/master #78 promotion

kubectl apply --recursive -f /tmp/helm-template-workdir-948611119/jx/output/namespaces/jx-staging -l jenkins.io/chart-release=jx --namespace jx-staging --wait --validate=false
==========
deployment.extensions/jx-api-poi-v2 configured
release.jenkins.io/api-poi-v2-0.0.20 created
service/api-poi-v2 configured
role.rbac.authorization.k8s.io/cleanup configured
rolebinding.rbac.authorization.k8s.io/cleanup configured
serviceaccount/cleanup configured
configmap/exposecontroller configured
role.rbac.authorization.k8s.io/expose configured
rolebinding.rbac.authorization.k8s.io/expose configured
serviceaccount/expose configured
deployment.extensions/jx-olli-log-management configured
ingress.extensions/jx-olli-log-management configured
issuer.certmanager.k8s.io/letsencrypt-prod configured
release.jenkins.io/olli-log-management-0.0.75 configured
service/olli-log-management configured
configmap/skills-server-redis configured
service/skills-server-redis-headless configured
configmap/skills-server-redis-health configured
statefulset.apps/skills-server-redis-master configured
service/skills-server-redis-master configured
deployment.extensions/jx-skills-server-x configured
release.jenkins.io/skills-server-x-0.0.37 created
service/skills-server-x configured
deployment.extensions/jx-skills-vue-x configured
ingress.extensions/jx-skills-vue-x configured
release.jenkins.io/skills-vue-x-0.0.59 created
service/skills-vue-x configured
deployment.extensions/jx-testx configured
release.jenkins.io/testx-0.0.7 created
service/testx configured
==========


kubectl delete all --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
service "skills-server-redis-headless" deleted
service "skills-server-redis-master" deleted
service "skills-vue-x" deleted
deployment.apps "jx-api-poi-v2" deleted
deployment.apps "jx-olli-log-management" deleted
deployment.apps "jx-skills-server-x" deleted
deployment.apps "jx-skills-vue-x" deleted
deployment.apps "jx-testx" deleted
release.jenkins.io "olli-log-management-0.0.75" deleted
==========
kubectldelete pvc --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete configmap --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
configmap "exposecontroller" deleted
configmap "skills-server-redis" deleted
configmap "skills-server-redis-health" deleted
==========
kubectl delete release --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete sa --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
serviceaccount "cleanup" deleted
serviceaccount "expose" deleted
==========
kubectl delete role --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
role.rbac.authorization.k8s.io "cleanup" deleted
role.rbac.authorization.k8s.io "expose" deleted
==========
kubectl delete rolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
rolebinding.rbac.authorization.k8s.io "cleanup" deleted
rolebinding.rbac.authorization.k8s.io "expose" deleted
==========
kubectl delete secret --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrole --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========
error: upgrading helm chart '.': failed to run 'kubectl delete -f /tmp/helm-template-workdir-948611119/jx/helmHooks/env/charts/expose/templates/job.yaml --namespace jx-staging --wait' command in directory '/tmp/jx-helm-apply-142711026/env', output: 'Error from server (NotFound): error when deleting "/tmp/helm-template-workdir-948611119/jx/helmHooks/env/charts/expose/templates/job.yaml": jobs.batch "expose" not found'

xxx/environment-xxx-staging/master #79 promotion

Showing logs for build xxx/environment-xxx-staging/master #79 promotion stage meta-pipeline and container step-create-tekton-crds                                                                                                                              
? A local Jenkins X versions repository already exists, pulling the latest: Yes
running command: jx step next-version --use-git-tag-only --tag
created new version: 0.0.73 and written to file: ./VERSION
error: Have you set up a git credential helper? See https://help.github.com/articles/caching-your-github-password-in-git/
: git output: To https://github.com/xxx/environment-x-staging.git
 ! [rejected]        v0.0.73 -> v0.0.73 (already exists)
error: failed to push some refs to 'https://github.com/xxx/environment-xxx-staging.git'
hint: Updates were rejected because the tag already exists in the remote.: failed to run 'git push origin v0.0.73' command in directory '', output: 'To https://github.com/xxx/environment-xxx-staging.git
 ! [rejected]        v0.0.73 -> v0.0.73 (already exists)
error: failed to push some refs to 'https://github.com/xxx/environment-xxx-staging.git'
hint: Updates were rejected because the tag already exists in the remote.'
error: failed to set the version on release pipelines: failed to run '/bin/sh -c jx step next-version --use-git-tag-only --tag' command in directory '/workspace/source', output: ''

Pipeline failed on stage 'meta-pipeline' : container 'step-create-tekton-crds'. The execution of the pipeline has stopped.

xxx/environment-xxx-staging/master #80 promotion

kubectl apply --recursive -f /tmp/helm-template-workdir-154755877/jx/output/namespaces/jx-staging -l jenkins.io/chart-release=jx --namespace jx-staging --wait --validate=false
==========
deployment.extensions/jx-api-poi-v2 configured
release.jenkins.io/api-poi-v2-0.0.20 configured
service/api-poi-v2 configured
role.rbac.authorization.k8s.io/cleanup configured
rolebinding.rbac.authorization.k8s.io/cleanup configured
serviceaccount/cleanup configured
configmap/exposecontroller configured
role.rbac.authorization.k8s.io/expose configured
rolebinding.rbac.authorization.k8s.io/expose configured
serviceaccount/expose configured
deployment.extensions/jx-olli-log-management configured
ingress.extensions/jx-olli-log-management configured
issuer.certmanager.k8s.io/letsencrypt-prod configured
release.jenkins.io/olli-log-management-0.0.75 configured
service/olli-log-management configured
configmap/skills-server-redis configured
service/skills-server-redis-headless configured
configmap/skills-server-redis-health configured
statefulset.apps/skills-server-redis-master configured
service/skills-server-redis-master configured
deployment.extensions/jx-skills-server-x configured
release.jenkins.io/skills-server-x-0.0.37 configured
service/skills-server-x configured
deployment.extensions/jx-skills-vue-x configured
ingress.extensions/jx-skills-vue-x configured
release.jenkins.io/skills-vue-x-0.0.59 configured
service/skills-vue-x configured
deployment.extensions/jx-testx configured
release.jenkins.io/testx-0.0.7 configured
service/testx configured
==========


kubectl delete all --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
service "api-poi-v2" deleted
service "olli-log-management" deleted
service "skills-server-x" deleted
service "testx" deleted
statefulset.apps "skills-server-redis-master" deleted
release.jenkins.io "api-poi-v2-0.0.20" deleted
release.jenkins.io "skills-server-x-0.0.37" deleted
release.jenkins.io "skills-vue-x-0.0.59" deleted
release.jenkins.io "testx-0.0.7" deleted
==========
kubectldelete pvc --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete configmap --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete release --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete sa --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete role --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete rolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete secret --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrole --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========

xxx/environment-xxx-staging/master #81 promotion

kubectl apply --recursive -f /tmp/helm-template-workdir-707178113/jx/output/namespaces/jx-staging -l jenkins.io/chart-release=jx --namespace jx-staging --wait --validate=false
==========
deployment.extensions/jx-api-poi-v2 configured
release.jenkins.io/api-poi-v2-0.0.20 configured
service/api-poi-v2 configured
role.rbac.authorization.k8s.io/cleanup configured
rolebinding.rbac.authorization.k8s.io/cleanup configured
serviceaccount/cleanup configured
configmap/exposecontroller configured
role.rbac.authorization.k8s.io/expose configured
rolebinding.rbac.authorization.k8s.io/expose configured
serviceaccount/expose configured
deployment.extensions/jx-olli-log-management configured
ingress.extensions/jx-olli-log-management configured
issuer.certmanager.k8s.io/letsencrypt-prod configured
release.jenkins.io/olli-log-management-0.0.75 configured
service/olli-log-management configured
configmap/skills-server-redis configured
service/skills-server-redis-headless configured
configmap/skills-server-redis-health configured
statefulset.apps/skills-server-redis-master configured
service/skills-server-redis-master configured
deployment.extensions/jx-skills-server-x configured
release.jenkins.io/skills-server-x-0.0.37 configured
service/skills-server-x configured
deployment.extensions/jx-skills-vue-x configured
ingress.extensions/jx-skills-vue-x configured
release.jenkins.io/skills-vue-x-0.0.59 configured
service/skills-vue-x configured
deployment.extensions/jx-testx configured
release.jenkins.io/testx-0.0.7 configured
service/testx configured
==========


kubectl delete all --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
release.jenkins.io "api-poi-v2-0.0.19" deleted
release.jenkins.io "skills-server-x-0.0.36" deleted
release.jenkins.io "skills-vue-x-0.0.57" deleted
release.jenkins.io "testx-0.0.6" deleted
==========
kubectl delete pvc --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete configmap --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete release --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete sa --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete role --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete rolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete secret --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrole --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========
error: upgrading helm chart '.': failed to run 'kubectl delete -f /tmp/helm-template-workdir-707178113/jx/helmHooks/env/charts/expose/templates/job.yaml --namespace jx-staging --wait' command in directory '/tmp/jx-helm-apply-934058140/env', output: 'Error from server (NotFound): error when deleting "/tmp/helm-template-workdir-707178113/jx/helmHooks/env/charts/expose/templates/job.yaml": jobs.batch "expose" not found'

areenvironment arefox kinbug prioritcritical

Source

aure-olli

👍2

All 12 comments

Oh and if it happens to you, jx start pipeline xxx/environment-xxx-staging/master (or any action that eventually start pipeline) will fix the mess.

aure-olli on 22 Nov 2019

The problem was also amplified by a failing post-upgrade job. Seems like the pipeline would wait for it to finally fail after several restarts. That together increased a lot the likehood of the wipe-out to happen, which made it happen several times a day.

aure-olli on 22 Nov 2019

Thanks for this research, I was already wondering what caused all my services in a namespace to vanish randomly when it was particularly busy. Definitely not an option for production workloads ;)

I think it will never be fully fixable by reordering the build steps in two concurrent pipelines by checking things or changing conditions, because that would be quite a burden to maintain, these might need to change in ways we cannot fully predict.

I think we have two options here:

Wait for the previous one to be completed before starting the next one. Could take a lot of time though, it might be uneccessary to wait if a newer build will do the same thing. The safest option though.
Kill the running one first, which should be fine if things are meant to be idempotent. I see this regularly with other build systems where a build is restarted immediately once new information comes in, rendering the previous build obsolete. Side effect could be missing/inconsistent tags or artifacts though, when things got killed before creating them. Not sure how bad this would be for the environment pipelines, these are meant to replace all outdated services on each run anyway.

I'd say option 1, and then 2 later where it is safe to do so, so you do not wait forever when a pipeline gets stuck.

sfynx on 29 Nov 2019

The proposed solution to fix this issue would be to have a lock created as a ConfigMap when the step helm apply begins to upgrade a chart release. The lock would be created per release and in the namespace where the release is deployed. Also the lock is removed automatically at the end of the upgrade process. This will ensure that every helm chart release is atomically applied.

Any concurrent upgrade while the lock is active, it will fail immediately. This is a more robust solution than retrying or waiting on a timer because a failed pipeline due to a concurrency issue can be re-triggered at any time. The same is valid when the step helm apply is executed manually.

ccojocar on 13 Jan 2020

👍1

relates to #5471

jstrachan on 13 Jan 2020

It will sure make everything much safer: only one helm apply at a time in each namespace.

In the future, a "wait for few minutes if I'm the latest pipeline" would however make things smoother.

aure-olli on 14 Jan 2020

@ccojocar, how far did you get with the lock approach? Did you have anything working?

hferentschik on 12 Feb 2020

What's about a custom admission controller? Based on some config it won't let request for specific pipelines proceed while another pipeline is still running. In our case we would use it for the environment pipelines, but we could put some API around it so that basically any pipeline can be configured to run in isolation.

hferentschik on 12 Feb 2020

Hi @aure-olli, have you experienced any further occurrences of the issue? Given the effort to address the matter is quite involved, it would be good to help us understand the urgency based on the frequency that you experience the problem.

deanesmith on 13 Feb 2020

Hi @deanesmith

We have temporarily solved our problem with a custom jx by changing the selector to clean the environment, as suggested in my first post: see the diff

The problem was much amplified by a pre-install hook that failed and restarted over and over, so the whole process would take much longer than it should. The environment would be wiped out several times a day in those conditions ! In my opinion this is a critically serious matter because:

pre-install hooks may also fail due to bad luck (imagine the database you try to initiate is temporarily offline)
This can impact even the production environment
This may happen even without the hook, just because many people are working together and accidentally starting the same pipeline roughly at the same time

I must say that I like your solution, of locking the whole kubectl apply process. However, pardon my ignorance but I'm quite surprised it takes so long to implement it. The way I would personally implement it is (instep helm apply):

while true:
  try:
    create a jx-lock-<namespace> configmap with
      - owner reference to the pipeline
      - the pipeline build number
      - an empty "next" field
    break
  except already exists:
    get the configmap jx-lock-<namespace>
    if the pipeline in the owner reference is finished:
      try:
        delete the received version of the configmap
      except: pass
      continue
    if "next" is from a pipeline with higher build number:
      fail
    try:
      update the "next" field of the configmap with our own pipeline
    except: continue
    watch the configmap and the pipeline:
      if the configmap has changed or was deleted:
        continue
      if the pipeline status has changed or was deleted:
        continue

And then delete the configmap once kubectl apply and kubectl delete finished, successfully or not.

Kubernetes deals atomically with objects create and update (as long as you provide the current resource version), so no concurrency problem.

I wouldn't mind implementing it myself. Of course there are missing details that will make it less easy, but I think this is globally straightforward.

aure-olli on 14 Feb 2020

@aure-olli, we are handling other priority matters right now, do you want to create a PR for your proposal? Seems reasonable.

deanesmith on 18 Feb 2020

👍1

I made a PR to fix the problem, using the algorithm we talked about: #6953. This is not safe for merge yet, but seems to work decently in my first tests.

Please can you let me know what you think about it ?

aure-olli on 23 Mar 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Installing with --no-default-environment still asks for GitHub username

SmartassSkeleton · 4Comments

Show some diagnostics information to the user when the install fails

ccojocar · 3Comments

jx import: support creating repositories in local BitBucket server / enterprise repos

jstrachan · 4Comments

add a CLI command to split a monorepo up into separate git repos

jstrachan · 3Comments

Instructions are incomplete and does not work

ipv1337 · 3Comments