This is a problem that kept me busy several days. When several environment build pipeline happen roughly at the same time (because several application master branches has been updated at the same time), the environment can be totally wiped out. It includes the deployments, the services, the ingresses, ... everything managed by the charts.
After adding outputs to jx (to see output of kubect apply and kubectl delete), the cause is clear:
kubect apply and sets everything to version to 61kubect apply and sets everything to versions to 62kubectl delete and deletes everything which has not the version 61, meaning at this point literally everything.kubectl delete but there's nothing left anyway.Both staging and production can be impacted. In production can be more careful to not promote applications too fast.
This is a synchronization problem.
The condition jenkins.io/chart-release=jx,jenkins.io/version!=78 seems quite violent, jenkins.io/chart-release=jx,jenkins.io/version<78 could already be much safer.
I think that if the jx step helm apply is cut in two step (jx step helm apply and jx step helm clean), and that before each of the step, a step checks that the pipeline is the latest version, most of the problems could be avoided.
But unlikely order of events could still be a problem.
Just run many pipelines
jx start pipeline my-repo/my-project-1/master
jx start pipeline my-repo/my-project-2/master
jx start pipeline my-repo/my-project-3/master
jx start pipeline my-repo/my-project-4/master
jx start pipeline my-repo/my-project-5/master
Wait and pray to see it
watch kubectl get all -n jx-staging
Pipelines should happen one by one, and everybody should be happy.
The environment gets totally wiped out, and everybody is sad.
The current head of jx (d2b7a7115fed1c537491ad556fd558a27b621163), customized to display kubectl apply and kubectl output
The output of jx version is:
NAME VERSION
jx 2.0.998-dev+46a9e4e6e
Kubernetes cluster v1.14.8-gke.12
kubectl v1.14.7
helm client Client: v2.14.3+g0e7f3b6
git 2.17.1
Operating System Ubuntu 18.04.3 LTS
Standard jx create cluster gke jx boot process.
Ubuntu 18.04.3 LTS
Here is what I could see while running 5 builds (form #77 to #81). I've lost #77 logs, but there was nothing specific to notice. #78 clearly wiped out everything (apparently with the kind help of #80)
xxx/environment-xxx-staging/master #78 promotion
kubectl apply --recursive -f /tmp/helm-template-workdir-948611119/jx/output/namespaces/jx-staging -l jenkins.io/chart-release=jx --namespace jx-staging --wait --validate=false
==========
deployment.extensions/jx-api-poi-v2 configured
release.jenkins.io/api-poi-v2-0.0.20 created
service/api-poi-v2 configured
role.rbac.authorization.k8s.io/cleanup configured
rolebinding.rbac.authorization.k8s.io/cleanup configured
serviceaccount/cleanup configured
configmap/exposecontroller configured
role.rbac.authorization.k8s.io/expose configured
rolebinding.rbac.authorization.k8s.io/expose configured
serviceaccount/expose configured
deployment.extensions/jx-olli-log-management configured
ingress.extensions/jx-olli-log-management configured
issuer.certmanager.k8s.io/letsencrypt-prod configured
release.jenkins.io/olli-log-management-0.0.75 configured
service/olli-log-management configured
configmap/skills-server-redis configured
service/skills-server-redis-headless configured
configmap/skills-server-redis-health configured
statefulset.apps/skills-server-redis-master configured
service/skills-server-redis-master configured
deployment.extensions/jx-skills-server-x configured
release.jenkins.io/skills-server-x-0.0.37 created
service/skills-server-x configured
deployment.extensions/jx-skills-vue-x configured
ingress.extensions/jx-skills-vue-x configured
release.jenkins.io/skills-vue-x-0.0.59 created
service/skills-vue-x configured
deployment.extensions/jx-testx configured
release.jenkins.io/testx-0.0.7 created
service/testx configured
==========
kubectl delete all --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
service "skills-server-redis-headless" deleted
service "skills-server-redis-master" deleted
service "skills-vue-x" deleted
deployment.apps "jx-api-poi-v2" deleted
deployment.apps "jx-olli-log-management" deleted
deployment.apps "jx-skills-server-x" deleted
deployment.apps "jx-skills-vue-x" deleted
deployment.apps "jx-testx" deleted
release.jenkins.io "olli-log-management-0.0.75" deleted
==========
kubectldelete pvc --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete configmap --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
configmap "exposecontroller" deleted
configmap "skills-server-redis" deleted
configmap "skills-server-redis-health" deleted
==========
kubectl delete release --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete sa --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
serviceaccount "cleanup" deleted
serviceaccount "expose" deleted
==========
kubectl delete role --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
role.rbac.authorization.k8s.io "cleanup" deleted
role.rbac.authorization.k8s.io "expose" deleted
==========
kubectl delete rolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
rolebinding.rbac.authorization.k8s.io "cleanup" deleted
rolebinding.rbac.authorization.k8s.io "expose" deleted
==========
kubectl delete secret --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrole --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=78,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========
error: upgrading helm chart '.': failed to run 'kubectl delete -f /tmp/helm-template-workdir-948611119/jx/helmHooks/env/charts/expose/templates/job.yaml --namespace jx-staging --wait' command in directory '/tmp/jx-helm-apply-142711026/env', output: 'Error from server (NotFound): error when deleting "/tmp/helm-template-workdir-948611119/jx/helmHooks/env/charts/expose/templates/job.yaml": jobs.batch "expose" not found'
xxx/environment-xxx-staging/master #79 promotion
Showing logs for build xxx/environment-xxx-staging/master #79 promotion stage meta-pipeline and container step-create-tekton-crds
? A local Jenkins X versions repository already exists, pulling the latest: Yes
running command: jx step next-version --use-git-tag-only --tag
created new version: 0.0.73 and written to file: ./VERSION
error: Have you set up a git credential helper? See https://help.github.com/articles/caching-your-github-password-in-git/
: git output: To https://github.com/xxx/environment-x-staging.git
! [rejected] v0.0.73 -> v0.0.73 (already exists)
error: failed to push some refs to 'https://github.com/xxx/environment-xxx-staging.git'
hint: Updates were rejected because the tag already exists in the remote.: failed to run 'git push origin v0.0.73' command in directory '', output: 'To https://github.com/xxx/environment-xxx-staging.git
! [rejected] v0.0.73 -> v0.0.73 (already exists)
error: failed to push some refs to 'https://github.com/xxx/environment-xxx-staging.git'
hint: Updates were rejected because the tag already exists in the remote.'
error: failed to set the version on release pipelines: failed to run '/bin/sh -c jx step next-version --use-git-tag-only --tag' command in directory '/workspace/source', output: ''
Pipeline failed on stage 'meta-pipeline' : container 'step-create-tekton-crds'. The execution of the pipeline has stopped.
xxx/environment-xxx-staging/master #80 promotion
kubectl apply --recursive -f /tmp/helm-template-workdir-154755877/jx/output/namespaces/jx-staging -l jenkins.io/chart-release=jx --namespace jx-staging --wait --validate=false
==========
deployment.extensions/jx-api-poi-v2 configured
release.jenkins.io/api-poi-v2-0.0.20 configured
service/api-poi-v2 configured
role.rbac.authorization.k8s.io/cleanup configured
rolebinding.rbac.authorization.k8s.io/cleanup configured
serviceaccount/cleanup configured
configmap/exposecontroller configured
role.rbac.authorization.k8s.io/expose configured
rolebinding.rbac.authorization.k8s.io/expose configured
serviceaccount/expose configured
deployment.extensions/jx-olli-log-management configured
ingress.extensions/jx-olli-log-management configured
issuer.certmanager.k8s.io/letsencrypt-prod configured
release.jenkins.io/olli-log-management-0.0.75 configured
service/olli-log-management configured
configmap/skills-server-redis configured
service/skills-server-redis-headless configured
configmap/skills-server-redis-health configured
statefulset.apps/skills-server-redis-master configured
service/skills-server-redis-master configured
deployment.extensions/jx-skills-server-x configured
release.jenkins.io/skills-server-x-0.0.37 configured
service/skills-server-x configured
deployment.extensions/jx-skills-vue-x configured
ingress.extensions/jx-skills-vue-x configured
release.jenkins.io/skills-vue-x-0.0.59 configured
service/skills-vue-x configured
deployment.extensions/jx-testx configured
release.jenkins.io/testx-0.0.7 configured
service/testx configured
==========
kubectl delete all --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
service "api-poi-v2" deleted
service "olli-log-management" deleted
service "skills-server-x" deleted
service "testx" deleted
statefulset.apps "skills-server-redis-master" deleted
release.jenkins.io "api-poi-v2-0.0.20" deleted
release.jenkins.io "skills-server-x-0.0.37" deleted
release.jenkins.io "skills-vue-x-0.0.59" deleted
release.jenkins.io "testx-0.0.7" deleted
==========
kubectldelete pvc --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete configmap --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete release --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete sa --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete role --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete rolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete secret --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrole --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=80,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========
xxx/environment-xxx-staging/master #81 promotion
kubectl apply --recursive -f /tmp/helm-template-workdir-707178113/jx/output/namespaces/jx-staging -l jenkins.io/chart-release=jx --namespace jx-staging --wait --validate=false
==========
deployment.extensions/jx-api-poi-v2 configured
release.jenkins.io/api-poi-v2-0.0.20 configured
service/api-poi-v2 configured
role.rbac.authorization.k8s.io/cleanup configured
rolebinding.rbac.authorization.k8s.io/cleanup configured
serviceaccount/cleanup configured
configmap/exposecontroller configured
role.rbac.authorization.k8s.io/expose configured
rolebinding.rbac.authorization.k8s.io/expose configured
serviceaccount/expose configured
deployment.extensions/jx-olli-log-management configured
ingress.extensions/jx-olli-log-management configured
issuer.certmanager.k8s.io/letsencrypt-prod configured
release.jenkins.io/olli-log-management-0.0.75 configured
service/olli-log-management configured
configmap/skills-server-redis configured
service/skills-server-redis-headless configured
configmap/skills-server-redis-health configured
statefulset.apps/skills-server-redis-master configured
service/skills-server-redis-master configured
deployment.extensions/jx-skills-server-x configured
release.jenkins.io/skills-server-x-0.0.37 configured
service/skills-server-x configured
deployment.extensions/jx-skills-vue-x configured
ingress.extensions/jx-skills-vue-x configured
release.jenkins.io/skills-vue-x-0.0.59 configured
service/skills-vue-x configured
deployment.extensions/jx-testx configured
release.jenkins.io/testx-0.0.7 configured
service/testx configured
==========
kubectl delete all --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
release.jenkins.io "api-poi-v2-0.0.19" deleted
release.jenkins.io "skills-server-x-0.0.36" deleted
release.jenkins.io "skills-vue-x-0.0.57" deleted
release.jenkins.io "testx-0.0.6" deleted
==========
kubectl delete pvc --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete configmap --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete release --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete sa --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete role --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete rolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete secret --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81 --namespace jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrole --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========
kubectl delete clusterrolebinding --ignore-not-found -l jenkins.io/chart-release=jx,jenkins.io/version!=81,jenkins.io/namespace=jx-staging --wait
==========
No resources found
==========
error: upgrading helm chart '.': failed to run 'kubectl delete -f /tmp/helm-template-workdir-707178113/jx/helmHooks/env/charts/expose/templates/job.yaml --namespace jx-staging --wait' command in directory '/tmp/jx-helm-apply-934058140/env', output: 'Error from server (NotFound): error when deleting "/tmp/helm-template-workdir-707178113/jx/helmHooks/env/charts/expose/templates/job.yaml": jobs.batch "expose" not found'
Oh and if it happens to you, jx start pipeline xxx/environment-xxx-staging/master (or any action that eventually start pipeline) will fix the mess.
The problem was also amplified by a failing post-upgrade job. Seems like the pipeline would wait for it to finally fail after several restarts. That together increased a lot the likehood of the wipe-out to happen, which made it happen several times a day.
Thanks for this research, I was already wondering what caused all my services in a namespace to vanish randomly when it was particularly busy. Definitely not an option for production workloads ;)
I think it will never be fully fixable by reordering the build steps in two concurrent pipelines by checking things or changing conditions, because that would be quite a burden to maintain, these might need to change in ways we cannot fully predict.
I think we have two options here:
Wait for the previous one to be completed before starting the next one. Could take a lot of time though, it might be uneccessary to wait if a newer build will do the same thing. The safest option though.
Kill the running one first, which should be fine if things are meant to be idempotent. I see this regularly with other build systems where a build is restarted immediately once new information comes in, rendering the previous build obsolete. Side effect could be missing/inconsistent tags or artifacts though, when things got killed before creating them. Not sure how bad this would be for the environment pipelines, these are meant to replace all outdated services on each run anyway.
I'd say option 1, and then 2 later where it is safe to do so, so you do not wait forever when a pipeline gets stuck.
The proposed solution to fix this issue would be to have a lock created as a ConfigMap when the step helm apply begins to upgrade a chart release. The lock would be created per release and in the namespace where the release is deployed. Also the lock is removed automatically at the end of the upgrade process. This will ensure that every helm chart release is atomically applied.
Any concurrent upgrade while the lock is active, it will fail immediately. This is a more robust solution than retrying or waiting on a timer because a failed pipeline due to a concurrency issue can be re-triggered at any time. The same is valid when the step helm apply is executed manually.
relates to #5471
It will sure make everything much safer: only one helm apply at a time in each namespace.
In the future, a "wait for few minutes if I'm the latest pipeline" would however make things smoother.
@ccojocar, how far did you get with the lock approach? Did you have anything working?
What's about a custom admission controller? Based on some config it won't let request for specific pipelines proceed while another pipeline is still running. In our case we would use it for the environment pipelines, but we could put some API around it so that basically any pipeline can be configured to run in isolation.
Hi @aure-olli, have you experienced any further occurrences of the issue? Given the effort to address the matter is quite involved, it would be good to help us understand the urgency based on the frequency that you experience the problem.
Hi @deanesmith
We have temporarily solved our problem with a custom jx by changing the selector to clean the environment, as suggested in my first post: see the diff
The problem was much amplified by a pre-install hook that failed and restarted over and over, so the whole process would take much longer than it should. The environment would be wiped out several times a day in those conditions ! In my opinion this is a critically serious matter because:
pre-install hooks may also fail due to bad luck (imagine the database you try to initiate is temporarily offline)I must say that I like your solution, of locking the whole kubectl apply process. However, pardon my ignorance but I'm quite surprised it takes so long to implement it. The way I would personally implement it is (instep helm apply):
while true:
try:
create a jx-lock-<namespace> configmap with
- owner reference to the pipeline
- the pipeline build number
- an empty "next" field
break
except already exists:
get the configmap jx-lock-<namespace>
if the pipeline in the owner reference is finished:
try:
delete the received version of the configmap
except: pass
continue
if "next" is from a pipeline with higher build number:
fail
try:
update the "next" field of the configmap with our own pipeline
except: continue
watch the configmap and the pipeline:
if the configmap has changed or was deleted:
continue
if the pipeline status has changed or was deleted:
continue
And then delete the configmap once kubectl apply and kubectl delete finished, successfully or not.
Kubernetes deals atomically with objects create and update (as long as you provide the current resource version), so no concurrency problem.
I wouldn't mind implementing it myself. Of course there are missing details that will make it less easy, but I think this is globally straightforward.
@aure-olli, we are handling other priority matters right now, do you want to create a PR for your proposal? Seems reasonable.
I made a PR to fix the problem, using the algorithm we talked about: #6953. This is not safe for merge yet, but seems to work decently in my first tests.
Please can you let me know what you think about it ?