Origin: Deleting projects is leaving projects on zombie state

Created on 16 Jan 2018 · 46Comments · Source: openshift/origin

After deleting projects via openshift UI, the project is not being deleted. Trying via oc command generates:

Error from server (Conflict): Operation cannot be fulfilled on namespaces "istio-system": The system is ensuring all content is removed from this namespace. Upon completion, this namespace will automatically be purged by the system

Version

oc v3.7.23
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO
openshift v3.7.18
kubernetes v1.7.6+a08f5eeb62

Steps To Reproduce

Create a project
Deploy something on this project
Delete this project via UI
If it doesn't get deleted, try deleting via oc command

Current Result

Projects on Zoombie state

Expected Result

Projects Deleted

Additional Information

Screenshots

componenservice-catalog kinbug prioritP2

Source

gbaufake

Most helpful comment

@juanvallejo
Yes, some projects are on "terminating state" more than 2 days.

Projects can be listed with oc

gbaufake on 16 Jan 2018

👍8

All 46 comments

Does it only reproducible when you removes project via UI? In other words, would it go to a zombie state if you remove it from CLI (oc delete project <project_name>)?

php-coder on 16 Jan 2018

@php-coder

I checked with OC command and the result is the same as the UI;

Do you know what is the cause of this issue?

gbaufake on 16 Jan 2018

@gbaufake No, I don't. I hope that @juanvallejo knows or at least could know who knows :)

Meanwhile, did you check logs? Is there something that could be related to the issue?

php-coder on 16 Jan 2018

👍1

Was able to reproduce using a 3.9 client against a 3.9 cluster.
Steps I took:

# create a new project 'deleteme'
$ oc new-project deleteme
Now using project "deleteme" on server "https://127.0.0.1:8443".
...

# deploy an application on that project
$ oc new-app <path/to/app>
--> Found image d5b68e7 (3 weeks old) in image stream ...
...
--> Success
...
    Run 'oc status' to view your app.

# immediately delete project after `oc new-app` finishes running
$ oc delete deleteme
project "deleteme" deleted

# try deleting project once more
$ oc delete deleteme
oc delete project deleteme
Error from server (Conflict): Operation cannot be fulfilled on namespaces "deleteme": The system is ensuring all content is removed from this namespace.  Upon completion, this namespace will automatically be purged by the system.

# check to see if project can still be listed
$ oc projects
You have access to the following projects and can switch between them with 'oc project <projectname>':

    default
  * deleteme

Using project "deleteme" on server "https://127.0.0.1:8443".

The project is finally deleted after a minute or so, and no longer appears in the output of $ oc projects.

@gbaufake I suspect that maybe one or more resources that are created as part of deploying an application in your project are taking a bit longer than normal to be deleted (or maybe there are a lot of resources to delete in the first place). Since all resources belonging to a project must be deleted before the project itself can be deleted, the project will continue to exist until everything in it is gone.

However since the project has been marked for deletion already (when you deleted it through the webconsole), attempting to delete it a second time (as seen in my example above), will render the (Conflict) error that you are seeing.

Can you confirm that you are no longer able to list the deleted project (through oc projects) after deleting it, and waiting a minute or two?

juanvallejo on 16 Jan 2018

@juanvallejo
Yes, some projects are on "terminating state" more than 2 days.

Projects can be listed with oc

gbaufake on 16 Jan 2018

👍8

cc @soltysh

@gbaufake any chance you could list the resources that remain in the project while it is on the "terminating" state? After you get the (Conflict) error message when deleting it, do oc get all on the project. (Feel free to redact anything / just post the resource kinds).

juanvallejo on 16 Jan 2018

"oc get all" returns "No resources found." for the "terminating" state projects.

hhovsepy on 16 Jan 2018

@deads2k @soltysh @liggitt could this maybe be failure to delete a resource in the namespace that is not part of "all"?

juanvallejo on 16 Jan 2018

@deads2k @soltysh @liggitt could this maybe be failure to delete a resource in the namespace that is not part of "all"?

No. oc get all will not list every resource in the project.

Check the controller logs... the namespace controller will indicate the resources it could not delete

liggitt on 16 Jan 2018

@liggitt service atomic-openshift-master-controllers status -l -f would do the work?

gbaufake on 16 Jan 2018

@gbaufake yes, that should do. In case there's nothing in the logs you can also try increasing the loglevels and grep for namespace_controller.go or namespaced_resources_deleter.go. These will come from the namespace controller @liggitt mentioned.

soltysh on 16 Jan 2018

@ironcladlou since you're the GC expert, any ideas what might be stuck when removing a project in a 3.7 version?

soltysh on 16 Jan 2018

The controller logs already requested should help reveal the problem.

ironcladlou on 16 Jan 2018

Not the original poster, but we are having the same problem. The controller logs obtained via service atomic-openshift-master-controllers status -l -f show:

01-17 13:48:05.396213661 +0100 CET (durationBeforeRetry 2m2s). Error:
ene 17 13:46:03 master1.*****.com atomic-openshift-master-controllers[1992]: E0117 13:46:03.396421    1992 glusterfs.go:647] glusterfs: error when deleting the volume :
ene 17 13:46:03 master1.*****.com atomic-openshift-master-controllers[1992]: E0117 13:46:03.396494    1992 goroutinemap.go:166] Operation for "delete-pvc-c7db9d3a-f973-11e7-a8d9-000c29f66ce4[cba7fb1f-f973-11e7-a8d9-000c29f66ce4]" failed. No retries permitted until 2018-

henning-cg on 17 Jan 2018

Some Logs from atomic-openshift-master-controllers status -l -f

https://paste.fedoraproject.org/paste/QgJ3S1QTiGRVvhREEEnoDQ

gbaufake on 17 Jan 2018

@gbaufake if you have an API group that is unresponsive (as you do), the namespace controller cannot guarantee it has cleaned up all the resources in the namespace.

It is expected that the namespace will remain in Terminating state until the controller can ensure it has discovered and removed all the resources in that namespace.

liggitt on 17 Jan 2018

@liggitt Is there a way to restart API group specifically?

gbaufake on 17 Jan 2018

It's the problem of the 'Service Catalog' API group under the kube-service-catalog namespace.
Please check the states of the two pods under this namespace.

louyihua on 18 Jan 2018

@gbaufake

Jan 17 08:37:21  atomic-openshift-master-controllers[4416]: E0117 08:37:21.347636    4416 namespace_controller.go:148] unable to retrieve the complete list of server APIs: istio.io/v1alpha1: the server could not find the requested resource, servicecatalog.k8s.io/v1beta1: an error on the server ("Error: 'x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"service-catalog-signer\")'\nTrying to reach: 'https://172.30.231.104:443/apis/servicecatalog.k8s.io/v1beta1'") has prevented the request from succeeding

Your log shows there is a certificate problem of the ServiceCatalog API group. Please fix this issue first.

louyihua on 18 Jan 2018

Seems like the cert issue is related to #17952. From https://bugzilla.redhat.com/show_bug.cgi?id=1525014#c14 one possible solution was to re-create the service catalog.

soltysh on 18 Jan 2018

@soltysh Using this workaround that you mentioned may lead to https://github.com/openshift/openshift-ansible/issues/6572?

gbaufake on 18 Jan 2018

After correcting the certs, I brought a new cluster up

oc version

oc v3.7.27
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://ip:8443
openshift v3.7.27
kubernetes v1.7.6+a08f5eeb62
`

and still faced the same problem on deleting projects.

I used @soltysh workaround oc delete apiservices.apiregistration.k8s.io/v1beta1.servicecatalog.k8s.io -n kube-service-catalog then ran the service-catalog playbook again.

The only problem is the serviceBinding which is staying behind.

oc get servicebinding

NAME AGE
jenkins-persistent-7fhmj-7wg7q 1h
jenkins-persistent-dbjdt-ts8g5 21m`

Also I tried to delete the first serviceBinding with force=true

oc delete servicebindings jenkins-persistent-7fhmj-7wg7q --force=true

servicebinding "jenkins-persistent-7fhmj-7wg7q" deleted

On controller-manager I saw this log.

> 0128 21:21:58.854041 1 controller_binding.go:190] ServiceBinding "jenkins/jenkins-persistent-7fhmj-7wg7q": Processing

| I0128 21:21:58.854139 1 controller_binding.go:218] ServiceBinding "jenkins/jenkins-persistent-7fhmj-7wg7q": trying to bind to ServiceInstance "jenkins/jenkins-persistent-7fhmj" that has ongoing asynchronous operation
| I0128 21:21:58.854265 1 controller_binding.go:880] ServiceBinding "jenkins/jenkins-persistent-7fhmj-7wg7q": Setting condition "Ready" to False
| I0128 21:21:58.854292 1 controller_binding.go:926] ServiceBinding "jenkins/jenkins-persistent-7fhmj-7wg7q": Updating status
| I0128 21:21:58.854363 1 event.go:218] Event(v1.ObjectReference{Kind:"ServiceBinding", Namespace:"jenkins", Name:"jenkins-persistent-7fhmj-7wg7q", UID:"325f296f-0464-11e8-ba34-0a580a820006", APIVersion:"servicecatalog.k8s.io", ResourceVersion:"89365", FieldPath:""}): type: 'Warning' reason: 'ErrorAsyncOperationInProgress' trying to bind to ServiceInstance "jenkins/jenkins-persistent-7fhmj" that has ongoing asynchronous operation
| I0128 21:21:58.860746 1 controller.go:232] Error syncing ServiceBinding jenkins/jenkins-persistent-7fhmj-7wg7q: Ongoing Asynchronous operation

Also for the other serviceBinding (oc delete servicebindings jenkins-persistent-dbjdt-ts8g5 --force=true) I tried to delete as well and saw a different log than the first one on controller-manager:

> I0128 21:24:41.659239 1 controller_binding.go:842] ServiceBinding "jenkins/jenkins-persistent-dbjdt-ts8g5": Deleting Secret "jenkins/jenkins-persistent-dbjdt-credentials-yyqnh"

| I0128 21:24:41.662509 1 controller_binding.go:880] ServiceBinding "jenkins/jenkins-persistent-dbjdt-ts8g5": Setting condition "Ready" to False
| I0128 21:24:41.662546 1 controller_binding.go:926] ServiceBinding "jenkins/jenkins-persistent-dbjdt-ts8g5": Updating status
| E0128 21:24:41.671371 1 controller_binding.go:929] ServiceBinding "jenkins/jenkins-persistent-dbjdt-ts8g5": Error updating status: ServiceBinding.servicecatalog.k8s.io "jenkins-persistent-dbjdt-ts8g5" is invalid: status.currentOperation: Forbidden: currentOperation must not be present when reconciledGeneration and generation are equal
| I0128 21:24:41.671406 1 controller.go:237] Dropping ServiceBinding "jenkins/jenkins-persistent-dbjdt-ts8g5" out of the queue: ServiceBinding.servicecatalog.k8s.io "jenkins-persistent-dbjdt-ts8g5" is invalid: status.currentOperation: Forbidden: currentOperation must not be present when reconciledGeneration and generation are equal

gbaufake on 28 Jan 2018

👍1

This looks like a problem that @openshift/team-service-catalog should look into

soltysh on 29 Jan 2018

jboyd01 on 29 Jan 2018

[Design] Contribute to Istio side car PSP issue https://github.com/kubernetes/kubernetes/issues/55435

sttts on 30 Jan 2018

"Forbidden: currentOperation must not be present when reconciledGeneration and generation are equal" looks to be the same issue that is causing https://bugzilla.redhat.com/show_bug.cgi?id=1535902 (try to delete an instance or binding while it is being provisioned async).

jboyd01 on 6 Feb 2018

fixed in 3.9 via upstream https://github.com/kubernetes-incubator/service-catalog/pull/1708 and re-vendored into OpenShift with https://github.com/openshift/origin/pull/18633

jboyd01 on 5 Mar 2018

I'm seeing the same thing

➜  ~ oc delete project nginx-ingress
Error from server (Conflict): Operation cannot be fulfilled on namespaces "nginx-ingress": The system is ensuring all content is removed from this namespace.  Upon completion, this namespace will automatically be purged by the system.

The project is denoted as

This project marked for deletion

in the web console.

nemonik on 1 Jun 2018

Side question: Is there a way to set TTL for temporal project to delete everything in that project (and itself) after a fixed amount of time?

Deepthroat on 12 Jun 2018

I'm seeing this in Minishift 3.11
oc get all returns No resources found.

laurafitzgerald on 14 May 2019

👍2

This is still an issue on some 3.11 clusters.

It's because of the finalizer 'kubernetes' not being removed from the project:

  finalizers:
  - kubernetes

I cleared up 1000's of projects by following these steps:

Do:
oc get projects |grep Terminating |awk '{print $1}' > mylist
Create and run this script to create a json file for each terminating project (while removing kubernetes finalizer):

#!/bin/bash
filename='mylist'
while read p; do
    echo $p
    oc get project $p -o json |grep -v "kubernetes" > $p.json
done < $filename

Run:
kubectl proxy --port=8080 &

4.Run this script to remove finalizer from running config:

#!/bin/bash
filename='mylist'
while read p; do
    curl -k -H "Content-Type: application/json" -X PUT --data-binary @$p.json localhost:8080/api/v1/namespaces/$p/finalize;
done < $filename

oc get projects |grep Terminating

Terminating projects should be gone.

pyates86 on 3 Sep 2019

👍6 👎3

We too got hit by this today. Quite stumped until we found this post. ~The solution from @pyates86 resolved it for us.~

oc v3.11.0+0cbc58b
kubernetes v1.11.0+d4cacc0
features: Basic-Auth SSPI Kerberos SPNEGO

Server ....
openshift v3.11.117
kubernetes v1.11.0+d4cacc0

greg-pendlebury on 1 Oct 2019

Spoke too soon... our team tried reusing that project name today and it immediately went back into the same Terminating state after it was created.

FWIW, it is almost exactly the same issue reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1507440#c45

Right down to it being a persistent Jenkins serviceinstance and reporting:

Error polling last operation: Status: 500; ErrorMessage: <nil>; Description: templateinstances.template.openshift.io "{...ID goes here...}" not found; ResponseError: <nil>

I have now read a number of reports that indicate the 'fix' from @pyates86 above will just hide the issue for you, but not resolve it.

greg-pendlebury on 2 Oct 2019

The cleanup procedure from @pyates86 works fine with minishift v1.34.1+c2ff9cb (oc v3.11.0+0cbc58b), but you need to be cluster-admin, use oc proxy --port=8080 & and do the following JSON replacements before running the 2nd script:

"kind": "Project" --> "kind": "Namespace"
v1 --> project.openshift.io/v1

fvaleri on 18 Oct 2019

👍3 🎉1

for OCP 4.1 working with
"kind": "Project" --> "kind": "Namespace"
apiVersion: "project.openshift.io/v1" -->apiVersion: "v1"

vtlrazin on 24 Oct 2019

👍4

@vtlrazin Thanks your comment helped when the original suggestion was giving me
"the API version in the data (project.openshift.io/v1) does not match the expected API version (v1)"

apastel on 31 Oct 2019

Similar issue and workaround described at https://access.redhat.com/solutions/4165791.

trumbaut on 6 Nov 2019

Similar issue and workaround described at https://access.redhat.com/solutions/4165791.

FYI, that issue is not accessible. I only have a redhat developer account. :/
I'd gladly get the solution though since it affects our 3.11 cluster as well.

Thanks!

Bengrunt on 20 Jan 2020

Someone made a script to help with this, using the solution mentioned by @pyates86 above.
I forked it and modified it to remove the Authorization header since that was causing a problem for me.
https://github.com/apastel/useful-scripts/blob/master/openshift/force-delete-openshift-project

apastel on 22 Jan 2020

Similar issue and workaround described at https://access.redhat.com/solutions/4165791.
FYI, that issue is not accessible. I only have a redhat developer account. :/
I'd gladly get the solution though since it affects our 3.11 cluster as well.

Can I know How I can get access to this link. Even I'm facing the same issues with one of my project in 3.11 cluster

Thanks!

saikaushik-itsmyworld on 3 Feb 2020

I am also facing same issue.
the project is in terminating state.

kind: Project
apiVersion: project.openshift.io/v1
metadata:
  name: icp4iapic2
  uid: 1d33c67d-4e74-11ea-bc04-0a826dbb1b51
  resourceVersion: '7631358'
  creationTimestamp:    ###'2020-02-13T15:18:40Z'
  deletionTimestamp: '2020-02-25T09:32:53Z'
  annotations:
    mcm.ibm.com/accountID: id-mycluster-account
    mcm.ibm.com/type: System
    openshift.io/description: ''
    openshift.io/display-name: ''
    openshift.io/requester: admin
    spec:
  finalizers:
    - kubernetes
status:
  phase: Terminating

sarvjeetrajvansh on 25 Feb 2020

I am also facing same issue.
the project is in terminating state.

A solution is already in this thread.

apastel on 25 Feb 2020

👍1

If Any one is still facing any issue.
I have just formalized above step into shell script.
https://github.com/sarvjeetrajvansh/publiccode/blob/shell/cleanprojectopenshift.sh

pass your namespace as argument to script.

sarvjeetrajvansh on 26 Feb 2020

👍3 🚀1 ❤1 🎉1

Corrected 'cleanprojectopenshift.sh' URL for sarvjeetrajvansh formalization

Thanks for posting it!

rrw on 31 Mar 2020

Here the instructions from @pyates86 updated (pay attention on step 5):

This is still an issue on some 3.11 clusters.

It's because of the finalizer 'kubernetes' not being removed from the project:

finalizers:

kubernetes

I cleared up 1000's of projects by following these steps:

Do: create a file with projects in state 'Terminating'

oc get projects |grep Terminating |awk '{print $1}' > mylist_project_terminating
Create and run this script to create a json file for each terminating project (while removing kubernetes finalizer):

script_create_json.sh:

!/bin/bash

filename='mylist'
while read p; do
echo $p
oc get project $p -o json |grep -v "kubernetes" > $p.json
done < $filename
Run: proxy al cluster

kubectl proxy --port=8080 &
Run this script to remove finalizer from running config:

script_remove_finalizer.sh:

!/bin/bash

filename='mylist'
while read p; do
curl -k -H "Content-Type: application/json" -X PUT --data-binary @$p.json localhost:8080/api/v1/namespaces/$p/finalize;
done < $filename
If it fails, check .json files generated:
{
"apiVersion": "project.openshift.io/v1",
"kind": "Project",
...

Replace "project.openshift.io/v1" with "v1" in that file:
"apiVersion": "v1",

... and run the script again.
Run validation:
oc get projects |grep Terminating