Origin: Deleting projects is leaving projects on zombie state

Created on 16 Jan 2018  Â·  46Comments  Â·  Source: openshift/origin

After deleting projects via openshift UI, the project is not being deleted. Trying via oc command generates:

Error from server (Conflict): Operation cannot be fulfilled on namespaces "istio-system": The system is ensuring all content is removed from this namespace. Upon completion, this namespace will automatically be purged by the system

Version

oc v3.7.23
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO
openshift v3.7.18
kubernetes v1.7.6+a08f5eeb62

Steps To Reproduce
  1. Create a project
  2. Deploy something on this project
  3. Delete this project via UI
  4. If it doesn't get deleted, try deleting via oc command
Current Result
  • Projects on Zoombie state
Expected Result
  • Projects Deleted
Additional Information

Screenshots
image

componenservice-catalog kinbug prioritP2

Most helpful comment

@juanvallejo
Yes, some projects are on "terminating state" more than 2 days.

Projects can be listed with oc

All 46 comments

Does it only reproducible when you removes project via UI? In other words, would it go to a zombie state if you remove it from CLI (oc delete project <project_name>)?

@php-coder

I checked with OC command and the result is the same as the UI;

Do you know what is the cause of this issue?

@gbaufake No, I don't. I hope that @juanvallejo knows or at least could know who knows :)

Meanwhile, did you check logs? Is there something that could be related to the issue?

Was able to reproduce using a 3.9 client against a 3.9 cluster.
Steps I took:

# create a new project 'deleteme'
$ oc new-project deleteme
Now using project "deleteme" on server "https://127.0.0.1:8443".
...

# deploy an application on that project
$ oc new-app <path/to/app>
--> Found image d5b68e7 (3 weeks old) in image stream ...
...
--> Success
...
    Run 'oc status' to view your app.

# immediately delete project after `oc new-app` finishes running
$ oc delete deleteme
project "deleteme" deleted

# try deleting project once more
$ oc delete deleteme
oc delete project deleteme
Error from server (Conflict): Operation cannot be fulfilled on namespaces "deleteme": The system is ensuring all content is removed from this namespace.  Upon completion, this namespace will automatically be purged by the system.

# check to see if project can still be listed
$ oc projects
You have access to the following projects and can switch between them with 'oc project <projectname>':

    default
  * deleteme

Using project "deleteme" on server "https://127.0.0.1:8443".

The project is finally deleted after a minute or so, and no longer appears in the output of $ oc projects.

@gbaufake I suspect that maybe one or more resources that are created as part of deploying an application in your project are taking a bit longer than normal to be deleted (or maybe there are a lot of resources to delete in the first place). Since all resources belonging to a project must be deleted before the project itself can be deleted, the project will continue to exist until everything in it is gone.

However since the project has been marked for deletion already (when you deleted it through the webconsole), attempting to delete it a second time (as seen in my example above), will render the (Conflict) error that you are seeing.

Can you confirm that you are no longer able to list the deleted project (through oc projects) after deleting it, and waiting a minute or two?

@juanvallejo
Yes, some projects are on "terminating state" more than 2 days.

Projects can be listed with oc

cc @soltysh

@gbaufake any chance you could list the resources that remain in the project while it is on the "terminating" state? After you get the (Conflict) error message when deleting it, do oc get all on the project. (Feel free to redact anything / just post the resource kinds).

"oc get all" returns "No resources found." for the "terminating" state projects.

@deads2k @soltysh @liggitt could this maybe be failure to delete a resource in the namespace that is not part of "all"?

@deads2k @soltysh @liggitt could this maybe be failure to delete a resource in the namespace that is not part of "all"?

No. oc get all will not list every resource in the project.

Check the controller logs... the namespace controller will indicate the resources it could not delete

@liggitt service atomic-openshift-master-controllers status -l -f would do the work?

@gbaufake yes, that should do. In case there's nothing in the logs you can also try increasing the loglevels and grep for namespace_controller.go or namespaced_resources_deleter.go. These will come from the namespace controller @liggitt mentioned.

@ironcladlou since you're the GC expert, any ideas what might be stuck when removing a project in a 3.7 version?

The controller logs already requested should help reveal the problem.

Not the original poster, but we are having the same problem. The controller logs obtained via service atomic-openshift-master-controllers status -l -f show:

01-17 13:48:05.396213661 +0100 CET (durationBeforeRetry 2m2s). Error:
ene 17 13:46:03 master1.*****.com atomic-openshift-master-controllers[1992]: E0117 13:46:03.396421    1992 glusterfs.go:647] glusterfs: error when deleting the volume :
ene 17 13:46:03 master1.*****.com atomic-openshift-master-controllers[1992]: E0117 13:46:03.396494    1992 goroutinemap.go:166] Operation for "delete-pvc-c7db9d3a-f973-11e7-a8d9-000c29f66ce4[cba7fb1f-f973-11e7-a8d9-000c29f66ce4]" failed. No retries permitted until 2018-

Some Logs from atomic-openshift-master-controllers status -l -f

https://paste.fedoraproject.org/paste/QgJ3S1QTiGRVvhREEEnoDQ

@gbaufake if you have an API group that is unresponsive (as you do), the namespace controller cannot guarantee it has cleaned up all the resources in the namespace.

It is expected that the namespace will remain in Terminating state until the controller can ensure it has discovered and removed all the resources in that namespace.

@liggitt Is there a way to restart API group specifically?

It's the problem of the 'Service Catalog' API group under the kube-service-catalog namespace.
Please check the states of the two pods under this namespace.

@gbaufake

Jan 17 08:37:21  atomic-openshift-master-controllers[4416]: E0117 08:37:21.347636    4416 namespace_controller.go:148] unable to retrieve the complete list of server APIs: istio.io/v1alpha1: the server could not find the requested resource, servicecatalog.k8s.io/v1beta1: an error on the server ("Error: 'x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"service-catalog-signer\")'\nTrying to reach: 'https://172.30.231.104:443/apis/servicecatalog.k8s.io/v1beta1'") has prevented the request from succeeding

Your log shows there is a certificate problem of the ServiceCatalog API group. Please fix this issue first.

Seems like the cert issue is related to #17952. From https://bugzilla.redhat.com/show_bug.cgi?id=1525014#c14 one possible solution was to re-create the service catalog.

@soltysh Using this workaround that you mentioned may lead to https://github.com/openshift/openshift-ansible/issues/6572?

After correcting the certs, I brought a new cluster up

oc version

oc v3.7.27
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://ip:8443
openshift v3.7.27
kubernetes v1.7.6+a08f5eeb62
`

and still faced the same problem on deleting projects.

I used @soltysh workaround oc delete apiservices.apiregistration.k8s.io/v1beta1.servicecatalog.k8s.io -n kube-service-catalog then ran the service-catalog playbook again.

The only problem is the serviceBinding which is staying behind.

oc get servicebinding

NAME AGE
jenkins-persistent-7fhmj-7wg7q 1h
jenkins-persistent-dbjdt-ts8g5 21m`

Also I tried to delete the first serviceBinding with force=true

oc delete servicebindings jenkins-persistent-7fhmj-7wg7q --force=true

servicebinding "jenkins-persistent-7fhmj-7wg7q" deleted

On controller-manager I saw this log.

> 0128 21:21:58.854041 1 controller_binding.go:190] ServiceBinding "jenkins/jenkins-persistent-7fhmj-7wg7q": Processing

  | I0128 21:21:58.854139 1 controller_binding.go:218] ServiceBinding "jenkins/jenkins-persistent-7fhmj-7wg7q": trying to bind to ServiceInstance "jenkins/jenkins-persistent-7fhmj" that has ongoing asynchronous operation
  | I0128 21:21:58.854265 1 controller_binding.go:880] ServiceBinding "jenkins/jenkins-persistent-7fhmj-7wg7q": Setting condition "Ready" to False
  | I0128 21:21:58.854292 1 controller_binding.go:926] ServiceBinding "jenkins/jenkins-persistent-7fhmj-7wg7q": Updating status
  | I0128 21:21:58.854363 1 event.go:218] Event(v1.ObjectReference{Kind:"ServiceBinding", Namespace:"jenkins", Name:"jenkins-persistent-7fhmj-7wg7q", UID:"325f296f-0464-11e8-ba34-0a580a820006", APIVersion:"servicecatalog.k8s.io", ResourceVersion:"89365", FieldPath:""}): type: 'Warning' reason: 'ErrorAsyncOperationInProgress' trying to bind to ServiceInstance "jenkins/jenkins-persistent-7fhmj" that has ongoing asynchronous operation
  | I0128 21:21:58.860746 1 controller.go:232] Error syncing ServiceBinding jenkins/jenkins-persistent-7fhmj-7wg7q: Ongoing Asynchronous operation

Also for the other serviceBinding (oc delete servicebindings jenkins-persistent-dbjdt-ts8g5 --force=true) I tried to delete as well and saw a different log than the first one on controller-manager:

> I0128 21:24:41.659239 1 controller_binding.go:842] ServiceBinding "jenkins/jenkins-persistent-dbjdt-ts8g5": Deleting Secret "jenkins/jenkins-persistent-dbjdt-credentials-yyqnh"

  | I0128 21:24:41.662509 1 controller_binding.go:880] ServiceBinding "jenkins/jenkins-persistent-dbjdt-ts8g5": Setting condition "Ready" to False
  | I0128 21:24:41.662546 1 controller_binding.go:926] ServiceBinding "jenkins/jenkins-persistent-dbjdt-ts8g5": Updating status
  | E0128 21:24:41.671371 1 controller_binding.go:929] ServiceBinding "jenkins/jenkins-persistent-dbjdt-ts8g5": Error updating status: ServiceBinding.servicecatalog.k8s.io "jenkins-persistent-dbjdt-ts8g5" is invalid: status.currentOperation: Forbidden: currentOperation must not be present when reconciledGeneration and generation are equal
  | I0128 21:24:41.671406 1 controller.go:237] Dropping ServiceBinding "jenkins/jenkins-persistent-dbjdt-ts8g5" out of the queue: ServiceBinding.servicecatalog.k8s.io "jenkins-persistent-dbjdt-ts8g5" is invalid: status.currentOperation: Forbidden: currentOperation must not be present when reconciledGeneration and generation are equal

This looks like a problem that @openshift/team-service-catalog should look into

"Forbidden: currentOperation must not be present when reconciledGeneration and generation are equal" looks to be the same issue that is causing https://bugzilla.redhat.com/show_bug.cgi?id=1535902 (try to delete an instance or binding while it is being provisioned async).

I'm seeing the same thing

➜  ~ oc delete project nginx-ingress
Error from server (Conflict): Operation cannot be fulfilled on namespaces "nginx-ingress": The system is ensuring all content is removed from this namespace.  Upon completion, this namespace will automatically be purged by the system.

The project is denoted as

This project marked for deletion

in the web console.

Side question: Is there a way to set TTL for temporal project to delete everything in that project (and itself) after a fixed amount of time?

I'm seeing this in Minishift 3.11
oc get all returns No resources found.

This is still an issue on some 3.11 clusters.

It's because of the finalizer 'kubernetes' not being removed from the project:

  finalizers:
  - kubernetes

I cleared up 1000's of projects by following these steps:

  1. Do:
    oc get projects |grep Terminating |awk '{print $1}' > mylist

  2. Create and run this script to create a json file for each terminating project (while removing kubernetes finalizer):

#!/bin/bash
filename='mylist'
while read p; do
    echo $p
    oc get project $p -o json |grep -v "kubernetes" > $p.json
done < $filename
  1. Run:
    kubectl proxy --port=8080 &

4.Run this script to remove finalizer from running config:

#!/bin/bash
filename='mylist'
while read p; do
    curl -k -H "Content-Type: application/json" -X PUT --data-binary @$p.json localhost:8080/api/v1/namespaces/$p/finalize;
done < $filename
  1. oc get projects |grep Terminating

Terminating projects should be gone.

We too got hit by this today. Quite stumped until we found this post. ~The solution from @pyates86 resolved it for us.~

oc v3.11.0+0cbc58b
kubernetes v1.11.0+d4cacc0
features: Basic-Auth SSPI Kerberos SPNEGO

Server ....
openshift v3.11.117
kubernetes v1.11.0+d4cacc0

Spoke too soon... our team tried reusing that project name today and it immediately went back into the same Terminating state after it was created.

FWIW, it is almost exactly the same issue reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1507440#c45

Right down to it being a persistent Jenkins serviceinstance and reporting:

Error polling last operation: Status: 500; ErrorMessage: <nil>; Description: templateinstances.template.openshift.io "{...ID goes here...}" not found; ResponseError: <nil>

I have now read a number of reports that indicate the 'fix' from @pyates86 above will just hide the issue for you, but not resolve it.

The cleanup procedure from @pyates86 works fine with minishift v1.34.1+c2ff9cb (oc v3.11.0+0cbc58b), but you need to be cluster-admin, use oc proxy --port=8080 & and do the following JSON replacements before running the 2nd script:

  • "kind": "Project" --> "kind": "Namespace"
  • v1 --> project.openshift.io/v1

for OCP 4.1 working with
"kind": "Project" --> "kind": "Namespace"
apiVersion: "project.openshift.io/v1" -->apiVersion: "v1"

@vtlrazin Thanks your comment helped when the original suggestion was giving me
"the API version in the data (project.openshift.io/v1) does not match the expected API version (v1)"

Similar issue and workaround described at https://access.redhat.com/solutions/4165791.

Similar issue and workaround described at https://access.redhat.com/solutions/4165791.

FYI, that issue is not accessible. I only have a redhat developer account. :/
I'd gladly get the solution though since it affects our 3.11 cluster as well.

Thanks!

Someone made a script to help with this, using the solution mentioned by @pyates86 above.
I forked it and modified it to remove the Authorization header since that was causing a problem for me.
https://github.com/apastel/useful-scripts/blob/master/openshift/force-delete-openshift-project

Similar issue and workaround described at https://access.redhat.com/solutions/4165791.
FYI, that issue is not accessible. I only have a redhat developer account. :/
I'd gladly get the solution though since it affects our 3.11 cluster as well.

Can I know How I can get access to this link. Even I'm facing the same issues with one of my project in 3.11 cluster

Thanks!

I am also facing same issue.
the project is in terminating state.

kind: Project
apiVersion: project.openshift.io/v1
metadata:
  name: icp4iapic2
  uid: 1d33c67d-4e74-11ea-bc04-0a826dbb1b51
  resourceVersion: '7631358'
  creationTimestamp:    ###'2020-02-13T15:18:40Z'
  deletionTimestamp: '2020-02-25T09:32:53Z'
  annotations:
    mcm.ibm.com/accountID: id-mycluster-account
    mcm.ibm.com/type: System
    openshift.io/description: ''
    openshift.io/display-name: ''
    openshift.io/requester: admin
    spec:
  finalizers:
    - kubernetes
status:
  phase: Terminating

I am also facing same issue.
the project is in terminating state.

A solution is already in this thread.

If Any one is still facing any issue.
I have just formalized above step into shell script.
https://github.com/sarvjeetrajvansh/publiccode/blob/shell/cleanprojectopenshift.sh

pass your namespace as argument to script.

Here the instructions from @pyates86 updated (pay attention on step 5):

This is still an issue on some 3.11 clusters.

It's because of the finalizer 'kubernetes' not being removed from the project:

finalizers:

  • kubernetes

I cleared up 1000's of projects by following these steps:

  1. Do: create a file with projects in state 'Terminating'

    oc get projects |grep Terminating |awk '{print $1}' > mylist_project_terminating

  2. Create and run this script to create a json file for each terminating project (while removing kubernetes finalizer):

    script_create_json.sh:

    !/bin/bash

    filename='mylist'
    while read p; do
    echo $p
    oc get project $p -o json |grep -v "kubernetes" > $p.json
    done < $filename

  3. Run: proxy al cluster

    kubectl proxy --port=8080 &

  4. Run this script to remove finalizer from running config:

    script_remove_finalizer.sh:

    !/bin/bash

    filename='mylist'
    while read p; do
    curl -k -H "Content-Type: application/json" -X PUT --data-binary @$p.json localhost:8080/api/v1/namespaces/$p/finalize;
    done < $filename

  5. If it fails, check .json files generated:
    {
    "apiVersion": "project.openshift.io/v1",
    "kind": "Project",
    ...

    Replace "project.openshift.io/v1" with "v1" in that file:
    "apiVersion": "v1",

    ... and run the script again.

  6. Run validation:
    oc get projects |grep Terminating

Terminating projects should be gone.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ghost picture ghost  Â·  51Comments

smarterclayton picture smarterclayton  Â·  72Comments

clcollins picture clcollins  Â·  48Comments

rhcarvalho picture rhcarvalho  Â·  51Comments

jeremyeder picture jeremyeder  Â·  59Comments