Serving: ClusterIngress objects are sometimes not deleted

Created on 28 Nov 2018  ·  11Comments  ·  Source: knative/serving

Expected Behavior

ClusterIngress objects are deleted when the associated Route object is.

Actual Behavior

Sometimes the ClusterIngress object is not deleted and is left in an orphaned state.

Steps to Reproduce the Problem

  1. Create 200 Knative Services
  2. Wait a while for them all to become ready and then scale to 0
  3. Delete them all at once: kubectl delete ksvc --all

Additional Info

  • This seems to only happen when the k8s garbage collector falls behind, after deleting 200 Knative Services it takes a few minutes for k8s to garbage collect everything
  • Knative relies on some undocumented Kubernetes behavior here, having the ownerReference of the ClusterIngress object (cluster scoped) be a Route object (namespace scoped).
areAPI arenetworking kinbug

Most helpful comment

Success! Now to fix all the unit tests it breaks 🙄

All 11 comments

/remove-area API
/remove-area autoscale
/remove-area build
/remove-area monitoring
/remove-area test-and-release
/remove-kind question
/remove-kind doc
/remove-kind feature
/remove-kind good-first-issue
/remove-kind process
/remove-kind spec

/remove-kind cleanup

cc @dprotaso

/area api

We could handle this via a finalizer on Route, anything less feels like it would need a separate controller that's tantamount to implementing what we're expecting from K8s' GC today.

I think the flow would go something like:

  1. Create the ClusterIngress
  2. If the ClusterIngress exists, then add the Finalizer to our Route's metadata list.
  3. When a Route has been marked for deletion with our Finalizer, we will delete the ClusterIngress and remove the Finalizer.

Hmm, I have a simple PoC working, which seems to do the right thing on a simple example and the e2e tests still leave around a buttload of ClusterIngress resources. 🤦‍♂️

I wonder if I'm hitting some strange interaction between finalizers and delete propagation like @vaikas-google hit a while back?

do you have pointer to the PoC?

Not pushed. We talked offline, and found my problem. Basically the problem is that our controllers don't deal well with finalizers in general. My change made Route deal with this, but the reason things aren't going away is that the ClusterIngress conroller is racing to recreate resources as the Kubernetes resource is GCing them.

Will keep experimenting after I get out of meetings.

Success! Now to fix all the unit tests it breaks 🙄

WOOHOO!!!

On Tue, Jan 29, 2019 at 2:24 PM Matt Moore notifications@github.com wrote:

Success! Now to fix all the unit tests it breaks 🙄


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/knative/serving/issues/2570#issuecomment-458731873,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKwedMFratH0Kpgr3Cq_2kemW3VMaOxXks5vIModgaJpZM4Y4Bkf
.

/assign

Was this page helpful?
0 / 5 - 0 ratings