Serving: ClusterIngress objects are sometimes not deleted

Created on 28 Nov 2018 · 11Comments · Source: knative/serving

Expected Behavior

ClusterIngress objects are deleted when the associated Route object is.

Actual Behavior

Sometimes the ClusterIngress object is not deleted and is left in an orphaned state.

Steps to Reproduce the Problem

Create 200 Knative Services
Wait a while for them all to become ready and then scale to 0
Delete them all at once: kubectl delete ksvc --all

Additional Info

This seems to only happen when the k8s garbage collector falls behind, after deleting 200 Knative Services it takes a few minutes for k8s to garbage collect everything
Knative relies on some undocumented Kubernetes behavior here, having the ownerReference of the ClusterIngress object (cluster scoped) be a Route object (namespace scoped).

areAPI arenetworking kinbug

Source

bradhoekstra

Most helpful comment

Success! Now to fix all the unit tests it breaks 🙄

mattmoor on 29 Jan 2019

🚀1 🎉1

All 11 comments

/remove-area API
/remove-area autoscale
/remove-area build
/remove-area monitoring
/remove-area test-and-release
/remove-kind question
/remove-kind doc
/remove-kind feature
/remove-kind good-first-issue
/remove-kind process
/remove-kind spec

bradhoekstra on 28 Nov 2018

/remove-kind cleanup

bradhoekstra on 28 Nov 2018

cc @dprotaso

mattmoor on 4 Dec 2018

/area api

tcnghia on 25 Jan 2019

We could handle this via a finalizer on Route, anything less feels like it would need a separate controller that's tantamount to implementing what we're expecting from K8s' GC today.

I think the flow would go something like:

Create the ClusterIngress
If the ClusterIngress exists, then add the Finalizer to our Route's metadata list.
When a Route has been marked for deletion with our Finalizer, we will delete the ClusterIngress and remove the Finalizer.

mattmoor on 29 Jan 2019

Hmm, I have a simple PoC working, which seems to do the right thing on a simple example and the e2e tests still leave around a buttload of ClusterIngress resources. 🤦‍♂️

I wonder if I'm hitting some strange interaction between finalizers and delete propagation like @vaikas-google hit a while back?

mattmoor on 29 Jan 2019

do you have pointer to the PoC?

vaikas on 29 Jan 2019

Not pushed. We talked offline, and found my problem. Basically the problem is that our controllers don't deal well with finalizers in general. My change made Route deal with this, but the reason things aren't going away is that the ClusterIngress conroller is racing to recreate resources as the Kubernetes resource is GCing them.

Will keep experimenting after I get out of meetings.

mattmoor on 29 Jan 2019

Success! Now to fix all the unit tests it breaks 🙄

mattmoor on 29 Jan 2019

🚀1 🎉1

WOOHOO!!!

On Tue, Jan 29, 2019 at 2:24 PM Matt Moore notifications@github.com wrote:

Success! Now to fix all the unit tests it breaks 🙄

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/knative/serving/issues/2570#issuecomment-458731873,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKwedMFratH0Kpgr3Cq_2kemW3VMaOxXks5vIModgaJpZM4Y4Bkf
.