We see that the size of PubSubSubscriptionState is growing from time to time with consumers with state=1 (SubscriptionStates.Faulted). It looks like after a consumer is marked as faulted, it is never returned in GetAllSubscriptionHandles so it is not possible to unsubscribe (i.e. remove that consumer from the pubsub state).
What's the best practice here? Can we remove those fault consumers other than tampering with stored PubSubSubscriptionState manually?
Just to clear the water a little bit, Fault state is on a subscription, not a consumer. There's not a concept of Fault state on a consumer. A consumer can have multiple subscriptions.
I don't think you can remove those fault subscriptions in an elegant way currently.
You can explicitly remove a healthy subscription from pubsub using IStreamSubscriptionManager.RemoveSubscription. But I don't think you can remove a fault subscription.
I wonder why fault subscription wasn't cleaned at the time when the subscription was marked Faulted. What purpose are they kept around for? Maybe @jason-bragg knows more about the context there.
@xiazen
Just to clear the water a little bit, Fault state is on a subscription, not a consumer. There's not a concept of Fault state on a consumer. A consumer can have multiple subscriptions.
Yes, thanks for correction.
The whole thing with faulted subscriptions is unclear for me. It looks like the behavior is different for implicit and explicit subscriptions. Correct me if I am wrong, but for explicit subscriptions, once they marked as faulted, they will never recover as they are ignored in most methods like retrieving the subscriptions list etc. But for implicit subscriptions, after a silo restart, previously faulted subscriptions will be used again to deliver messages. If that is correct, does it mean that implicit subscriptions are more fault-tolerant and the preferable way to subscribe to streams?
By reading this I got a question as well.
What is the exact conditions that a subscription assumed to be faulty and the state change is triggered?
IIRC Sergey said if a subscription can not receive a message , it will be considered faulty after some retires but that was for clients which this definition is logical for, For grains which subscribe to streams, it should not go to that state.
Chances are high that my assumptions are based overlooking facts since I have no experience making streaming systems.
A faulted subscription occurs when a delivery failure occurs on an explicitly subscribed stream and the stream failure handler's ShouldFaultSubsriptionOnError returns true.
Implicit subscriptions cannot be faulted.
Faulted subscriptions are no longer valid and no more events will be delivered on them. If a grain wishes to consume a stream with a faulted subscription, it will need to re-subscribe to the stream (as opposed to resume processing on the faulted subscription handle).
This behavior was introduced for environments where ordered processing of events was critical, and a failure to process an event required special recovery logic. The faulted subscription prevented the current consumer from continuing processing the stream, so it would not interfere with the recovery logic.
This was a specific recovery pattern that should not be the default, but looking at the code, it looks like faulting the subscription is currently on by default, which is wrong and I'll fix. In the interim, to work around this behavior you can provide your own IStreamFailureHandler with a ShouldFaultSubsriptionOnError that returns false, or re-subscribe to the stream if the current subscription is faulted.
@jason-bragg Thanks for explanation. Still the original question remains unanswered - why do we store faulted subscriptions in the pubsub grain state and how to get rid of them programmatically?
@DixonDs, Yes, I apologize for getting caught up in the thread rather than addressing the original concern. :/
We store them so that the fact that that subscription has been faulted is resilient across shutdown and/or failures. As stated, it was intended for a specific recovery scenario, which did not involve continued re-subscription so we didn't anticipate high numbers of faulted subscriptions. I do se how this can be a problem, especially given the relatively limited space allows for subscription information in the pubsub system.
Unfortunately, we don't yet have the ability to programmatically remove subscriptions. Our 1.5 release will introduce this capability.
Fixed in 1.5
Most helpful comment
A faulted subscription occurs when a delivery failure occurs on an explicitly subscribed stream and the stream failure handler's ShouldFaultSubsriptionOnError returns true.
Implicit subscriptions cannot be faulted.
Faulted subscriptions are no longer valid and no more events will be delivered on them. If a grain wishes to consume a stream with a faulted subscription, it will need to re-subscribe to the stream (as opposed to resume processing on the faulted subscription handle).
This behavior was introduced for environments where ordered processing of events was critical, and a failure to process an event required special recovery logic. The faulted subscription prevented the current consumer from continuing processing the stream, so it would not interfere with the recovery logic.
This was a specific recovery pattern that should not be the default, but looking at the code, it looks like faulting the subscription is currently on by default, which is wrong and I'll fix. In the interim, to work around this behavior you can provide your own IStreamFailureHandler with a ShouldFaultSubsriptionOnError that returns false, or re-subscribe to the stream if the current subscription is faulted.