I appreciate the Telemetry/Metrics APIs in Orleans, and I'm excited about the idea of sending this performance data to an APM where our team can analyze it all to diagnose problems and optimize performance.
What I don't see is a way for the system we're building to become aware of these important metrics. Software systems are increasingly becoming self-aware, monitoring various aspects of their own performance, and using that data to make adjustments: scale up by starting some new VMs and adding them to the cluster, adjust throttling limits for some types of operations, or unfolding more space to be filled with additional trading posts and space pirates.
Yes, it's true we could track these metrics ourselves within grain state. But why wouldn't we be able to reuse the existing metrics that we're already reporting through the Orleans Telemetry API?
My proposal is to add a mechanism for tracking metrics and their current values (based on calls to the Logging/Telemetry API)--instead of just "passing them through"--and to provide efficient access of these current values to grains or Orleans clients that inquire. (And beyond current values, perhaps also a stream that grains could subscribe and react to Rx style.)
At a minimum, the tracking of these metrics should survive the failure of any single silo, but need not be durable in the event of a complete cluster restart. However, @galvesribeiro mentioned how making it durable across restarts (using the configured Orleans durable storage provider) is one easy way to make it survive silo failures.
One approach is to use a new system grain type, called Metrics or TelemetryMetrics and let Orleans manage it along with all the others. One potential disadvantage of accessing metrics in this way is the asynchronous, grain-method-queued nature of the interaction. (Still, this might be the fastest prototyping opportunity.)
Another approach would be to update the base Grain class to include an indexed Metrics collection (think like IDictionary(string, double)), which would provide a safe call into some concurrent dictionary (or whatever makes more sense here, some distributed hash structure?).
Since the APM is custom on my current project, I'll be building a custom TelemetryConsumer to pass the telemetry along. And because Orleans supports multiple telemetry consumers, it seems to make sense to write another, separate TelemetryConsumer to keep track of all the calls for updating metrics. This telemetry consumer, instead of passing the data along to an external APM like it's supposed to, would instead track the current metrics values: when IncrementMetric or DecrementMetric is called, etc., it would update its cached value for the specified metric. Calls to TrackEvent could increment a counter metric.
The other piece, of which I'm less certain, is how best to expose those metrics to any grain or grain client that asks for them. (Plus it'd be nice to be able to ask for more than one metric at once, especially from the client or if a grain has to call another grain to get metric values.)
Thoughts?
So far, the most promising mechanism I've come up with for exposing those metrics is to use virtual streams. My understanding is that they efficiently short-circuit and do nothing when a virtual stream has no subscribers, rather than always serializing a message, enqueueing the message, invoking a method on another grain, etc.
A virtual stream could be created for every metric defined by the system (at some point), or extended with user metrics. @jdom makes the excellent point that this needs to be rate-limited, such that (for example) metric value updates are published no more often than once every x seconds, or until the metric value changes by more than some minimum threshold y (specific to each metric). The time resolution of updates, and the impact this will have on total system performance, will vary widely from one implementation to the next, and so this should be configurable. ("Just remember: it's your foot [that you could shoot with this tool].")
Temporal aggregation (tumbling, sliding time windows) can be used to find 10-second or 1-minute moving averages (etc), and generate additional events--even posting to a second level of derivative metrics)--from which larger system reactions can be based.
Looks like this has been implemented. Thank you, @danvanderboom!
Most helpful comment
Completed as a TelemetryConsumer extension to Orleans.
Orleans.TelemetryConsumers.MetricsTracker