@greg-szabo
Edit (2018-07-01):
It would be nice if the sdk and the hub ship their own metrics to promotheus.
AFAIK this just means upgrade for the new tendermint with prometheus support. though maybe we want to include metrics at the SDK level too
I'm a bit confused. Are we simply stating the we need to upgrade the version of Tendermint in the SDK or do we also want to expose additional separate SDK metrics (e.g. total gets, total puts, request metrics, validator stats, etc...)?
I think the latter couldn't hurt.
I think the request is to update to the latest tendermint.
But I'm also not sure prometheus is the correct tool for tracking info about the state machine. It would have to persist data and stay synced with the blockchain properly. More likely we should keep it focused on information about the running process, rather than getting involved with the SDK state machine. Though it could be used to track reads/writes to the underlying db and maybe latency spent in AVL store access. @xla does that sound right?
Getting metrics on the db/avl access sounds pretty useful, so let's leave this open for that.
Ok cool, so I'll boil this down to:
Correct?
Generally anything that requires state to be kept is not a good candidate for prometheus and as pointed out any quantitive information e.g. number of operations, errors, latencies and beyond that dimensional breakdown with tags e.g. operation type, endpoint, error type. There is some information which fits well into gauges which could value beyond operational insight. An interesting exercise would be to actually compile a list of potential metrics and see if they would work with the prometheus modle and if it is feasible to track.
Can we close this @ebuchman? Seems like we want to create a ticket for compiling a list of potential metrics (most likely gauges). Doesn't seem super high priority atm.
Most helpful comment
Generally anything that requires state to be kept is not a good candidate for prometheus and as pointed out any quantitive information e.g. number of operations, errors, latencies and beyond that dimensional breakdown with tags e.g. operation type, endpoint, error type. There is some information which fits well into gauges which could value beyond operational insight. An interesting exercise would be to actually compile a list of potential metrics and see if they would work with the prometheus modle and if it is feasible to track.