telegraf plugin to dynamically traverse windows services listing the host along with service name, display name, startup mode, current state
none - new plugin
as per Proposal
As a target solution list and monitoring the state of windows services is a key feature of any Windows server estate and a widely used feature in monitoring products.
Support ideally for Windows 2008 R2 (all editions) and onwards (if this makes a difference to implementation).
absolute agree, windows service monitoring is key feature and must have for MS monitoring
What is it that you'd want to collect about a service with Telegraf? The only thing you can't already find out about a service is the actual state of it (ie is it running, stopped etc), which is reasonably trivial to grab with a piece of PowerShell, and run via the exec plugin.
Here's a snippet of something similar I run:
````
function Get-ServiceStatus($service) {
$status = int.Status
if ($status -gt 0 -and $status -lt 5) {
Write-Host "win_service,host=$hostname,service_name=$service status=$status"
}
}
````
This will output the service status as an integer (better than storing a string!); 1 is Stopped, 2 Starting, 3 Stopping, 4 Started.
I then show this on a single stat panel in Grafana, using an integer -> text mapping, and color thresholds to change the colour of the panel based on the service status.
You use the exec plugin to launch a Powershell script, which would contain that function I pasted above. Telegraf knows how to parse the output if it's in the correct format. Do some reading, it's all on the interwebs - https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec
Technically everything could be done with exec (disk, cpu, even perfmon and procstat) - it shouldn't be the default answer otherwise we'd not need any other plugin, ideally a uniform way is best, plus can't guarantee people have the same version of powershell on every server possibly.
One requirement here would be to collect the data in the original request. For example I want to run a check on 5000 servers (so single stat not manageable) what if you want this;
(service.startup = automatic AND service.state != RUNNING) OR (service.startup = disabled AND service.state != STOPPED)
Its not just about grafana either, there are other interfaces to influx even kapacitor for the aggregation.
I guess the reason I (personally) find the request a little odd is because I use metrics collection for retrospective analysis - not monitoring of things on the fly to to see if they are alive. Eg - you can't monitor for not receiving stats, which then means your check is useless. You also have a SPOF in whatever you're using for collecting the data - I'm assuming InfluxDB. Admittedly I only use the Telegraf/InfluxDB part of the InfluxData stack. For "is it alive" monitoring I use Sensu, which also allows me to check if host checks have gone stale, and alert accordingly.
I hacked in the little PowerShell script together as a dude I work with asked for a service state dashboard - which feels like a bit of misuse of the product - but serves a purpose for showing the state of our non-prod environments to interested parties.
@biker73 I think you're misunderstanding what I meant by "single stat" - that is a Grafana panel plugin, nothing to do with the way the data is recorded - you can see that from the output of my Powershell function above (which I personally run on a few hundred servers).
Sorry - yes not used single stat, however on the other points we're using this for live monitoring and post incident analysis. For this you can monitor for not receiving stats - 100% you can, and as yes its essential.
internal plugin is key for this couple with deadman switch in kapacitor, it gives you info per plugin and a number of metrics around it - extremely useful.
But, appreciate the input :) as a workaround if nothing else.
you can't monitor for not receiving stats
You can do this with Kapacitor.
Would be happy to include the requested functionality but it will have to be developed and supported by the community.
Hi,
I've started to work on this plugin.
@danielnelson, I will later reach you out with a plugin design proposal and a few questions.
Regards,
Vlasta
great @vlastahajek
Hi all,
I'm glad to announce that the plugin is almost ready!
Before I will create PR I have a final schema design hesitation and I would like get your opinion.
The current schema is:
Measurements & Fields:
The state tag can have the following values:
The startup_mode tag can have the following values:
Tags
The question is whether this schema suits future needs.
The issue I see is that the most interesting property, __state__ ,could be often used for querying, as a conditional or grouping parameter, and thus is would be more efficient to have it as a tag.
On the other hand, in the alerting use case, it is more convenient to keep it as a field key.
WDYT?
If would like to try it, you can get it from: https://github.com/bonitoo-io/telegraf/tree/vh-win-services
Thanks,
Vlasta
I think the way you have it structured is correct. Most of the time I think you will want to monitor these values changing. Using GROUP BY "state" does not seem particularly useful to me either, because there are no interesting fields to aggregate.
I'd probably name the states just stopped, start_pending, etc and the same for boot_start, system_start.
Does this seem like it will work for you? @biker73 @urbanb
Looking forward to the pull request.
I will try this next Monday but the proposal looks good to me. Not sure making state a tag would be great for series cardinality and you're unlikely to want to group by this value.
Hi,
@danielnelson: I took the names from MSDN doc of those params, but it definitely makes sense to remove the prefix.
It's done and PR is created.
Perhaps you can report some value like
0 is Offline 1 is Stopped, 2 Starting, 3 Stopping, 4 Started
for each state. This will make it possible to create a timeline serie when the service was offline.
Looking forward to test it !!!
@urbanb Agree re integers rather than strings.
I knocked up a simple PowerShell script that I use with the Telegraf exec runner (sadly can't reproduce it here as it was done at my place of work) that runs Get-Service, which already returns the state of the service as a number.
As you say, this allows graphing the state over time, alerting if not a specific state etc etc. The numbers the cmdlet outputs are:
````
````
If you use Grafana, it's relatively trivial to map the integer to a string for a "single stat" box. Not sure if this functionality exists in Chronograf however.
I've also wrote a script which provides integers as service state:
function Get-ServiceStatus($service) {
$status = int.Status
#Write-Host "Service: $service status is $status"
if ($status -gt 0 -and $status -lt 5) {
Write-Host "win_service,host=$($env:ComputerName),service_name=$($service) status=$($status)"
}
}
$serviceList = (Get-Service).Name
foreach ($serviceName in $serviceList){
Get-ServiceStatus $serviceName
}
Indeed the Windows Service API itself returns numbers, not strings, as shown in the PowerShell examples above.
Looks like it will be more beneficial to report state and startup_mode as number. The possible values will be:
state
startup mode
I don't want to report the data in multiple representations, but here is a list of advantages and disadvantages of each format:
In favor of using strings:
In favor of using integers:
Alerting using Kapacitor shouldn't be a problem with either format.
If you stored the service status as a string/tag, what would you store as the actual numeric value for each recording of the measurement?
AFAIK you need a number stored as a value for the data to validate correctly, and hence save. From memory this is another reason why I used integers with my script and not strings.
(I could be wrong about this, so please ignore me if so!)
For InfluxDB it is only required to have at least one field of any type, so the string values would be fine. Some of our outputs only support numeric types such as prometheus or graphite.
Though I wish it were more readable, I think it looks like encoding the fields as integers has quite a few advantages. If there are no objections, I suggest we switch to using ints for both fields.
Thanks a lot for your feedback!
The state and startup_mode fields are now integer: check it out from https://github.com/bonitoo-io/telegraf/tree/vh-win-services, it is synced with the upstream.
The PR is in progress
The plugin is already merged to the Telegraf repo. Original branch is deleted.
Hey @vlastahajek
What's the plugin name ? I can't seem to find it.
It's _win_services_: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/win_services, available only if you build Telegraf on Windows..
Hi @vlastahajek,
Can we imagine adding wildcard support ? If so it could permit autodiscovery of services like "t*"
What do you think ?
@glcx This is a good idea, can you open a new feature request issue?
This is fantastic and very useful - started a POC last Friday. Is the same available for Linux service/daemons?
@danielnelson Done: #3673 :+1:
@aings No, nothing available for Linux at this time.
Most helpful comment
Hi,
I've started to work on this plugin.
@danielnelson, I will later reach you out with a plugin design proposal and a few questions.
Regards,
Vlasta