Telegraf: Feature Request - telegraf "Windows Services" plugin

Created on 25 Apr 2017  路  31Comments  路  Source: influxdata/telegraf

Proposal:

telegraf plugin to dynamically traverse windows services listing the host along with service name, display name, startup mode, current state

Current behavior:

none - new plugin

Desired behavior:

as per Proposal

Use case: [Why is this important (helps with prioritizing requests)]

As a target solution list and monitoring the state of windows services is a key feature of any Windows server estate and a widely used feature in monitoring products.

Support ideally for Windows 2008 R2 (all editions) and onwards (if this makes a difference to implementation).

feature request help wanted platforwindows

Most helpful comment

Hi,
I've started to work on this plugin.

@danielnelson, I will later reach you out with a plugin design proposal and a few questions.

Regards,
Vlasta

All 31 comments

absolute agree, windows service monitoring is key feature and must have for MS monitoring

What is it that you'd want to collect about a service with Telegraf? The only thing you can't already find out about a service is the actual state of it (ie is it running, stopped etc), which is reasonably trivial to grab with a piece of PowerShell, and run via the exec plugin.

Here's a snippet of something similar I run:

````
function Get-ServiceStatus($service) {
$status = int.Status

    if ($status -gt 0 -and $status -lt 5) {
        Write-Host "win_service,host=$hostname,service_name=$service status=$status"
    }

}
````
This will output the service status as an integer (better than storing a string!); 1 is Stopped, 2 Starting, 3 Stopping, 4 Started.

I then show this on a single stat panel in Grafana, using an integer -> text mapping, and color thresholds to change the colour of the panel based on the service status.

You use the exec plugin to launch a Powershell script, which would contain that function I pasted above. Telegraf knows how to parse the output if it's in the correct format. Do some reading, it's all on the interwebs - https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec

Technically everything could be done with exec (disk, cpu, even perfmon and procstat) - it shouldn't be the default answer otherwise we'd not need any other plugin, ideally a uniform way is best, plus can't guarantee people have the same version of powershell on every server possibly.

One requirement here would be to collect the data in the original request. For example I want to run a check on 5000 servers (so single stat not manageable) what if you want this;

(service.startup = automatic AND service.state != RUNNING) OR (service.startup = disabled AND service.state != STOPPED)

Its not just about grafana either, there are other interfaces to influx even kapacitor for the aggregation.

I guess the reason I (personally) find the request a little odd is because I use metrics collection for retrospective analysis - not monitoring of things on the fly to to see if they are alive. Eg - you can't monitor for not receiving stats, which then means your check is useless. You also have a SPOF in whatever you're using for collecting the data - I'm assuming InfluxDB. Admittedly I only use the Telegraf/InfluxDB part of the InfluxData stack. For "is it alive" monitoring I use Sensu, which also allows me to check if host checks have gone stale, and alert accordingly.

I hacked in the little PowerShell script together as a dude I work with asked for a service state dashboard - which feels like a bit of misuse of the product - but serves a purpose for showing the state of our non-prod environments to interested parties.

@biker73 I think you're misunderstanding what I meant by "single stat" - that is a Grafana panel plugin, nothing to do with the way the data is recorded - you can see that from the output of my Powershell function above (which I personally run on a few hundred servers).

Sorry - yes not used single stat, however on the other points we're using this for live monitoring and post incident analysis. For this you can monitor for not receiving stats - 100% you can, and as yes its essential.

internal plugin is key for this couple with deadman switch in kapacitor, it gives you info per plugin and a number of metrics around it - extremely useful.

But, appreciate the input :) as a workaround if nothing else.

you can't monitor for not receiving stats

You can do this with Kapacitor.

Would be happy to include the requested functionality but it will have to be developed and supported by the community.

Hi,
I've started to work on this plugin.

@danielnelson, I will later reach you out with a plugin design proposal and a few questions.

Regards,
Vlasta

great @vlastahajek

Hi all,
I'm glad to announce that the plugin is almost ready!
Before I will create PR I have a final schema design hesitation and I would like get your opinion.

The current schema is:

Measurements & Fields:

  • win_services

    • state

    • startup_mode

The state tag can have the following values:

  • _service_stopped_
  • _service_start_pending_
  • _service_stop_pending_
  • _service_running_
  • _service_continue_pending_
  • _service_pause_pending_
  • _service_paused_

The startup_mode tag can have the following values:

  • _service_boot_start_
  • _service_system_start_
  • _service_auto_start_
  • _service_demand_start_
  • _service_disabled_

Tags

  • All measurements have the following tags:

    • service_name

    • display_name

The question is whether this schema suits future needs.

The issue I see is that the most interesting property, __state__ ,could be often used for querying, as a conditional or grouping parameter, and thus is would be more efficient to have it as a tag.
On the other hand, in the alerting use case, it is more convenient to keep it as a field key.

WDYT?

If would like to try it, you can get it from: https://github.com/bonitoo-io/telegraf/tree/vh-win-services

Thanks,
Vlasta

I think the way you have it structured is correct. Most of the time I think you will want to monitor these values changing. Using GROUP BY "state" does not seem particularly useful to me either, because there are no interesting fields to aggregate.

I'd probably name the states just stopped, start_pending, etc and the same for boot_start, system_start.

Does this seem like it will work for you? @biker73 @urbanb

Looking forward to the pull request.

I will try this next Monday but the proposal looks good to me. Not sure making state a tag would be great for series cardinality and you're unlikely to want to group by this value.

Hi,
@danielnelson: I took the names from MSDN doc of those params, but it definitely makes sense to remove the prefix.
It's done and PR is created.

Perhaps you can report some value like
0 is Offline 1 is Stopped, 2 Starting, 3 Stopping, 4 Started
for each state. This will make it possible to create a timeline serie when the service was offline.
Looking forward to test it !!!

@urbanb Agree re integers rather than strings.

I knocked up a simple PowerShell script that I use with the Telegraf exec runner (sadly can't reproduce it here as it was done at my place of work) that runs Get-Service, which already returns the state of the service as a number.

As you say, this allows graphing the state over time, alerting if not a specific state etc etc. The numbers the cmdlet outputs are:
````

Possible windows service states

4 - Started

3 - Stopping

2 - Starting

1 - Stopped

````

If you use Grafana, it's relatively trivial to map the integer to a string for a "single stat" box. Not sure if this functionality exists in Chronograf however.

I've also wrote a script which provides integers as service state:

function Get-ServiceStatus($service) {
$status = int.Status
#Write-Host "Service: $service status is $status"
if ($status -gt 0 -and $status -lt 5) {
Write-Host "win_service,host=$($env:ComputerName),service_name=$($service) status=$($status)"
}
}
$serviceList = (Get-Service).Name
foreach ($serviceName in $serviceList){
Get-ServiceStatus $serviceName
}

Indeed the Windows Service API itself returns numbers, not strings, as shown in the PowerShell examples above.

Looks like it will be more beneficial to report state and startup_mode as number. The possible values will be:

state

  • 1: stopped
  • 2: start_pending
  • 3: stop_pending
  • 4: running
  • 5: continue_pending
  • 6: pause_pending
  • 7: paused

startup mode

  • 0: boot_start
  • 1: system_start
  • 2: auto_start
  • 3: demand_start
  • 4: disabled

I don't want to report the data in multiple representations, but here is a list of advantages and disadvantages of each format:

In favor of using strings:

  • more readable
  • easier to make a single stat type chart with latest value

In favor of using integers:

  • will likely store more efficiently
  • maps directly to a y-axis for graphs over time (though no control of mapping order, since we will want to use Microsoft's numbering).
  • could use InfluxQL function such as derivative to find state changes.

Alerting using Kapacitor shouldn't be a problem with either format.

If you stored the service status as a string/tag, what would you store as the actual numeric value for each recording of the measurement?

AFAIK you need a number stored as a value for the data to validate correctly, and hence save. From memory this is another reason why I used integers with my script and not strings.

(I could be wrong about this, so please ignore me if so!)

For InfluxDB it is only required to have at least one field of any type, so the string values would be fine. Some of our outputs only support numeric types such as prometheus or graphite.

Though I wish it were more readable, I think it looks like encoding the fields as integers has quite a few advantages. If there are no objections, I suggest we switch to using ints for both fields.

Thanks a lot for your feedback!
The state and startup_mode fields are now integer: check it out from https://github.com/bonitoo-io/telegraf/tree/vh-win-services, it is synced with the upstream.
The PR is in progress

The plugin is already merged to the Telegraf repo. Original branch is deleted.

Hey @vlastahajek

What's the plugin name ? I can't seem to find it.

It's _win_services_: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/win_services, available only if you build Telegraf on Windows..

Hi @vlastahajek,

Can we imagine adding wildcard support ? If so it could permit autodiscovery of services like "t*"

What do you think ?

@glcx This is a good idea, can you open a new feature request issue?

This is fantastic and very useful - started a POC last Friday. Is the same available for Linux service/daemons?

@danielnelson Done: #3673 :+1:

@aings No, nothing available for Linux at this time.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yn1v picture yn1v  路  3Comments

robert-gomes picture robert-gomes  路  3Comments

m4ce picture m4ce  路  3Comments

IxDay picture IxDay  路  3Comments

Bregor picture Bregor  路  3Comments