Beats: [Metricbeat] Grouping Windows Perfmon Metrics in Events

Created on 16 Mar 2018 · 16Comments · Source: elastic/beats

The perfmon metricset generates one event per counter instance. It would be nice to offer more flexibility in grouping related metrics into a single event.

Having a single event per metric was the simplest implementation that would allow similar metrics to be grouped and visualized (e.g. visualize disk write times for each disk instance on the same graph).

It would be nice to be able to group all metrics related to an instance of an object (e.g. all metrics for C:\ or all metrics for processor 0). Here are some examples that show my idea.

metricbeat.modules:
- module: windows
  metricsets: [perfmon]
  perfmon.queries:
  - object: '\Processor'
    namespace: processor
    instance: *
    counters:
    - name: '% User Time'
      label: time.user.pct
    - name: '% Processor Time'
      label: time.processor.pct
    - name: '% Interrupt Time'
      label: time.interrupt.pct

  - object: '\UDPv4'
    namespace: udpv4
    counters:
    - name: 'Datagrams Received/sec'
      label: packets.received.per_sec
    - name: 'Datagrams Received Errors'
      label: packets.received.errors
    - name: 'Datagrams No Port/sec'
      label: packets.received.no_port_per_sec
    - name: 'Datagrams Sent/sec'
      label: packets.sent.per_sec
    - name: 'Datagrams/sec'
      label: packets.per_sec

The first query uses a Performance Data Helper (PDH) path of \Processor(*)\<counter name>. And would produce an event like

{
  "windows": {
    "perfmon": {
      "processor": {
        "instance": "_Total",
        "time": {
          "user": {
            "pct": 1.2
          },
          "processor": {
            "pct": 10.1
          },
          "interrupt": {
            "pct": 0.5
          }
        }
      }
    }
  }
}

The second query uses a PDH path of \UDPv4\<counter name> (it has no instance) and produces an event like

{
  "windows": {
    "perfmon": {
      "udpv4": {
        "packets": {
          "received": {
            "per_sec": 18,
            "errors": 1,
            "no_port_per_sec": 11
          },
          "sent": {
            "per_sec": 12
          },
          "per_sec": 32
        }
      }
    }
  }
}

This would also resolve #6528.

Related Info

Working with Performance Counters: https://technet.microsoft.com/en-us/library/bb734903.aspx?f=255&MSPPError=-2147217396

:Windows Metricbeat Integrations enhancement module

Source

andrewkroh

👍8

Most helpful comment

@andrewkroh, if you want to collect for multiple instances you want to do it this way?

To avoid duplication of the counters configuration I think instance should be able to accept a single string or a list.

metricbeat.modules:
- module: windows
  metricsets: [perfmon]
  perfmon.queries:
  - object: '\Processor'
    namespace: processor
    - instance: [0, 1]       # Allow both a string or []string.
      counters:
      - name: '% User Time'
        label: time.user.pct      
      - name: '% Interrupt Time'
        label: time.interrupt.pct

andrewkroh on 13 Aug 2018

👍3

All 16 comments

I really like this idea that we allow the user to build his own events. Reminds me also of https://github.com/elastic/beats/pull/6462 where this could also be a potential solution (@jsoriano ).

ruflin on 19 Mar 2018

Related #4944.

andrewkroh on 11 Apr 2018

We'd love to see this functionality. The way the perfmon module works atm is not very efficient and generates a huge amount of documents with very little information. For example try gather some diskio related counters on 1k+ Windows servers every 10s:

    - instance_label: "diskio.name"
      measurement_label: "diskio.reads"
      query: '\LogicalDisk(*)\Disk Reads/sec'
      format: "long"
    - instance_label: "diskio.name"
      measurement_label: "diskio.writes"
      query: '\LogicalDisk(*)\Disk Writes/sec'
      format: "long"
    - instance_label: "diskio.name"
      measurement_label: "diskio.read.queue_length"
      query: '\LogicalDisk(*)\Avg. Disk Read Queue Length'
    - instance_label: "diskio.name"
      measurement_label: "diskio.write.queue_length"
      query: '\LogicalDisk(*)\Avg. Disk Write Queue Length'
    - instance_label: "diskio.name"
      measurement_label: "diskio.read.time.pct"
      query: '\LogicalDisk(*)\% Disk Read Time'
    - instance_label: "diskio.name"
      measurement_label: "diskio.write.time.pct"
      query: '\LogicalDisk(*)\% Disk Write Time'
    - instance_label: "diskio.name"
      measurement_label: "diskio.bytes_per_read.avg"
      query: '\LogicalDisk(*)\Avg. Disk Bytes/Read'
      format: "long"
    - instance_label: "diskio.name"
      measurement_label: "diskio.bytes_per_write.avg"
      query: '\LogicalDisk(*)\Avg. Disk Bytes/Write'
      format: "long"
    - instance_label: "diskio.name"
      measurement_label: "diskio.read.bytes_per_sec"
      query: '\LogicalDisk(*)\Disk Read Bytes/sec'
      format: "long"
    - instance_label: "diskio.name"
      measurement_label: "diskio.write.bytes_per_sec"
      query: '\LogicalDisk(*)\Disk Write Bytes/sec'
      format: "long"

Check this graph:

This is for only the above perfmon counters on 6 Windows servers.

willemdh on 1 Jun 2018

@ruflin @andrewkroh

Just wanted to add that I have a feeling that the millions of documents the perfmon module is generating results in very slow recovery of metricbeat indices. I added the diskio metrics from my previous post on +- 600 servers and have seen a significant detoriation of performance during recovery. (it could of course be related to other things in my cluster, but still wanted to mention this, maybe other customers are seeing the same)

willemdh on 26 Jul 2018

Hey @andrewkroh, is someone working on this topic?

martinscholz83 on 9 Aug 2018

AFAIK nobody is working on it at the moment.

ruflin on 9 Aug 2018

Ok. I like the idea to group events in a namespace. If ok then i would open a PR.

martinscholz83 on 9 Aug 2018

❤2 🎉1

Of course, that would be great.

ruflin on 9 Aug 2018

@andrewkroh, if you want to collect for multiple instances you want to do it this way?

metricbeat.modules:
- module: windows
  metricsets: [perfmon]
  perfmon.queries:
  - object: '\Processor'
    namespace: processor
    - instance: 0
      counters:
      - name: '% User Time'
        label: time.user.pct      
      - name: '% Interrupt Time'
        label: time.interrupt.pct
    - instance: 1
      counters:
      - name: '% Processor Time'
        label: time.processor.pct

  - object: '\UDPv4'
    namespace: udpv4
    counters:
    - name: 'Datagrams Received/sec'
      label: packets.received.per_sec
    - name: 'Datagrams Received Errors'
      label: packets.received.errors
    - name: 'Datagrams No Port/sec'
      label: packets.received.no_port_per_sec
    - name: 'Datagrams Sent/sec'
      label: packets.sent.per_sec
    - name: 'Datagrams/sec'
      label: packets.per_sec

martinscholz83 on 10 Aug 2018

Maybe while this is being rewritten, we should consider a perfmon ecs object? It would be nice if there is some sort of convention for perfmon data, so that everyone is using the same field names?

willemdh on 10 Aug 2018

@willemdh Not sure if ECS should have something specific to perfmon. Perfmon can use ECS fields but there are lots of metrics which I would not expect to be in ECS in perfmon. This is not only related to perfmon but metrics in general.

ruflin on 13 Aug 2018

@andrewkroh, if you want to collect for multiple instances you want to do it this way?

To avoid duplication of the counters configuration I think instance should be able to accept a single string or a list.

metricbeat.modules:
- module: windows
  metricsets: [perfmon]
  perfmon.queries:
  - object: '\Processor'
    namespace: processor
    - instance: [0, 1]       # Allow both a string or []string.
      counters:
      - name: '% User Time'
        label: time.user.pct      
      - name: '% Interrupt Time'
        label: time.interrupt.pct

andrewkroh on 13 Aug 2018

👍3

Hello folks,

+1 on the requirement, but ...

When checking around, system.filesystem does not return all mount points of Windows.
For this reason, we used the perf counter.
Drawback is that events are independent, meaning that we need 2 events to get both percentage + Free MegaBytes.

Three options here :

Merge both events when using logstash, but it is a really dirty solution.
Update system.filesystem to be compatible with all mounts on Windows
Do this implementation to merge events as in system.filesystem and keep using perfmon

What would be the best ?

Sialagio on 5 Oct 2018

Pinging @elastic/infrastructure

elasticmachine on 29 Nov 2018

I think the best would be to generate single event per performance counter value and use pre-defined field names so the result could look like this:
{
"windows": {
"perfmon": {
"category" : ".NET CLR Exceptions",
"instance" : "??APP_CLR_PROC??",
"name" : "my_counter_name",
"value" : 0.0
}
}
}

There are many advantages:

Allows term aggregations per category, name ...
Allows easy filtering results for known category, name, ... No need for searching available fields.
Avoids possible huge number of dynamic fields.
With a little modification can be part of the ECS.
In environments/companies with mix of metricbeat, 3rd party and custom log shippers allows easily mixing the data searching via Kibana.

vbohata on 24 May 2019

Based on multiple requests we have worked on a new config format and event output that should satisfy most of the proposed options here, will close the issue for now, if there are any questions, please reopen and resume the conversation. (referred PR https://github.com/elastic/beats/pull/17596)