Beats: [Metricbeat] Grouping Windows Perfmon Metrics in Events

Created on 16 Mar 2018  路  16Comments  路  Source: elastic/beats

The perfmon metricset generates one event per counter instance. It would be nice to offer more flexibility in grouping related metrics into a single event.

Having a single event per metric was the simplest implementation that would allow similar metrics to be grouped and visualized (e.g. visualize disk write times for each disk instance on the same graph).

It would be nice to be able to group all metrics related to an instance of an object (e.g. all metrics for C:\ or all metrics for processor 0). Here are some examples that show my idea.

metricbeat.modules:
- module: windows
  metricsets: [perfmon]
  perfmon.queries:
  - object: '\Processor'
    namespace: processor
    instance: *
    counters:
    - name: '% User Time'
      label: time.user.pct
    - name: '% Processor Time'
      label: time.processor.pct
    - name: '% Interrupt Time'
      label: time.interrupt.pct

  - object: '\UDPv4'
    namespace: udpv4
    counters:
    - name: 'Datagrams Received/sec'
      label: packets.received.per_sec
    - name: 'Datagrams Received Errors'
      label: packets.received.errors
    - name: 'Datagrams No Port/sec'
      label: packets.received.no_port_per_sec
    - name: 'Datagrams Sent/sec'
      label: packets.sent.per_sec
    - name: 'Datagrams/sec'
      label: packets.per_sec

The first query uses a Performance Data Helper (PDH) path of \Processor(*)\<counter name>. And would produce an event like

{
  "windows": {
    "perfmon": {
      "processor": {
        "instance": "_Total",
        "time": {
          "user": {
            "pct": 1.2
          },
          "processor": {
            "pct": 10.1
          },
          "interrupt": {
            "pct": 0.5
          }
        }
      }
    }
  }
}

The second query uses a PDH path of \UDPv4\<counter name> (it has no instance) and produces an event like

{
  "windows": {
    "perfmon": {
      "udpv4": {
        "packets": {
          "received": {
            "per_sec": 18,
            "errors": 1,
            "no_port_per_sec": 11
          },
          "sent": {
            "per_sec": 12
          },
          "per_sec": 32
        }
      }
    }
  }
}

This would also resolve #6528.

Related Info

:Windows Metricbeat Integrations enhancement module

Most helpful comment

@andrewkroh, if you want to collect for multiple instances you want to do it this way?

To avoid duplication of the counters configuration I think instance should be able to accept a single string or a list.

metricbeat.modules:
- module: windows
  metricsets: [perfmon]
  perfmon.queries:
  - object: '\Processor'
    namespace: processor
    - instance: [0, 1]       # Allow both a string or []string.
      counters:
      - name: '% User Time'
        label: time.user.pct      
      - name: '% Interrupt Time'
        label: time.interrupt.pct

All 16 comments

I really like this idea that we allow the user to build his own events. Reminds me also of https://github.com/elastic/beats/pull/6462 where this could also be a potential solution (@jsoriano ).

Related #4944.

We'd love to see this functionality. The way the perfmon module works atm is not very efficient and generates a huge amount of documents with very little information. For example try gather some diskio related counters on 1k+ Windows servers every 10s:

    - instance_label: "diskio.name"
      measurement_label: "diskio.reads"
      query: '\LogicalDisk(*)\Disk Reads/sec'
      format: "long"
    - instance_label: "diskio.name"
      measurement_label: "diskio.writes"
      query: '\LogicalDisk(*)\Disk Writes/sec'
      format: "long"
    - instance_label: "diskio.name"
      measurement_label: "diskio.read.queue_length"
      query: '\LogicalDisk(*)\Avg. Disk Read Queue Length'
    - instance_label: "diskio.name"
      measurement_label: "diskio.write.queue_length"
      query: '\LogicalDisk(*)\Avg. Disk Write Queue Length'
    - instance_label: "diskio.name"
      measurement_label: "diskio.read.time.pct"
      query: '\LogicalDisk(*)\% Disk Read Time'
    - instance_label: "diskio.name"
      measurement_label: "diskio.write.time.pct"
      query: '\LogicalDisk(*)\% Disk Write Time'
    - instance_label: "diskio.name"
      measurement_label: "diskio.bytes_per_read.avg"
      query: '\LogicalDisk(*)\Avg. Disk Bytes/Read'
      format: "long"
    - instance_label: "diskio.name"
      measurement_label: "diskio.bytes_per_write.avg"
      query: '\LogicalDisk(*)\Avg. Disk Bytes/Write'
      format: "long"
    - instance_label: "diskio.name"
      measurement_label: "diskio.read.bytes_per_sec"
      query: '\LogicalDisk(*)\Disk Read Bytes/sec'
      format: "long"
    - instance_label: "diskio.name"
      measurement_label: "diskio.write.bytes_per_sec"
      query: '\LogicalDisk(*)\Disk Write Bytes/sec'
      format: "long"

Check this graph:

image

This is for only the above perfmon counters on 6 Windows servers.

@ruflin @andrewkroh

Just wanted to add that I have a feeling that the millions of documents the perfmon module is generating results in very slow recovery of metricbeat indices. I added the diskio metrics from my previous post on +- 600 servers and have seen a significant detoriation of performance during recovery. (it could of course be related to other things in my cluster, but still wanted to mention this, maybe other customers are seeing the same)

Hey @andrewkroh, is someone working on this topic?

AFAIK nobody is working on it at the moment.

Ok. I like the idea to group events in a namespace. If ok then i would open a PR.

Of course, that would be great.

@andrewkroh, if you want to collect for multiple instances you want to do it this way?

metricbeat.modules:
- module: windows
  metricsets: [perfmon]
  perfmon.queries:
  - object: '\Processor'
    namespace: processor
    - instance: 0
      counters:
      - name: '% User Time'
        label: time.user.pct      
      - name: '% Interrupt Time'
        label: time.interrupt.pct
    - instance: 1
      counters:
      - name: '% Processor Time'
        label: time.processor.pct

  - object: '\UDPv4'
    namespace: udpv4
    counters:
    - name: 'Datagrams Received/sec'
      label: packets.received.per_sec
    - name: 'Datagrams Received Errors'
      label: packets.received.errors
    - name: 'Datagrams No Port/sec'
      label: packets.received.no_port_per_sec
    - name: 'Datagrams Sent/sec'
      label: packets.sent.per_sec
    - name: 'Datagrams/sec'
      label: packets.per_sec

Maybe while this is being rewritten, we should consider a perfmon ecs object? It would be nice if there is some sort of convention for perfmon data, so that everyone is using the same field names?

@willemdh Not sure if ECS should have something specific to perfmon. Perfmon can use ECS fields but there are lots of metrics which I would not expect to be in ECS in perfmon. This is not only related to perfmon but metrics in general.

@andrewkroh, if you want to collect for multiple instances you want to do it this way?

To avoid duplication of the counters configuration I think instance should be able to accept a single string or a list.

metricbeat.modules:
- module: windows
  metricsets: [perfmon]
  perfmon.queries:
  - object: '\Processor'
    namespace: processor
    - instance: [0, 1]       # Allow both a string or []string.
      counters:
      - name: '% User Time'
        label: time.user.pct      
      - name: '% Interrupt Time'
        label: time.interrupt.pct

Hello folks,

+1 on the requirement, but ...

When checking around, system.filesystem does not return all mount points of Windows.
For this reason, we used the perf counter.
Drawback is that events are independent, meaning that we need 2 events to get both percentage + Free MegaBytes.

Three options here :

  • Merge both events when using logstash, but it is a really dirty solution.
  • Update system.filesystem to be compatible with all mounts on Windows
  • Do this implementation to merge events as in system.filesystem and keep using perfmon

What would be the best ?

Pinging @elastic/infrastructure

I think the best would be to generate single event per performance counter value and use pre-defined field names so the result could look like this:
{
"windows": {
"perfmon": {
"category" : ".NET CLR Exceptions",
"instance" : "??APP_CLR_PROC??",
"name" : "my_counter_name",
"value" : 0.0
}
}
}

There are many advantages:

  1. Allows term aggregations per category, name ...
  2. Allows easy filtering results for known category, name, ... No need for searching available fields.
  3. Avoids possible huge number of dynamic fields.
  4. With a little modification can be part of the ECS.
  5. In environments/companies with mix of metricbeat, 3rd party and custom log shippers allows easily mixing the data searching via Kibana.

Based on multiple requests we have worked on a new config format and event output that should satisfy most of the proposed options here, will close the issue for now, if there are any questions, please reopen and resume the conversation. (referred PR https://github.com/elastic/beats/pull/17596)

Was this page helpful?
0 / 5 - 0 ratings