Telegraf: cluster metrics limited

Created on 9 Feb 2019  路  20Comments  路  Source: influxdata/telegraf

Relevant telegraf.conf:

  interval = "300s"
  vm_metric_exclude = [ "*" ]
  host_metric_exclude = [ "*" ]
  datastore_metric_exclude = [ "*" ]
  datacenter_metric_exclude = [ "*" ]
  cluster_metric_include = [ "*" ]
  collect_concurrency = 3
  force_discover_on_init = true
  insecure_skip_verify = true

System info:

Telegraf 1.9.4 (git: HEAD 4da8d0a4)
CentOS Linux release 7.6.1810 (Core)

Steps to reproduce:

Use telegraf.conf as provided.

Expected behavior:

Full set of metrics returned as noted here:
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/vsphere/METRICS.md

Actual behavior:

Only a limited set of cluster metrics are returned. For example:
vsphere_cluster_mem: overhead_average, totalmb_average, usage_average.

Additional info:

Logging set to debug. No relevant details appear in log.

arevsphere bug

Most helpful comment

@kkruzich I added some builds with the fix in #5563 here: https://github.com/influxdata/telegraf/issues/5565#issuecomment-471684508. You should be able to select either integer (the default) or float (use_int_samples = false) type depending on what works best for you now. With these builds you should be able to properly test the maxQueryMetrics option.

All 20 comments

Can you try these items and let us know the results:

  • Is there any additional values output when using the default config?
  • Enable the internal plugin and check internal_gather,input=vsphere gather_time_ns.
  • Even though you didn't see anything, can you attach the log when running in --debug mode?

I'm attaching a zip file containing vsphere.realtime.conf, vsphere.historical.conf, a log running over the past several hours since a restart of telegraf (2019-02-19T21:00:49Z), and internal_gather_vsphere_time.png. Please see attached
cluster_metrics_limited_5397.zip

The most recent item of interest regarding clusters specifically are the vmop.* series. I should be getting them via 'cluster_metric_include = [ "*" ]' but they never arrive in Influx and there's nothing in the log.

I am able to do this:

govc metric.sample /region003/host/rg003CL100 vmop.numPoweroff.latest
rg003CL100 -  vmop.numPoweroff.latest    5032,5032,5032,5032,5032,5032  num

Do you see any improvements when using the nightly builds?

I tried telegraf-1.10.0~431c58d8-0.x86_64. I've attached a logfile and image of gather_time here.

A couple of things stand out:
1) Quite a few 'field type conflict' errors even though I cleared measurements before starting this run.
2) errors with vpxd.stats.maxQueryMetrics but I can't tell for which vcenter. Can you?

In using the nightly build it seems the metrics I was previously collecting, eg, vsphere_host_cpu(readiness.average,ready_summation), vsphere_host_mem(vmmemctl_average) due to the field type issues.

cluster_metrics_limited_5397-20190223.zip

This is due to a vCenter issue. When vCenter estimates the query complexity, it assumes all hosts and VMs in a cluster need to be queried and bails out because the query would be too complex. In theory, this is the correct behavior, but it has unwanted ramifications, as you have just experienced.

There are three possible workarounds:
1) Increase vpxd.stats.maxQueryMetrics or even better, set it to unlimited (-1) in your vCenter.
2) Reduce the number of metrics collected to a very small set, such as power metrics.
3) Skip cluster collection altogether and synthesize the data using queries in InfluxDB or whatever you use for analytics/visualization.

@kkruzich You might be running into https://github.com/influxdata/influxdb/issues/10052, where dropping the measurement doesn't seem to totally remove it. I have experienced this myself but it always clears up after a few minutes.

@prydin Should these values be changed back to integers?

Now running telegraf-1.10.0-1.x86_64.

  • I'm now able to see vmops metrics.
  • I've increased maxQueryMetrics on a couple of vcenters and I'm able to do 'govc metric.sample' to see results from items (eg, mem.vmmemctl.average) which previously restricted.
  • However after following the steps below, I still see these field type conflicts:

2019-03-07T23:35:22Z E! [outputs.influxdb]: when writing to [http://localhost:8086]: received error partial write: field type conflict: input field "vmmemctl_average" on measurement "vsphere_host_mem" is type float, already exists as type integer dropped=1000; discarding points

Remove vsphere measurements:
1) Stop telegraf and run:
influx --execute 'show measurements' --database=telegraf | grep "^vsphere" | xargs -I{} influx --database=telegraf --execute 'drop measurement "{}"'
2) Restart influxd
3) The following will return no results:
influx --execute 'show measurements' --database=telegraf | grep "^vsphere"
4) Restart telegraf.

Another possibility is that the type is changing, could you run the experiment again but also add a file output like:

[[outputs.file]]
  files = ["/tmp/metrics.out"]

Run it until the error occurs, we can then inspect the file to see if the types are consistent.

Another possibility is that the type is changing

This was not the case, and I can import your dataset into my InfluxDB without issue. Instead the type has changed from 1.9 -> 1.10:

- blah active_average=9197547i,totalCapacity_average=74317i,usage_average=74.13 1552089138000000000
+ blah active_average=6847617,totalCapacity_average=67324,usage_average=57.95 1552088760000000000

This seems to be caused by the alignSamples code, but I haven't dug in any deeper than that. @prydin What were we doing in 1.9, were we sending the latest value only?

We probably need to rename these fields for 1.10.1, or it will be a big disruption as more people upgrade. In the meantime, and this will also work around the issue in InfluxDB, I suggest adding a static tag at the bottom of the input configuration (edit: doesn't work, use name_suffix = "_foo" instead).

How about this: A flag called force_int_values that's set to true by default? That way it's 100% backwards compatible.

It is still a little problematic because it doesn't provide an easy way to move forward without stopping all Telegraf and dropping all data, but I'm not sure we can think of a new name that isn't an eyesore.

Let's try to come up with a more descriptive name though, maybe something like use_raw_samples, maybe you can come up with a more accurate name.

My workaround above was also not working, something will have to be added to the measurement name:

[[inputs.vsphere]]
  name_suffix = "_v1.10"

use_raw_samples works for me.

I'm not sure there is an ideal solution, but I think keeping the type the same with an option as you proposed is our best choice.

I'm assuming we would like it if these could be floats, but the only way to make this transition is rename the measurement or the field, and both of those are breaking changes for dashboards/alerts unless you keep both the new and old versions.

The option helps quite a bit, and will be sufficient for most users I think, but to do a zero downtime upgrade you would need to do something like described here in the mysql plugin.

Ok. So a configurable option it is. I'll try to get it done over the weekend.

Just filed PR #5563

Introduced a use_int_samples flag ("raw" is a misnomer in this case). It's currently on by default, resulting in true backwards compatibility.

For a full discussion, please refer to the PR!

@kkruzich I added some builds with the fix in #5563 here: https://github.com/influxdata/telegraf/issues/5565#issuecomment-471684508. You should be able to select either integer (the default) or float (use_int_samples = false) type depending on what works best for you now. With these builds you should be able to properly test the maxQueryMetrics option.

I've installed telegraf-1.10.0~5970053b-0.x86_64.rpm and I'm seeing some interesting results.

Prior to setting up each of these cases, I've removed all measurements from Influx as described earlier.

  • With use_int_samples UNdefined (not written anywhere in the configuration files, default, otherwise 'true') I see field type conflicts of int -> float. Many of these are metrics I've not seen defined in the govmomi documentation (often involving a name 'resource*'). But also, measurements of vsphere_cluster_vmop are also getting this field type conflict. Please see attached file ft.error.use_int_samples_is_default for details.

  • With use_int_samples defined (use_int_samples = false) I see field type conflicts of float -> int and the metrics noted are entirely different from those listed when using the default for use_int_samples.
    Please see attached ft.error.use_int_samples_is_false.

I'm going to look into where these resource* metrics may be coming from and also work through each case described above to be certain the results are consisent.

ft.error.use_int_samples_is_false.gz
ft.error.use_int_samples_is_default.gz

When you deleted the data earlier and ran the previous version, you probably created some fields with float type. When you send samples as int, it's going to conflict with that. You need to drop those metrics.

As I noted earlier, for each case the telegraf version was consistent and I removed all measurements from Influx as previously described.. However it seems that method may not be good enough. What I did this time was use name_suffix = "_v1_10_5970053b" and ran with use_int_samples UNdefined, default. I am not seeing any field type conflicts now.

I'm going to turn some attention to https://github.com/influxdata/influxdb/issues/10052 and hopefully increase maxQueryMetrics on all vcenters by end of week.

Was this page helpful?
0 / 5 - 0 ratings