Telegraf: inputs-vsphere did not complete within its interval

Created on 16 Jan 2019 · 19Comments · Source: influxdata/telegraf

Relevant telegraf.conf:

System info:

CentOS Linux release 7.6.1810 (Core)

[Include Telegraf version, operating system name, and other relevant details]
Telegraf 1.9.2 (git: HEAD dda80799)
influxdb-1.7.2-1

Steps to reproduce:

... installed telegraf
... enabled Vsphere plugin

Expected behavior:

Graph should be Constance

Actual behavior:

Additional info:

2019-01-16T13:41:10Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-16T13:41:20Z D! [outputs.influxdb] wrote batch of 28 metrics in 7.071246ms
2019-01-16T13:41:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics.
2019-01-16T13:41:20Z W! [agent] input "inputs.vsphere" did not complete within its interval

[Include gist of relevant config, logs, etc.]

If i am restarting telegraf then, it is working fine for few minutes and again started with same error and no graph

arevsphere bug

Source

aksharbarot

Most helpful comment

It looks like your interval is set to 15s. You need to increase that to at least 20s!
Keep in mind that vSphere data is only available every 20s, you should never specify an interval lower than that. We generally recommend you keep the collection interval to 60s due to the load you shorter intervals may put on the vCenter server. However, we have been able to successfully use a 20s interval on a vCenter managing 7000 VMs, so it is possible, albeit not recommended.

If you truly need 20s granularity on your data, I recommend you do two things:
1) Move collection of clusters, datacenters and datastores to a separate instance of the plugin. These metrics are only available on a 300s interval, so it's not useful to collect them more often than that. Also, since they're stored on disk and not in memory, they take considerably longer to fetch. Here's a writeup I did on this. You might find this helpful: http://docs-dev.wavefront.com/integrations_vsphere.html (Note to self: Add something similar to the README)

2) Once you've made the changes above, you should increase both discover_concurrency and collect_concurrency to 3. This should give you an extra performance boost!

prydin on 21 Jan 2019

👍2

All 19 comments

Could you check if this still occurs in the nightly builds ?

danielnelson on 17 Jan 2019

@danielnelson Thank you!.... I moved to below version as per your suggestion.

Telegraf version:

Telegraf unknown (git: master e95b88e0)

All graphs are generating little constant but, issue with CPU collection only which is missing for all esxi hosts

CPU graph which is missing in between

Should work as below

aksharbarot on 17 Jan 2019

It may be that you need to increase your collection interval, depending on how many values you are collecting and the amount of time it takes. One way to see how much time the plugin is taking is to enable the internal plugin and look at the internal_gather,plugin=vsphere and internal_vsphere metrics.

Can you also attach your configuration for this plugin?

danielnelson on 17 Jan 2019

Please share your configuration file with us! Shorter collection interval than 60s is generally not recommended for vsphere. You may also want to increase collection concurrency.

prydin on 20 Jan 2019

FYI: configured 4 VC which will have 500+ hosts all together.

Even many time observed [inputs.vsphere]: Error in plugin: While collecting host: Post https://VC/sdk: context deadline exceeded

FYI: It is not able to collect metrics where more number of hosts present. (having 2 VC where few number of hosts are present and which is showing constant graph) only issue is with larger inventory.

Okay! here is the config file.
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
debug = true
quiet = false
logfile = "/var/log/telegraf/telegraf.log"
hostname = ""
omit_hostname = false

[[outputs.influxdb]]

urls = ["http://127.0.0.1:8086"]
database = "telegraf"
timeout = "5s"
username = "user"
password = "pass"

[[inputs.vsphere]]
vcenters = [ "VC1","VC2","VC3", VC4]
username = user
password = pass
vm_metric_exclude = [ "*" ]
host_metric_include = [
"cpu.coreUtilization.average",
"cpu.costop.summation",
"cpu.demand.average",
"cpu.idle.summation",
"cpu.latency.average",
"cpu.readiness.average",
"cpu.ready.summation",
"cpu.swapwait.summation",
"cpu.usage.average",
"cpu.usagemhz.average",
"cpu.used.summation",
"cpu.utilization.average",
"cpu.wait.summation",
"mem.active.average",
"mem.latency.average",
"mem.state.latest",
"mem.swapin.average",
"mem.swapinRate.average",
"mem.swapout.average",
"mem.swapoutRate.average",
"mem.totalCapacity.average",
"mem.usage.average",
"mem.vmmemctl.average",
"net.bytesRx.average",
"net.bytesTx.average",
"net.droppedRx.summation",
"net.droppedTx.summation",
"net.errorsRx.summation",
"net.errorsTx.summation",
"net.usage.average",
"power.power.average",
"storageAdapter.numberReadAveraged.average",
"power.power.average",
"storageAdapter.numberReadAveraged.average",
"storageAdapter.numberWriteAveraged.average",
"storageAdapter.read.average",
"storageAdapter.write.average",
"sys.uptime.latest",
]

cluster_metric_include = []

datastore_metric_exclude = [ "*" ]
insecure_skip_verify = true

aksharbarot on 21 Jan 2019

I have the same issue with version 1.9.0.1、1.9.1.1，no problem with version 1.8.3-1.
FYI: configured 2 VC which will have less than 10 hosts all together.
Here is the config file.:

[global_tags]
[agent]
interval = "15s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
precision = ""
debug = false
quiet = false
logfile = "/var/log/telegraf/telegraf.log"
hostname = ""
omit_hostname = false

[[outputs.prometheus_client]]
listen = ":9273"

[[inputs.vsphere]]
vcenters = [ "VC1","VC2" ]
username = "*"
password = ""
insecure_skip_verify = true

Here is the log:
2019-01-21T03:24:17Z I! [agent] Hang on, flushing any cached metrics before shutdown
2019-01-21T03:24:18Z I! Loaded inputs: inputs.mem inputs.system inputs.cpu inputs.diskio inputs.kernel inputs.processes inputs.swap inputs.vsphere inputs.disk
2019-01-21T03:24:18Z I! Loaded aggregators:
2019-01-21T03:24:18Z I! Loaded processors:
2019-01-21T03:24:18Z I! Loaded outputs: prometheus_client
2019-01-21T03:24:18Z I! Tags enabled: host=bocloud
2019-01-21T03:24:18Z I! [agent] Config: Interval:15s, Quiet:false, Hostname:"localhost", Flush Interval:10s
2019-01-21T03:24:18Z W! [input.vsphere] Configured max_query_metrics is 256, but server limits it to 64. Reducing.
2019-01-21T03:24:45Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:25:00Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:25:15Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:25:30Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:25:45Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:27:30Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:27:45Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:28:00Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:28:15Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:28:30Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:28:50Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:29:05Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:29:20Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:29:35Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:29:50Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:30:05Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:30:15Z E! [inputs.vsphere]: Error in plugin: While collecting host: ServerFaultCode: A specified parameter was not correct: querySpec.startTime, querySpec.endTime
2019-01-21T03:30:15Z E! [inputs.vsphere]: Error in plugin: While collecting vm: ServerFaultCode: A specified parameter was not correct: querySpec.startTime, querySpec.endTime
2019-01-21T03:32:30Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:32:45Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:33:00Z W! [agent] input "inputs.vsphere" did not complete within its interval
2019-01-21T03:33:15Z W! [agent] input "inputs.vsphere" did not complete within its interval

zhangyf0820 on 21 Jan 2019

2) Once you've made the changes above, you should increase both discover_concurrency and collect_concurrency to 3. This should give you an extra performance boost!

prydin on 21 Jan 2019

👍2

It works, thanks a lot

zhangyf0820 on 22 Jan 2019

No. it is not working only if including datastore metrics.
If excluding datastore metrics then, no missing values observed (means graph is stable).

aksharbarot on 23 Jan 2019

As stated above, you need to declare a separate instance of the plugin and specify data stores, clusters and data centers in that instance with a 300s interval.

prydin on 23 Jan 2019

According to your suggestion, real-time metrics are good, but non-real-time data can not get continuous data.
How can I always get non-real-time data? Is this a bug? see #5322
1548243967 1

zhangyf0820 on 23 Jan 2019

What's the collection interval on the non real time metrics? It should be 300s or higher. Also, can you run with the -debug flag and send us logs?

prydin on 25 Jan 2019

And yes, it might be because of a bug that's scheduled to be fixed in 1.10. You may want to try the latest nightly build from master and see if it fixes the issue.

prydin on 25 Jan 2019

Yes!
object_discovery_interval = "300s"
collect_concurrency = 4
discover_concurrency = 4

Data-collection interval
interval = "60s"

But, below error observed in logs.
2019-01-29T06:36:03Z E! [inputs.vsphere]: Error in plugin: While collecting host: Post https://example.com/sdk: context deadline exceeded
2019-01-29T06:37:03Z E! [inputs.vsphere]: Error in plugin: While collecting host: Post https://example.com/sdk: context deadline exceeded

and during this time, data is showing no value.

aksharbarot on 29 Jan 2019

Same Problem here !
I've used a release that @prydin made for another issue and worked like a charm.

Functional release

Telegraf unknown (git: prydin-scale-improvement 646c5960)

Not Functional release

telegraf-1.9.4-1.x86_64

I only get inputs-vsphere did not complete within its interval

PROBLEM

No graph is generated !

CONFIG

#[global_tags]

[agent]

interval = '300s'
round_interval = true
metric_batch_size = 10000
metric_buffer_limit = 100000
collection_jitter = '0s'
flush_interval = '10s'
flush_jitter = '0s'
precision = ''
debug = true
quiet = false
logfile = '/var/log/telegraf/telegraf.log'
hostname = ''
omit_hostname = false

[[outputs.influxdb]]

urls = ['http://foo.bar.com:8086']
database = 'telegraf'

[[inputs.vsphere]]
vcenters = [ 25 x vcenter ]
username = 'foobar'
password = 'barfoo'
vm_metric_include = [
  'sys.osUptime.latest',
  'cpu.usage.average',
  'disk.read.average',
  'cpu.usage.average',
  'cpu.demand.average',
  'cpu.idle.summation',
  'cpu.latency.average',
  'cpu.readiness.average',
  'cpu.ready.summation',
  'cpu.run.summation',
  'cpu.usagemhz.average',
  'cpu.used.summation',
  'cpu.wait.summation',
  'mem.active.average',
  'mem.granted.average',
  'mem.latency.average',
  'mem.swapin.average',
  'mem.swapinRate.average',
  'mem.swapout.average',
  'mem.swapoutRate.average',
  'mem.usage.average',
  'mem.vmmemctl.average',
'net.bytesRx.average',
  'net.bytesTx.average',
  'net.droppedRx.summation',
  'net.droppedTx.summation',
  'net.usage.average',
  'power.power.average',
  'virtualDisk.numberReadAveraged.average',
  'virtualDisk.numberWriteAveraged.average',
  'virtualDisk.read.average',
  'virtualDisk.readOIO.latest',
  'virtualDisk.throughput.usage.average',
  'virtualDisk.totalReadLatency.average',
  'virtualDisk.totalWriteLatency.average',
  'virtualDisk.write.average',
  'virtualDisk.writeOIO.latest',
  'sys.uptime.latest',
]
host_metric_include = [
    'cpu.coreUtilization.average',
    'cpu.costop.summation',
    'cpu.demand.average',
    'cpu.idle.summation',
    'cpu.latency.average',
    'cpu.readiness.average',
    'cpu.ready.summation',
    'cpu.swapwait.summation',
    'cpu.usage.average',
    'cpu.usagemhz.average',
    'cpu.used.summation',
    'cpu.utilization.average',
    'cpu.wait.summation',
    'disk.deviceReadLatency.average',
    'disk.deviceWriteLatency.average',
    'disk.kernelReadLatency.average',
    'disk.kernelWriteLatency.average',
    'disk.numberReadAveraged.average',
    'disk.numberWriteAveraged.average',
    'disk.read.average',
    'disk.totalReadLatency.average',
    'disk.totalWriteLatency.average',
    'disk.write.average',
    'mem.active.average',
    'mem.latency.average',
    'mem.state.latest',
    'mem.swapin.average',
    'mem.swapinRate.average',
    'mem.swapout.average',
    'mem.swapoutRate.average',
    'mem.totalCapacity.average',
    'mem.usage.average',
    'mem.vmmemctl.average',
    'net.bytesRx.average',
    'net.bytesTx.average',
'net.droppedRx.summation',
    'net.droppedTx.summation',
    'net.errorsRx.summation',
    'net.errorsTx.summation',
    'net.usage.average',
    'power.power.average',
    'storageAdapter.numberReadAveraged.average',
    'storageAdapter.numberWriteAveraged.average',
    'storageAdapter.read.average',
    'storageAdapter.write.average',
    'sys.uptime.latest',
]
datastore_metric_include = [
  'datastore.numberReadAveraged.average',
  'datastore.throughput.contention.average',
  'datastore.throughput.usage.average',
  'datastore.write.average',
  'datastore.read.average',
  'datastore.numberWriteAveraged.average',
  'disk.used.latest',
  'disk.provisioned.latest',
  'disk.capacity.latest',
  'disk.capacity.contention.average',
  'disk.capacity.provisioned.average',
  'disk.capacity.usage.average'
]
cluster_metric_include = []
datacenter_metric_exclude = [ '*' ]
collect_concurrency = 2
discover_concurrency = 1
object_discovery_interval = '1200s'
insecure_skip_verify = true

bashrc666 on 6 Feb 2019

I having the same issue using this build: 1.9.4-1.x86_64.
Please help me fix the same or recommend the right telegraf build. I am having 200+ hosts with 2000+ VM's.

[agent] input "inputs.vsphere" did not complete within its interval

# Read metrics from one or many vCenters
[[inputs.vsphere]]
    ## List of vCenter URLs to be monitored. These three lines must be uncommented
  ## and edited for the plugin to work.
  vcenters = [ "https://xhd/sdk" ]
  username = "sdf"
  password = "password"

  ## VMs
  ## Typical VM metrics (if omitted or empty, all metrics are collected)
  vm_metric_include = [

    "mem.usage.average",
    "net.usage.average",
  ]
  vm_metric_exclude = [
    "power.power.average",    
    "virtualDisk.numberReadAveraged.average",
    "virtualDisk.numberWriteAveraged.average",
    "virtualDisk.read.average",
    "virtualDisk.readOIO.latest",
    "virtualDisk.throughput.usage.average",
    "virtualDisk.totalReadLatency.average",
    "virtualDisk.totalWriteLatency.average",
    "virtualDisk.write.average",
    "virtualDisk.writeOIO.latest",
    "sys.uptime.latest",
    "mem.vmmemctl.average",
    "net.bytesRx.average",
    "net.bytesTx.average",
    "net.droppedRx.summation",
    "net.droppedTx.summation",
    "cpu.wait.summation",
    "mem.active.average",
    "mem.granted.average",
    "mem.latency.average",
    "mem.swapin.average",
    "mem.swapinRate.average",
    "mem.swapout.average",
    "mem.swapoutRate.average",
    "cpu.run.summation",  
    "cpu.demand.average",
    "cpu.idle.summation",
    "cpu.latency.average", 
    "cpu.readiness.average",
    "cpu.ready.summation",
    "cpu.usagemhz.average",
    "cpu.used.summation",
 ]

  ## Nothing is excluded by default
  # vm_instances = true ## true by default

  ## Hosts 
  ## Typical host metrics (if omitted or empty, all metrics are collected)
  host_metric_include = [

    "cpu.usage.average",
    "cpu.usagemhz.average",
    "cpu.used.summation",
    "cpu.utilization.average",
    "cpu.wait.summation",
    "mem.usage.average",
    "net.usage.average",

  ]
   host_metric_exclude = [
    "power.power.average",
    "storageAdapter.numberReadAveraged.average",
    "storageAdapter.numberWriteAveraged.average",
    "storageAdapter.read.average",
    "storageAdapter.write.average",
    "sys.uptime.latest",
    "mem.vmmemctl.average",
    "net.bytesRx.average",
    "net.bytesTx.average",
    "net.droppedRx.summation",
    "net.droppedTx.summation",
    "net.errorsRx.summation",
    "net.errorsTx.summation",
    "disk.deviceReadLatency.average",
    "disk.deviceWriteLatency.average",
    "disk.kernelReadLatency.average",
    "disk.kernelWriteLatency.average",
    "disk.numberReadAveraged.average",
    "disk.numberWriteAveraged.average",
    "disk.read.average",
    "disk.totalReadLatency.average",
    "disk.totalWriteLatency.average",
    "disk.write.average",
    "mem.active.average",
    "mem.latency.average",
    "mem.state.latest",
    "mem.swapin.average",
    "mem.swapinRate.average",
    "mem.swapout.average",
    "mem.swapoutRate.average",
    "mem.totalCapacity.average",
    "cpu.coreUtilization.average",
    "cpu.costop.summation",
    "cpu.demand.average",
    "cpu.idle.summation",
    "cpu.latency.average",
    "cpu.readiness.average",
    "cpu.ready.summation",
    "cpu.swapwait.summation",
   ] 
  ## Nothing excluded by default
  # host_instances = true ## true by default

  ## Clusters 
  # cluster_metric_include = [] ## if omitted or empty, all metrics are collected
  # cluster_metric_exclude = [] ## Nothing excluded by default
  # cluster_instances = true ## true by default

  ## Datastores 
  # datastore_metric_include = [] ## if omitted or empty, all metrics are collected
  # datastore_metric_exclude = [] ## Nothing excluded by default
  # datastore_instances = false ## false by default for Datastores only

  ## Datacenters
  datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
  datacenter_metric_exclude = [ "*" ] ## Datacenters are not collected by default.
  # datacenter_instances = false ## false by default for Datastores only

  ## Plugin Settings  
  ## separator character to use for measurement and field names (default: "_")
  # separator = "_"

  ## number of objects to retreive per query for realtime resources (vms and hosts)
  ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
  # max_query_objects = 256

  ## number of metrics to retreive per query for non-realtime resources (clusters and datastores)
  ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
  # max_query_metrics = 256

  ## number of go routines to use for collection and discovery of objects and metrics
   collect_concurrency = 5
   discover_concurrency = 3

  ## whether or not to force discovery of new objects on initial gather call before collecting metrics
  ## when true for large environments this may cause errors for time elapsed while collecting metrics
  ## when false (default) the first collection cycle may result in no or limited metrics while objects are discovered
  # force_discover_on_init = false

  ## the interval before (re)discovering objects subject to metrics collection (default: 300s)
  # object_discovery_interval = "300s"

  ## timeout applies to any of the api request made to vcenter
  timeout = "180s"

  ## Optional SSL Config
  # ssl_ca = "/path/to/cafile"
  # ssl_cert = "/path/to/certfile"
  # ssl_key = "/path/to/keyfile"
  ## Use SSL but skip chain & host verification
    insecure_skip_verify = true

sunnybhatnagar on 8 Feb 2019

@sunnybhatnagar Can you try with the latest nightly build?

danielnelson on 15 Feb 2019

@danielnelson I have the same issue; just tried the nightly build and that fixes the issue!

But now I got weird failed events in my vcenter, see:

screenshot 2019-02-22 at 09 56 41

MartVisser on 22 Feb 2019

@MartVisser Can you open a new issue for that side effect?

I'm going to close this issue, if anyone is having issues with the plugin not completing by the end of the interval please read the hints above and try again with the nightly builds. If you still have problems after that, please open a new issue.

danielnelson on 26 Feb 2019

Was this page helpful?

0 / 5 - 0 ratings