Telegraf: vsphere Client.Timeout exceeded while awaiting headers

Created on 11 Dec 2018 · 7Comments · Source: influxdata/telegraf

Relevant telegraf.conf:

file telegraf.d/manu010.vm.conf

[[inputs.vsphere]]
   ## List of vCenter URLs to be monitored. These three lines must be uncommented
   ## and edited for the plugin to work.
   vcenters = [ "https://manu010.vm/sdk"]
   username = "[email protected]"
   password = "XXXXX"

   ## VMs
   ## Typical VM metrics (if omitted or empty, all metrics are collected)
   vm_metric_include = []
   vm_metric_exclude = ["*"] ## Nothing is excluded by default
   vm_instances = false ## true by default

   ## Hosts
   ## Typical host metrics (if omitted or empty, all metrics are collected)
   host_metric_include = []
   host_metric_exclude = ["*"] ## Nothing excluded by default
   host_instances = false ## true by default

   ## Clusters
   cluster_metric_include = [] ## if omitted or empty, all metrics are collected
   cluster_metric_exclude = [] ## Nothing excluded by default
   cluster_instances = true ## true by default

   ## Datastores
   datastore_metric_include = [] ## if omitted or empty, all metrics are collected
   datastore_metric_exclude = ["*"] ## Nothing excluded by default
   datastore_instances = false ## false by default for Datastores only

   ## Datacenters
   datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
   datacenter_metric_exclude = [] ## Datacenters are not collected by default.
   datacenter_instances = true ## false by default for Datastores only

   max_query_objects = 100
   max_query_metrics = 100
   collect_concurrency = 30
   discover_concurrency = 10
   force_discover_on_init = false
   object_discovery_interval = "1600s"
   insecure_skip_verify = true

file telegraf.d/manu001.vm.conf

[[inputs.vsphere]]
   vcenters = [ "https://manu001.vm/sdk"]
   username = "[email protected]"
   password = "XXXXX"

   ## VMs
   ## Typical VM metrics (if omitted or empty, all metrics are collected)
   vm_metric_include = []
   vm_metric_exclude = ["*"] ## Nothing is excluded by default
   vm_instances = false ## true by default

   ## Hosts
   ## Typical host metrics (if omitted or empty, all metrics are collected)
   host_metric_include = []
   host_metric_exclude = ["*"] ## Nothing excluded by default
   host_instances = false ## true by default

   ## Clusters
   cluster_metric_include = [] ## if omitted or empty, all metrics are collected
   cluster_metric_exclude = [] ## Nothing excluded by default
   cluster_instances = true ## true by default

   ## Datastores
   datastore_metric_include = [] ## if omitted or empty, all metrics are collected
   datastore_metric_exclude = ["*"] ## Nothing excluded by default
   datastore_instances = false ## false by default for Datastores only

   ## Datacenters
   datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
   datacenter_metric_exclude = [] ## Datacenters are not collected by default.
   datacenter_instances = true ## false by default for Datastores only

   max_query_objects = 100
   max_query_metrics = 1000
   collect_concurrency = 30
   discover_concurrency = 10
   force_discover_on_init = false
   object_discovery_interval = "1600s"
   insecure_skip_verify = true

### System info:
OS: Debian 8.11
ii  telegraf                                              1.9.0-1                         amd64                           Plugin-driven server agent for reporting metrics into InfluxDB.

Inmediately after telegraf stars manu100.vm outputs this error:

2018-12-11T11:25:14Z D! [input.vsphere]: Starting plugin
2018-12-11T11:25:14Z D! [input.vsphere]: Starting plugin
2018-12-11T11:25:14Z D! [input.vsphere]: Creating client: manu010.vm
2018-12-11T11:25:14Z D! [input.vsphere]: Creating client: manu001.vm
2018-12-11T11:25:14Z E! [input.vsphere]: Error in discovery for manu010.vm: ServerFaultCode: Permission to perform this operation was denied.
2018-12-11T11:25:14Z D! [input.vsphere] vCenter maxQueryMetrics is defined: -1
2018-12-11T11:25:14Z D! [input.vsphere] vCenter says max_query_metrics should be 10000
2018-12-11T11:25:14Z D! [input.vsphere]: Discover new objects for manu001.vm
2018-12-11T11:25:14Z D! [input.vsphere] Discovering resources for datacenter
2018-12-11T11:25:14Z D! [input.vsphere]: No parent found for Folder:group-d1 (ascending from Folder:group-d1)
2018-12-11T11:25:15Z D! [input.vsphere] Discovering resources for cluster
2018-12-11T11:25:15Z D! [input.vsphere] Discovering resources for host
2018-12-11T11:25:16Z D! [input.vsphere] Discovering resources for vm
2018-12-11T11:25:16Z D! [input.vsphere] Discovering resources for datastore

No metrics is stored in influxdb for manu010.vm.

From time to time appears errors for the other vcenter, manu001.vm:

2018-12-11T11:31:00Z E! [inputs.vsphere]: Error in plugin: Post https://manu001.vm/sdk: context deadline exceeded
2018-12-11T11:31:00Z E! [inputs.vsphere]: Error in plugin: Post https://manu001.vm/sdk: context deadline exceeded

There is metrics for manu001 but not continually. Fails for an hour and after that continues getting metrics again.

arevsphere bug

Source

ragonlan

Most helpful comment

Hi,
For info, got same issue and the problem was coming from configuration I copied-pasted from website. In timeout the guy forgot the "s" at the end of timeout ="1800s".

Good evening!

Venopsis on 5 May 2019

👍5

All 7 comments

Add timeout = "60s" to the config!

prydin on 11 Dec 2018

@prydin Can you help me understand this recommendation better, isn't the default already timeout 60s in 1.9.0?

danielnelson on 11 Dec 2018

I sas assuming this was 1.8.x, but maybe not.

prydin on 12 Dec 2018

Yes, I run telegraf 1.9:
telegraf 1.9.0-1 amd64
in a Debian 9.6

I set the value 60 but the timeouts keep coming. Actually I putted it to 120s and nothing changed. A pity that debug mode does not show what is doing when timeouts occur or the SeverFaultCode.

ragonlan on 12 Dec 2018

Try this with version 1.10 of telegraf. There's been some improvement of the timeout handling.

prydin on 9 Mar 2019

Hi,
For info, got same issue and the problem was coming from configuration I copied-pasted from website. In timeout the guy forgot the "s" at the end of timeout ="1800s".

Good evening!

Venopsis on 5 May 2019

👍5

tried with 1.10.3 and timeout = "1800s" ... the messages are still comming ...