Telegraf: vsphere plugin does not connect locally

Created on 4 Oct 2018 · 17Comments · Source: influxdata/telegraf

Relevant telegraf.conf:

# Telegraf configuration

# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.

# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.

# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.

# Global tags can be specified here in key="value" format.
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will cache metric_buffer_limit metrics for each output, and will
  ## flush this buffer on a successful write.
  metric_buffer_limit = 1000
  ## Flush the buffer whenever full, regardless of flush_interval.
  flush_buffer_when_full = true

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## Logging configuration:
  ## Run telegraf in debug mode
  debug = false
  ## Run telegraf in quiet mode
  quiet = false
  ## Specify the log file name. The empty string means to log to stdout.
  logfile = "/Program Files/Telegraf/telegraf.log"

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""


###############################################################################
#                                  OUTPUTS                                    #
###############################################################################

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  # The full HTTP or UDP endpoint URL for your InfluxDB instance.
  # Multiple urls can be specified but it is assumed that they are part of the same
  # cluster, this means that only ONE of the urls will be written to each interval.
  # urls = ["udp://127.0.0.1:8089"] # UDP endpoint example
  urls = ["http://172.27.200.17:8086"] # required
  # The target database for metrics (telegraf will create it if not exists)
  database = "telegraf" # required
  # Precision of writes, valid values are "ns", "us" (or "µs"), "ms", "s", "m", "h".
  # note: using second precision greatly helps InfluxDB compression
  precision = "s"

  ## Write timeout (for the InfluxDB client), formatted as a string.
  ## If not provided, will default to 5s. 0s means no timeout (not recommended).
  timeout = "5s"
  # username = "telegraf"
  # password = "metricsmetricsmetricsmetrics"
  # Set the user agent for HTTP POSTs (can be useful for log differentiation)
  # user_agent = "telegraf"
  # Set UDP payload size, defaults to InfluxDB UDP Client default (512 bytes)
  # udp_payload = 512


###############################################################################
#                                  INPUTS                                     #
###############################################################################

# Windows Performance Counters plugin.
# These are the recommended method of monitoring system metrics on windows,
# as the regular system plugins (inputs.cpu, inputs.mem, etc.) rely on WMI,
# which utilize more system resources.
#
# See more configuration examples at:
#   https://github.com/influxdata/telegraf/tree/master/plugins/inputs/win_perf_counters

[[inputs.win_perf_counters]]
  [[inputs.win_perf_counters.object]]
    # Processor usage, alternative to native, reports on a per core.
    ObjectName = "Processor"
    Instances = ["*"]
    Counters = [
      "% Idle Time",
      "% Interrupt Time",
      "% Privileged Time",
      "% User Time",
      "% Processor Time",
      "% DPC Time",
    ]
    Measurement = "win_cpu"
    # Set to true to include _Total instance when querying for all (*).
    IncludeTotal=true

  [[inputs.win_perf_counters.object]]
    # Disk times and queues
    ObjectName = "LogicalDisk"
    Instances = ["*"]
    Counters = [
      "% Idle Time",
      "% Disk Time",
      "% Disk Read Time",
      "% Disk Write Time",
      "Current Disk Queue Length",
      "% Free Space",
      "Free Megabytes",
    ]
    Measurement = "win_disk"
    # Set to true to include _Total instance when querying for all (*).
    #IncludeTotal=false

  [[inputs.win_perf_counters.object]]
    ObjectName = "PhysicalDisk"
    Instances = ["*"]
    Counters = [
      "Disk Read Bytes/sec",
      "Disk Write Bytes/sec",
      "Current Disk Queue Length",
      "Disk Reads/sec",
      "Disk Writes/sec",
      "% Disk Time",
      "% Disk Read Time",
      "% Disk Write Time",
    ]
    Measurement = "win_diskio"

  [[inputs.win_perf_counters.object]]
    ObjectName = "Network Interface"
    Instances = ["*"]
    Counters = [
      "Bytes Received/sec",
      "Bytes Sent/sec",
      "Packets Received/sec",
      "Packets Sent/sec",
      "Packets Received Discarded",
      "Packets Outbound Discarded",
      "Packets Received Errors",
      "Packets Outbound Errors",
    ]
    Measurement = "win_net"

  [[inputs.win_perf_counters.object]]
    ObjectName = "System"
    Counters = [
      "Context Switches/sec",
      "System Calls/sec",
      "Processor Queue Length",
      "System Up Time",
    ]
    Instances = ["------"]
    Measurement = "win_system"
    # Set to true to include _Total instance when querying for all (*).
    #IncludeTotal=false

  [[inputs.win_perf_counters.object]]
    # Example query where the Instance portion must be removed to get data back,
    # such as from the Memory object.
    ObjectName = "Memory"
    Counters = [
      "Available Bytes",
      "Cache Faults/sec",
      "Demand Zero Faults/sec",
      "Page Faults/sec",
      "Pages/sec",
      "Transition Faults/sec",
      "Pool Nonpaged Bytes",
      "Pool Paged Bytes",
      "Standby Cache Reserve Bytes",
      "Standby Cache Normal Priority Bytes",
      "Standby Cache Core Bytes",

    ]
    # Use 6 x - to remove the Instance bit from the query.
    Instances = ["------"]
    Measurement = "win_mem"
    # Set to true to include _Total instance when querying for all (*).
    #IncludeTotal=false

  [[inputs.win_perf_counters.object]]
    # Example query where the Instance portion must be removed to get data back,
    # such as from the Paging File object.
    ObjectName = "Paging File"
    Counters = [
      "% Usage",
    ]
    Instances = ["_Total"]
    Measurement = "win_swap"



# Windows system plugins using WMI (disabled by default, using
# win_perf_counters over WMI is recommended)

# # Read metrics about cpu usage
# [[inputs.cpu]]
#   ## Whether to report per-cpu stats or not
#   percpu = true
#   ## Whether to report total system cpu stats or not
#   totalcpu = true
#   ## Comment this line if you want the raw CPU time metrics
#   fielddrop = ["time_*"]


# # Read metrics about disk usage by mount point
# [[inputs.disk]]
#   ## By default, telegraf gather stats for all mountpoints.
#   ## Setting mountpoints will restrict the stats to the specified mountpoints.
#   ## mount_points=["/"]
#
#   ## Ignore some mountpoints by filesystem type. For example (dev)tmpfs (usually
#   ## present on /run, /var/run, /dev/shm or /dev).
#   # ignore_fs = ["tmpfs", "devtmpfs", "devfs", "overlay", "aufs", "squashfs"]


# # Read metrics about disk IO by device
# [[inputs.diskio]]
#   ## By default, telegraf will gather stats for all devices including
#   ## disk partitions.
#   ## Setting devices will restrict the stats to the specified devices.
#   ## devices = ["sda", "sdb"]
#   ## Uncomment the following line if you do not need disk serial numbers.
#   ## skip_serial_number = true


# # Read metrics about memory usage
# [[inputs.mem]]
#   # no configuration


# # Read metrics about swap memory usage
# [[inputs.swap]]
#   # no configuration

# Read metrics from one or many vCenters
[[inputs.vsphere]]
    ## List of vCenter URLs to be monitored. These three lines must be uncommented
  ## and edited for the plugin to work.
  vcenters = [ "https://localhost:9443/" ]
  username = "username"
  password = "password"

  ## VMs
  ## Typical VM metrics (if omitted or empty, all metrics are collected)
  vm_metric_include = [
    "cpu.demand.average",
    "cpu.idle.summation",
    "cpu.latency.average",
    "cpu.readiness.average",
    "cpu.ready.summation",
    "cpu.run.summation",
    "cpu.usagemhz.average",
    "cpu.used.summation",
    "cpu.wait.summation",
    "mem.active.average",
    "mem.granted.average",
    "mem.latency.average",
    "mem.swapin.average",
    "mem.swapinRate.average",
    "mem.swapout.average",
    "mem.swapoutRate.average",
    "mem.usage.average",
    "mem.vmmemctl.average",
    "net.bytesRx.average",
    "net.bytesTx.average",
    "net.droppedRx.summation",
    "net.droppedTx.summation",
    "net.usage.average",
    "power.power.average",    
    "virtualDisk.numberReadAveraged.average",
    "virtualDisk.numberWriteAveraged.average",
    "virtualDisk.read.average",
    "virtualDisk.readOIO.latest",
    "virtualDisk.throughput.usage.average",
    "virtualDisk.totalReadLatency.average",
    "virtualDisk.totalWriteLatency.average",
    "virtualDisk.write.average",
    "virtualDisk.writeOIO.latest",
    "sys.uptime.latest",
  ]
  # vm_metric_exclude = [] ## Nothing is excluded by default
  # vm_instances = true ## true by default

  ## Hosts 
  ## Typical host metrics (if omitted or empty, all metrics are collected)
  host_metric_include = [
    "cpu.coreUtilization.average",
    "cpu.costop.summation",
    "cpu.demand.average",
    "cpu.idle.summation",
    "cpu.latency.average",
    "cpu.readiness.average",
    "cpu.ready.summation",
    "cpu.swapwait.summation",
    "cpu.usage.average",
    "cpu.usagemhz.average",
    "cpu.used.summation",
    "cpu.utilization.average",
    "cpu.wait.summation",
    "disk.deviceReadLatency.average",
    "disk.deviceWriteLatency.average",
    "disk.kernelReadLatency.average",
    "disk.kernelWriteLatency.average",
    "disk.numberReadAveraged.average",
    "disk.numberWriteAveraged.average",
    "disk.read.average",
    "disk.totalReadLatency.average",
    "disk.totalWriteLatency.average",
    "disk.write.average",
    "mem.active.average",
    "mem.latency.average",
    "mem.state.latest",
    "mem.swapin.average",
    "mem.swapinRate.average",
    "mem.swapout.average",
    "mem.swapoutRate.average",
    "mem.totalCapacity.average",
    "mem.usage.average",
    "mem.vmmemctl.average",
    "net.bytesRx.average",
    "net.bytesTx.average",
    "net.droppedRx.summation",
    "net.droppedTx.summation",
    "net.errorsRx.summation",
    "net.errorsTx.summation",
    "net.usage.average",
    "power.power.average",
    "storageAdapter.numberReadAveraged.average",
    "storageAdapter.numberWriteAveraged.average",
    "storageAdapter.read.average",
    "storageAdapter.write.average",
    "sys.uptime.latest",
  ]
  # host_metric_exclude = [] ## Nothing excluded by default
   host_instances = true ## true by default

  ## Clusters 
  # cluster_metric_include = [] ## if omitted or empty, all metrics are collected
  # cluster_metric_exclude = [] ## Nothing excluded by default
   cluster_instances = true ## true by default

  ## Datastores 
  # datastore_metric_include = [] ## if omitted or empty, all metrics are collected
  # datastore_metric_exclude = [] ## Nothing excluded by default
   datastore_instances = false ## false by default for Datastores only

  ## Datacenters
  datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
  datacenter_metric_exclude = [ "*" ] ## Datacenters are not collected by default.
   datacenter_instances = false ## false by default for Datastores only

  ## Plugin Settings  
  ## separator character to use for measurement and field names (default: "_")
  # separator = "_"

  ## number of objects to retreive per query for realtime resources (vms and hosts)
  ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
  # max_query_objects = 256

  ## number of metrics to retreive per query for non-realtime resources (clusters and datastores)
  ## set to 64 for vCenter 5.5 and 6.0 (default: 256)
  # max_query_metrics = 256

  ## number of go routines to use for collection and discovery of objects and metrics
  # collect_concurrency = 1
  # discover_concurrency = 1

  ## whether or not to force discovery of new objects on initial gather call before collecting metrics
  ## when true for large environments this may cause errors for time elapsed while collecting metrics
  ## when false (default) the first collection cycle may result in no or limited metrics while objects are discovered
  # force_discover_on_init = false

  ## the interval before (re)discovering objects subject to metrics collection (default: 300s)
  # object_discovery_interval = "300s"

  ## timeout applies to any of the api request made to vcenter
  # timeout = "20s"

  ## Optional SSL Config
  # ssl_ca = "/path/to/cafile"
  # ssl_cert = "/path/to/certfile"
  # ssl_key = "/path/to/keyfile"
  ## Use SSL but skip chain & host verification
   insecure_skip_verify = true

System info:

Windows 2012 R2 Server
Telegraf 1.8.0 RC2

Steps to reproduce:

Trying different ways of introducing the vsphere plugin.
Always same logfile output.

Expected behavior:

Data should be collected without error.

Actual behavior:

04T17:11:43Z I! Loaded processors: 
2018-10-04T17:11:43Z I! Loaded outputs: influxdb
2018-10-04T17:11:43Z I! Tags enabled: host=TMCFPWVC01
2018-10-04T17:11:43Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"TMCFPWVC01", Flush Interval:10s 
2018-10-04T17:11:50Z E! [input.vsphere]: Error in discovery for localhost:9443: expected element type <Envelope> but have <html>
2018-10-04T17:11:50Z E! Error in plugin [inputs.vsphere]: expected element type <Envelope> but have <html>
2018-10-04T17:12:00Z E! Error in plugin [inputs.vsphere]: expected element type <Envelope> but have <html>
2018-10-04T17:12:10Z E! Error in plugin [inputs.vsphere]: expected element type <Envelope> but have <html>
2018-10-04T17:12:20Z E! Error in plugin

arevsphere need more info

Source

unnamed3788

All 17 comments

Can you confirm that you can make a request to localhost:9443 using the username and password in the config file, and that you get valid data back?

glinton on 4 Oct 2018

Can you confirm that you can make a request to localhost:9443 using the username and password in the config file, and that you get valid data back?

Sure. Accessing that URL I can login locally to my vSphere Web Client.

unnamed3788 on 4 Oct 2018

The URL needs to have /sdk at the end. You're hitting the UI, not the API.

prydin on 4 Oct 2018

👍1

@unnamed3788 can you validate against the /sdk endpoint?

glinton on 5 Oct 2018

As a side note: The fact that you're accessing localhost indicates that you're running telegraf on the vCenter virtual appliance. Running 3rd party software on the vCenter appliance is not supported by VMware. You're much better off spinning up a separate VM running Telegraf.

prydin on 9 Oct 2018

👎2

I am also facing the similar issue, however i am not running the telegraf on the same local instance but in a separate VM.
Though i am not able to fetch the data from the vsphere input plugin

[[inputs.vsphere]]
    ## List of vCenter URLs to be monitored. These three lines must be uncommented
  ## and edited for the plugin to work.
  vcenters = [ "https://serverip/sdk" ]
  username = "username"
  password = "password"

Also i am not able to access the /sdk from a browser - says page not found
is there anyway i can enable /sdk to be accessed from Vsphere?

karnamonkster on 16 Oct 2018

@karnamonkster check out this KB article: https://kb.vmware.com/s/article/1003218

Also, can you tell me which error message you're getting when telegraf tries to connect?

prydin on 16 Oct 2018

The error which i receive is :

Error in plugin [inputs.vsphere]: took longer to collect than collection interval (20s)
Error in plugin [inputs.vsphere]: took longer to collect than collection interval (20s)
Error in plugin [inputs.vsphere]: Post https://vcenterip/sdk: Service Unavailable
Error in plugin [inputs.vsphere]: took longer to collect than collection interval (20s)
Error in plugin [inputs.vsphere]: took longer to collect than collection interval (20s)
Error in plugin [inputs.vsphere]: Error in discovery for vcenterip : Post https://vcenterip/sdk: Service Unavailable

However as per the article you mentioned i also tried with
https://vcenteripaddress/sdk/vimService.wsdl
but still getting the same result.

karnamonkster on 17 Oct 2018

It looks like your vCenter is having some issues. Which version? Appliance or windows install? Does the HTML5-based UI work? Can you access it with PowrCLI? Here are some similar issues and a few soultions: https://communities.vmware.com/thread/547345

If you cant access your vCenter using PowerCLI or HTML5, I suggest you open a case with vmware support.

prydin on 17 Oct 2018

I am running an Appliance running with version 6.0 on Linux OS
I am able to access the url https://vcenteripaddress/sdk/vimService.wsdl however not able to access
https://vcenteripaddress/sdk
May be we need some services enabled on Vcenter or Vsphere? not sure
Totally lost. :(

karnamonkster on 17 Oct 2018

@karnamonkster It looks like you have a problem with your vCenter, not the Telegrad agent. The error message you get back is a HTTP 503 "Service Unavailable" and it means exactly that: Some service in vCenter is unavailable. Most likely, some service has failed on your vCenter virtual appliance.

A simple reboot of the vCenter appliance may solve the problem, but if it reoccurs, you should contact VMware support.

To check the status of services on vCenter services, please refer to this article:
https://kb.vmware.com/s/article/2109887

You could also download the govc tool and try a govc ls. If that fails, you definitely have an issue with your vCenter and it may be easier for support to pinpoint it. You may also try connecting to vCenter using PowerCLI. Both govc and PowerCLI are free tools.

Govc is available here: https://github.com/vmware/govmomi/releases
PowerCLI is available here: https://my.vmware.com/web/vmware/details?downloadGroup=PCLI650R1&productId=614

@danielnelson We may want to consider closing this ticket, as it refers to vCenter issues and not issues with Telegraf.

prydin on 17 Oct 2018

Finally everything works. Just like magic, may be wearing my lucky shirt. :)
Beautiful dashboards are in front of me. Thanks for all the help @prydin

karnamonkster on 18 Oct 2018

👍1

How I can add few vcenter with different usernames?

kuzma00 on 15 Nov 2018

@kuzma00 There are two ways of doing this. One is documented, supported and encouraged. The other one is undocumented and... well... discouraged.

Method 1

Simply declare multiple instances of the vsphere plugin:

[[inputs.vsphere]]
vcenters = [ "https://vcenter1.my.domain/sdk" ]
username = "donald"
password = "melania"
...

[[inputs.vsphere]]
vcenters = [ "https://vcenter2.my.domain/sdk" ]
username = "barack"
password = "michelle"
...

Mehtod 2 (undocumented)

vCenter allows user credentials to be passed in the URL. We made this an undocumented/unsupported "feature" since it can get pretty messy when you need to escape special characters.

[[inputs.vsphere]]
vcenters = [ 
    "https://donald:[email protected]",
    "https://barack:[email protected]" ]
...

Let me stress that this is not the recommended way of doing this, but it will work, since vCenter will pick up the credentials. I hope this helps!

prydin on 15 Nov 2018

And since most usernames will contain the "@" sign, it starts to get messy. You'd have to write something like this with a real username and a password containing a "!".

https://donald%40whitehouse.gov:Melania1%[email protected]/sdk

(This does not in any way reflect my political views. Just picked US presidents as an example)

prydin on 15 Nov 2018

Thanks. It works.

kuzma00 on 15 Nov 2018

Hello, thanks guys, @prydin you Method 2 works !

I made the change from
[[inputs.vsphere]]
vcenters = [ "https://vcenter1.my.domain/sdk" ]
username = "[email protected]"
password = "MyPassword"
...metrics...
...metrics...

[[inputs.vsphere]]
vcenters = [ "https://vcenter2.my.domain/sdk" ]
username = "[email protected]"
password = "MyPassword"
...metrics...
...metrics...

To :

[[inputs.vsphere]]
vcenters = [
"https://administrator%40vsphere.local :[email protected]/sdk",
"https://administrator%40vsphere.local :[email protected]/sdk" ]
...metrics...
...metrics...

Kind regards