Telegraf: inputs.snmp fails all agents if one agent does not respond in time

Created on 23 Feb 2018 · 11Comments · Source: influxdata/telegraf

# Bug report
When running inputs.snmp plugin and polling several devices (agents) within same instance, if at least one device does not send back all requested information all devices in that inputs.snmp instance fails. Other instances running at the same time are not impacted.
My case example: I am poling several hundred network devices with several inputs.snmp instances every 1 minute. Each instance is dedicated for single metric.
When polling interface metrics on remotely located with big latency (~100ms) large network device (few hundred interfaces), for all requests and responses there simply is not enough time. And this is my problem to solve. But due to this condition all other few hundred devices in same instance fails even though they send back all data in time.
Log message is generated:
E! Error in plugin [inputs.snmp]: took longer to collect than collection interval (1m0s)
And from log it is impossible to tell which device is failing.

# Relevant telegraf.conf:

[[inputs.snmp]]

agents = ["test1:161" , "test2:161" , "test5" , "test3:161"]
interval = "60s"
timeout = "5s"
retries = 3
version = 2
community = "SNMPcommunity"
max_repetitions = 100

 [[inputs.snmp.table]]
 name="interface"
 oid="IF-MIB::ifXTable"

 [[inputs.snmp.table.field]]
 name="ifDescr"
 oid="IF-MIB::ifDescr"
 is_tag=true

 [[inputs.snmp.table.field]]
 name="ifOperStatus"
 oid="IF-MIB::ifOperStatus"
 is_tag=true

# System info:
Telegraf: 1.5.1-1
OS: Rhel 7.4

# Expected behavior:
Telegraf should fail only that device which does not respond in time and generate log message which device failed.
# Actual behavior:
All instance fails and only general message is generated.

aresnmp bug

Source

aurimasplu

👍5

All 11 comments

You can move to snmpcollector to gather snmp metrics. Polling time or state does not affect to the other devices and you have a nice web interface to check device runtime statistics (state polling time, num metrics , errors ) etc.

https://github.com/toni-moreno/snmpcollector

Check the wiki for configuration examples

https://github.com/toni-moreno/snmpcollector/wiki

toni-moreno on 23 Feb 2018

👎11

toni-moreno, thanks for advice I will check it out, but I don't believe that bug reports is the place to advertise your products.

aurimasplu on 26 Feb 2018

However I think the issue is more serious than implied by your report. I believe the collection is failing because the input is exceeding the _collection_ time for the entire input. That is, if you've got 100 routers and it takes a couple of seconds to poll each one then to complete the collection run exceeds the interval even if each box responds quickly.

I have proven this in my environment as I have the same problem even with no devices timing out, doing basically just ifTable queries, with as low as about 150 devices.

I've tried to resolve the issue by breaking my configuration into chunks of [[inputs.snmp]] but they all seem to be executed as if they were one so you run into the same problem:

Mar 21 15:38:00 act-collector01 telegraf[6933]: 2018-03-21T04:38:00Z E! Error in plugin [inputs.snmp]: took longer to collect than collection interval (2m0s)

Unfortunately, this makes Telegraf with the snmp plugin unusable for the purpose of collecting network metrics via SNMP on a reasonably sized network without a considerable amount of scripting to stand up additional telegraf instances and delegate appropriately sized configurations to each.

My environment details:

Telegraf 1.5.3
6 core xeon E5-2697 (virtualized under ESX)
16GB Memory
SSD storage

Example file (replace the agent string with 100+ hostnames...):

[[inputs.snmp]]
    interval = "120s"
    agents = [ "spaghetti.csiro.au" ]
    version = 2
    community = fake
    name = "switch_snmp"
    timeout = "2s"
    retries = 1

  [[inputs.snmp.field]]
    name = "hostname"
    oid = "RFC1213-MIB::sysName.0"
    is_tag = true

  [[inputs.snmp.table]]
    name = "cisco_physical_cpu"
    inherit_tags = [ "hostname" ]
    oid = "CISCO-PROCESS-MIB::cpmCPUTotalTable"
    index_as_tag = true

  # Poll an entire table for all of it's fields
  [[inputs.snmp.table]]
    name = "interface"
    inherit_tags = [ "hostname" ]
    oid = "IF-MIB::ifTable"

    # Interface tag - used to identify interface in metrics database
    # Mark the OID IF-MIB::ifDescr as "ifDescr" in the snmp table
    [[inputs.snmp.table.field]]
      name = "ifDescr"
      oid = "IF-MIB::ifDescr"
      is_tag = true

    #
    [[inputs.snmp.table.field]]
      name = "ifName"
      oid = "IF-MIB::ifName"
      is_tag = true

  # IF-MIB::ifXTable contains newer High Capacity (HC) counters that do not overflow as fast for a few of the ifTable counters
  [[inputs.snmp.table]]
    name = "interface"
    inherit_tags = [ "hostname" ]
    oid = "IF-MIB::ifXTable"

    # Interface tag - used to identify interface in metrics database
    [[inputs.snmp.table.field]]
      name = "ifName"
      oid = "IF-MIB::ifName"
      is_tag = true

    [[inputs.snmp.table.field]]
      name = "ifDescr"
      oid = "IF-MIB::ifDescr"
      is_tag = true

adambaumeister on 21 Mar 2018

Plugins don't fail if they exceed the interval, this is really just a warning message that the plugin is scheduled to run again but it still has not completed. The plugin will continue to run and it will not run again until the first collection completes.

E! Error in plugin [inputs.snmp]: took longer to collect than collection interval (2m0s)

The timeout in the configuration is per network operation, so if you have a 5s timeout and multiple retries and multiple fields, it can add up pretty fast and it is difficult to know how long it can take. This definitely needs improved, I think we should have one timeout that applies to an entire agents work.

However, if you split the agents into separately plugins, each one does run independently. The workaround of splitting the agents into smaller multiple plugins should work, though obviously hard to do until we improve the log messages.

danielnelson on 21 Mar 2018

@danielnelson I agree a per-agent timeout value would be much better. Basically a hard limit for how long a collection an take on a single host. You could even make it automatic maybe by dividing the interval by the number of hosts by the amount of parallel SNMP sessions.

I tried splitting my snmp inputs into groups of 50 hosts each and it _almost_ works but I still get some groups that take too long to collect so you end up with weird looking graphs for certain devices. I wrote some automation to do this which, while nice, is probably too much additional overhead for the relatively simple task of collecting network metrics.

adambaumeister on 22 Mar 2018

Issue actually exists and plugin fails if you use lost of agents in one instance.

But I have solved my problem by creating separate configuration file per device and put them all to /etc/telegraf/telegraf.d/ directory. Each file contains several snmp plugin instances usually one per snmp table. I have created several templates per device type, came up with some naming convention for config files and wrote script which takes list of devices and creates config files. I made it really easy for myself to provision devices monitored by telegraf.
So I am running two 8 core VMs and monitor more 2700 devices polling them every 1 minute quite smoothly :)

This way I isolated all "took longer to collect" problems to a per device basis, but logging improvement still needed because if some device takes longer to collect I don't know which one.

Now I only run into problems with very large devices, for example I have several switches with more that 1200 interfaces and telegraf is not able to poll all ifTable and ifXtable withing one minute. But this is different topic.

aurimasplu on 8 Jun 2018

Have there been any updates on this? I'm seeing the same thing when one of three routers is down.

mcaulifn on 9 Oct 2018

Any update on this?

I'm seeing the same behaviour - When multiple hosts are configured and one is down, the snmp queries appear to be sent sequentially to the hosts and so if a single/multiple host(s) are down, this can cause the snmp plugin to timeout on the offline hosts before actually completing the list of online hosts?

Is it not possible to run the snmp queries to all configured hosts / agents in parallel so that one host being down wouldn't affect gathering metrics for the others?

Thanks!

n1nj4888 on 24 Jul 2019

In a nutshell, the workaround is to talk to a single remote agent per plugin:

[[inputs.snmp]]
  agents = ["host1"]
  # other options
[[inputs.snmp]]
  agents = ["host2"]
  # other options

danielnelson on 24 Jul 2019

Ok thanks @danielnelson. Understand the workaround but, if "# other options" is an extensive list of OIDs, managing any changes to those options would be complicated across a number of hosts.

Are there plans to make the snmp calls in parallel to the listed agents or, if a response is not received from AgentA, continue with AgentB before retrying AgentA?

Thanks!

n1nj4888 on 24 Jul 2019

Yes, the other options would be whatever tables/fields you want to collect and would need repeated. If you are reading this issue you probably have lots of agents and for managing that I recommend using a templating program to generate your configuration.

To be clear, the plugin does make SNMP calls in parallel to the agents, but all agents must complete before the next collection will begin. This means one agent can theoretically hold up all the other agents for as long as the total timeout is set. This behavior is probably won't change anytime soon. Placing them in a separate plugin definition will allow them to be fully independent.

danielnelson on 24 Jul 2019

Was this page helpful?

0 / 5 - 0 ratings