Telegraf: SNMP input slow warning with just 18 snmp clients

Created on 9 Apr 2020  路  6Comments  路  Source: influxdata/telegraf

Relevant telegraf.conf:

[[inputs.snmp]]
  agents = [ "8.8.8.8" ]
  version = 2
  community = "public"
  interval = "60s"
  timeout = "10s"
  retries = 3
  [inputs.snmp.tags]
    name = "Home Router"

  [[inputs.snmp.field]]
    name = "hostname"
    oid = "RFC1213-MIB::sysName.0"
    is_tag = true

  [[inputs.snmp.field]]
    name = "uptime"
    oid = "DISMAN-EXPRESSION-MIB::sysUpTimeInstance"

  # IF-MIB::ifTable contains counters on input and output traffic as well as errors and discards.
  [[inputs.snmp.table]]
    name = "interface"
    inherit_tags = [ "hostname" ]
    oid = "IF-MIB::ifTable"

    # Interface tag - used to identify interface in metrics database
    [[inputs.snmp.table.field]]
      name = "ifDescr"
      oid = "IF-MIB::ifDescr"
      is_tag = true

  # IF-MIB::ifXTable contains newer High Capacity (HC) counters that do not overflow as fast for a few of the ifTable counters
  [[inputs.snmp.table]]
    name = "interface"
    inherit_tags = [ "hostname" ]
    oid = "IF-MIB::ifXTable"

    # Interface tag - used to identify interface in metrics database
    [[inputs.snmp.table.field]]
      name = "ifDescr"
      oid = "IF-MIB::ifDescr"
      is_tag = true

  # EtherLike-MIB::dot3StatsTable contains detailed ethernet-level information about what kind of errors have been logged on an interface (such as FCS error, frame too long, etc)
  [[inputs.snmp.table]]
    name = "interface"
    inherit_tags = [ "hostname" ]
    oid = "EtherLike-MIB::dot3StatsTable"

    # Interface tag - used to identify interface in metrics database
    [[inputs.snmp.table.field]]
      name = "ifDescr"
      oid = "IF-MIB::ifDescr"

System info:


Ubuntu Server 18.04
Fiber connection
Telegraf 1.14.0

Docker

Steps to reproduce:

  1. add multiple configs with snmp clients, remote routers in my case. With a interval of 60 seconds.
  2. Eventually, in my case around 18 clients, it can not longer retrieve the data
  3. Telegraf stops working entirely.

Expected behavior:

Telegraf collects snmp data.

Actual behavior:

SNMP error in the log and telegraf stops working entirely.

Additional info:

Log is filled with:
2020-04-08T18:29:00Z W! [agent] [inputs.snmp] did not complete within its interval
2020-04-08T18:30:00Z W! [agent] [inputs.snmp] did not complete within its interval

aresnmp

Most helpful comment

Ok, I think I see the issue here. It collects from each agent independently and has two layers of retries. it always retries connections once (but that retry has 3 internal retries), and if the there's a large number of tables, it tries twice for each table. With a large config and a long connection timeout this could easily lead to the "did not complete within its interval" message. The total number of connection tries here is 2 x retries * table_count. the time required for this is roughly 2 x retries * table_count * timeout, so about 9 minutes to fully time out an unresponsive agent. I don't think that was ever expected or intended.

It would be better behavior to stop trying tables once you get the first error.

Going to reopen because I think there's definitely something that can be improved here.

All 6 comments

That message can sometimes be misleading: it could be one of the other plugins snmp has handed the message off to. One thing you could do here is enable the inputs.internal metric collection and it should give you insight into how long each of the plugins are taking.

This sounds like it might be a single slow SNMP device that's not responding in time. It looks like this plugin waits for all responses before exiting (which is maybe an opportunity for improvement, or better log messaging). You could discover if this is the case with a packet capture during the Telegraf SNMP request cycle on the SNMP port(s) (161,162?).

Today I added some clients while watching it collect the data with tcpdump.

Nothing strange happend sofar, I'll keep you updated.

I think I found the issue using tcpdump.

Telegraf tried to fetch data from a NAS device that was offline. At that point, it kept retrying until I restart telegraf.

I don't know why it kept doing that because it should give up on 3 retries. After removing the NAS telegraf was working as intended.

That's strange. I'll look into that. Maybe you've discovered something.

Ok, I think I see the issue here. It collects from each agent independently and has two layers of retries. it always retries connections once (but that retry has 3 internal retries), and if the there's a large number of tables, it tries twice for each table. With a large config and a long connection timeout this could easily lead to the "did not complete within its interval" message. The total number of connection tries here is 2 x retries * table_count. the time required for this is roughly 2 x retries * table_count * timeout, so about 9 minutes to fully time out an unresponsive agent. I don't think that was ever expected or intended.

It would be better behavior to stop trying tables once you get the first error.

Going to reopen because I think there's definitely something that can be improved here.

Was this page helpful?
0 / 5 - 0 ratings