Icinga2: Icinga 2.7 InfluxdbWriter fails to write metrics to InfluxDB over HTTPS

Created on 4 Aug 2017  Â·  14Comments  Â·  Source: Icinga/icinga2

Since having upgraded to icinga version 2.7 it seems like InfluxdbWriter feature is unable to connect and write metrics into InfluxDB through HTTPS. We encountered this problem on multiple (staging and production) Icinga installations where HTTPS was enabled in the InfluxdbWriter feature.
InfluxdB is configured properly and writing of metrics went fine with InfluxdbWriter from v2.6.
The problem arises when enabling HTTPS-mode in InfluxdbWriter 2.7. In plain HTTP mode this works fine under 2.7.
Downgrading to version 2.6 makes the problem disappear.

Expected Behavior

Icinga InfluxdbWriter feature should connect to InfluxDB over HTTPS and write metrics as was the case when using Icinga version 2.6.

Current Behavior

Icinga InfluxdbWriter feature fails to write any data to InfluxDB using HTTPS and shows TCP Socket timeout errors.
Judging by the below errors written to debuglog, it looks like a few attempts are made by InfluxdbWriter right after restarting Icinga. After that we see the workerqueue growing and no datapoints are appearing in Influxdb. Also Icinga's CPU-usage increases.
Interestingly, it looks like some data is written to Influx, right after issuing a restart of Icinga. Querying a certain metric through InfluxDB CLI shows that a few datapoints are written, right after restart. The timestamps of these datapoints seem to correlate to the moment of restarting Icinga. However, this is just during one instant, after that it remains silent on the InfluxDB side.
In our setup InfluxDB uses a self-signed certificate for HTTPS. I couldn't find any signs of connection errors logged by InfluxDB.

Excerpt from the icinga debuglog:

[2017-08-04 11:45:27 +0200] debug/InfluxdbWriter: Timer expired writing 184 data points
[2017-08-04 11:45:27 +0200] notice/InfluxdbWriter: Reconnecting to InfluxDB on host 'localhost' port '8086'.
[2017-08-04 11:45:28 +0200] warning/InfluxdbWriter: Response timeout of TCP socket from host 'localhost' port '8086'.

from InfluxDB SHOW DATABASES:

> show databases
name: databases
name
----
icinga
_internal

from InfluxDB SHOW STATS:

authFail clientError pingReq pointsWrittenDropped pointsWrittenFail pointsWrittenOK queryReq queryReqDurationNs queryRespBytes req reqActive reqDurationNs serverError statusReq writeReq writeReqActive writeReqBytes writeReqDurationNs
-------- ----------- ------- -------------------- ----------------- --------------- -------- ------------------ -------------- --- --------- ------------- ----------- --------- -------- -------------- ------------- ------------------
0        0           6       0                    0                 13964           351      14541557291        5995252        403 1         16592672959   0           0         46       0              1129905       1562058648

Steps to Reproduce

  1. yum upgrade or clean yum install Icinga v2.7.0 from icinga-stable-release
  2. Enable and configure InfluxdbWriter as shown below (HTTPS also enabled in influxdb, of course).
    Note, example connects to InfluxDB on localhost, but also with influxdb running in HTTPS-mode on a separate host we encounter the same problem.
library "perfdata"

object InfluxdbWriter "influxdb" {
  host = "localhost"
  port = 8086
  database = "icinga"
  username = "*****"
  password = "*****"
  flush_threshold = 1024
  flush_interval = 10s
  ssl_enable = true
  host_template = {
    measurement = "$host.check_command$"
    tags = {
      hostname = "$host.name$"
    }
  }

  service_template = {
    measurement = "$service.check_command$"
    tags = {
      hostname = "$host.name$"
      service = "$service.name$"
    }
  }
}

Your Environment

  • Version used (icinga2 --version): r2.7.0-1
  • Operating System and version: CentOS 7 3.10.0-514.26.2.el7.x86_64
  • Enabled features (icinga2 feature list): checker command ido-mysql influxdb mainlog notification
  • Icinga Web 2 version and modules (System - About): not relevant
  • Config validation (icinga2 daemon -C): Config validates fine.
areinfluxdb bug

Most helpful comment

I've installed the latest snapshot package. It looks like this solved the issue;

  • Logs seem normal (no signs of InfluxDBWriter workerqueue stacking up)
  • Datapoints show up in influxDB
  • No more high cpu load

Will let it run some longer and report any anomalies if they occur

All 14 comments

I was having the same issue on 1 of 2 setups pushing to a central Influxdb server via https. However after increasing the default flush_* values to

flush_threshold = 4096
flush_interval = 60s

things seem to again be working. Or at least it's been working for the last hour or so whereas before I would maybe get 5 minutes after restarting the Icinga service.

Sadly my fix broke several hours later (around 9PM EST last night).

You've only specified ssl_enable = true ... wouldn't SSL require ssl_ca_cert, ssl_cert and ssl_key being set according to https://www.icinga.com/docs/icinga2/latest/doc/09-object-types/#influxdbwriter ?

You've only specified ssl_enable = true ... wouldn't SSL require ssl_ca_cert, ssl_cert and ssl_key being set according to https://www.icinga.com/docs/icinga2/latest/doc/09-object-types/#influxdbwriter ?

I believe those settings are only needed if you are working with client certificates.

(I am seeing the same issue by the way!)

I downraded to 2.6.3 last night after removing the the flush_* options from InfludbWriter and, with no other changes, everything's been running as expected the last 12 hours

I don't see any code related changes which could influence SSL here.

$ git diff v2.6.3 HEAD lib/perfdata/influxdbwriter.cpp

The only exception is that hanging tcp connections are now properly killed after a defined socket_timeout. Versions prior to 2.7.0 silently have hidden that problem, and it seemed that everything was ok (while it was not).

How do you tell that everything's been running as expected @briansumma ?

4927 and related.

@dnsmichi I am not certain that the issue is with ssl at all. I kind of glommed onto @basg's issue when I realized I was having similar issues considering our setups were almost identical. As far as telling whether things are running as expected or not, my approach is incredibly untechnical — I can either visualize my InfluxDB data source in Grafana or I can't.

Just to add further context, in my setup I have a local Icinga 2 master using InfluxdbWriter w/ssl enabled to write data to an AWS EC2 instance hosting InfluxDB and Grafana. Without changing anything on the EC2 instance, I updated Icinga on my local machine and several hours later noticed that Grafana wasn't display an new data from that day. When I grepped the Icinga logs for InfluxdbWriter I saw a few messages similar to @basg's Response timeout of TCP socket from host 'localhost' port '8086' (the exception being that it was my EC2 instance's FQDN not localhost).

After reading the extended blog concerning 2.7 and futzing with the flush_threshold, flush_interval and socket_timeout options in /etc/icinga2/features-available/influxdb.conf I thought I found a fix when I increased these values. However, the next day when I checked Grafana my graphs were empty indicating that the InfluxdbWriter had choked about 3 hours after I applied my changes.

With my boss breathing down my neck I ran yum downgrade icinga2 icinga2-bin icinga2-common ... to see if the problem was also in 2.6.3 so that I could figure out if I needed to troubleshoot Icinga or InfluxDB/Grafana.

I have been running 2.6.3 since 8/6 without dropped data or any Response timeout of TCP socket from host 'localhost' port '8086' in my Icinga logs.

I guess I should mention that when I downgraded to 2.6.3, I did modify /etc/icinga2/features-enabled/influxdb.conf by commenting out the new options

[...]
   enable_send_metadata = true
   //flush_threshold = 4096
   //flush_interval = 60s
   //socket_timeout = 10s
   host_template = {
[...]

Please note that flush_threshold and flush_interval existed in 2.6.x and had low default values if not explicitly specified. Only the socket_timeout attribute was added to prevent a hanging InfluxDB API connection. @spjmurray looked into it and found out that the API tends to just "keep the connection open until infinite time", unless the client closes it hard.

I haven't seen any problems with my dev instance before 2.7 (but without enable_ssl, that's why I am going that route). Either @spjmurray is faster, or I find the time to investigate on your issue.

I also understand the downgrade, no worries. I'd just be glad if a fix is there, if you're available for tests (or do have a test lab somewhere, where we could test additional logging and so on).

Please test the snapshot packages, or the referenced patch with your setups. Thanks.

I've installed the latest snapshot package. It looks like this solved the issue;

  • Logs seem normal (no signs of InfluxDBWriter workerqueue stacking up)
  • Datapoints show up in influxDB
  • No more high cpu load

Will let it run some longer and report any anomalies if they occur

Thanks for that @spjmurray. Couldn't understand why my master instance all of sudden had 10x times the load after the 2.7 upgrade and basically killed it.

Works again after Update to 2.7.1.

Was this page helpful?
0 / 5 - 0 ratings