Since having upgraded to icinga version 2.7 it seems like InfluxdbWriter feature is unable to connect and write metrics into InfluxDB through HTTPS. We encountered this problem on multiple (staging and production) Icinga installations where HTTPS was enabled in the InfluxdbWriter feature.
InfluxdB is configured properly and writing of metrics went fine with InfluxdbWriter from v2.6.
The problem arises when enabling HTTPS-mode in InfluxdbWriter 2.7. In plain HTTP mode this works fine under 2.7.
Downgrading to version 2.6 makes the problem disappear.
Icinga InfluxdbWriter feature should connect to InfluxDB over HTTPS and write metrics as was the case when using Icinga version 2.6.
Icinga InfluxdbWriter feature fails to write any data to InfluxDB using HTTPS and shows TCP Socket timeout errors.
Judging by the below errors written to debuglog, it looks like a few attempts are made by InfluxdbWriter right after restarting Icinga. After that we see the workerqueue growing and no datapoints are appearing in Influxdb. Also Icinga's CPU-usage increases.
Interestingly, it looks like some data is written to Influx, right after issuing a restart of Icinga. Querying a certain metric through InfluxDB CLI shows that a few datapoints are written, right after restart. The timestamps of these datapoints seem to correlate to the moment of restarting Icinga. However, this is just during one instant, after that it remains silent on the InfluxDB side.
In our setup InfluxDB uses a self-signed certificate for HTTPS. I couldn't find any signs of connection errors logged by InfluxDB.
Excerpt from the icinga debuglog:
[2017-08-04 11:45:27 +0200] debug/InfluxdbWriter: Timer expired writing 184 data points
[2017-08-04 11:45:27 +0200] notice/InfluxdbWriter: Reconnecting to InfluxDB on host 'localhost' port '8086'.
[2017-08-04 11:45:28 +0200] warning/InfluxdbWriter: Response timeout of TCP socket from host 'localhost' port '8086'.
from InfluxDB SHOW DATABASES:
> show databases
name: databases
name
----
icinga
_internal
from InfluxDB SHOW STATS:
authFail clientError pingReq pointsWrittenDropped pointsWrittenFail pointsWrittenOK queryReq queryReqDurationNs queryRespBytes req reqActive reqDurationNs serverError statusReq writeReq writeReqActive writeReqBytes writeReqDurationNs
-------- ----------- ------- -------------------- ----------------- --------------- -------- ------------------ -------------- --- --------- ------------- ----------- --------- -------- -------------- ------------- ------------------
0 0 6 0 0 13964 351 14541557291 5995252 403 1 16592672959 0 0 46 0 1129905 1562058648
yum upgrade or clean yum install Icinga v2.7.0 from icinga-stable-releaselibrary "perfdata"
object InfluxdbWriter "influxdb" {
host = "localhost"
port = 8086
database = "icinga"
username = "*****"
password = "*****"
flush_threshold = 1024
flush_interval = 10s
ssl_enable = true
host_template = {
measurement = "$host.check_command$"
tags = {
hostname = "$host.name$"
}
}
service_template = {
measurement = "$service.check_command$"
tags = {
hostname = "$host.name$"
service = "$service.name$"
}
}
}
icinga2 --version): r2.7.0-1icinga2 feature list): checker command ido-mysql influxdb mainlog notificationicinga2 daemon -C): Config validates fine.I was having the same issue on 1 of 2 setups pushing to a central Influxdb server via https. However after increasing the default flush_* values to
flush_threshold = 4096
flush_interval = 60s
things seem to again be working. Or at least it's been working for the last hour or so whereas before I would maybe get 5 minutes after restarting the Icinga service.
Sadly my fix broke several hours later (around 9PM EST last night).
You've only specified ssl_enable = true ... wouldn't SSL require ssl_ca_cert, ssl_cert and ssl_key being set according to https://www.icinga.com/docs/icinga2/latest/doc/09-object-types/#influxdbwriter ?
You've only specified ssl_enable = true ... wouldn't SSL require ssl_ca_cert, ssl_cert and ssl_key being set according to https://www.icinga.com/docs/icinga2/latest/doc/09-object-types/#influxdbwriter ?
I believe those settings are only needed if you are working with client certificates.
(I am seeing the same issue by the way!)
I downraded to 2.6.3 last night after removing the the flush_* options from InfludbWriter and, with no other changes, everything's been running as expected the last 12 hours
I don't see any code related changes which could influence SSL here.
$ git diff v2.6.3 HEAD lib/perfdata/influxdbwriter.cpp
The only exception is that hanging tcp connections are now properly killed after a defined socket_timeout. Versions prior to 2.7.0 silently have hidden that problem, and it seemed that everything was ok (while it was not).
How do you tell that everything's been running as expected @briansumma ?
@dnsmichi I am not certain that the issue is with ssl at all. I kind of glommed onto @basg's issue when I realized I was having similar issues considering our setups were almost identical. As far as telling whether things are running as expected or not, my approach is incredibly untechnical — I can either visualize my InfluxDB data source in Grafana or I can't.
Just to add further context, in my setup I have a local Icinga 2 master using InfluxdbWriter w/ssl enabled to write data to an AWS EC2 instance hosting InfluxDB and Grafana. Without changing anything on the EC2 instance, I updated Icinga on my local machine and several hours later noticed that Grafana wasn't display an new data from that day. When I grepped the Icinga logs for InfluxdbWriter I saw a few messages similar to @basg's Response timeout of TCP socket from host 'localhost' port '8086' (the exception being that it was my EC2 instance's FQDN not localhost).
After reading the extended blog concerning 2.7 and futzing with the flush_threshold, flush_interval and socket_timeout options in /etc/icinga2/features-available/influxdb.conf I thought I found a fix when I increased these values. However, the next day when I checked Grafana my graphs were empty indicating that the InfluxdbWriter had choked about 3 hours after I applied my changes.
With my boss breathing down my neck I ran yum downgrade icinga2 icinga2-bin icinga2-common ... to see if the problem was also in 2.6.3 so that I could figure out if I needed to troubleshoot Icinga or InfluxDB/Grafana.
I have been running 2.6.3 since 8/6 without dropped data or any Response timeout of TCP socket from host 'localhost' port '8086' in my Icinga logs.
I guess I should mention that when I downgraded to 2.6.3, I did modify /etc/icinga2/features-enabled/influxdb.conf by commenting out the new options
[...]
enable_send_metadata = true
//flush_threshold = 4096
//flush_interval = 60s
//socket_timeout = 10s
host_template = {
[...]
Please note that flush_threshold and flush_interval existed in 2.6.x and had low default values if not explicitly specified. Only the socket_timeout attribute was added to prevent a hanging InfluxDB API connection. @spjmurray looked into it and found out that the API tends to just "keep the connection open until infinite time", unless the client closes it hard.
I haven't seen any problems with my dev instance before 2.7 (but without enable_ssl, that's why I am going that route). Either @spjmurray is faster, or I find the time to investigate on your issue.
I also understand the downgrade, no worries. I'd just be glad if a fix is there, if you're available for tests (or do have a test lab somewhere, where we could test additional logging and so on).
Please test the snapshot packages, or the referenced patch with your setups. Thanks.
I've installed the latest snapshot package. It looks like this solved the issue;
Will let it run some longer and report any anomalies if they occur
Thanks for that @spjmurray. Couldn't understand why my master instance all of sudden had 10x times the load after the 2.7 upgrade and basically killed it.
Works again after Update to 2.7.1.
Most helpful comment
I've installed the latest snapshot package. It looks like this solved the issue;
Will let it run some longer and report any anomalies if they occur