Apologies if I got something wrong, which might be the case.
Using drop_original = true on some filtered aggregators causes telegraf to discard metrics not being matched by their filters. This question on community.influxdata.com seems to point out that this is not the intended behavior.
I've reproduced the issue on the basicstats and the minmax aggregator.
I'm currently using this test scenario:
.
├── log.txt
└── telegraf.conf
The telegraf.conf is a simple configuration file with two inputs, a debug output and an aggregator:
[global_tags]
[agent]
interval = "5s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "5s"
flush_jitter = "0s"
precision = ""
debug = false
quiet = false
logfile = ""
hostname = ""
omit_hostname = false
[[outputs.file]]
files = ["stdout"]
[[inputs.cpu]]
percpu = true
totalcpu = true
collect_cpu_time = false
[[inputs.logparser]]
files = ["/etc/telegraf/telegraf.debug_aggregator/log.txt"]
from_beginning = true
[inputs.logparser.grok]
patterns = [
"^id %{NUMBER:id:tag} rev %{NUMBER:rev:int}"
]
measurement = "modsec_rules_hits"
timezone = 'Local'
[[aggregators.basicstats]]
namepass = "modsec_rules_hits"
drop_original = true
period = "10s"
grace = "86400s"
fieldpass = ["rev"]
stats = ["count"]
The log.txt has been stripped down for simplicity and its contents are as follows:
id 111 rev 2
id 111 rev 2
id 111 rev 2
id 111 rev 2
To reproduce the issue do these three tests:
telegraf.conf and get a batch of metrics with telegraf --debug --config telegraf.conf test. Save the results.drop_original = false on the config file and get another batch of metrics. Save the results.drop_original = true on the config file and get another batch of metrics. Compare to the previously saved results.At the end of the issue I'm attaching my results.
Unless I've misunderstood this post, the final output should be something like this, as the only dropped metrics are those that are being filtered by the aggregator (in this case, those that have the "modsec_rules_hits" measurement):
modsec_rules_hits,host=localhost,modsec_rule_id=111,path=/etc/telegraf/telegraf.debug_aggregator/log.txt rev_count=3 1591094940000000000
cpu,cpu=cpu-total,host=localhost usage_steal=0,usage_guest=0,usage_user=0.2009040682919928,usage_system=0.5022601707573902,usage_idle=99.04570567582773,usage_nice=0.20090406830569688,usage_irq=0,usage_iowait=0.05022601707642422,usage_softirq=0,usage_guest_nice=0 1591094895000000000
cpu,cpu=cpu3,host=localhost usage_system=0.8048289738468508,usage_irq=0,usage_steal=0,usage_guest=0,usage_guest_nice=0,usage_user=0.20120724346628763,usage_idle=98.79275654044665,usage_nice=0.2012072434617127,usage_iowait=0,usage_softirq=0 1591094895000000000
cpu,cpu=cpu2,host=localhost usage_idle=99.79797979778975,usage_softirq=0,usage_steal=0,usage_guest_nice=0,usage_user=0,usage_system=0.20202020202651216,usage_nice=0,usage_iowait=0,usage_irq=0,usage_guest=0 1591094895000000000
cpu,cpu=cpu1,host=localhost usage_system=0.40080160319768315,usage_nice=0.40080160320679636,usage_iowait=0.20040080160453733,usage_softirq=0,usage_guest_nice=0,usage_user=0.20040080158972842,usage_idle=98.79759519159167,usage_irq=0,usage_steal=0,usage_guest=0 1591094895000000000
cpu,cpu=cpu0,host=localhost usage_iowait=0,usage_irq=0,usage_softirq=0,usage_guest=0,usage_user=0.20161290323172149,usage_system=0.2016129032133849,usage_idle=99.39516129210182,usage_nice=0.20161290322713735,usage_steal=0,usage_guest_nice=0 1591094895000000000
drop_original = true seems to be discarding all of the metrics, not only those that are being processed by the aggregator. The end result of a batch of metrics is something like this:
modsec_rules_hits,host=localhost,modsec_rule_id=111,path=/etc/telegraf/telegraf.debug_aggregator/log.txt rev_count=3 1591094940000000000
Metrics obtained by the test setup with the aggregator commented out on the config file:
modsec_rules_hits,host=localhost,modsec_rule_id=111,path=/etc/telegraf/telegraf.debug_aggregator/log.txt rev=2i 1591094530000000000
modsec_rules_hits,host=localhost,modsec_rule_id=111,path=/etc/telegraf/telegraf.debug_aggregator/log.txt rev=2i 1591094525000000000
modsec_rules_hits,host=localhost,modsec_rule_id=111,path=/etc/telegraf/telegraf.debug_aggregator/log.txt rev=2i 1591094530000000000
cpu,cpu=cpu-total,host=localhost usage_guest_nice=0,usage_idle=99.2466097443031,usage_steal=0,usage_guest=0,usage_iowait=0,usage_irq=0,usage_softirq=0,usage_user=0.2009040682966916,usage_system=0.35158211954205043,usage_nice=0.2009040683103957 1591094830000000000
cpu,cpu=cpu3,host=localhost usage_nice=0.40160642571333627,usage_steal=0,usage_guest=0,usage_guest_nice=0,usage_user=0.20080321284297092,usage_idle=98.79518072407448,usage_irq=0,usage_softirq=0,usage_system=0.6024096385837017,usage_iowait=0 1591094830000000000
cpu,cpu=cpu2,host=localhost usage_user=0.1999999999908323,usage_system=0.1999999999908323,usage_nice=0.2000000000044747,usage_softirq=0,usage_idle=99.20000000156462,usage_iowait=0.2000000000044747,usage_irq=0,usage_steal=0,usage_guest=0,usage_guest_nice=0 1591094830000000000
cpu,cpu=cpu1,host=localhost usage_nice=0.20040080160339818,usage_irq=0,usage_steal=0,usage_guest=0,usage_guest_nice=0,usage_system=0.4008016032159095,usage_idle=98.99799599104891,usage_softirq=0,usage_user=0.4008016032159095,usage_iowait=0 1591094830000000000
cpu,cpu=cpu0,host=localhost usage_user=0,usage_idle=99.59758551270143,usage_irq=0,usage_steal=0,usage_guest_nice=0,usage_system=0.40241448691427556,usage_nice=0,usage_iowait=0,usage_softirq=0,usage_guest=0 1591094830000000000
Metrics obtained with the aggregator uncommented but using drop_original = false. Note that the first line is the one being aggregated:
modsec_rules_hits,host=localhost,modsec_rule_id=111,path=/etc/telegraf/telegraf.debug_aggregator/log.txt rev_count=3 1591094890000000000
modsec_rules_hits,host=localhost,modsec_rule_id=111,path=/etc/telegraf/telegraf.debug_aggregator/log.txt rev=2i 1591094530000000000
modsec_rules_hits,host=localhost,modsec_rule_id=111,path=/etc/telegraf/telegraf.debug_aggregator/log.txt rev=2i 1591094525000000000
modsec_rules_hits,host=localhost,modsec_rule_id=111,path=/etc/telegraf/telegraf.debug_aggregator/log.txt rev=2i 1591094530000000000
cpu,cpu=cpu-total,host=localhost usage_steal=0,usage_guest=0,usage_user=0.2009040682919928,usage_system=0.5022601707573902,usage_idle=99.04570567582773,usage_nice=0.20090406830569688,usage_irq=0,usage_iowait=0.05022601707642422,usage_softirq=0,usage_guest_nice=0 1591094895000000000
cpu,cpu=cpu3,host=localhost usage_system=0.8048289738468508,usage_irq=0,usage_steal=0,usage_guest=0,usage_guest_nice=0,usage_user=0.20120724346628763,usage_idle=98.79275654044665,usage_nice=0.2012072434617127,usage_iowait=0,usage_softirq=0 1591094895000000000
cpu,cpu=cpu2,host=localhost usage_idle=99.79797979778975,usage_softirq=0,usage_steal=0,usage_guest_nice=0,usage_user=0,usage_system=0.20202020202651216,usage_nice=0,usage_iowait=0,usage_irq=0,usage_guest=0 1591094895000000000
cpu,cpu=cpu1,host=localhost usage_system=0.40080160319768315,usage_nice=0.40080160320679636,usage_iowait=0.20040080160453733,usage_softirq=0,usage_guest_nice=0,usage_user=0.20040080158972842,usage_idle=98.79759519159167,usage_irq=0,usage_steal=0,usage_guest=0 1591094895000000000
cpu,cpu=cpu0,host=localhost usage_iowait=0,usage_irq=0,usage_softirq=0,usage_guest=0,usage_user=0.20161290323172149,usage_system=0.2016129032133849,usage_idle=99.39516129210182,usage_nice=0.20161290322713735,usage_steal=0,usage_guest_nice=0 1591094895000000000
Metrics obtained with the aggregator uncommented and drop_original = true. Note that the metrics from the cpu input are missing along the original ones processed by the aggregator:
modsec_rules_hits,host=localhost,modsec_rule_id=111,path=/etc/telegraf/telegraf.debug_aggregator/log.txt rev_count=3 1591094940000000000
Thanks for writing up the case. It took me a bit longer than I'd like to admit to spot the issue, but this is related to a known issue in our TOML parser where it does not properly warn if you pass a string instead of an array of strings.
This causes the namepass option not to be set and so all metrics are matched.
This change to your configuration should take care of it:
[[aggregators.basicstats]]
- namepass = "modsec_rules_hits"
+ namepass = ["modsec_rules_hits"]
Sorry about the confusion, the related issues are #3444 and #6474 but I'm going to keep this issue open as well since it is manifest in a slightly different way.
That indeed seems to be the issue :woman_facepalming:
Thank you very much for your time and work, Daniel.
Is there any way to detect those kinds of syntax errors from userland?
This issue really needs to be fixed in Telegraf, there isn't any good workaround to detect as it's not a TOML syntax error.