[Include Telegraf version, operating system name, and other relevant details]
Telegraf 1.9.1 (git: HEAD 20636091)
CentOS 7
InfluxDB shell version: 1.7.2
sudo systemctl stop puppet.serviceSELECT last("pid_count") FROM "autogen"."procstat_lookup" WHERE ("systemd_unit" = 'puppet.service') AND time > now()-2m GROUP BY "myTag"name: procstat_lookup
tags: myTag1
time last
---- ----
1547670360000000000 0
name: procstat_lookup
tags: myTag1
time last
---- ----
1547670360000000000 1
I only learned a bit of GO to read the source code, but from what I understood, the problem is that here len(pids) is not enough, since running systemctl show puppet.service when the service is not running will give MainPID=0 (see also #3612) and systemdUnitPIDs() does not check for zero values.
[Include gist of relevant config, logs, etc.]
[[inputs.procstat]]
systemd_unit = "puppet.service"
There are cases where MainPID=0 when the service is running, which seems to be caused by using the incorrect systemd unit, or incorrectly constructed file. Maybe systemdUnitPIDs() should check for zero values to encourage the proper use of the units/files. ...or maybe it should check for the pids in a different way as many of the units/files are created by default when installing a service.
A different way to check if the service is running could be by checking the SubState or the ActiveState in the same way that the MainPID is being currently obtained.
In any case it would be good if someone could confirm if this is a bug or if it has something to do with my setup, so I can know how to proceed :)
That's a great suggestion. It seems like pid_count is a bad place to populate if MainPID=0 as the service has many child processes. There would also need to be a new path to determine "running" if that is the case, likely using the method you linked to - systemctl show -p SubState something.service. I suppose for compatibility sake, we set pid_count to 1 if the 'SubState' is "running" and 0 otherwise.
Since 0 is not a valid PID, we should just check for it and skip it when reading the MainPID. The pid_count isn't meant to indicate if the service is running or not, just how many pids were found. We will likely have a solution for monitoring systemd service status in #4823.
Agreed, though I feel we should drop the pid_count metric entirely if systemd_unit is specified because it doesn't give any sort of accurate count as there will always only be a single pid in MainPID
edit:
If we do keep pid_count for systemd, I do think we should set it to 1 if the service is running (even if mainpid is 0) for more consistent behavior with properly configured/named services.
@danielnelson I thought that was what it was meant to do as per #4237
This plugin doesn't report if systemd thinks a service is running or not, it searches for PIDs and then reads the process information for them. This field should contain the number of PIDs that have been found, if MainPID=0 then it hasn't found any valid PIDs so it should be zero.
This is why if you are using the pidfile method and the file exists containing a PID, the pid_count will report 1 no matter if the process is running or not.
It may be that we are not finding the PIDs correctly for systemd, which would be another issue, but I don't think we need to special case systemd in any way.
Closed in #5972