When Telegraf has been running for a period of time greater than an hour (might be less), restarting the service will time out.
Stop is ok
Start fails with an error 1053 ERROR_SERVICE_REQUEST_TIMEOUT
(Timeout is 30000 ms for total time for Stop and Start)
win_services
win_ping
standard windows system monitoring (disk, diskio, cpu, mem, net, processes)
Windows 2008 R2
If Telegraf has been recently restarted, the service will restart succesfully.
Restart does not timeout.
Telegraf does not start.
Happens with Telegraf 1.4.x, 1.5 latest nightly.
Event log only contains two error messages stating that the start did not complete. due to a imeout.
Telegraf service resiliency.
I would expect I am not alone, as this happens on most if not all servers.
I'm seeing this too on a couple of machines. It fails more often than not. Especially annoying when Telegraf fails to start on system reboot. #2754 would bypass the problem until the root cause is found.
I also can confirm this bug on Windows Server 2008R2/2012R2
On Windows Server 2016 it seems to work as expected
Telegraf Versions 1.4.5/1.5.0
Our Windows 7 machines encounter the same bug. Telegraf 1.4.5 and 1.5.0
I see something similar, but mostly notice when trying to restart the service remotely via PowerShell Get-Service | Restart-Service. For those having issues with it not starting after a reboot, I found setting the service to "Delayed Start" fixed the issue.
Remotely: sc.exe \\<servername> config telegraf start= delayed-auto seems to do the trick.
Telegraf built from latest source, Windows Server 2012 R2
@m82labs Tried the delayed start myself, but no dice. Not only did Telegraf fail to start on all machines after reboot, but it took 2 manual tries to do the trick.
Needless to say, this is a MASSIVE issue, because automatic service recovery in Windows seems to kick in only when a service exits with a non-zero code, which is not the case for startup timeouts.
@danielnelson Is there a backlog of work or tasks that Telegraf needs to go through on startup if it hasn't been stopped gracefully?
No, there is no special processing that is done after an unclean shutdown.
In can also confirm the issue on Windows Server 2012 R2, Telegraf 1.5.1, “Loaded inputs: inputs.influxdb inputs.win_perf_counters inputs.mysql”, and “Loaded outputs: influxdb”.
After a reboot, the Telegraf service will not start and the only way I found to get it going again is “As Administrator” (e.g. “Command Prompt”, “Run as administrator “, and “net start telegraf”).
I tried a few workarounds without success (starting Telegraf via nssm, giving “Everyone” permission to the service, configuring the service with the “Administrator” rather than “Local System” log on, having the Telegraf service restart on failures, create a service dependency to winmgmt, etc.).
I can confirm this issue as well.
After a restart the service cannot start. The first time the service runs (on automatic, Delayed) it always times out, if i start it manually for the first time as well it times out. Subsequent starts successfully start the service.
Happening on Windows Server 2016 and 2008R2 at the moment.
EDIT: Increasing the timeout for service starting has worked for me. The timeout change affects all services.
I have done some testing with different versions and here is results:
Tested versions: 1.4.0, 14.1, 1.4.2, 1.4.3, 1.4.4, 1.4.5, 1.5.0, 1.5.1
OK versions: 1.4.0, 1.4.1, 1.4.2, 1.4.3
FAIL version: 1.4.4, 1.4.5, 1.5.0, 1.5.1
Latest working version is 1.4.3, on this version Windows service is starting as expected without any work-arounds (service timeout adjustment/restarts/etc).
So something was broken between version 1.4.3 and 1.4.4, but I can't see any suspicious changes on diff here, probably issue in building process.
Current work-around for me (and probably other guys) - rollback on 1.4.3 version if possible.
@Fiery-Fenix That is very interesting. We did update gopsutil between these updates, it's possible the cause is part of that.
Reporting the same issue.
Haven't extensively tested telegraf versions, but I did notice the same described behavior.
We were seeing exactly the same issue, rolling back to 1.4.3 fixed the issue.
Seeing the same issue as well on 1.4.4 - 1.5.1
I changed the gopsutil version in GODEPS to what it was in 1.4.3 and ran a build of the (otherwise) latest master. This seems to be working fine on the few Server 2008r2, 2012r2, and Ubuntu machines I've tested so far.
Same issue here, after a reboot it fails to start in a "timely fashion" and needs to be manually restarted a couple of times to get up and running before windows complains.
Would be great if someone could git bisect the commit that caused this in gopsutil.
updated
hmm.. never mind. I went thru some commits, but I am not the right person for this.
I am running into this issue on Win7 but not on Win10.
Based on some procmon traces and additional service stop/start actions, I think this issue has something to do with the way gopsutil interacts with the winmgmt service. If I stop telegraf, stop winmgmt, start winmgmt, and then attempt to start telegraf the problem is reproducible without a reboot.
Steps to reproduce without a reboot...
net start telegraf (this may fail if this is the first time since boot)
net start telegraf (successful, telegraf is now running)
net stop telegraf
net stop winmgmt
net start winmgmt
net start telegraf (always fails after stop/start of winmgmt)
net start telegraf (sucessful, telegraf is now running)
I can not reproduce this issue on Win10, only Win7.
I haven not attempted to test on Windows Server.
I tried increasing the Windows service startup timeout all the way to ten minutes with HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ServicesPipeTimeout = 600000 but that did not help. It just takes ten minutes for the 'net start' command to time out but it still fails.
I doubt this is limited to that. It sometimes takes way too long for Telegraf to start from the command line as well as wrapped in nssm with the console flag .
I would be willing to put up a bounty to get this fixed ASAP. Please send me a an email, just click on my profile. This would be through a company, as we need official invoices.
I put together this PowerShell script that setups a scheduled task that makes sure it starts when Windows starts up if it helps anyone http://tracyboggiano.com/archive/2018/03/setting-up-telegraf-to-startup-on-windows/.
Same issue here on 1.4.4 - 1.5.3 with multiple Windows Server 2012, rolling back to 1.4.3
@xkilian are you comfortable compiling telegraf yourself? We rolled back the gopsutil version in GODEPS to what it was pre-1.4.4 and recompiled with the latest commit at the time (1.6~)- so far no issues across all environments
Will this be in the next release? Having some issues with a lot of machines.
We've prioritized this as a bug and will work on it as soon as we finish the current release. If anyone would like to take a stab at fixing it and submit a PR, that would be awesome!
If I take a stab at it, could someone maybe just critique me. Never been a contrib 👍
I'm having the same issue on Server 2012r2 and 2016. after a reboot the service will fail to start, manually starting the service works on the second try like clockwork.
@russorat any idea on timing for the fix? are we talking days? weeks? months?
I am trying it out. Getting stuck as I am not 100% on the process for the deps. Will let you know when I have it.
Guys, just download 1.6.0-rc2 or newer and use nssm for the time being. Telegraf will still boot slowly, but never fail. I use the following command to install it as a service:
nssm.exe install Telegraf "c:\Program Files\Telegraf\telegraf.exe" "-console -config ""c:\Program Files\Telegraf\telegraf.conf"" --config-directory ""C:\Program Files\Telegraf\sub"""
sc failure "Telegraf" actions= restart/60000/restart/60000/restart/60000 reset= 86400
nssm.exe start Telegraf
I can no longer reproduce with 1.6.0-rc2 even without nssm. With 1.5.3 I could always reproduce by installing the service, starting it, and then rebooting.
Can everyone try out 1.6.0-rc2 and let us know if it helps, and if you are still experiencing problems can you include reproduction steps.
I will trying this tomorrow when I get back into the office @danielnelson
@danielnelson I just tested it out in kitchen-vagrant, and it looks like the problem is gone. I rebooted the VM 4 times to make sure, and all 4 times telegraf was running. Just for good measure I did the same test with version 1.5.1 and telegraf was stuck in the starting phase and then died. I think we should be good with this version as far as this bug is concerned.
@joldham1023 That's great news, I'm going to close this issue as resolved for 1.6.0 then.
This issue is still occuring for us on low powered systems. I am currently running the 1.15 build due all other builds having a memory leak and eventually crashing in windows 2016.
Check the Windows event log for clues as well as the Telegraf log file. If that doesn't help, its probably best to open a new issue with details of your system and if possible a way to reproduce.
Most helpful comment
We've prioritized this as a bug and will work on it as soon as we finish the current release. If anyone would like to take a stab at fixing it and submit a PR, that would be awesome!