[global_tags]
sc = "daf"
p = "32"
custom_version = "1.x"
os_type = "win2016s"
[agent]
interval = "1s"
round_interval = false
metric_batch_size = 1000
metric_buffer_limit = 5000
collection_jitter = "1s"
flush_interval = "1s"
flush_jitter = "1s"
precision = "s"
debug = true
quiet = false
logfile = "/Program Files/Telegraf/telegraf.log"
hostname = "win2016s"
omit_hostname = false
[[outputs.influxdb]]
urls = [ "http://1.1.1.1:8086" ]
database = "telegraf"
retention_policy = "24hours"
precision = "m"
[[inputs.cpu]]
percpu = true
totalcpu = true
collect_cpu_time = false
report_active = false
[[inputs.mem]]
[[inputs.procstat]]
interval = "1s"
exe = ".*"
pid_finder = "native"
[[inputs.internal]]


No special steps, memory leak appears to be related to procstat input plugin:


It looks like there are as many frees 16MB as allocs 16.1MB. What is the query behind procstat.mean (process_name: telegraf.exe)?
The drops in this graph are due to restarting telegraf for troubleshooting:

Here I'm just trying to show how much memory is accumulated over time:

Eventually, telegraf will consume enough memory that it will be stopped:

It's interesting to see the change in slope of memory growth before and after the restart at ~10am. I modified the config from exe = ".*" to 7 instances of procstat, each calling a unique process name. This also decreased the gather time from 17 seconds to 3 seconds.

Before/After:

Before:

After:

I've been running this configuration for about 4 hours and I'm not seeing this pattern, however I don't have any real load on the system.
Could you add the --pprof-addr=:6060 option when starting Telegraf and after the process RSS doubles from startup, go to http://localhost:6060/debug/pprof/heap and attach this file?
@danielnelson can do. Can I send it directly to you?
Yes, email address is on my profile page.
Can you show a couple hours of internal_memstats sys_bytes?
Mostly fixed at 19.81 MB
24 hours:

3 hours:

The RSS is rising but Go doesn't seem to know about it, I didn't see any interesting objects in the memory profiles either. Just thinking aloud but could it be lost to Go or maybe it is leaked in a dll call, I'm not sure. I'm unable to replicate as well even when spawning new processes on the system.
Do you know if this is a new issue with Telegraf 1.13? Could you compare against 1.12.6 and 1.11.5?


For the record, here are my numbers with Telegraf 1.13.0 (Windows 7):

At first it trends up, but levels off around 55MB RSS. It seems to take about 3 hours before it maxes.
So far, looking across our different instances of windows, I've only seen this occur on windows server 2016.
2016:

2012:

What should I be looking at here?
Just trying to find a way to capture the difference in DLL's in use between 2016 and not-2016.
I'm going to try installing WMF 5.1 to see if it causes the error as suggested in https://github.com/go-ole/go-ole/issues/135#issuecomment-283440299.
Quick update, I installed WMF 5.1 on a Windows 2012 (non-R2) box, and it didn't cause the memory leak.
Same on my Windows 7 system, WMF 5.1 had no effect.
Still not reproducing the leak with a Windows 10 Pro VM:

I have been able to reproduce this on a Windows 2016 VM running in Azure. Will update if I can find a way to reduce or eliminate the leaked memory.
This still continues to be an issue with the latest Windows Server 2016 - WMI based metrics cause memory leak. We have been using WMI queries and are able to bypass this by calling CoInitializeEx only once per thread, but it seems that telegraf leaks based on the number of WMI metrics, probably since CoInitializeEx is called for every query? Has anyone submitted a bug to Microsoft about this?

In the attached screenshot, the win_proc for telegraf (Working Set Private metric) increases to 100MB with-in 7 days.
@danielnelson how does win_proc access WMI metrics? We have fixed the issue by calling CoInitializeEx once per thread to avoid the leak from happening when using WMI metrics in our code, instead of calling it once per WMI query.
@danielnelson Looks like datadog had the same issue with their WMI sampler https://github.com/DataDog/integrations-core/pull/3987 which tells clearly that this issue is with Windows 2016 memory leak when calling CoInitalize for each WMI query.
After reviewing the code for telegraf it seems like you rely on win_pdh library that does the actual Win32 calls, and I couldn't find the call to the CoInitalize so I'm not sure how to help.
any update? @danielnelson
Can you retest with this build telegraf-1.15.0~d78dfac1_windows_amd64.zip?
Will do!
Looks like this has been resolved, the WMI leak is gone on Windows Server 2016!
48 hours running with 1.12.6:

24 hours running with 1.15.0:

thank you @danielnelson
what's the eta on 1.15.0?
Great news, thanks for testing.
I expect 1.15.0 to be released sometime in the first half of July.
Most helpful comment
I have been able to reproduce this on a Windows 2016 VM running in Azure. Will update if I can find a way to reduce or eliminate the leaked memory.