Telegraf: Memory Leak with procstat

Created on 17 Dec 2019  路  31Comments  路  Source: influxdata/telegraf

Relevant telegraf.conf:

[global_tags]
  sc = "daf"
  p = "32"
  custom_version = "1.x"
  os_type = "win2016s"

[agent]
  interval = "1s"
  round_interval = false
  metric_batch_size = 1000
  metric_buffer_limit = 5000
  collection_jitter = "1s"
  flush_interval = "1s"
  flush_jitter = "1s"
  precision = "s"
  debug = true
  quiet = false
  logfile = "/Program Files/Telegraf/telegraf.log"
  hostname = "win2016s"
  omit_hostname = false

[[outputs.influxdb]]
  urls = [ "http://1.1.1.1:8086" ]
  database = "telegraf"
  retention_policy = "24hours"
  precision = "m"

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false

[[inputs.mem]]  

[[inputs.procstat]]
  interval = "1s"
  exe = ".*"
  pid_finder = "native"

[[inputs.internal]]


System info:

image

image

Steps to reproduce:

No special steps, memory leak appears to be related to procstat input plugin:
image

image

areprocstat bug ready

Most helpful comment

I have been able to reproduce this on a Windows 2016 VM running in Azure. Will update if I can find a way to reduce or eliminate the leaked memory.

All 31 comments

It looks like there are as many frees 16MB as allocs 16.1MB. What is the query behind procstat.mean (process_name: telegraf.exe)?

The drops in this graph are due to restarting telegraf for troubleshooting:
image

Here I'm just trying to show how much memory is accumulated over time:
image

Eventually, telegraf will consume enough memory that it will be stopped:
image

It's interesting to see the change in slope of memory growth before and after the restart at ~10am. I modified the config from exe = ".*" to 7 instances of procstat, each calling a unique process name. This also decreased the gather time from 17 seconds to 3 seconds.

image

Before/After:
image

Before:
image

After:
image

I've been running this configuration for about 4 hours and I'm not seeing this pattern, however I don't have any real load on the system.

Could you add the --pprof-addr=:6060 option when starting Telegraf and after the process RSS doubles from startup, go to http://localhost:6060/debug/pprof/heap and attach this file?

@danielnelson can do. Can I send it directly to you?

Yes, email address is on my profile page.

Can you show a couple hours of internal_memstats sys_bytes?

Mostly fixed at 19.81 MB

24 hours:
image

3 hours:
image

The RSS is rising but Go doesn't seem to know about it, I didn't see any interesting objects in the memory profiles either. Just thinking aloud but could it be lost to Go or maybe it is leaked in a dll call, I'm not sure. I'm unable to replicate as well even when spawning new processes on the system.

Do you know if this is a new issue with Telegraf 1.13? Could you compare against 1.12.6 and 1.11.5?

image

image

For the record, here are my numbers with Telegraf 1.13.0 (Windows 7):

2019-12-19-144203_1370x918_scrot

At first it trends up, but levels off around 55MB RSS. It seems to take about 3 hours before it maxes.

So far, looking across our different instances of windows, I've only seen this occur on windows server 2016.

2016:
image

2012:
image

What should I be looking at here?

Just trying to find a way to capture the difference in DLL's in use between 2016 and not-2016.

I'm going to try installing WMF 5.1 to see if it causes the error as suggested in https://github.com/go-ole/go-ole/issues/135#issuecomment-283440299.

Quick update, I installed WMF 5.1 on a Windows 2012 (non-R2) box, and it didn't cause the memory leak.

Same on my Windows 7 system, WMF 5.1 had no effect.

Still not reproducing the leak with a Windows 10 Pro VM:

2020-01-08-220402_1332x874_scrot

I have been able to reproduce this on a Windows 2016 VM running in Azure. Will update if I can find a way to reduce or eliminate the leaked memory.

This still continues to be an issue with the latest Windows Server 2016 - WMI based metrics cause memory leak. We have been using WMI queries and are able to bypass this by calling CoInitializeEx only once per thread, but it seems that telegraf leaks based on the number of WMI metrics, probably since CoInitializeEx is called for every query? Has anyone submitted a bug to Microsoft about this?

image
In the attached screenshot, the win_proc for telegraf (Working Set Private metric) increases to 100MB with-in 7 days.

@danielnelson how does win_proc access WMI metrics? We have fixed the issue by calling CoInitializeEx once per thread to avoid the leak from happening when using WMI metrics in our code, instead of calling it once per WMI query.

@danielnelson Looks like datadog had the same issue with their WMI sampler https://github.com/DataDog/integrations-core/pull/3987 which tells clearly that this issue is with Windows 2016 memory leak when calling CoInitalize for each WMI query.

After reviewing the code for telegraf it seems like you rely on win_pdh library that does the actual Win32 calls, and I couldn't find the call to the CoInitalize so I'm not sure how to help.

any update? @danielnelson

Can you retest with this build telegraf-1.15.0~d78dfac1_windows_amd64.zip?

Will do!

Looks like this has been resolved, the WMI leak is gone on Windows Server 2016!

48 hours running with 1.12.6:

image

24 hours running with 1.15.0:

image

thank you @danielnelson

what's the eta on 1.15.0?

Great news, thanks for testing.

I expect 1.15.0 to be released sometime in the first half of July.

Was this page helpful?
0 / 5 - 0 ratings