Telegraf: Memory Leak with procstat

Created on 17 Dec 2019 · 31Comments · Source: influxdata/telegraf

Relevant telegraf.conf:

[global_tags]
  sc = "daf"
  p = "32"
  custom_version = "1.x"
  os_type = "win2016s"

[agent]
  interval = "1s"
  round_interval = false
  metric_batch_size = 1000
  metric_buffer_limit = 5000
  collection_jitter = "1s"
  flush_interval = "1s"
  flush_jitter = "1s"
  precision = "s"
  debug = true
  quiet = false
  logfile = "/Program Files/Telegraf/telegraf.log"
  hostname = "win2016s"
  omit_hostname = false

[[outputs.influxdb]]
  urls = [ "http://1.1.1.1:8086" ]
  database = "telegraf"
  retention_policy = "24hours"
  precision = "m"

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false

[[inputs.mem]]  

[[inputs.procstat]]
  interval = "1s"
  exe = ".*"
  pid_finder = "native"

[[inputs.internal]]

System info:

Steps to reproduce:

No special steps, memory leak appears to be related to procstat input plugin:

areprocstat bug ready

Source

mafinn

Most helpful comment

I have been able to reproduce this on a Windows 2016 VM running in Azure. Will update if I can find a way to reduce or eliminate the leaked memory.

danielnelson on 14 Jan 2020

👍5

All 31 comments

It looks like there are as many frees 16MB as allocs 16.1MB. What is the query behind procstat.mean (process_name: telegraf.exe)?

danielnelson on 17 Dec 2019

The drops in this graph are due to restarting telegraf for troubleshooting:

Here I'm just trying to show how much memory is accumulated over time:

mafinn on 17 Dec 2019

Eventually, telegraf will consume enough memory that it will be stopped:

mafinn on 17 Dec 2019

It's interesting to see the change in slope of memory growth before and after the restart at ~10am. I modified the config from exe = ".*" to 7 instances of procstat, each calling a unique process name. This also decreased the gather time from 17 seconds to 3 seconds.

Before/After:

Before:

After:

mafinn on 18 Dec 2019

I've been running this configuration for about 4 hours and I'm not seeing this pattern, however I don't have any real load on the system.

Could you add the --pprof-addr=:6060 option when starting Telegraf and after the process RSS doubles from startup, go to http://localhost:6060/debug/pprof/heap and attach this file?

danielnelson on 18 Dec 2019

@danielnelson can do. Can I send it directly to you?

mafinn on 18 Dec 2019

Yes, email address is on my profile page.

danielnelson on 18 Dec 2019

Can you show a couple hours of internal_memstats sys_bytes?

danielnelson on 18 Dec 2019

Mostly fixed at 19.81 MB

24 hours:

3 hours:

mafinn on 18 Dec 2019

The RSS is rising but Go doesn't seem to know about it, I didn't see any interesting objects in the memory profiles either. Just thinking aloud but could it be lost to Go or maybe it is leaked in a dll call, I'm not sure. I'm unable to replicate as well even when spawning new processes on the system.

Do you know if this is a new issue with Telegraf 1.13? Could you compare against 1.12.6 and 1.11.5?

danielnelson on 19 Dec 2019

mafinn on 19 Dec 2019

For the record, here are my numbers with Telegraf 1.13.0 (Windows 7):

2019-12-19-144203_1370x918_scrot

At first it trends up, but levels off around 55MB RSS. It seems to take about 3 hours before it maxes.

danielnelson on 19 Dec 2019

So far, looking across our different instances of windows, I've only seen this occur on windows server 2016.

mafinn on 20 Dec 2019

danielnelson on 20 Dec 2019

2016:

2012:

mafinn on 20 Dec 2019

What should I be looking at here?

danielnelson on 20 Dec 2019

Just trying to find a way to capture the difference in DLL's in use between 2016 and not-2016.

mafinn on 20 Dec 2019

I'm going to try installing WMF 5.1 to see if it causes the error as suggested in https://github.com/go-ole/go-ole/issues/135#issuecomment-283440299.

danielnelson on 20 Dec 2019

Quick update, I installed WMF 5.1 on a Windows 2012 (non-R2) box, and it didn't cause the memory leak.

mafinn on 20 Dec 2019

Same on my Windows 7 system, WMF 5.1 had no effect.

danielnelson on 23 Dec 2019

Still not reproducing the leak with a Windows 10 Pro VM:

2020-01-08-220402_1332x874_scrot

danielnelson on 9 Jan 2020

I have been able to reproduce this on a Windows 2016 VM running in Azure. Will update if I can find a way to reduce or eliminate the leaked memory.

danielnelson on 14 Jan 2020

👍5

This still continues to be an issue with the latest Windows Server 2016 - WMI based metrics cause memory leak. We have been using WMI queries and are able to bypass this by calling CoInitializeEx only once per thread, but it seems that telegraf leaks based on the number of WMI metrics, probably since CoInitializeEx is called for every query? Has anyone submitted a bug to Microsoft about this?

In the attached screenshot, the win_proc for telegraf (Working Set Private metric) increases to 100MB with-in 7 days.

romanblachman on 24 Feb 2020

@danielnelson how does win_proc access WMI metrics? We have fixed the issue by calling CoInitializeEx once per thread to avoid the leak from happening when using WMI metrics in our code, instead of calling it once per WMI query.

romanblachman on 18 Mar 2020

@danielnelson Looks like datadog had the same issue with their WMI sampler https://github.com/DataDog/integrations-core/pull/3987 which tells clearly that this issue is with Windows 2016 memory leak when calling CoInitalize for each WMI query.

After reviewing the code for telegraf it seems like you rely on win_pdh library that does the actual Win32 calls, and I couldn't find the call to the CoInitalize so I'm not sure how to help.

romanblachman on 10 Apr 2020

👍1

any update? @danielnelson