Telegraf: Potential memory leak in Windows (v1.4.3)

Created on 17 Nov 2017 · 32Comments · Source: influxdata/telegraf

Directions

This week, we had an issue wherein at least 6 customer servers' Telegraf process was consuming 5GB of RAM all the time. Looking at the metrics Telegraf had collected, it looks to have started creeping gradually from the 1st of the month until memory used was sitting at 99%. Restarting the Telegraf service has released this memory pressure, but I'm observing it at the moment, in anticipation of it happening again.

Basically, I want to know what information I can provide to help tackle this if it's going to reoccur. Can I enable some verbose debugging and catch it in the act again, or is this related to the metrics I'm collecting? Although, there should be no change in series, so I would have thought the usage would be steady?

Bug report

Relevant telegraf.conf:

Just collecting a config dump now. Will attach shortly...

System info:

Mostly Windows Server 2012 R2 with one possible 2008 R2.

Steps to reproduce:

Run telegraf for a few weeks.

aresqlserver bug

Source

absolutejam

👍1

All 32 comments

Is this a new issue after an upgrade, if so what version did you upgrade from?

It may be useful to start telegraf with the --pprof-addr localhost:6060 flags, You can then browse to http://localhost:6060/debug/pprof/. It may be helpful to compare the files there when you just start Telegraf, with how they look after the memory has increased.

You may want to enable the internal input and watch the internal_memstats measurement, this won't tell us the cause but it may be useful for tracking.

danielnelson on 17 Nov 2017

These servers never had Telegraf before; it's a new deployment.

I can see that the issue is happening again so I'm going to do as you suggested an report back in a few days to see if I can find anything. Thanks.

absolutejam on 20 Nov 2017

👍1

Sorry for the delay. I've re-enabled Telegraf and added the debug flags on 0.0.0.0 and I'll try and scrape that info periodically. I've also enable the internal stats.

Is there a specific value I should keep an eye out for?

I'm going to set up some alerts for high memory usage and get back to you in a couple of weeks.

absolutejam on 8 Dec 2017

I would focus on the heap page. Save the output of the heap page ~10 minutes after you first startup, and then again after a day when then memory has gone up.

If you have Go installed somewhere you can try creating an image of the current memory use:

go tool pprof -png http://localhost:6060/debug/pprof/heap > out.png

danielnelson on 8 Dec 2017

Alright, I didn't manage to set up anything to scrape when I left on Friday, but I checked on it today:

Checking across some of my servers with Salt, some of them are up to 2GB of used memory for Telegraf - and they all seem to have one thing in common: SQL. And my SQL servers automatically get a .conf file which contains:

[[inputs.sqlserver]]
servers = [ "Server=localhost;Port=1433;User Id=telegraf;Password=redacted;app name=telegraf;log=1; ]

and as far as I know, these servers haven't had the a user set up for telegraf yet. Would the log=1 be storing SQL connection failures issues in memory?

I can send some of the /debug/pprof data across from one of the servers where it's up to 2GB RAM used if needed.

absolutejam on 11 Dec 2017

Here is the documentation for the log=1 bit of the connection string.

I'm not sure if this would effect memory usage, I would think it would only increase the amount of logging.

Please send me the pprof data when you can, hopefully it will contain a clue. It is probably not necessary to wait for it to get to 2GB so long as it is approximately double the normal memory usage or more.

danielnelson on 11 Dec 2017

Looks like the issue wasn't resolved after all. Running a single instance on a server to see what happens with debugging on and removing some configs to see what's happening.

absolutejam on 15 Mar 2018

👍1

This is a bit old, but I'm experiencing a similar memory leak on Windows Server 2016. After a few weeks, Telegraf is using GB of RAM.

Was there any resolutions?

In my case I am running 1.6.1 with MSSQL Input and a few Windows OS metrics.

etlweather on 11 Nov 2018

@etlweather Can you try running telegraf with the additional --pprof-addr localhost:6060 command line options and after it has run for log enough to see an increase in memory collect http://localhost:6060/debug/pprof/goroutine?debug=1 and attach it here along with your Telegraf config file?

danielnelson on 13 Nov 2018

Telegraf started around 15Mb of RAM, not up to 241Mb.

Attached is the output of goroutine and the config.

goroutine.txt

mssql.txt
telegraf.txt

etlweather on 13 Nov 2018

Can you get http://localhost:6060/debug/pprof/goroutine?debug=2 as well? It turned out the first link I had you get doesn't contain the full list of goroutines.

danielnelson on 13 Nov 2018

I thought it was a bit small, but hey, what do I know about Go!

Here is the full dump. The count on the main page for goroutines has increased by about 3,000 since.

goroutine2.txt

etlweather on 13 Nov 2018

Looks like there are lots of leaking goroutines for connections waiting to receive:

goroutine 18237 [chan receive, 879 minutes]:
database/sql.(*DB).connectionOpener(0xc042f8a460)
    /usr/local/go/src/database/sql/sql.go:871 +0x5a
created by database/sql.Open
    /usr/local/go/src/database/sql/sql.go:609 +0x1f5

Looking through the docs for github.com/denisenkom/go-mssqldb, I think we may need to use a NewConnector with a context when we open connections. However, a possible workaround might be to add a connection timeout to your DSN, could you try that out?

Server=192.168.1.10;Port=1433;User Id=<user>;Password=<pw>;app name=telegraf;log=1;connection timeout=15

danielnelson on 13 Nov 2018

Trying... after running for a few seconds, the goroutines count is 34. Will come back in a couple of hours with an update.

etlweather on 13 Nov 2018

Still growing, from 13Mb to 80+Mb of RAM. goroutine counter is now 7054. Attached is the output of the debug logs.

goroutine.txt

etlweather on 14 Nov 2018

Too bad, I think the next thing to check is if this is fixed in the latest Telegraf release. In 1.8.3 we have moved from a fork of the driver back to the main version, which has a fair number of bug fixes.

danielnelson on 14 Nov 2018

I should have tried that first... trying now. Started 1.8.3, started with 14Mb of RAM or so. After a few second goroutine is at 55. Will come back in a couple hours.

etlweather on 14 Nov 2018

Already at 70Mb :disappointed:

etlweather on 14 Nov 2018

Maybe I had some configuration error which was preventing the plugin to work (when I tried today, I found I didn't have a user in SQL Server for telegraf, don't know if I lost the user as I have been messing around...).

Anyway, right now with the current 1.9.3 version, I don't see this problem after a few hours of it running, it's been stably at between 20-23Mb of memory.

etlweather on 3 Feb 2019

@absolutejam Could you take another look with 1.9.3?

danielnelson on 3 Feb 2019

We experienced the same with version 1.9.1.

As long as the sqlserver input is configured correct and it can get the data is seems to work OK and there is no memory pressure. When there is a problem with the sql connection we get the logs as below (redacted the ip's) and the memory keeps rising.

Ran it for a while now with 1.10.4 and that gives the same result. When misconfigured the errors occur and the memory keeps rising.

2019-05-21T18:58:50Z I! ERROR: BeginRead failed read tcp [<IPV6 ip address>%Ethernet 2]:59491->[<IPV6 ip address>]:1433: wsarecv: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z I! ERROR: BeginRead failed read tcp <Local IPV4 address >:59487-><Local IPV4 address >:1433: wsarecv: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z E! [inputs.sqlserver]: Error in plugin: read tcp <Local IPV4 address >:59487-><Local IPV4 address >:1433: wsarecv: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z E! [inputs.sqlserver]: Error in plugin: read tcp [<IPV6 ip address>%Ethernet 2]:59491->[<IPV6 ip address>]:1433: wsarecv: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z I! Failed to send SqlBatch with write tcp [<IPV6 ip address>%Ethernet 2]:59490->[<IPV6 ip address>]:1433: wsasend: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z E! [inputs.sqlserver]: Error in plugin: failed to send SQL Batch: write tcp [<IPV6 ip address>%Ethernet 2]:59490->[<IPV6 ip address>]:1433: wsasend: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z I! Failed to send SqlBatch with write tcp [<IPV6 ip address>%Ethernet 2]:59494->[<IPV6 ip address>]:1433: wsasend: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z E! [inputs.sqlserver]: Error in plugin: failed to send SQL Batch: write tcp [<IPV6 ip address>%Ethernet 2]:59494->[<IPV6 ip address>]:1433: wsasend: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z I! Failed to send SqlBatch with write tcp <Local IPV4 address >:59486-><Local IPV4 address >:1433: wsasend: An existing connection was forcibly closed by the remote host.
2019-05-21T18:58:50Z E! [inputs.sqlserver]: Error in plugin: failed to send SQL Batch: write tcp <Local IPV4 address >:59486-><Local IPV4 address >:1433: wsasend: An existing connection was forcibly closed by the remote host.

arnodenuijl on 21 May 2019

I ran it with the --pperf-addr setting. This gist show a "full go routine stack dump" after 10 minutes of running. I have no experience in Go but i think it could help.
https://gist.github.com/arnodenuijl/7e0d31be5da6ac6f8e816b0fd8f9600e

arnodenuijl on 21 May 2019

Thanks, that should indeed be helpful, looks like will need to dig into Go's sql package to understand the creation of these goroutines and why they are not completing:

goroutine 5896 [select]:
database/sql.(*DB).connectionOpener(0xc003c22480, 0x26dbf40, 0xc000139b00)
    /usr/local/go/src/database/sql/sql.go:1001 +0xef
created by database/sql.OpenDB
    /usr/local/go/src/database/sql/sql.go:671 +0x164

goroutine 5897 [select]:
database/sql.(*DB).connectionResetter(0xc003c22480, 0x26dbf40, 0xc000139b00)
    /usr/local/go/src/database/sql/sql.go:1014 +0x102
created by database/sql.OpenDB
    /usr/local/go/src/database/sql/sql.go:672 +0x19a

danielnelson on 21 May 2019

Great. If I need to try or run something i'd be happy to.

arnodenuijl on 21 May 2019

Could it be that the defer conn.Close() must come on line 154 before the ping. So that we also clean up if the ping fails?

https://github.com/influxdata/telegraf/blob/master/plugins/inputs/sqlserver/sqlserver.go#L160

arnodenuijl on 21 May 2019

👍1

That could be it, can you test it? I can make you a build if it would help or it's pretty easy to setup Go as well.

danielnelson on 22 May 2019

I will test it today. It was indeed easy to set up Go

arnodenuijl on 22 May 2019

Putting the defer conn.Close() before the ping seems to resolve the issue. The amount of go routines stays at ca. 16 (without the change it quickly goes into the 1000's).
On the github of the library i asked the question if the library authors see any problems with this. I saw the same pattern in the sample (open -> ping -> defer) and i wonder if there is a specific reason for this.

How can we proceed with this?

arnodenuijl on 22 May 2019

Yes, Looking at the code in question and the goroutine list, this is indeed the issue. Putting defer db.Close above the ping will fix it.

However: Why is telegraph apparently opening a new connection on every stat collection? Why are you creating a new connection pool, opening a connection from it, then closing it all down every time, (every 30 seconds?)? This is a complete waste of resources. Also, don't use ping like this. That is a complete waste of time and additional resources. Just try the query and if it works, then it works. All PING does is add yet another round trip doing a pointless select 1;.

You want three phases to working with this: init: create the connection pool. Don't use ping. That doesn't help you in this context. Then periodically use this connection pool (sql.DB) and call Query. Also, calling the variable conn, err := sql.Open(...) is misleading at best. This is NOT A CONNECTION. This is a connection pool, and at open, no network connection has been yet established.

This will be an issue for any database stats, not just SQL Server. This will be an issue for any OS, not just Windows.

kardianos on 22 May 2019

@kardianos All good points, it would be good to make all of those changes. We could look at the postgresql plugins as an example for using the connection pool.

For this issue though lets only put the defer in the right spot, and optionally remove the unneeded ping. @arnodenuijl Would you be able to open a PR for that change?

danielnelson on 22 May 2019

I created a pull request. Second in my life, so if I need to do anything other/different, just say so.

arnodenuijl on 22 May 2019

Looks perfect, thanks!

I opened this issue for the connection pool work #5898, it isn't something that I am going to work on right now but it is good to have it tracked.

danielnelson on 22 May 2019

Was this page helpful?

0 / 5 - 0 ratings