Influxdb: File descriptor leak

Created on 8 Aug 2017 · 10Comments · Source: influxdata/influxdb

Bug report

__System info:__ [Include InfluxDB version, operating system name, and other relevant details]
Influxdb 1.2.4
Linux

__Steps to reproduce:__

run influxdb
???

I am working with a large number of retention policies (approx 250), but only 6 retention policies are being written to at any given time, and only 1 read from any given time.

__Expected behavior:__ [What you expected to happen]
Work normally

__Actual behavior:__ [What actually happened]
Runs out of file descriptors and starts throwing errors.
Such as

open /mnt/influxdb/influxdb2/meta/meta.dbtmp: too many open files

and

Aug 07 20:06:47 whistler influxd[25058]: [I] 2017-08-08T00:06:47Z Snapshot for path /mnt/influxdb/influxdb2/data/market/s0069/1382 written in 6.906µs engine=tsm1
Aug 07 20:06:47 whistler influxd[25058]: [I] 2017-08-08T00:06:47Z error writing snapshot: error opening new segment file for wal (1): invalid argument engine=tsm1

__Additional info:__ [Include gist of relevant config, logs, etc.]
lsof output can be found here: https://gist.github.com/phemmer/4e7767fa33ff3470322a44f0cbebf964

This seems very likely related to having so many retention policies, but it doesn't seem like InfluxDB should be opening every single one of them, or so many files because of such.

Source

phemmer

Most helpful comment

No, I switched to a different database that scales better.

phemmer on 22 Aug 2017

👍3

All 10 comments

What does ulimit -n report?

jwilder on 8 Aug 2017

Should be 1024 (don't have access to the machine right now, but that was the FD count when the errors started occurring).

phemmer on 8 Aug 2017

That's really low. Our packages default the open file limit to 65536. See https://github.com/influxdata/influxdb/blob/f7c686dbefbf116f79493739f3a9f6e26ef351b5/scripts/init.sh#L53

jwilder on 8 Aug 2017

ulimit -n will only give you the per process soft limit, which a process is allowed to change up to the hard limit (ulimit -Hn) if it needs to. Using lsof to count actual open files is hard because it will show you a lot of other things as well, hard to filter. Better if you use "ls -la /proc/$pid/fd | wc -l" and you can also easily see which files are opened.

dennissehalic on 10 Aug 2017

@jwilder So are you saying it's expected to open so many files even when the RPs are not in use? What if I wanted to have several thousand retention policies / databases?

phemmer on 14 Aug 2017

@phemmer I'm saying your open file limit system config is not configured correctly to run the database. I'm assuming you are not running one of our packages which increases the limit. The database uses file handles for many things and the defaults are too low for typical uses. If you have thousands of rps/dbs, you will likely need to adjust other settings as well since all dbs/rps take up resources whether they are in use or not.

jwilder on 14 Aug 2017

I can raise the limit, and other limits, that's not my concern here. My concern is that it's not scalable. From this, it seems I can expect 4 open files per retention policy. If I wanted to have a hundred thousand retention policies (and yes, I do), that's 400,000 files open at the same time.

phemmer on 14 Aug 2017

You have to set the limit for what is appropriate for your system. The limit is just a kernel setting to prevent errant user land programs consuming too many resources on the system. 65k is a better default for most users so that is what we use. That could still be too low in some cases so bumping it put to 512k or higher may be necessary.

100k retention polices or databases on a single system is likely not going to work well. I would not recommend that setup personally. While there isn't a fixed limit for the number of DBs and RPs in the system, there is some overhead incurred for each one. After a certain point, it can affect the system due to the overhead. If you really need that many db/rps, you may need to look into running a cluster or partitioning them over multiple servers.

jwilder on 14 Aug 2017

@phemmer, did you find a workaround for this issue, instead of bumping the ulimit?