Beats: Close open files when reaching "too many open files" to avoid other errors

Created on 6 May 2019 · 19Comments · Source: elastic/beats

What happened

My filebeat faced too many open files after some log rotations. Happens. But once it faced this condition, it also started reporting errors like

2019-05-06T16:57:24.902+0200    ERROR   pipeline/output.go:100  Failed to connect to backoff(async(tcp://logstash.test.local:5044)): lookup logstash.test.local: too many open files
2019-05-06T16:57:24.903+0200    INFO    pipeline/output.go:93   Attempting to reconnect to backoff(async(tcp://logstash.test.local:5044)) with 6351 reconnect attempt(s)
2019-05-06T16:57:24.903+0200    WARN    transport/tcp.go:53 DNS lookup failure "logstash.test.local": lookup logstash.test.local: too many open files

and stopped pushing any data upstream. So old logs stayed unpushed, new logs appeared so there were more and more files etc etc.

What could be done better

In my opinion filebeat could handle this situation better - after facing this specific error (too many open files) on "upstream-related" operation (DNS query, upstream connection) it should
close some logfiles to make room for those operations. This way the load would be reduced over time instead of piling up.

Filebeat Integrations discuss enhancement

Source

Mekk

All 19 comments

Hi @Mekk,

Thanks for your proposal.

It'd be good to discard some file descriptors leak here, do you actually have so many files being harvested by filebeat?
What do you do when this happens? Do you increase the open files limit or restart the filebeat process?

jsoriano on 7 May 2019

Oh, I see more info in https://github.com/elastic/beats/issues/12069 :slightly_smiling_face:

jsoriano on 7 May 2019

Indeed (I faced implicit 1024 files limit, my harvester limit was bigger). I indeed have that many files (apps which use separate-log-per-process, using pids and dates in lognames)

BTW: restarting doesn't help. After restart filebeat opens 1024 (or whatever the limit is) files and starts reporting DNS lookup failures. Tested.

Mekk on 7 May 2019

Regarding workarounds: I didn't test but I bet that an alternative (to raising file limit) is to lower harvester_limit's (so their sum is noticeably below descriptor limit).

And to put thing in context: in general it is my configuration error (that I didn't guarantee that sum of my harvester_limit's is noticeably lower than system-induced descriptor limit). In my very case I simply didn't know that my descriptor limit is 1024 (I expected 65536 as limits.conf and ulimit declared).
Still, even without such mistake, keeping proper relation of those values is unobvious at least (who remembers, while adding 47th input, about summing up harvester_limits and reducing them all a bit in case they are too big in total, who is aware that systemd puts 1024 even if limits.conf is much bigger etc etc). So I suggest that filebeat could be much friendlier in such a case, and keep running (and frequently self-heal) instead of breaking completely.

(to a degree this issue is about self-regulation instead of manual harvester_limit regulation, if it is implemented, harvester_limit may turn out to be less necessary and function mostly as a cap for less important log groups)

Mekk on 7 May 2019

Ok, I see, thanks for the details.

I am not sure if keeping a hard limit on the pool size of open files will be so good, keeping the files open makes filebeat to collect logs as soon as they are written, and prevents it to be continuously closing and reopening files to check for new contents. But I agree that this would be better than reaching a point of continuous failures.

There are some options to control when files are kept open, for example if your applications are not logging continuously, one thing you can try is to reduce the close_inactive time, so files are closed early if no new logs are being harvested from them. This setting defaults to 5 minutes.

jsoriano on 7 May 2019

I don't suggest hard limit. I suggest handling this very specific error – once DNS resolution, socket open to upstream or similar operation returns errno=24 (or whatever is Go wrapper for this condition), we know that we exceeded the limit so we must close something to get the most important descriptors working.

Mekk on 7 May 2019

👍1

Another interesting failure case:

2019-05-09T09:52:10.945+0200    ERROR   registrar/registrar.go:416      Failed to create tempfil
e (/var/lib/filebeat/registry/filebeat/data.json.new) for writing: open /var/lib/filebeat/regist
ry/filebeat/data.json.new: too many open files

this descriptor seems crucial too…

Mekk on 9 May 2019

Yes, lots of things can fail if open files limit is reached, this is why I wrote about the possibility of setting a limit in the number of open harvesters, I find it easier to limit that than to control all the cases where this can be a problem.

@Mekk did you try to reduce close_inactive time to see if it mitigates your problems a little bit?

jsoriano on 9 May 2019

@Mekk there is also a harvester_limit option, that allows to limit the number of simultaneous open files per log input and is disabled by default.

jsoriano on 10 May 2019

I know about harvester_limit and use it (see comment above https://github.com/elastic/beats/issues/12068#issuecomment-490036185 ). I can't close inactive too soon as I have some "do sth twice a day" components. In general I know that I can configure this better and I do it.

This bug is about improving the behaviour when somebody misconfigures.

And as I already said: it is fairly difficult to keep harvester_limit-to-descriptor limit in control when there are a lot of inputs. As filebeat knows that it reached the limit (this is specific distinguishable error), it can try to manage the problem if it happens (by prioritizing "push" sockets and descriptors to those "pull"-ing data, in practice by closing sth like 10-20 watching descriptors and keeping the global cap afterwards if the condition happens). After all it doesn't make sense to keep watching if any attempts to forward data to upstream fail.

Mekk on 10 May 2019

👍1

Just faced the problem again, on filebeat 7.5.1

Mekk on 4 Feb 2020

If I understood your request correctly, you want Filebeat to close files if it cannot open any more files, e.g if it encounters backpressure. But by adding this functionality we would risk users losing data without letting them opt-out. We cannot add heuristics to Filebeat to select which file to close. What if the selected file is deleted before Filebeat is able to process more events? Unfortunately, we cannot retrieve that lost file for the user.

Right now we are trying to get all data from all configured paths and let users explicitly configure close_ options. This way, they are aware of potential data loss. For closing opened files we provide the following options:

close_removed
close_renamed
close_inactive
close_eof
close_timeout

For your case, I suggest you set close_timeout, so each harvester has a predefined lifetime. But keep in mind that it might lead to data loss. (See more: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-log.html#filebeat-input-log-close-timeout)

kvch on 11 Feb 2020

👍1

Could you please read my description with some care?

In the state I describe you are loosing all the data. Absolutely everything. Filebeat is unable to push data upstream and is unable to save it's own state.

Mekk on 13 Feb 2020

Is it possible to reopen the issue somehow? It is clear that zube, who closed it, misunderstood it.

It's not about heuristically choosing what to push, and what not.
It is about filebeat being unable to (a) resolve DNS name of upstream (b) open socket to upstream and (c) save it's own registry (due to the latter it can't even be reasonably restarted) – because it starved itself from possibility to open file descriptor.

Mekk on 13 Feb 2020

It's not about heuristically choosing what to push, and what not.

I am sorry if I did not express myself clearly. That's not what I meant. I meant choosing what files to close.

If you are setting the option close_timeout, Filebeat closes input files after the configured interval, which decreases the number of open files. This lets the publisher pipeline do its thing, connect to the output and send events.

If you want to be more aggressive you can set scan_frequency to a higher value than close_timeout. In this case, there will be intervals when no harvesters are running. For example, you set close_timeout to 3 minutes, and scan_frequency to 5 minutes, This gives Filebeat 2 minutes to open output sockets and forward events without reading any input files. But these settings also lead to potential data loss as well.

You need to adjust these options to fit your use case and infrastructure and to minimize potential data loss.

Let me know if these options are working for you.

kvch on 13 Feb 2020

I think that the decision „what files to close” is exactly the same as the decision „which files not to open yet” when harvester_limit is reached. This is exactly the same problem, due to limits we can't handle all logs, so we handle as much as we can. In fact, whole bug can be probably resolved by artificially lowering harvester_limit when filebeat faces descriptor problem.

Proper solution is of course to always set harvester_limit so descriptor limit is not reached. But, as I said a few times, this is error-prone so some sensible behaviour when it happens to be configured too losely would really be helpful.

Once more: here is the situation filebeat can enter as it is now. Imagine we have default ulimit of 1024 descriptors (systemd default filebeat packages preserve) and it happens we have 1030 files in log directory:

filebeat opens successive log files, until it reaches 1024 (well, probably rather 1020-1021 as some descriptors are already open, for example logfile or stderr)
sooner or later attempts to send anything upstream start to fail due to failed DNS resolution or failed attempt to open socket (both those need descriptors)
any attempt to save filebeat registry also fail due to failed attempt to open this file

So filebeat keeps watching all those files, reporting errors with upstream connection and reporting errors with registry. And stays so forever, not doing anything except error reporting and pumping some data from logs to internal memory. Moreover, even restart is troublesome as it can't save registry to preserve the state.

What I suggest is that if we detect that (attempt to save registry or attempt to connect upstream fails with error "Too many open descriptors") filebeat should close a few logs to be able to start working again. Keeping them open won't help with anything, after all. And those files are not to be abandoned, they will be handled later, when sth else is closed (due to inactivity, file removal, etc) – again, just like it is with logs left unmonitored due to harvester_limit..

Another, less efficient but fairly possible option, would be to close _all_ logs in such a case, internally reduce all harvester_limits a bit, and reinit to start opening logs as usual.

Mekk on 14 Feb 2020

@Mekk thanks for your thoughtful explanations, but the more I think on this the more I think that implementing a solution like this one would add quite a bit of complexity and it would provide little benefit. Let me try to explain.

Complexity added:

We would need to identify what parts of filebeat are "critical" to its operation (resolving DNS, connecting to outputs, writing the registry...).
In these critical parts, we would need to add additional error handling to detect a very specific error.
We should be very careful on not adding this additional error handling in parts of the code that are reused between critical and non-critical functions.
We should ensure that this error handling works consistently between the different supported operating systems and environments. Testing this can be tricky.
Once the situation is detected, Filebeat has to decide what mitigation to apply, and during how long. Regarding that:
- There is no global harvester_limit, this setting is per input, in Filebeats configured with many inputs, we would need to decide what input to throttle, or decide if throttling all of them.
- There are a variety of inputs, the most common ones read files, but others make network requests, or are servers listening for connections, these operations also use file descriptors, should these inputs be closed too?
- For how long should these mitigations be applied? We should also add some mechanism to decide if the situation of pressure has finished, to stop throttling the inputs. Doing it too soon may lead to continue having these errors, only with lower frequency, doing it too late may increase the possibility of missing data or never catching up on reading all the logs.

Little benefit:

It may be hiding greater problems. Any resource exhaustion may mean that there is a greater problem. In this case it could be that for some reason Filebeat is keeping file descriptors open for too long (as seems to be in https://github.com/elastic/beats/issues/14389), or that the monitored applications are logging to more files than they should, or that some input is misconfigured and watching files that it shouldn't.
It may provoke other problems. If the volume of logs is expected, but Filebeat keeps adding countermeasures automatically, data loss will happen at some point and the system will never recover (imagine that instead of 1030 files as in your case, there are 10000 files). Investigation of the issue in this situation will be much more complex than if you earlier see problems, and "too many open files" errors in the logs.
Default behaviour may be more confusing in some cases.
In general, with all its complexity, it may still not be a solution for all cases. It may only help in cases where the limit is slightly exceeded.

Please notice that current situation is not so bad. In the worst case a user would see that Filebeat is not working, they will check the logs and they will quickly see many "too many open files" errors. Then they could decide what to do, as they would do with any other software that hits some limit. Maybe increasing the limits is enough (I actually agree that increasing the default limits can be a good thing to do https://github.com/elastic/beats/issues/12069). If increasing the limits is not possible, they can try what of the many options Filebeat offers works better for their use case, as the options @kvch and I have proposed.

I may be completely wrong, or missing something :slightly_smiling_face: but really I think that this is a problem that when happens can be easily solved by configuration, and it is really hard (if even possible) to solve well by changing default Filebeat behaviour.

jsoriano on 14 Feb 2020

a user would see that Filebeat is not working, they will check the logs and they will quickly see many "too many open files" errors

In setup with 74 filebeats working on various machines it may not be that immediately obvious that 3 of them stopped to work. Especially when ELK is used mostly to search for problems and alert about them (query didn't find errors, dashboard is green…). And logs need not be that accessible if they are usually looked at via Kibana and filebeat stopped to provide it's own logs too. With normal setups situation is usually not that really bad, but think also about filebeat's running as sidecar containers in kubernetes pods.
If I may make a bit wider remark: it seems to me that counting on admins spotting problems in log files is working worse and worse every year. In fact, this is exactly what makes ELK needed so much ;-)

We would need to identify what parts of filebeat are "critical" to its operation

I'd say that simply anything except reading. From the outside-user point of view filebeat seems to have three main areas of operation - reading (handling incoming data, whatever is the source), writing (pushing data out, whatever is the destination) and self-management (reading configuration, writing state, writing logs etc). Prioritizing writing and self-management against reading seems rather natural back-pressure technique, after all it doesn't make much sense to grab data if it can't be handled.

Well, I understand that it is too hard to implement directly, but maybe think about this as some general problem for the future.

we would need to add additional error handling to detect a very specific error
…

Another approach to the whole story could be do sth very big: if we have persisting problem with state management (whatever the problem is), perform big reset (close everything and start from scratch, and while starting, open more important thing first). From my registry-related discussions I understand that having up-to-date state is considered to be very important by some users (there is some reason that by default state is flushed everytime sth happens, even if that brings huge CPU cost). So … filebeat finds itself unable to save state. And again. And again. And again. And again. And nothing happens?

The state problem is realistic, in my cases the failed attempt to write registry/data.json.new was most often the first failing op. It could also happen that filebeat was more-or-less working (at least for some time) apart from that (if upstream connection was already active and could be reused for some time). So it is possible that filebeat will advance various logs noticeably (using already open upstream connection) while it is unable to save the state to data.json. After unavoidable restart, this state is likely to be lost. Is there at least some guarantee that the _final_ state save will be done after inputs were closed?

As a sidenote: https://github.com/elastic/beats/issues/12069 is trappy. You upgrade from init.d-based system to systemd-based one. You have evidence of tuning done earlier, so you remember about moving limits.conf, sysctl-s and such. You keep using packaged filebeat. Out of the sudden you get hit by 1024 descriptors limit. That's not filebeat fault, but…

Mekk on 14 Feb 2020

I'm also facing this issue and, like @Mekk , have had issues noticing that one log source has stopped transmitting when the rest of our cluster is working well. While the concerns raised by @jsoriano are quite valid as well and more intelligent configuration may certainly be required in some circumstances holding a small pool of file handles (5-10?) to allow filebeats to be sure of recording the registry and having ports for upstream transmission would seem to be an intelligent default.

As it stands, for me right now, after restarting filebeats I'm immediately back in the same state because of the large number of unread files. I'm going to need to adjust my configuration and restart to apply some of these recommendations which will then, because I can't save the registry, result in a round of retransmission of the same data. A small pool of reserved file handles would have solved this problem, much like the harvester_limit setting of about 1000 I'm about to apply should.