Filebeat currently lacks an option to cleanup the registry file. The goal of this issue is two answer three questions:
I suggest to introduce two new config options clean_older and clean_removed with the following behaviour and defaults. This is option 4 below.
| Variable | Description | Default |
| --- | --- | --- |
| close_older | Closes file handler for files which are older then x | 1h |
| clean_removed | Cleans the registry file from files which are not found anymore after x | 24h |
| ignore_older | Does not crawl files which are older then ignore_older. Still keeps the file state and moves the file offset to the end of the file. | infinity |
| clean_older | Removes file from the registry which are older then x | infinity |
Better name suggestions for the new variables are welcome.
Below is the current behaviour and the different options described in more detail.
The current behaviour is as following:
ignore_older are set to the end of the file and persisted.This brings the problem that the registry is never cleaned up.
One option is to use ignore_older to also cleanup the state. This would have the following consequences:
The introduction of a clean_older variable would allow to set a time when the registry should be cleaned up. In case ignore_older has the same behaviour as now, it requires that clean_older is >= ignore_older as otherwise the two would get into a race condition. As ignore_older is set by default to infinity, it would also mean that clean_older is disabled by default. If clean_older is enabled with a time, it would mean also ignore_older has to be set.
Introducing a variable clean_removed would allow to enable an option, that only files which disappear from disk are cleaned up. clean_removed could either be a bool and files that disappear would be directly removed or it could also be time a duration after whicht the files are removed. This means all files which still exist will be kept in the registry.
In case clean_removed is duration, it would require clean_removed >= close_older or force_close_files to be enable to make sure the files are closed.
Option 4 would be combining all these options. This is my preferred option as it gives the full flexibility and keeps the behaviour of the existing variables.
@guyboertje Please have a look
considering ignore_older virtually removes file from prospector (scanner), isn't close_older =ignore_older + clean_removed? On the other hand having close_older allows for better fine-tuning.
Discussed briefly with @ruflin . force_close_files closes the file handle as soon as the file name changes (or the file is rotated). It can be helpful to have something like "force_close_files (for rotated files) after N hours" to give filebeat some time to catch up on processing bursts in traffic after a file name change.
Based on some internal discussion and community feedback it becomes clear that already force_close_files is not fully clear on what it does and how it behaves. Contributing to this is that our docs in config.full.yml are outdated and in the documentation itself are wrong.
For a better understanding I try first to elaborate on how filebeat works, which should make it easier to discuss the different options.
On startup, filebeat creates one prospector for each prospector configuration. These prospectors exist for the full runtime of filebeat and their task is to make sure harvesters are started for each file in paths. The prospector stores the state for each file it started a harvester. As soon as a harvester is started, the only communication channel between harvester and prospector is that the harvester reports back, when it finished the "task". This normally means reaching close_older or hitting an error. As long as a harvester didn't update its state to "Finished" no harvester will be started for the same file. A harvester itself does not see a change if a file was renamed and stays open, as the identifier stays the same.
The force_close_files option is completely implemented on the harvester side. Normally a harvester just keeps reading the file until the end of the file is reached. Then the harvester keeps backing off until close_older is reached to close the file. In the case that force_close_files is enabled, two things happen:
force_close_files could potentially be split up in two parts:
close_file_on_removalclose_file_on_renameBe aware, that enabling force_close_file also can have a performance hit as the file stats are read and compared for every line.
Taking a second look at all the configuration options, they seem to apply on 3 layers with the following purposes:
The options below to not necessarly represent the current behaviour but what I would suggest it to be.
That means for the harvester we have the follwing config options.
close_renamed: Closes file handler if file is renamed (rotated?)close_removed: Closes file handler if file is removedclose_older: Closes file handler if harvester finished reading and file mod time is older then the dateforce_close_older / drop_older: Closes file if ModTime() is older then force_close_older also if reading didn't finish. force_close_older == ignore_older, that means the file is ignore afterwards.close_eof: Closes file if EOF is reachedBy default, backoff applies if EOF is reached, means the file handler is kept open but the harvester sleeps for backoff time. If a file is closed, that means the harvester is stopped and a new harvester for the same file will be started again by the prospector after scan_frequency. Data is only lost, in case a rotated file is not in the path pattern to be found again by the prospector or if the file is removed.
Currently close_renamed + close_removed = force_close_files
Closing files very often can have a performance impact. One reason is that additional file meta data has to be checked all the time and it diverges from "near real time" as if a file is closed, it will only be picked up again after scan_frequency.
ignore_older: Do not start new harvesters scan_frequeny: How often files are check if content was added or new files appeared, to start a new harvester.As long as a harvester is open for a file, ignore_older will not have any affect. It will only be taken into account when a harvester for a file finished.
The registry persists the state of the files. The registry is only used, when filebeat is (re)started.
It would be possible to move these config options into the prospector, means they could be set per prospector but would still apply on the registrar level.
clean_older: Removes files form the registry which were not modified for the given duration. This means the file state is lost.clean_removed: Removes the state for files which a state exists in the registry but currently cannot be found anymore. In case the file shows up again at a later stage, state is lost. clean_removed could also be a duration, but for simplicity we should just enable / disable it.The above options are especially important in 2 cases:
Filebeat has the principle to send a lot line at least one. Some of the configuration changes especially in combination can lead to data loss. Here some more details on when data could be lost.
Harvester options: In general, closing a file handler normally means not data loss, as the file is picked up and scanned again after scan_frequency. A few exceptions can apply here that can lead to a data loss / file not completely read. Which config option for closing a file does not really matter.
scan_frequency. The lines which were not read so far, will be lostCombinations: There are several combinations which can lead to data loss. Some of these combinations can be intended, but it is important to understand the consequences:
clean_older == ignore_older, even the states are directly removed.clean_older< close_*: File states will be removed before a state is persisted.clean_removed: If a file system disappears during a scan, all states will be removed.close_renamed: falseclose_removed: falseclose_older: 1hforce_close_older / drop_older: 0 (disabled)close_eof: falseignore_older: 0 (disabled)scan_frequeny: 10sclean_older: 0 (disabled)clean_removed: falseI will now start implementing the options mentioned above. This commend it to track the implementation. Renaming of options is still possible at a later stage:
Config options:
close_renamed: false https://github.com/elastic/beats/pull/1909close_removed: false https://github.com/elastic/beats/pull/1909close_older: 1h close_ttl / force_close_older / drop_older: 0 (disabled) https://github.com/elastic/beats/pull/1926close_eof: false https://github.com/elastic/beats/pull/1914ignore_older: 0 (disabled)scan_frequeny: 10sclean_older: 0 (disabled) https://github.com/elastic/beats/pull/1915clean_removed: false https://github.com/elastic/beats/pull/1922Add Docs:
During the implementation of the clean_* options I realised potentially one additional variable is needed: clean_frequency. Currently on the prospector level scan_frequency is used for the check ups as all states are checked every time a scan is run. But on the registrar level it is not clear when this cleanup happens. We could just take scan_frequency here but that would introduce again 2 meanings for one variable. I added clean_frequency above for implementation.
After discussion clean_frequency more in detail, we decided to do the following:
This makes an additional configuration option obsolete. It has one side affect, that as long as no new events are harvested by filebeat, the registry will not be cleaned up as the registry is never written.
Starting to implement force_close_older / drop_older I realised the naming should also have the prefix close_ to be consistent with all other harvester closing options. As this is essentially a time to live for a harvester, I plan to name it close_ttl.
The clean_frequency makes sense to me.
I want to suggest a small tweak to the configuration options and see if the logic makes sense as a result. Instead of having two different settings with relative times in them (close_older, drop_older), what if you had just one time-based setting and then a boolean option which can determine whether the harvester comes to a hard stop even if the file isn't finished...?
I also want to offer a suggestion to rename these as promised. After rereading this all a few times I realized that we should probably structure the documentation with these three big concepts as headers and pieces of a large diagram (prospector, harvester, registrar). And further to that end, the options which apply to those logical pieces should be named as such. So, here's a set of names which should be easily mapped to yours. Note that I have taken the liberty of incorporating the above suggestion in this list of names.
Yes, these names are long but I believe that they assist in self documenting the behavior which users are unlikely to be reading about on a routine basis. They will go to the docs, understand it once and come back later - hopefully these names in their config files will be clear to them without reading the docs every time.
harvester_close_on_renamed
harvester_close_on_removed
harvester_close_on_eof
harvester_timeout
harvester_close_on_timeout
prospector_scan_frequeny
prospector_ignore_older
registrar_clean_frequency
registrar_clean_older
registrar_clean_removed
[edit - I dropped 'force' from the harvester_close_on_timeout based on your last comment, which I missed]
WDYT about putting the configs under a namespace?
harvester:
close_on_renamed
...
About your boolean option for close_older and close_on_timoeut:
close_older must be <= close_on_timeout. In your case it would be always identicalclose_older to 10s to make sure files handlers are always closed very fast, but close_on_timeout to 1min to close files which couldn't finish fast enough.Hmm, after seeing all these options, I wonder if it's a good idea to expose them all to the user. I mean, the number of possible combinations is huge, and it sounds like many possible combinations can result in subtle data loss.
I'm not quite sure I get the motivation to make all of these configurable, is it for completeness or for handling corner cases?
It is for handling corner cases and to not have config options with multiple meanings. I would keep the same defaults as we have now so no data loss by default. For the harvester, the corner cases are that people want the file handler closed to be closed faster but for different reasons. For the registrar it is solving the problem of registrar growing too big and inode reuse.
Ok, just worried that we might be going the other extreme now :-). Perhaps we can go through the options in today's meeting, I'm not sure I get the difference between close_older and close_ttl, for example.
Based on the inputs from @brandonmensing, a suggestion for the config file:
Before:
filebeat.prospectors:
- input_type: log
paths:
- /var/log/*.log
exclude_lines: ["^DBG"]
include_lines: ["^ERR", "^WARN"]
exclude_files: [".gz$"]
ignore_older: 0
close_older: 1h
scan_frequency: 10s
harvester_buffer_size: 16384
max_bytes: 10485760
tail_files: false
backoff: 1s
max_backoff: 10s
backoff_factor: 2
close_renamed: false
close_removed: false
close_eof: false
clean_older: 0
clean_remove: false
filebeat.spool_size: 2048
filebeat.idle_timeout: 5s
filebeat.registry_file: registry
After
filebeat.prospectors:
- input_type: log
paths:
- /var/log/*.log
exclude_files: [".gz$"]
ignore_older: 0
scan_frequency: 10s
max_bytes: 10485760
state:
clean_older: 0
clean_remove: false
harvester:
tail_files: false
close_renamed: false
close_removed: false
close_eof: false
close_older: 1h
backoff: 1s
max_backoff: 10s
backoff_factor: 2
buffer_size: 16384
exclude_lines: ["^DBG"]
include_lines: ["^ERR", "^WARN"]
filebeat.registry_file: registry
filebeat.spooler:
size: 2048
idle_timeout: 5s
I removed the options that would stay the same.
For harvester it could also be:
filebeat.prospectors:
- input_type: log
harvester:
tail_files: false
close:
renamed: false
removed: false
eof: false
older: 1h
backoff:
duration: 1s
max: 10s
factor: 2
@brandonmensing I left out the _on_ par in the config options as it felt to me more natural as it is the closing for example happens after it was renamed, not during the rename. Happy to discuss further.
I left out close_timeout above, but I think timeout is definitively less abstract then ttl. I'm curious if there is perhaps even something more descriptive like close_after or other alternatives?
I just realised, the above config examples have a mistake inside. The state clean variables are part of each prospector, as each prospector can have different values. I changed it above.
@brandonmensing I was wondering if people understand state better then registry? See example above.
One more: More accurate for close_older would probably be close_inactive ?
As a note: The general discussion came up to change prospectors to inputs and harvester to reader.
Quick summary of the conversations yesterday:
Close configurations:
close_renamed: false
close_removed: false
close_eof: false
close_idle: 1h
close_timeout: 0
Close_older will be renamed to close_idle as it describes better that it does not depend on a file mode time but an idle harvester.
clean_idle:
clean_removed
Same argument for clean_idle as for close_idle.
Instead of nesting the variable options under state and harvester, we will use comments the visualised the nesting. This reduces the length of the variable names, not additional identation and keeps a new name "harvester" out of the config file.
max_bytes and buffer_size were also discussed as these could be improved, but can be done at a later stage. Cleaning up spooler would be nice, but not critical.
Closing as all changes were implemented. Small follow up changes are tracked in https://github.com/elastic/beats/issues/2012
I would like to have a word about the change of the terminology "prospectors" and "harvesters" to "inputs" and "readers".
It's been some time I'm under the impression there exists a general trend in software that strives to impoverish what vocabulary lusers are exposed to the most "Simple English" possible.
Perhaps it comes from the scientist quest to find perfectly abstract entities, the deepest, most atomic, purest ones, the most irreducible concepts, that would have no meaning left if you tried to narrow-down their meaning just a lil' bit.
But that doesn't justify to stick to a terminology that stay far, far away and disconnected from any real thing that exists in the physical world, even when it's obvious that an analogy is used and there's no confusion possible with that of the real world.
Also, who believe "the people-as-a-whole" are asking for dumbing-down ?
The words "prospectors" and "harvesters" seemed to me to be perfectly describing, in an imaginative way, the role of each actor in the system, and as a nonnative English speaker, I don't mind having to fetch a word definition if needed -- it takes 10 seconds from within a web browser.
Moreover, reducing the overall vocabulary in use in a system has a nasty side effect in the long term: you end up with a lot of confusion around every word because they've become ambiguous, and nobody can guess what it's about without a lot of context. Here, the word "input" is the most problematic; it has a lousy definition and could very well mean to me "one particular file", and it does not convey to me any activity about discovering any new data: input is passive thing. The word "reader" is okayish albeit generic and blunt compared to harverster.
-- I have a mouse on my desktop, and I have a mouse in the attic.