Beats: Registry state persistence and cleanup

Created on 10 May 2016 · 22Comments · Source: elastic/beats

Filebeat currently lacks an option to cleanup the registry file. The goal of this issue is two answer three questions:

Which files are ignore for crawling?
Which states are removed from the registry?
How long is the state kept for files that cannot be found anymore?

I suggest to introduce two new config options clean_older and clean_removed with the following behaviour and defaults. This is option 4 below.

| Variable | Description | Default |
| --- | --- | --- |
| close_older | Closes file handler for files which are older then x | 1h |
| clean_removed | Cleans the registry file from files which are not found anymore after x | 24h |
| ignore_older | Does not crawl files which are older then ignore_older. Still keeps the file state and moves the file offset to the end of the file. | infinity |
| clean_older | Removes file from the registry which are older then x | infinity |

Better name suggestions for the new variables are welcome.

Option Details

Below is the current behaviour and the different options described in more detail.

Current Behaviour

The current behaviour is as following:

File offset for files reaching ignore_older are set to the end of the file and persisted.

This brings the problem that the registry is never cleaned up.

Option 1 - ignore_older double meaning

One option is to use ignore_older to also cleanup the state. This would have the following consequences:

If a file reaches ignore_older, the state is removed from the registry
That means if a file that reached ignore_older is updated again, it will be read from the beginning as no state was stored

Option 2 - clean_older >= ignore_older

The introduction of a clean_older variable would allow to set a time when the registry should be cleaned up. In case ignore_older has the same behaviour as now, it requires that clean_older is >= ignore_older as otherwise the two would get into a race condition. As ignore_older is set by default to infinity, it would also mean that clean_older is disabled by default. If clean_older is enabled with a time, it would mean also ignore_older has to be set.

Option 3 - ignore_older, clean_removed

Introducing a variable clean_removed would allow to enable an option, that only files which disappear from disk are cleaned up. clean_removed could either be a bool and files that disappear would be directly removed or it could also be time a duration after whicht the files are removed. This means all files which still exist will be kept in the registry.

In case clean_removed is duration, it would require clean_removed >= close_older or force_close_files to be enable to make sure the files are closed.

Option 4 - clean_older >= ignore_older, clean_older >= clean_removed

Option 4 would be combining all these options. This is my preferred option as it gives the full flexibility and keeps the behaviour of the existing variables.

Filebeat discuss

Source

ruflin

All 22 comments

@guyboertje Please have a look

ruflin on 10 May 2016

considering ignore_older virtually removes file from prospector (scanner), isn't close_older =ignore_older + clean_removed? On the other hand having close_older allows for better fine-tuning.

urso on 23 May 2016

Discussed briefly with @ruflin . force_close_files closes the file handle as soon as the file name changes (or the file is rotated). It can be helpful to have something like "force_close_files (for rotated files) after N hours" to give filebeat some time to catch up on processing bursts in traffic after a file name change.

ppf2 on 7 Jun 2016

👍1

Based on some internal discussion and community feedback it becomes clear that already force_close_files is not fully clear on what it does and how it behaves. Contributing to this is that our docs in config.full.yml are outdated and in the documentation itself are wrong.

For a better understanding I try first to elaborate on how filebeat works, which should make it easier to discuss the different options.

On startup, filebeat creates one prospector for each prospector configuration. These prospectors exist for the full runtime of filebeat and their task is to make sure harvesters are started for each file in paths. The prospector stores the state for each file it started a harvester. As soon as a harvester is started, the only communication channel between harvester and prospector is that the harvester reports back, when it finished the "task". This normally means reaching close_older or hitting an error. As long as a harvester didn't update its state to "Finished" no harvester will be started for the same file. A harvester itself does not see a change if a file was renamed and stays open, as the identifier stays the same.

force_close_files

The force_close_files option is completely implemented on the harvester side. Normally a harvester just keeps reading the file until the end of the file is reached. Then the harvester keeps backing off until close_older is reached to close the file. In the case that force_close_files is enabled, two things happen:

Every time a line is read, it checks if the file name stayed the same. If the file name changed, an error is returned which stops the harvester and closes the file handler
Every time a line is read, it checks if the file still exists on disk by fetching the Stat information. If it does not exist anymore, error is returned, harvester closed, file handler closed.

force_close_files could potentially be split up in two parts:

Close file if was removed. This could be called: close_file_on_removal
Close file on rename. This could be called: close_file_on_rename

Be aware, that enabling force_close_file also can have a performance hit as the file stats are read and compared for every line.

ruflin on 17 Jun 2016

Taking a second look at all the configuration options, they seem to apply on 3 layers with the following purposes:

harvester: decision when to close open file handlers
prospector: decision which files to crawl and which ones not
registry: decision when to remove the state for a file

The options below to not necessarly represent the current behaviour but what I would suggest it to be.

Harvester

That means for the harvester we have the follwing config options.

close_renamed: Closes file handler if file is renamed (rotated?)
close_removed: Closes file handler if file is removed
close_older: Closes file handler if harvester finished reading and file mod time is older then the date
force_close_older / drop_older: Closes file if ModTime() is older then force_close_older also if reading didn't finish. force_close_older == ignore_older, that means the file is ignore afterwards.
close_eof: Closes file if EOF is reached

By default, backoff applies if EOF is reached, means the file handler is kept open but the harvester sleeps for backoff time. If a file is closed, that means the harvester is stopped and a new harvester for the same file will be started again by the prospector after scan_frequency. Data is only lost, in case a rotated file is not in the path pattern to be found again by the prospector or if the file is removed.

Currently close_renamed + close_removed = force_close_files

Closing files very often can have a performance impact. One reason is that additional file meta data has to be checked all the time and it diverges from "near real time" as if a file is closed, it will only be picked up again after scan_frequency.

Prospector

ignore_older: Do not start new harvesters
scan_frequeny: How often files are check if content was added or new files appeared, to start a new harvester.

As long as a harvester is open for a file, ignore_older will not have any affect. It will only be taken into account when a harvester for a file finished.

State / Registrar

The registry persists the state of the files. The registry is only used, when filebeat is (re)started.
It would be possible to move these config options into the prospector, means they could be set per prospector but would still apply on the registrar level.

clean_older: Removes files form the registry which were not modified for the given duration. This means the file state is lost.
clean_removed: Removes the state for files which a state exists in the registry but currently cannot be found anymore. In case the file shows up again at a later stage, state is lost. clean_removed could also be a duration, but for simplicity we should just enable / disable it.

The above options are especially important in 2 cases:

inode reuse: Because of file deletions inodes are reused. Currently filebeat assumes it is the same file in case it saw the file with the same inode before
lots of files: Over time with lots of small files the registry continues growing in case all the files created are always new files.

Data Loss

Filebeat has the principle to send a lot line at least one. Some of the configuration changes especially in combination can lead to data loss. Here some more details on when data could be lost.

Harvester options: In general, closing a file handler normally means not data loss, as the file is picked up and scanned again after scan_frequency. A few exceptions can apply here that can lead to a data loss / file not completely read. Which config option for closing a file does not really matter.

After closing a file, it is rotated / moved and doesn't match any of the patterns in the prospector defined. That means, the state for the file still exists, but the file reading is not continued
After closing a file, the file is removed before it is picked up again by the prospector after scan_frequency. The lines which were not read so far, will be lost

Combinations: There are several combinations which can lead to data loss. Some of these combinations can be intended, but it is important to understand the consequences:

scan_frequency >= ignore_older: If a file is closed it will be normally ignored before it can be picked up again. In case clean_older == ignore_older, even the states are directly removed.
clean_older< close_*: File states will be removed before a state is persisted.
clean_removed: If a file system disappears during a scan, all states will be removed.

Recommend default settings:

close_renamed: false
close_removed: false
close_older: 1h
force_close_older / drop_older: 0 (disabled)
close_eof: false
ignore_older: 0 (disabled)
scan_frequeny: 10s
clean_older: 0 (disabled)
clean_removed: false

Questions

What is the best way to group the config options? Namespacing?

Notes

Default config is optimised for "at least once" and "near real time", that is why files are kept open to fetch new lines as soon as possible.
The above config options can have affects on the code side on different levels, but for simplicity only the things from a user perspective were stated

ruflin on 17 Jun 2016

I will now start implementing the options mentioned above. This commend it to track the implementation. Renaming of options is still possible at a later stage:

Config options:

[x] close_renamed: false https://github.com/elastic/beats/pull/1909
[x] close_removed: false https://github.com/elastic/beats/pull/1909
[x] close_older: 1h
[x] close_ttl / force_close_older / drop_older: 0 (disabled) https://github.com/elastic/beats/pull/1926
[x] close_eof: false https://github.com/elastic/beats/pull/1914
[x] ignore_older: 0 (disabled)
[x] scan_frequeny: 10s
[x] clean_older: 0 (disabled) https://github.com/elastic/beats/pull/1915
[x] clean_removed: false https://github.com/elastic/beats/pull/1922

Add Docs:

[x] Add docs for all config options
[x] Describe in docs on how level how config options relate

ruflin on 27 Jun 2016

During the implementation of the clean_* options I realised potentially one additional variable is needed: clean_frequency. Currently on the prospector level scan_frequency is used for the check ups as all states are checked every time a scan is run. But on the registrar level it is not clear when this cleanup happens. We could just take scan_frequency here but that would introduce again 2 meanings for one variable. I added clean_frequency above for implementation.

ruflin on 27 Jun 2016

After discussion clean_frequency more in detail, we decided to do the following:

Prospector states are cleaned up after scan frequency
Registrar states are cleaned up every time the registry is written

This makes an additional configuration option obsolete. It has one side affect, that as long as no new events are harvested by filebeat, the registry will not be cleaned up as the registry is never written.

ruflin on 28 Jun 2016

Starting to implement force_close_older / drop_older I realised the naming should also have the prefix close_ to be consistent with all other harvester closing options. As this is essentially a time to live for a harvester, I plan to name it close_ttl.

ruflin on 29 Jun 2016

The clean_frequency makes sense to me.

I want to suggest a small tweak to the configuration options and see if the logic makes sense as a result. Instead of having two different settings with relative times in them (close_older, drop_older), what if you had just one time-based setting and then a boolean option which can determine whether the harvester comes to a hard stop even if the file isn't finished...?

I also want to offer a suggestion to rename these as promised. After rereading this all a few times I realized that we should probably structure the documentation with these three big concepts as headers and pieces of a large diagram (prospector, harvester, registrar). And further to that end, the options which apply to those logical pieces should be named as such. So, here's a set of names which should be easily mapped to yours. Note that I have taken the liberty of incorporating the above suggestion in this list of names.

Yes, these names are long but I believe that they assist in self documenting the behavior which users are unlikely to be reading about on a routine basis. They will go to the docs, understand it once and come back later - hopefully these names in their config files will be clear to them without reading the docs every time.

harvester_close_on_renamed
harvester_close_on_removed
harvester_close_on_eof
harvester_timeout
harvester_close_on_timeout
prospector_scan_frequeny
prospector_ignore_older
registrar_clean_frequency
registrar_clean_older
registrar_clean_removed

[edit - I dropped 'force' from the harvester_close_on_timeout based on your last comment, which I missed]

brandonmensing on 30 Jun 2016

WDYT about putting the configs under a namespace?

harvester:
  close_on_renamed
  ...

About your boolean option for close_older and close_on_timoeut:

Pro: Reduces the checks we have to do on startup, for example close_older must be <= close_on_timeout. In your case it would be always identical
Con: It does not allow to set for example close_older to 10s to make sure files handlers are always closed very fast, but close_on_timeout to 1min to close files which couldn't finish fast enough.

ruflin on 4 Jul 2016

Hmm, after seeing all these options, I wonder if it's a good idea to expose them all to the user. I mean, the number of possible combinations is huge, and it sounds like many possible combinations can result in subtle data loss.

I'm not quite sure I get the motivation to make all of these configurable, is it for completeness or for handling corner cases?

tsg on 4 Jul 2016

It is for handling corner cases and to not have config options with multiple meanings. I would keep the same defaults as we have now so no data loss by default. For the harvester, the corner cases are that people want the file handler closed to be closed faster but for different reasons. For the registrar it is solving the problem of registrar growing too big and inode reuse.

ruflin on 4 Jul 2016

Ok, just worried that we might be going the other extreme now :-). Perhaps we can go through the options in today's meeting, I'm not sure I get the difference between close_older and close_ttl, for example.

tsg on 4 Jul 2016

👍1

Based on the inputs from @brandonmensing, a suggestion for the config file:

Before:

filebeat.prospectors:
- input_type: log
  paths:
    - /var/log/*.log

  exclude_lines: ["^DBG"]
  include_lines: ["^ERR", "^WARN"]
  exclude_files: [".gz$"]

  ignore_older: 0
  close_older: 1h

  scan_frequency: 10s
  harvester_buffer_size: 16384

  max_bytes: 10485760

  tail_files: false

  backoff: 1s
  max_backoff: 10s
  backoff_factor: 2

  close_renamed: false
  close_removed: false
  close_eof: false
  clean_older: 0
  clean_remove: false

filebeat.spool_size: 2048
filebeat.idle_timeout: 5s
filebeat.registry_file: registry

After

filebeat.prospectors:
- input_type: log
  paths:
    - /var/log/*.log

  exclude_files: [".gz$"]

  ignore_older: 0
  scan_frequency: 10s

  max_bytes: 10485760

  state:
    clean_older: 0
    clean_remove: false

  harvester:
    tail_files: false
    close_renamed: false
    close_removed: false
    close_eof: false
    close_older: 1h

    backoff: 1s
    max_backoff: 10s
    backoff_factor: 2
    buffer_size: 16384

    exclude_lines: ["^DBG"]
    include_lines: ["^ERR", "^WARN"]

filebeat.registry_file: registry

filebeat.spooler:
  size: 2048
  idle_timeout: 5s

I removed the options that would stay the same.

For harvester it could also be:

filebeat.prospectors:
- input_type: log
  harvester:
    tail_files: false
    close:
      renamed: false
      removed: false
      eof: false
      older: 1h

    backoff:
      duration: 1s
      max: 10s
      factor: 2

ruflin on 5 Jul 2016

@brandonmensing I left out the _on_ par in the config options as it felt to me more natural as it is the closing for example happens after it was renamed, not during the rename. Happy to discuss further.

I left out close_timeout above, but I think timeout is definitively less abstract then ttl. I'm curious if there is perhaps even something more descriptive like close_after or other alternatives?

ruflin on 5 Jul 2016

I just realised, the above config examples have a mistake inside. The state clean variables are part of each prospector, as each prospector can have different values. I changed it above.

@brandonmensing I was wondering if people understand state better then registry? See example above.

ruflin on 6 Jul 2016

One more: More accurate for close_older would probably be close_inactive ?

ruflin on 6 Jul 2016

As a note: The general discussion came up to change prospectors to inputs and harvester to reader.

ruflin on 11 Jul 2016

👎1

Quick summary of the conversations yesterday:

Close configurations:

close_renamed: false
close_removed: false
close_eof: false
close_idle: 1h
close_timeout: 0

Close_older will be renamed to close_idle as it describes better that it does not depend on a file mode time but an idle harvester.

clean_idle:
clean_removed

Same argument for clean_idle as for close_idle.

Instead of nesting the variable options under state and harvester, we will use comments the visualised the nesting. This reduces the length of the variable names, not additional identation and keeps a new name "harvester" out of the config file.

max_bytes and buffer_size were also discussed as these could be improved, but can be done at a later stage. Cleaning up spooler would be nice, but not critical.

ruflin on 12 Jul 2016

Closing as all changes were implemented. Small follow up changes are tracked in https://github.com/elastic/beats/issues/2012

ruflin on 18 Jul 2016

I would like to have a word about the change of the terminology "prospectors" and "harvesters" to "inputs" and "readers".

It's been some time I'm under the impression there exists a general trend in software that strives to impoverish what vocabulary lusers are exposed to the most "Simple English" possible.
Perhaps it comes from the scientist quest to find perfectly abstract entities, the deepest, most atomic, purest ones, the most irreducible concepts, that would have no meaning left if you tried to narrow-down their meaning just a lil' bit.
But that doesn't justify to stick to a terminology that stay far, far away and disconnected from any real thing that exists in the physical world, even when it's obvious that an analogy is used and there's no confusion possible with that of the real world.
Also, who believe "the people-as-a-whole" are asking for dumbing-down ?

The words "prospectors" and "harvesters" seemed to me to be perfectly describing, in an imaginative way, the role of each actor in the system, and as a nonnative English speaker, I don't mind having to fetch a word definition if needed -- it takes 10 seconds from within a web browser.

Moreover, reducing the overall vocabulary in use in a system has a nasty side effect in the long term: you end up with a lot of confusion around every word because they've become ambiguous, and nobody can guess what it's about without a lot of context. Here, the word "input" is the most problematic; it has a lousy definition and could very well mean to me "one particular file", and it does not convey to me any activity about discovering any new data: input is passive thing. The word "reader" is okayish albeit generic and blunt compared to harverster.

-- I have a mouse on my desktop, and I have a mouse in the attic.