Beats: Filebeat Fails After Power Failure

Created on 6 Feb 2017  Β·  25Comments  Β·  Source: elastic/beats

Issue: Filebeat running on CentOS Linux release 7.3.1611 (Core), filebeat 5.1. Server suffers power failure due to host failing. Ungraceful shutdown, VM running off remote QNAP via ISCSI. VM recovers fine but filebeat doesn't start.

Host OS: Windows Server 2016 Datacenter Evaluation.
HyperVisor: Hyper-V 10.0.14393.0.
VM OS Affected: CentOS Linux release 7.3.1611 (Core), x64 Server Edition.
Filebeat: Version 5.1.1
Filebeat Log: Default logging verbosity.

File before failure:

  • input_type: log

    • /var/log/audit/audit.log

This is the only file that was in use. /var/log/filebeat/* yields just the standard EOF error, no details around time stamp or what action the filebeat agent was taking at the time of power failure. Is there an alternate location that might tell me this information?

2017-02-05T19:21:25+13:00 INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]
2017-02-05T19:21:25+13:00 INFO Setup Beat: filebeat; Version: 5.1.1
2017-02-05T19:21:25+13:00 INFO Max Retries set to: 3
2017-02-05T19:21:25+13:00 INFO Activated logstash as output plugin.
2017-02-05T19:21:25+13:00 INFO Publisher name: mon.testdomain.com
2017-02-05T19:21:25+13:00 INFO Flush Interval set to: 1s
2017-02-05T19:21:25+13:00 INFO Max Bulk Size set to: 2048
2017-02-05T19:21:25+13:00 INFO filebeat start running.
2017-02-05T19:21:25+13:00 INFO Registry file set to: /var/lib/filebeat/registry
2017-02-05T19:21:25+13:00 INFO Loading registrar data from /var/lib/filebeat/registry
2017-02-05T19:21:25+13:00 INFO Total non-zero values:
2017-02-05T19:21:25+13:00 INFO Uptime: 9.9703ms
2017-02-05T19:21:25+13:00 INFO filebeat stopped.
2017-02-05T19:21:25+13:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: EOF

* This issue has been moved to Github from https://discuss.elastic.co/t/filebeat-fails-after-power-failure/73967 *

Happy to provide any additional detail, please just provide steps to acquire it if its not in obvious log files.

Filebeat

Most helpful comment

Hi folks,

On a hunch, I'm going to suggest the problem is a lack of fsync in here:
https://github.com/elastic/beats/blob/0af759a991b8cdc5e130ac50613bcc078c49727c/filebeat/registrar/registrar.go#L296-L328

Event timeline:

  1. New registry file is opened.
  2. File is written, but not synced. This means the file on disk is zero bytes, and the changes are cached are dirty pages in memory waiting for the next file system sync. On Linux (at least CentOS) this happens every 30s.
  3. The new registry file is swapped in for the real one. Remember: it's still zero bytes on disk, so now the only legit knowledge is the state is cached in memory, still waiting for the dirty pages to flush to disk.
  4. Server crashes, changes are lost. File on disk stays zero bytes.

The fix is probably to open the file with O_SYNC on L301.

All 25 comments

Is there another log file, like /var/log/filebeat/filebeat.1, containing data from before the ungraceful shutdown? It may not contain anything useful since the logs may not have been synced to disk.

Filebeat writes registry data to temporary file, calls fsync, then uses a move to achieve something that's close to an atomic update. So we are a bit surprised to see a zero byte file since fsync is supposed to block until the bytes are written to disk.

Sure is. Added the .1 β€”> .6 log files for completeness.

Reviewed the other server that experienced the same blank registry file and it is pretty much the same as the logs provided below. Nothing additional. I could possibly try and reproduce the issue by enabling higher level of verbose logging within filebeat (if that’s possible) and cause a power failure? Would that yield better information possibly?

[β€”REDACT--@β€”REDACT-- filebeat]# cat filebeat.1

2017-02-05T21:39:16+13:00 INFO Metrics logging every 30s

[β€”REDACT--@β€”REDACT-- filebeat]# cat filebeat.2

2017-02-05T21:39:16+13:00 INFO Metrics logging every 30s

2017-02-05T21:39:16+13:00 INFO Max Retries set to: 3

2017-02-05T21:39:16+13:00 INFO Activated logstash as output plugin.

2017-02-05T21:39:16+13:00 INFO Publisher name: [email protected]

2017-02-05T21:39:16+13:00 INFO Flush Interval set to: 1s

2017-02-05T21:39:16+13:00 INFO Max Bulk Size set to: 2048

2017-02-05T21:39:16+13:00 INFO filebeat start running.

2017-02-05T21:39:16+13:00 INFO Registry file set to: /var/lib/filebeat/registry

2017-02-05T21:39:16+13:00 INFO Loading registrar data from /var/lib/filebeat/registry

2017-02-05T21:39:16+13:00 INFO Total non-zero values:

2017-02-05T21:39:16+13:00 INFO Uptime: 15.5132ms

2017-02-05T21:39:16+13:00 INFO filebeat stopped.

2017-02-05T21:39:16+13:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: EOF

[β€”REDACT--@β€”REDACT-- filebeat]# cat filebeat.3

2017-02-05T21:39:16+13:00 INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]

2017-02-05T21:39:16+13:00 INFO Setup Beat: filebeat; Version: 5.1.1

[β€”REDACT--@β€”REDACT-- filebeat]# cat filebeat.4

2017-02-05T21:39:16+13:00 INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]

2017-02-05T21:39:16+13:00 INFO Setup Beat: filebeat; Version: 5.1.1

2017-02-05T21:39:16+13:00 INFO Max Retries set to: 3

2017-02-05T21:39:16+13:00 INFO Activated logstash as output plugin.

2017-02-05T21:39:16+13:00 INFO Publisher name: [email protected]

2017-02-05T21:39:16+13:00 INFO Flush Interval set to: 1s

2017-02-05T21:39:16+13:00 INFO Max Bulk Size set to: 2048

2017-02-05T21:39:16+13:00 INFO filebeat start running.

2017-02-05T21:39:16+13:00 INFO Registry file set to: /var/lib/filebeat/registry

2017-02-05T21:39:16+13:00 INFO Loading registrar data from /var/lib/filebeat/registry

2017-02-05T21:39:16+13:00 INFO Total non-zero values:

2017-02-05T21:39:16+13:00 INFO Uptime: 11.1636ms

2017-02-05T21:39:16+13:00 INFO filebeat stopped.

2017-02-05T21:39:16+13:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: EOF

[β€”REDACT--@β€”REDACT-- filebeat]# cat filebeat.5

2017-02-05T21:39:16+13:00 INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]

2017-02-05T21:39:16+13:00 INFO Setup Beat: filebeat; Version: 5.1.1

2017-02-05T21:39:16+13:00 INFO Max Retries set to: 3

2017-02-05T21:39:16+13:00 INFO Activated logstash as output plugin.

2017-02-05T21:39:16+13:00 INFO Publisher name: [email protected]

2017-02-05T21:39:16+13:00 INFO Flush Interval set to: 1s

2017-02-05T21:39:16+13:00 INFO Max Bulk Size set to: 2048

2017-02-05T21:39:16+13:00 INFO filebeat start running.

2017-02-05T21:39:16+13:00 INFO Registry file set to: /var/lib/filebeat/registry

2017-02-05T21:39:16+13:00 INFO Loading registrar data from /var/lib/filebeat/registry

2017-02-05T21:39:16+13:00 INFO Total non-zero values:

2017-02-05T21:39:16+13:00 INFO Uptime: 14.0138ms

2017-02-05T21:39:16+13:00 INFO filebeat stopped.

2017-02-05T21:39:16+13:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: EOF

[β€”REDACT--@β€”REDACT-- filebeat]# cat filebeat.6

2017-02-05T21:39:15+13:00 INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]

2017-02-05T21:39:15+13:00 INFO Setup Beat: filebeat; Version: 5.1.1

2017-02-05T21:39:15+13:00 INFO Max Retries set to: 3

2017-02-05T21:39:15+13:00 INFO Activated logstash as output plugin.

2017-02-05T21:39:15+13:00 INFO Publisher name: [email protected]

2017-02-05T21:39:15+13:00 INFO Flush Interval set to: 1s

2017-02-05T21:39:15+13:00 INFO Max Bulk Size set to: 2048

2017-02-05T21:39:15+13:00 INFO filebeat start running.

2017-02-05T21:39:15+13:00 INFO Registry file set to: /var/lib/filebeat/registry

2017-02-05T21:39:15+13:00 INFO Loading registrar data from /var/lib/filebeat/registry

2017-02-05T21:39:15+13:00 INFO Total non-zero values:

2017-02-05T21:39:15+13:00 INFO Uptime: 39.7352ms

2017-02-05T21:39:15+13:00 INFO filebeat stopped.

2017-02-05T21:39:15+13:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: EOF

From: Andrew Kroh notifications@github.com
Reply-To: elastic/beats reply@reply.github.com
Date: Tuesday, 7 February 2017 at 06:39
To: elastic/beats beats@noreply.github.com
Cc: Andrew Hardy , Author author@noreply.github.com
Subject: Re: [elastic/beats] Filebeat Fails After Power Failure (#3537)

Is there another log file, like /var/log/filebeat/filebeat.1, containing data from before the ungraceful shutdown? It may not contain anything useful since the logs may not have been synced to disk.

Filebeat writes registry data to temporary file, calls fsync, then uses a move to achieve something that's close to an atomic update. So we are a bit surprised to see a zero byte file since fsync is supposed to block until the bytes are written to disk.

β€”
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/elastic/beats","title":"elastic/beats","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/elastic/beats"}},"updates":{"snippets":[{"icon":"PERSON","message":"@andrewkroh in #3537: Is there another log file, like /var/log/filebeat/filebeat.1, containing data from before the ungraceful shutdown? It may not contain anything useful since the logs may not have been synced to disk.\r\n\r\nFilebeat writes registry data to temporary file, calls fsync, then uses a move to achieve something that's close to an atomic update. So we are a bit surprised to see a zero byte file since fsync is supposed to block until the bytes are written to disk."}],"action":{"name":"View Issue","url":"https://github.com/elastic/beats/issues/3537#issuecomment-277755602"}}}

I'm experiencing (or rather experienced) similar issue, after updating filebeat to latest (5.0.2) version, filebeat service wouldn't start anymore, however after removing _/var/lib/filebeat/registry_ filebeat service starts without any issues)

ref: Exiting: Could not start registrar: Error loading state: Error decoding states: EOF - Beats / Filebeat - Discuss the Elastic Stack

I experienced something like this in a VMWare Virtual Machine of CentOS 7 64-Bit. When I power off the VM or revert to a snapshot, filebeat will not start again. The fix for me was to delete the registry file. I made a forum post about it with more information, such as the log files/syslogs.

Thanks for the response, removal of the registry file has resolved the issue for as well.

Is a possible fix to perform an error check to determine whether the EOF is present in the log file and if true, perform a check against the integrity of the registry file and if blank, perform the remove and reattempt starting of the service else continue?
Identifying the reason for the agent failing would be an optimal solution in my mind but maybe error checking is an appropriate way to handle this specific issue if root cause cannot be determined? I’m not a developer by day so just random thinking on possible solutions.

The provided work around fix was successful, thanks for that guys. I haven’t seen the issue occur with Ubuntu VMs which also suffered the same critical failure (also running filebeat) but had no issues starting back up. Is there differences in how Centos and Ubuntu handle ungraceful failure that may be contributing to the issue?

From: Ben Hoffman notifications@github.com
Reply-To: elastic/beats reply@reply.github.com
Date: Thursday, 9 February 2017 at 18:20
To: elastic/beats beats@noreply.github.com
Cc: Andrew Hardy andrewhardy@andrewhardy.co.nz, Author author@noreply.github.com
Subject: Re: [elastic/beats] Filebeat Fails After Power Failure (#3537)

I experienced something like this in a VMWare Virtual Machine of CentOS 7 64-Bit. When I power off the VM or revert to a snapshot, filebeat will not start again. The fix for me was to delete the registry file. I made a forum post about it with more information, such as the log files/syslogs.

β€”
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/elastic/beats","title":"elastic/beats","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/elastic/beats"}},"updates":{"snippets":[{"icon":"PERSON","message":"@bah8892 in #3537: I experienced something like this in a VMWare Virtual Machine of CentOS 7 64-Bit. When I power off the VM or revert to a snapshot, filebeat will not start again. The fix for me was to delete the registry file. I made a forum post about it with more information, such as the log files/syslogs. "}],"action":{"name":"View Issue","url":"https://github.com/elastic/beats/issues/3537#issuecomment-278550062"}}}

Since version 5.0 filebeat does some checks of the registry file if it is in valid state and with 5.1 additional checks were introduced. The problem before was, that above issues were not even detected. Now if there is an invalid registry file, filebeat will not start. If we just reset the registry file this would lead to duplicated events.

@andynz2017 About the log files you posted. It seems you tried to start filebeat several times which lead to the issue that each time a new log file was written, and the one we are interested in does not exist anymore.

So far it seems, all people reporting an issue on restart and the registry file are using VM Virtual Machines. Could you guys share what file systems you are using? Did anyone find a "reliable" way to reproduce this?

Nicholas - That would make a lot of sense, must have lost the key log files you guys were looking for. Damn.

File system on both systems that experienced the failure was XFS for OS and ext3 for the drive that ES indexes are stored. OS - sda1, ES storage - sdb1. (Elk server). The other server was a centos 7 server running Open monitoring distribution, OS drive was also using XFS as the file system type.

OS partition on both servers using LVM2 format as per vgdisplay output.

I haven't been able to reliably reproduce the issue to date but as this is in a lab environment, I can try and do this to help us gain some traction over the next couple of days.

I have a couple of ideas on where to start with the testing but hoping for some input on the order... whether the suspicion is with file system type, partition format, the way the OS handles ungraceful failure or somewhere else. Thinking it maybe worth while coming up with a plan of attack to save time.

Andy

On 11/02/2017, at 4:44 AM, Nicolas Ruflin notifications@github.com wrote:

Since version 5.0 filebeat does some checks of the registry file if it is in valid state and with 5.1 additional checks were introduced. The problem before was, that above issues were not even detected. Now if there is an invalid registry file, filebeat will not start. If we just reset the registry file this would lead to duplicated events.

@andynz2017 About the log files you posted. It seems you tried to start filebeat several times which lead to the issue that each time a new log file was written, and the one we are interested in does not exist anymore.

So far it seems, all people reporting an issue on restart and the registry file are using VM Virtual Machines. Could you guys share what file systems you are using? Did anyone find a "reliable" way to reproduce this?

β€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Just an update here.

Host failed last night The following information was being written over a 13 minute period prior to the power failure on one of the servers. Comparing the output to another log such as /var/log/messages, it looks like nothing at all is written at the time that the server dies.

It appears to indicate that when ever there is a major catastrophic power event, the VM(s) on the host does not have time to write anything to disk. This only affects Centos Servers. Ubuntu servers running filebeat do not experience the same issues and filebeat gracefully starts fine when those VMs boot up. Ubuntu 16.04 VM running default file system, ext4 – Didn’t have my 2FA token handy to elevate to root to verify this but will do that tomorrow.

Looking through the actual Host logs, nothing was written to disk also – no kernel power alerts, no dump file, no logs in application of system prior to the event. I cant explain this one. Any ideas on what other logs/information may be useful in helping track the root of this issue down? May need to use matter of elimination and build some test servers with different file systems etc to see if I see similar behaviour?

//Last extract from /var/log/messages
Feb 13 00:32:01 β€”Redact-- systemd[1]: Removed slice user-997.slice.

Feb 13 00:32:01 β€”Redact-- systemd[1]: Stopping user-997.slice.

Feb 13 00:32:01 β€”Redact-- systemd[1]: Removed slice user-996.slice.

Feb 13 00:32:01 β€”Redact-- systemd[1]: Stopping user-996.slice.

Feb 13 06:44:56 β€”Redact-- rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="930" x-info="http://www.rsyslog.com"] start

//Last 13 or so minutes from /var/log/filebeat/filebeat
2017-02-13T00:19:07+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=14 publish.events=14 registrar.writes=1 libbeat.publisher.published_events=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1244 libbeat.logstash.published_and_acked_events=14 libbeat.logstash.publish.read_bytes=35

2017-02-13T00:19:37+13:00 INFO Non-zero metrics in the last 30s: publish.events=12 registrar.states.update=12 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.publish.write_bytes=1320 libbeat.logstash.published_and_acked_events=12 libbeat.publisher.published_events=12 libbeat.logstash.call_count.PublishEvents=1 registrar.writes=1

2017-02-13T00:20:07+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=14 registrar.writes=1 libbeat.logstash.published_and_acked_events=14 libbeat.logstash.publish.read_bytes=35 publish.events=14 libbeat.logstash.publish.write_bytes=1312 libbeat.logstash.call_count.PublishEvents=1 libbeat.publisher.published_events=14

2017-02-13T00:20:37+13:00 INFO Non-zero metrics in the last 30s: publish.events=8 libbeat.logstash.publish.write_bytes=938 libbeat.publisher.published_events=8 libbeat.logstash.published_and_acked_events=8 registrar.writes=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=35 registrar.states.update=8

2017-02-13T00:21:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.published_and_acked_events=14 libbeat.logstash.publish.read_bytes=35 registrar.states.update=14 registrar.writes=1 libbeat.logstash.publish.write_bytes=1231 publish.events=14 libbeat.publisher.published_events=14 libbeat.logstash.call_count.PublishEvents=1

2017-02-13T00:21:37+13:00 INFO Non-zero metrics in the last 30s: registrar.writes=1 libbeat.logstash.publish.write_bytes=1253 libbeat.logstash.publish.read_bytes=35 publish.events=12 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=12 libbeat.publisher.published_events=12 registrar.states.update=12

2017-02-13T00:22:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1297 libbeat.publisher.published_events=14 registrar.writes=1 publish.events=14 registrar.states.update=14 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=14

2017-02-13T00:22:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1103 registrar.states.update=8 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=8 libbeat.logstash.publish.read_bytes=35 registrar.writes=1 libbeat.publisher.published_events=8 publish.events=8

2017-02-13T00:23:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.publisher.published_events=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=14 registrar.states.update=14 libbeat.logstash.publish.write_bytes=1312 registrar.writes=1 libbeat.logstash.publish.read_bytes=35 publish.events=14

2017-02-13T00:23:37+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=12 libbeat.publisher.published_events=12 libbeat.logstash.call_count.PublishEvents=1 registrar.writes=1 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.publish.write_bytes=1221 publish.events=12 libbeat.logstash.published_and_acked_events=12

2017-02-13T00:24:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=35 publish.events=14 registrar.states.update=14 libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14 libbeat.logstash.publish.write_bytes=1244 registrar.writes=1

2017-02-13T00:24:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.published_and_acked_events=8 libbeat.publisher.published_events=8 libbeat.logstash.call_count.PublishEvents=1 registrar.states.update=8 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.publish.write_bytes=927 publish.events=8 registrar.writes=1

2017-02-13T00:25:07+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=14 publish.events=14 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14 registrar.writes=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1253

2017-02-13T00:25:37+13:00 INFO Non-zero metrics in the last 30s: publish.events=12 libbeat.logstash.published_and_acked_events=12 registrar.states.update=12 libbeat.publisher.published_events=12 libbeat.logstash.call_count.PublishEvents=1 registrar.writes=1 libbeat.logstash.publish.write_bytes=1264 libbeat.logstash.publish.read_bytes=35

2017-02-13T00:26:07+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=14 registrar.writes=1 publish.events=14 libbeat.logstash.publish.read_bytes=35 libbeat.publisher.published_events=14 libbeat.logstash.publish.write_bytes=1301 libbeat.logstash.published_and_acked_events=14 libbeat.logstash.call_count.PublishEvents=1

2017-02-13T00:26:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 registrar.states.update=8 registrar.writes=1 libbeat.logstash.published_and_acked_events=8 publish.events=8 libbeat.publisher.published_events=8 libbeat.logstash.publish.write_bytes=948 libbeat.logstash.publish.read_bytes=35

2017-02-13T00:27:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1228 registrar.states.update=14 libbeat.logstash.call_count.PublishEvents=1 publish.events=14 libbeat.logstash.publish.read_bytes=35 registrar.writes=1 libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14

2017-02-13T00:27:37+13:00 INFO Non-zero metrics in the last 30s: registrar.writes=1 libbeat.publisher.published_events=12 libbeat.logstash.call_count.PublishEvents=1 publish.events=12 libbeat.logstash.published_and_acked_events=12 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.publish.write_bytes=1229 registrar.states.update=12

2017-02-13T00:28:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14 registrar.writes=1 libbeat.logstash.publish.read_bytes=35 publish.events=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1263 registrar.states.update=14

2017-02-13T00:28:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=955 registrar.states.update=8 libbeat.publisher.published_events=8 libbeat.logstash.publish.read_bytes=35 registrar.writes=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=8 publish.events=8

2017-02-13T00:29:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1304 publish.events=14 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.call_count.PublishEvents=1 libbeat.publisher.published_events=14 registrar.writes=1 libbeat.logstash.published_and_acked_events=14 registrar.states.update=14

2017-02-13T00:29:37+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=12 libbeat.logstash.publish.write_bytes=1225 libbeat.publisher.published_events=12 libbeat.logstash.publish.read_bytes=35 publish.events=12 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=12 registrar.writes=1

2017-02-13T00:30:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1322 publish.events=14 registrar.writes=1 libbeat.logstash.publish.read_bytes=35 registrar.states.update=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14

2017-02-13T00:30:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=949 libbeat.publisher.published_events=8 registrar.writes=1 publish.events=8 libbeat.logstash.published_and_acked_events=8 registrar.states.update=8 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=35

2017-02-13T00:31:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1249 libbeat.logstash.published_and_acked_events=14 publish.events=14 registrar.writes=1 registrar.states.update=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=35 libbeat.publisher.published_events=14

2017-02-13T00:31:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.publisher.published_events=12 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1345 libbeat.logstash.published_and_acked_events=12 libbeat.logstash.publish.read_bytes=35 registrar.states.update=12 registrar.writes=1 publish.events=12

2017-02-13T00:32:07+13:00 INFO Non-zero metrics in the last 30s: registrar.writes=1 libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14 publish.events=14 registrar.states.update=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1266 libbeat.logstash.publish.read_bytes=35

2017-02-13T00:32:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=8 registrar.writes=1 libbeat.logstash.publish.write_bytes=960 libbeat.publisher.published_events=8 libbeat.logstash.publish.read_bytes=35 registrar.states.update=8 publish.events=8

β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”

Andrew

From: Ben Hoffman notifications@github.com
Reply-To: elastic/beats reply@reply.github.com
Date: Thursday, 9 February 2017 at 18:20
To: elastic/beats beats@noreply.github.com
Cc: Andrew Hardy andrewhardy@andrewhardy.co.nz, Author author@noreply.github.com
Subject: Re: [elastic/beats] Filebeat Fails After Power Failure (#3537)

I experienced something like this in a VMWare Virtual Machine of CentOS 7 64-Bit. When I power off the VM or revert to a snapshot, filebeat will not start again. The fix for me was to delete the registry file. I made a forum post about it with more information, such as the log files/syslogs.

β€”
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/elastic/beats","title":"elastic/beats","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/elastic/beats"}},"updates":{"snippets":[{"icon":"PERSON","message":"@bah8892 in #3537: I experienced something like this in a VMWare Virtual Machine of CentOS 7 64-Bit. When I power off the VM or revert to a snapshot, filebeat will not start again. The fix for me was to delete the registry file. I made a forum post about it with more information, such as the log files/syslogs. "}],"action":{"name":"View Issue","url":"https://github.com/elastic/beats/issues/3537#issuecomment-278550062"}}}

@andynz2017 Thanks a lot for digging deeper into this one. I went through all the logs above, unfortunately there isn't something inside that would even indicate that filebeat got stopped. But this is kind of not expected, as the machine is not shut down properly.

Here my notes I made when going through the issue:

  • For testing the file output could be used to eliminate the dependency on ES setup.
  • Only the local file system should matter, where the registry file is stored.
  • As far as I understand, this file system is XFS?
  • It is very strange that it there is a difference between CentOS and others. Do these systems have different file systems?

Could it be that something "strange" happens in case a process has an open file handler of a file when the OS crashes?

Hi folks,

On a hunch, I'm going to suggest the problem is a lack of fsync in here:
https://github.com/elastic/beats/blob/0af759a991b8cdc5e130ac50613bcc078c49727c/filebeat/registrar/registrar.go#L296-L328

Event timeline:

  1. New registry file is opened.
  2. File is written, but not synced. This means the file on disk is zero bytes, and the changes are cached are dirty pages in memory waiting for the next file system sync. On Linux (at least CentOS) this happens every 30s.
  3. The new registry file is swapped in for the real one. Remember: it's still zero bytes on disk, so now the only legit knowledge is the state is cached in memory, still waiting for the dirty pages to flush to disk.
  4. Server crashes, changes are lost. File on disk stays zero bytes.

The fix is probably to open the file with O_SYNC on L301.

Apologies I havn't had time to read the whole thread, but I think I have experienced this issue as well. I got an ELK dev stack up and running on my laptop, installed filebeat and had it working yesterday. Then when i booted this morning, checked Kibana - no logs since last night. My laptop didn't have a un-graceful shutdown so not sure what caused the issue.

`2017-02-24T10:10:27Z INFO filebeat start running.
2017-02-24T10:10:27Z INFO Registry file set to: C:ProgramDatafilebeatregistry
2017-02-24T10:10:27Z INFO Loading registrar data from C:ProgramDatafilebeatregistry
2017-02-24T10:10:27Z ERR Error decoding old state: invalid character 'x00' looking for beginning of value
2017-02-24T10:10:27Z INFO Total non-zero values:
2017-02-24T10:10:27Z INFO Uptime: 35.9984ms
2017-02-24T10:10:27Z INFO filebeat stopped.
2017-02-24T10:10:27Z CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: invalid character 'x00' looking for beginning of value

EDIT : Sorry was in a bit of rush earlier - i'm running windows 10, build 14393.726 - JDK v8 u121.
`

@leinad13 That seems to be a different issue. Can you open a topic on discuss? https://discuss.elastic.co/c/beats/filebeat

@avleen Great find. Based on your input I opened https://github.com/elastic/beats/pull/3668 so we can directly discuss further based on code.

Hey,

Out of interest, was the registry file blank in your case? I noticed you're using filebeat running on Windows whereas the initial thread was based on Centos 64... this issue is possibly OS agnostic and the common information that I have derived to date seems to indicate that the issues triggered during shutdown/power off. I'm not sure if it's just a coincidence or if there is actually something going on with how filebeat writes to registry during the shutdown/power off scenario.

I'm guessing removal of the registry file in your case and starting filebeat service resolves the issue?

On 24/02/2017, at 11:17 PM, leinad13 notifications@github.com wrote:

Apologies I havn't had time to read the whole thread, but I think I have experienced this issue as well. I got an ELK dev stack up and running on my laptop, installed filebeat and had it working yesterday. Then when i booted this morning, checked Kibana - no logs since last night. My laptop didn't have a un-graceful shutdown so not sure what caused the issue.

2017-02-24T10:10:27Z INFO filebeat start running. 2017-02-24T10:10:27Z INFO Registry file set to: C:ProgramDatafilebeatregistry 2017-02-24T10:10:27Z INFO Loading registrar data from C:ProgramDatafilebeatregistry 2017-02-24T10:10:27Z ERR Error decoding old state: invalid character 'x00' looking for beginning of value 2017-02-24T10:10:27Z INFO Total non-zero values: 2017-02-24T10:10:27Z INFO Uptime: 35.9984ms 2017-02-24T10:10:27Z INFO filebeat stopped. 2017-02-24T10:10:27Z CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: invalid character 'x00' looking for beginning of value

β€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@avleen PR just go merged. I really appreciate the work you put into this one to find the issue and point it out.

Thanks @ruflin !

I was having the same error as @leinad13 , after a BSOD, namely "2017-09-07T10:51:51+03:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: invalid character 'x00' looking for beginning of value"

I am using filebeat 5.5.2.

The fix was to delete the files in "C:ProgramDatafilebeat" if it helps someone reading this thread.

I was having the same error as @leinad13 , after a BSOD, namely "2017-09-07T10:51:51+03:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: invalid character 'x00' looking for beginning of value"

We had the same issue, registry file was full of NUL characters.
I deleted registry and renamed registry.old to registry and things seem to be working now.

This was on 5.2.2

I just had the same issue after a host failure, filebeat v 5.6.3

@cosminpopescu @JoshSchreuder @trondhindenes There must be something more there as the above fix is in all 5.x branches. Any chance you guys could provide some more informations like

  • OS
  • File system used
  • Filebeat configs
  • Potentially content of registry file (if not empty)

I am still able to reproduce it with filebeat 6.1.1 and 6.2.1 on Linux and FreeBSD.

cat /dev/null > /var/db/beats/filebeat/data/registry
service filebeat start

Start fails with empty registry

beat.go:635: Exiting: Could not start registrar: Error loading state: Error decoding states: EOF

I suggest not catting /dev/null into your registry file ;-)

@ruflin I've also had the same issue with filebeat not starting after a hard shutdown on multiple systems. Here are some details:

  • Filebeat 6.2.2 installed via chocolatey on Windows Server 2016 on AWS EC2
  • Filesystem is NTFS
  • Server is stopped and started on a schedule, using the AWS API (StopInstance, StartInstance)
  • When system comes back up, the filebeat service is Stopped and the log shows the following:
2018-03-05T06:30:05.118Z    INFO    instance/beat.go:468    Home path: [C:\ProgramData\chocolatey\lib\filebeat\tools\filebeat-6.2.2-windows-x86_64] Config path: [C:\ProgramData\chocolatey\lib\filebeat\tools\filebeat-6.2.2-windows-x86_64] Data path: [C:\\ProgramData\\filebeat] Logs path: [C:\\ProgramData\\filebeat\logs]
2018-03-05T06:30:05.161Z    INFO    instance/beat.go:475    Beat UUID: 36b5d0e5-691c-4343-bb3b-b64ca4ac2ad2
2018-03-05T06:30:05.161Z    INFO    instance/beat.go:213    Setup Beat: filebeat; Version: 6.2.2
2018-03-05T06:30:05.162Z    INFO    pipeline/module.go:76   Beat name: STAGE-HOSTNAME
2018-03-05T06:30:05.391Z    INFO    instance/beat.go:301    filebeat start running.
2018-03-05T06:30:05.392Z    INFO    [monitoring]    log/log.go:97   Starting metrics logging every 30s
2018-03-05T06:30:05.404Z    INFO    registrar/registrar.go:108  Loading registrar data from C:\ProgramData\filebeat\registry
2018-03-05T06:30:05.404Z    INFO    instance/beat.go:308    filebeat stopped.
2018-03-05T06:30:21.991Z    INFO    [monitoring]    log/log.go:132  Total non-zero metrics  {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":78,"time":78},"total":{"ticks":78,"time":93,"value":0},"user":{"ticks":0,"time":15}},"info":{"ephemeral_id":"5772a22c-bf28-4ae4-8e54-97ee4b5f703d","uptime":{"ms":16483}},"memstats":{"gc_next":4473924,"memory_alloc":2891360,"memory_total":2891360,"rss":19214336}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"output":{"type":"logstash"},"pipeline":{"clients":0,"events":{"active":0}}},"registrar":{"states":{"current":0}},"system":{"cpu":{"cores":2}}}}}
2018-03-05T06:30:21.991Z    INFO    [monitoring]    log/log.go:133  Uptime: 16.4832551s
2018-03-05T06:30:21.991Z    INFO    [monitoring]    log/log.go:110  Stopping metrics logging.
2018-03-05T06:30:21.994Z    ERROR   instance/beat.go:667    Exiting: Could not start registrar: Error loading state: Error decoding states: invalid character '\x00' looking for beginning of value
  • Deleting the registrar and restarting the service solves the problem, but it's not a nice solution from an automation point of view
  • Filebeat config is the following:
filebeat.prospectors:
- input_type: log
  paths:
  - C:\AppLogs\Service1\Service1.log
  multiline.pattern: '(INFO|ERROR|DEBUG|TRACE|FATAL)'
  multiline.negate: true
  multiline.match: after
- input_type: log
  paths:
    - C:\AppLogs\Service2\Service2.log
    - C:\AppLogs\Service3\Service3.log
    - C:\AppLogs\Service4\Service4.log
    - C:\AppLogs\Service5\Service5.log
    - C:\AppLogs\Service6\Service6.log
    - C:\AppLogs\Service7\Service7.log
    - C:\AppLogs\Service8\Service8.log
  #exclude_lines: ["^DBG"]
  #include_lines: ["^ERR", "^WARN"]
  #exclude_files: [".gz$"]
  #fields:
  #  level: debug
  #  review: 1
  ignore_older: 15m
  multiline.pattern: ^\d{1,2}\/\d{1,2}\/\d{4}
  multiline.negate: true
  multiline.match: after
#name:
#tags: ["service-X", "web-tier"]
fields:
  env: STAGE
#output.elasticsearch:
  #protocol: "https"
  #username: "elastic"
  #password: "changeme"
output.logstash:
  hosts: ["logstash.example.local:5044"]
  #ssl.certificate_authorities: ["/etc/pki/root/ca.pem"]
  #ssl.certificate: "/etc/pki/client/cert.pem"
  #ssl.key: "/etc/pki/client/cert.key"
#logging.level: debug
#logging.selectors: ["*"]
  • The registry file is empty when the service has failed to start, but deleting it and restarting the service produces a new file with the following contetn:
[{"source":"C:\\AppLogs\\Service4\\Service4.log","offset":323454,"timestamp":"2018-03-05T09:06:20.3922249Z","ttl":-1,"type":"log","FileStateOS":{"idxhi":2228224,"idxlo":36817,"vol":3841148702}}]

Let me know if there's anything else I can get from these systems to get to the bottom of this.

@polmolea We potential have a fix for this which we applied in winlogbeat but missed in Filebeat: https://github.com/elastic/beats/pull/2434 We will also apply it to Filebeat.

Here is the Filebeat issue for it: https://github.com/elastic/beats/issues/6792

Was this page helpful?
0 / 5 - 0 ratings