Issue: Filebeat running on CentOS Linux release 7.3.1611 (Core), filebeat 5.1. Server suffers power failure due to host failing. Ungraceful shutdown, VM running off remote QNAP via ISCSI. VM recovers fine but filebeat doesn't start.
Host OS: Windows Server 2016 Datacenter Evaluation.
HyperVisor: Hyper-V 10.0.14393.0.
VM OS Affected: CentOS Linux release 7.3.1611 (Core), x64 Server Edition.
Filebeat: Version 5.1.1
Filebeat Log: Default logging verbosity.
File before failure:
This is the only file that was in use. /var/log/filebeat/* yields just the standard EOF error, no details around time stamp or what action the filebeat agent was taking at the time of power failure. Is there an alternate location that might tell me this information?
2017-02-05T19:21:25+13:00 INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]
2017-02-05T19:21:25+13:00 INFO Setup Beat: filebeat; Version: 5.1.1
2017-02-05T19:21:25+13:00 INFO Max Retries set to: 3
2017-02-05T19:21:25+13:00 INFO Activated logstash as output plugin.
2017-02-05T19:21:25+13:00 INFO Publisher name: mon.testdomain.com
2017-02-05T19:21:25+13:00 INFO Flush Interval set to: 1s
2017-02-05T19:21:25+13:00 INFO Max Bulk Size set to: 2048
2017-02-05T19:21:25+13:00 INFO filebeat start running.
2017-02-05T19:21:25+13:00 INFO Registry file set to: /var/lib/filebeat/registry
2017-02-05T19:21:25+13:00 INFO Loading registrar data from /var/lib/filebeat/registry
2017-02-05T19:21:25+13:00 INFO Total non-zero values:
2017-02-05T19:21:25+13:00 INFO Uptime: 9.9703ms
2017-02-05T19:21:25+13:00 INFO filebeat stopped.
2017-02-05T19:21:25+13:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: EOF
* This issue has been moved to Github from https://discuss.elastic.co/t/filebeat-fails-after-power-failure/73967 *
Happy to provide any additional detail, please just provide steps to acquire it if its not in obvious log files.
Is there another log file, like /var/log/filebeat/filebeat.1, containing data from before the ungraceful shutdown? It may not contain anything useful since the logs may not have been synced to disk.
Filebeat writes registry data to temporary file, calls fsync, then uses a move to achieve something that's close to an atomic update. So we are a bit surprised to see a zero byte file since fsync is supposed to block until the bytes are written to disk.
Sure is. Added the .1 β> .6 log files for completeness.
Reviewed the other server that experienced the same blank registry file and it is pretty much the same as the logs provided below. Nothing additional. I could possibly try and reproduce the issue by enabling higher level of verbose logging within filebeat (if thatβs possible) and cause a power failure? Would that yield better information possibly?
[βREDACT--@βREDACT-- filebeat]# cat filebeat.1
2017-02-05T21:39:16+13:00 INFO Metrics logging every 30s
[βREDACT--@βREDACT-- filebeat]# cat filebeat.2
2017-02-05T21:39:16+13:00 INFO Metrics logging every 30s
2017-02-05T21:39:16+13:00 INFO Max Retries set to: 3
2017-02-05T21:39:16+13:00 INFO Activated logstash as output plugin.
2017-02-05T21:39:16+13:00 INFO Publisher name: [email protected]
2017-02-05T21:39:16+13:00 INFO Flush Interval set to: 1s
2017-02-05T21:39:16+13:00 INFO Max Bulk Size set to: 2048
2017-02-05T21:39:16+13:00 INFO filebeat start running.
2017-02-05T21:39:16+13:00 INFO Registry file set to: /var/lib/filebeat/registry
2017-02-05T21:39:16+13:00 INFO Loading registrar data from /var/lib/filebeat/registry
2017-02-05T21:39:16+13:00 INFO Total non-zero values:
2017-02-05T21:39:16+13:00 INFO Uptime: 15.5132ms
2017-02-05T21:39:16+13:00 INFO filebeat stopped.
2017-02-05T21:39:16+13:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: EOF
[βREDACT--@βREDACT-- filebeat]# cat filebeat.3
2017-02-05T21:39:16+13:00 INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]
2017-02-05T21:39:16+13:00 INFO Setup Beat: filebeat; Version: 5.1.1
[βREDACT--@βREDACT-- filebeat]# cat filebeat.4
2017-02-05T21:39:16+13:00 INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]
2017-02-05T21:39:16+13:00 INFO Setup Beat: filebeat; Version: 5.1.1
2017-02-05T21:39:16+13:00 INFO Max Retries set to: 3
2017-02-05T21:39:16+13:00 INFO Activated logstash as output plugin.
2017-02-05T21:39:16+13:00 INFO Publisher name: [email protected]
2017-02-05T21:39:16+13:00 INFO Flush Interval set to: 1s
2017-02-05T21:39:16+13:00 INFO Max Bulk Size set to: 2048
2017-02-05T21:39:16+13:00 INFO filebeat start running.
2017-02-05T21:39:16+13:00 INFO Registry file set to: /var/lib/filebeat/registry
2017-02-05T21:39:16+13:00 INFO Loading registrar data from /var/lib/filebeat/registry
2017-02-05T21:39:16+13:00 INFO Total non-zero values:
2017-02-05T21:39:16+13:00 INFO Uptime: 11.1636ms
2017-02-05T21:39:16+13:00 INFO filebeat stopped.
2017-02-05T21:39:16+13:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: EOF
[βREDACT--@βREDACT-- filebeat]# cat filebeat.5
2017-02-05T21:39:16+13:00 INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]
2017-02-05T21:39:16+13:00 INFO Setup Beat: filebeat; Version: 5.1.1
2017-02-05T21:39:16+13:00 INFO Max Retries set to: 3
2017-02-05T21:39:16+13:00 INFO Activated logstash as output plugin.
2017-02-05T21:39:16+13:00 INFO Publisher name: [email protected]
2017-02-05T21:39:16+13:00 INFO Flush Interval set to: 1s
2017-02-05T21:39:16+13:00 INFO Max Bulk Size set to: 2048
2017-02-05T21:39:16+13:00 INFO filebeat start running.
2017-02-05T21:39:16+13:00 INFO Registry file set to: /var/lib/filebeat/registry
2017-02-05T21:39:16+13:00 INFO Loading registrar data from /var/lib/filebeat/registry
2017-02-05T21:39:16+13:00 INFO Total non-zero values:
2017-02-05T21:39:16+13:00 INFO Uptime: 14.0138ms
2017-02-05T21:39:16+13:00 INFO filebeat stopped.
2017-02-05T21:39:16+13:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: EOF
[βREDACT--@βREDACT-- filebeat]# cat filebeat.6
2017-02-05T21:39:15+13:00 INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]
2017-02-05T21:39:15+13:00 INFO Setup Beat: filebeat; Version: 5.1.1
2017-02-05T21:39:15+13:00 INFO Max Retries set to: 3
2017-02-05T21:39:15+13:00 INFO Activated logstash as output plugin.
2017-02-05T21:39:15+13:00 INFO Publisher name: [email protected]
2017-02-05T21:39:15+13:00 INFO Flush Interval set to: 1s
2017-02-05T21:39:15+13:00 INFO Max Bulk Size set to: 2048
2017-02-05T21:39:15+13:00 INFO filebeat start running.
2017-02-05T21:39:15+13:00 INFO Registry file set to: /var/lib/filebeat/registry
2017-02-05T21:39:15+13:00 INFO Loading registrar data from /var/lib/filebeat/registry
2017-02-05T21:39:15+13:00 INFO Total non-zero values:
2017-02-05T21:39:15+13:00 INFO Uptime: 39.7352ms
2017-02-05T21:39:15+13:00 INFO filebeat stopped.
2017-02-05T21:39:15+13:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: EOF
From: Andrew Kroh notifications@github.com
Reply-To: elastic/beats reply@reply.github.com
Date: Tuesday, 7 February 2017 at 06:39
To: elastic/beats beats@noreply.github.com
Cc: Andrew Hardy
Subject: Re: [elastic/beats] Filebeat Fails After Power Failure (#3537)
Is there another log file, like /var/log/filebeat/filebeat.1, containing data from before the ungraceful shutdown? It may not contain anything useful since the logs may not have been synced to disk.
Filebeat writes registry data to temporary file, calls fsync, then uses a move to achieve something that's close to an atomic update. So we are a bit surprised to see a zero byte file since fsync is supposed to block until the bytes are written to disk.
β
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/elastic/beats","title":"elastic/beats","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/elastic/beats"}},"updates":{"snippets":[{"icon":"PERSON","message":"@andrewkroh in #3537: Is there another log file, like /var/log/filebeat/filebeat.1, containing data from before the ungraceful shutdown? It may not contain anything useful since the logs may not have been synced to disk.\r\n\r\nFilebeat writes registry data to temporary file, calls fsync, then uses a move to achieve something that's close to an atomic update. So we are a bit surprised to see a zero byte file since fsync is supposed to block until the bytes are written to disk."}],"action":{"name":"View Issue","url":"https://github.com/elastic/beats/issues/3537#issuecomment-277755602"}}}
I'm experiencing (or rather experienced) similar issue, after updating filebeat to latest (5.0.2) version, filebeat service wouldn't start anymore, however after removing _/var/lib/filebeat/registry_ filebeat service starts without any issues)
I experienced something like this in a VMWare Virtual Machine of CentOS 7 64-Bit. When I power off the VM or revert to a snapshot, filebeat will not start again. The fix for me was to delete the registry file. I made a forum post about it with more information, such as the log files/syslogs.
Thanks for the response, removal of the registry file has resolved the issue for as well.
Is a possible fix to perform an error check to determine whether the EOF is present in the log file and if true, perform a check against the integrity of the registry file and if blank, perform the remove and reattempt starting of the service else continue?
Identifying the reason for the agent failing would be an optimal solution in my mind but maybe error checking is an appropriate way to handle this specific issue if root cause cannot be determined? Iβm not a developer by day so just random thinking on possible solutions.
The provided work around fix was successful, thanks for that guys. I havenβt seen the issue occur with Ubuntu VMs which also suffered the same critical failure (also running filebeat) but had no issues starting back up. Is there differences in how Centos and Ubuntu handle ungraceful failure that may be contributing to the issue?
From: Ben Hoffman notifications@github.com
Reply-To: elastic/beats reply@reply.github.com
Date: Thursday, 9 February 2017 at 18:20
To: elastic/beats beats@noreply.github.com
Cc: Andrew Hardy andrewhardy@andrewhardy.co.nz, Author author@noreply.github.com
Subject: Re: [elastic/beats] Filebeat Fails After Power Failure (#3537)
I experienced something like this in a VMWare Virtual Machine of CentOS 7 64-Bit. When I power off the VM or revert to a snapshot, filebeat will not start again. The fix for me was to delete the registry file. I made a forum post about it with more information, such as the log files/syslogs.
β
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/elastic/beats","title":"elastic/beats","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/elastic/beats"}},"updates":{"snippets":[{"icon":"PERSON","message":"@bah8892 in #3537: I experienced something like this in a VMWare Virtual Machine of CentOS 7 64-Bit. When I power off the VM or revert to a snapshot, filebeat will not start again. The fix for me was to delete the registry file. I made a forum post about it with more information, such as the log files/syslogs. "}],"action":{"name":"View Issue","url":"https://github.com/elastic/beats/issues/3537#issuecomment-278550062"}}}
Since version 5.0 filebeat does some checks of the registry file if it is in valid state and with 5.1 additional checks were introduced. The problem before was, that above issues were not even detected. Now if there is an invalid registry file, filebeat will not start. If we just reset the registry file this would lead to duplicated events.
@andynz2017 About the log files you posted. It seems you tried to start filebeat several times which lead to the issue that each time a new log file was written, and the one we are interested in does not exist anymore.
So far it seems, all people reporting an issue on restart and the registry file are using VM Virtual Machines. Could you guys share what file systems you are using? Did anyone find a "reliable" way to reproduce this?
Nicholas - That would make a lot of sense, must have lost the key log files you guys were looking for. Damn.
File system on both systems that experienced the failure was XFS for OS and ext3 for the drive that ES indexes are stored. OS - sda1, ES storage - sdb1. (Elk server). The other server was a centos 7 server running Open monitoring distribution, OS drive was also using XFS as the file system type.
OS partition on both servers using LVM2 format as per vgdisplay output.
I haven't been able to reliably reproduce the issue to date but as this is in a lab environment, I can try and do this to help us gain some traction over the next couple of days.
I have a couple of ideas on where to start with the testing but hoping for some input on the order... whether the suspicion is with file system type, partition format, the way the OS handles ungraceful failure or somewhere else. Thinking it maybe worth while coming up with a plan of attack to save time.
Andy
On 11/02/2017, at 4:44 AM, Nicolas Ruflin notifications@github.com wrote:
Since version 5.0 filebeat does some checks of the registry file if it is in valid state and with 5.1 additional checks were introduced. The problem before was, that above issues were not even detected. Now if there is an invalid registry file, filebeat will not start. If we just reset the registry file this would lead to duplicated events.
@andynz2017 About the log files you posted. It seems you tried to start filebeat several times which lead to the issue that each time a new log file was written, and the one we are interested in does not exist anymore.
So far it seems, all people reporting an issue on restart and the registry file are using VM Virtual Machines. Could you guys share what file systems you are using? Did anyone find a "reliable" way to reproduce this?
β
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
Just an update here.
Host failed last night The following information was being written over a 13 minute period prior to the power failure on one of the servers. Comparing the output to another log such as /var/log/messages, it looks like nothing at all is written at the time that the server dies.
It appears to indicate that when ever there is a major catastrophic power event, the VM(s) on the host does not have time to write anything to disk. This only affects Centos Servers. Ubuntu servers running filebeat do not experience the same issues and filebeat gracefully starts fine when those VMs boot up. Ubuntu 16.04 VM running default file system, ext4 β Didnβt have my 2FA token handy to elevate to root to verify this but will do that tomorrow.
Looking through the actual Host logs, nothing was written to disk also β no kernel power alerts, no dump file, no logs in application of system prior to the event. I cant explain this one. Any ideas on what other logs/information may be useful in helping track the root of this issue down? May need to use matter of elimination and build some test servers with different file systems etc to see if I see similar behaviour?
//Last extract from /var/log/messages
Feb 13 00:32:01 βRedact-- systemd[1]: Removed slice user-997.slice.
Feb 13 00:32:01 βRedact-- systemd[1]: Stopping user-997.slice.
Feb 13 00:32:01 βRedact-- systemd[1]: Removed slice user-996.slice.
Feb 13 00:32:01 βRedact-- systemd[1]: Stopping user-996.slice.
Feb 13 06:44:56 βRedact-- rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="930" x-info="http://www.rsyslog.com"] start
//Last 13 or so minutes from /var/log/filebeat/filebeat
2017-02-13T00:19:07+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=14 publish.events=14 registrar.writes=1 libbeat.publisher.published_events=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1244 libbeat.logstash.published_and_acked_events=14 libbeat.logstash.publish.read_bytes=35
2017-02-13T00:19:37+13:00 INFO Non-zero metrics in the last 30s: publish.events=12 registrar.states.update=12 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.publish.write_bytes=1320 libbeat.logstash.published_and_acked_events=12 libbeat.publisher.published_events=12 libbeat.logstash.call_count.PublishEvents=1 registrar.writes=1
2017-02-13T00:20:07+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=14 registrar.writes=1 libbeat.logstash.published_and_acked_events=14 libbeat.logstash.publish.read_bytes=35 publish.events=14 libbeat.logstash.publish.write_bytes=1312 libbeat.logstash.call_count.PublishEvents=1 libbeat.publisher.published_events=14
2017-02-13T00:20:37+13:00 INFO Non-zero metrics in the last 30s: publish.events=8 libbeat.logstash.publish.write_bytes=938 libbeat.publisher.published_events=8 libbeat.logstash.published_and_acked_events=8 registrar.writes=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=35 registrar.states.update=8
2017-02-13T00:21:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.published_and_acked_events=14 libbeat.logstash.publish.read_bytes=35 registrar.states.update=14 registrar.writes=1 libbeat.logstash.publish.write_bytes=1231 publish.events=14 libbeat.publisher.published_events=14 libbeat.logstash.call_count.PublishEvents=1
2017-02-13T00:21:37+13:00 INFO Non-zero metrics in the last 30s: registrar.writes=1 libbeat.logstash.publish.write_bytes=1253 libbeat.logstash.publish.read_bytes=35 publish.events=12 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=12 libbeat.publisher.published_events=12 registrar.states.update=12
2017-02-13T00:22:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1297 libbeat.publisher.published_events=14 registrar.writes=1 publish.events=14 registrar.states.update=14 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=14
2017-02-13T00:22:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1103 registrar.states.update=8 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=8 libbeat.logstash.publish.read_bytes=35 registrar.writes=1 libbeat.publisher.published_events=8 publish.events=8
2017-02-13T00:23:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.publisher.published_events=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=14 registrar.states.update=14 libbeat.logstash.publish.write_bytes=1312 registrar.writes=1 libbeat.logstash.publish.read_bytes=35 publish.events=14
2017-02-13T00:23:37+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=12 libbeat.publisher.published_events=12 libbeat.logstash.call_count.PublishEvents=1 registrar.writes=1 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.publish.write_bytes=1221 publish.events=12 libbeat.logstash.published_and_acked_events=12
2017-02-13T00:24:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=35 publish.events=14 registrar.states.update=14 libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14 libbeat.logstash.publish.write_bytes=1244 registrar.writes=1
2017-02-13T00:24:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.published_and_acked_events=8 libbeat.publisher.published_events=8 libbeat.logstash.call_count.PublishEvents=1 registrar.states.update=8 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.publish.write_bytes=927 publish.events=8 registrar.writes=1
2017-02-13T00:25:07+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=14 publish.events=14 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14 registrar.writes=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1253
2017-02-13T00:25:37+13:00 INFO Non-zero metrics in the last 30s: publish.events=12 libbeat.logstash.published_and_acked_events=12 registrar.states.update=12 libbeat.publisher.published_events=12 libbeat.logstash.call_count.PublishEvents=1 registrar.writes=1 libbeat.logstash.publish.write_bytes=1264 libbeat.logstash.publish.read_bytes=35
2017-02-13T00:26:07+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=14 registrar.writes=1 publish.events=14 libbeat.logstash.publish.read_bytes=35 libbeat.publisher.published_events=14 libbeat.logstash.publish.write_bytes=1301 libbeat.logstash.published_and_acked_events=14 libbeat.logstash.call_count.PublishEvents=1
2017-02-13T00:26:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 registrar.states.update=8 registrar.writes=1 libbeat.logstash.published_and_acked_events=8 publish.events=8 libbeat.publisher.published_events=8 libbeat.logstash.publish.write_bytes=948 libbeat.logstash.publish.read_bytes=35
2017-02-13T00:27:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1228 registrar.states.update=14 libbeat.logstash.call_count.PublishEvents=1 publish.events=14 libbeat.logstash.publish.read_bytes=35 registrar.writes=1 libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14
2017-02-13T00:27:37+13:00 INFO Non-zero metrics in the last 30s: registrar.writes=1 libbeat.publisher.published_events=12 libbeat.logstash.call_count.PublishEvents=1 publish.events=12 libbeat.logstash.published_and_acked_events=12 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.publish.write_bytes=1229 registrar.states.update=12
2017-02-13T00:28:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14 registrar.writes=1 libbeat.logstash.publish.read_bytes=35 publish.events=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1263 registrar.states.update=14
2017-02-13T00:28:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=955 registrar.states.update=8 libbeat.publisher.published_events=8 libbeat.logstash.publish.read_bytes=35 registrar.writes=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=8 publish.events=8
2017-02-13T00:29:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1304 publish.events=14 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.call_count.PublishEvents=1 libbeat.publisher.published_events=14 registrar.writes=1 libbeat.logstash.published_and_acked_events=14 registrar.states.update=14
2017-02-13T00:29:37+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=12 libbeat.logstash.publish.write_bytes=1225 libbeat.publisher.published_events=12 libbeat.logstash.publish.read_bytes=35 publish.events=12 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=12 registrar.writes=1
2017-02-13T00:30:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1322 publish.events=14 registrar.writes=1 libbeat.logstash.publish.read_bytes=35 registrar.states.update=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14
2017-02-13T00:30:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=949 libbeat.publisher.published_events=8 registrar.writes=1 publish.events=8 libbeat.logstash.published_and_acked_events=8 registrar.states.update=8 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=35
2017-02-13T00:31:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1249 libbeat.logstash.published_and_acked_events=14 publish.events=14 registrar.writes=1 registrar.states.update=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=35 libbeat.publisher.published_events=14
2017-02-13T00:31:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.publisher.published_events=12 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1345 libbeat.logstash.published_and_acked_events=12 libbeat.logstash.publish.read_bytes=35 registrar.states.update=12 registrar.writes=1 publish.events=12
2017-02-13T00:32:07+13:00 INFO Non-zero metrics in the last 30s: registrar.writes=1 libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14 publish.events=14 registrar.states.update=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1266 libbeat.logstash.publish.read_bytes=35
2017-02-13T00:32:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=8 registrar.writes=1 libbeat.logstash.publish.write_bytes=960 libbeat.publisher.published_events=8 libbeat.logstash.publish.read_bytes=35 registrar.states.update=8 publish.events=8
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Andrew
From: Ben Hoffman notifications@github.com
Reply-To: elastic/beats reply@reply.github.com
Date: Thursday, 9 February 2017 at 18:20
To: elastic/beats beats@noreply.github.com
Cc: Andrew Hardy andrewhardy@andrewhardy.co.nz, Author author@noreply.github.com
Subject: Re: [elastic/beats] Filebeat Fails After Power Failure (#3537)
I experienced something like this in a VMWare Virtual Machine of CentOS 7 64-Bit. When I power off the VM or revert to a snapshot, filebeat will not start again. The fix for me was to delete the registry file. I made a forum post about it with more information, such as the log files/syslogs.
β
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/elastic/beats","title":"elastic/beats","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/elastic/beats"}},"updates":{"snippets":[{"icon":"PERSON","message":"@bah8892 in #3537: I experienced something like this in a VMWare Virtual Machine of CentOS 7 64-Bit. When I power off the VM or revert to a snapshot, filebeat will not start again. The fix for me was to delete the registry file. I made a forum post about it with more information, such as the log files/syslogs. "}],"action":{"name":"View Issue","url":"https://github.com/elastic/beats/issues/3537#issuecomment-278550062"}}}
@andynz2017 Thanks a lot for digging deeper into this one. I went through all the logs above, unfortunately there isn't something inside that would even indicate that filebeat got stopped. But this is kind of not expected, as the machine is not shut down properly.
Here my notes I made when going through the issue:
Could it be that something "strange" happens in case a process has an open file handler of a file when the OS crashes?
Hi folks,
On a hunch, I'm going to suggest the problem is a lack of fsync in here:
https://github.com/elastic/beats/blob/0af759a991b8cdc5e130ac50613bcc078c49727c/filebeat/registrar/registrar.go#L296-L328
Event timeline:
The fix is probably to open the file with O_SYNC on L301.
Apologies I havn't had time to read the whole thread, but I think I have experienced this issue as well. I got an ELK dev stack up and running on my laptop, installed filebeat and had it working yesterday. Then when i booted this morning, checked Kibana - no logs since last night. My laptop didn't have a un-graceful shutdown so not sure what caused the issue.
`2017-02-24T10:10:27Z INFO filebeat start running.
2017-02-24T10:10:27Z INFO Registry file set to: C:ProgramDatafilebeatregistry
2017-02-24T10:10:27Z INFO Loading registrar data from C:ProgramDatafilebeatregistry
2017-02-24T10:10:27Z ERR Error decoding old state: invalid character 'x00' looking for beginning of value
2017-02-24T10:10:27Z INFO Total non-zero values:
2017-02-24T10:10:27Z INFO Uptime: 35.9984ms
2017-02-24T10:10:27Z INFO filebeat stopped.
2017-02-24T10:10:27Z CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: invalid character 'x00' looking for beginning of value
EDIT : Sorry was in a bit of rush earlier - i'm running windows 10, build 14393.726 - JDK v8 u121.
`
@leinad13 That seems to be a different issue. Can you open a topic on discuss? https://discuss.elastic.co/c/beats/filebeat
@avleen Great find. Based on your input I opened https://github.com/elastic/beats/pull/3668 so we can directly discuss further based on code.
Hey,
Out of interest, was the registry file blank in your case? I noticed you're using filebeat running on Windows whereas the initial thread was based on Centos 64... this issue is possibly OS agnostic and the common information that I have derived to date seems to indicate that the issues triggered during shutdown/power off. I'm not sure if it's just a coincidence or if there is actually something going on with how filebeat writes to registry during the shutdown/power off scenario.
I'm guessing removal of the registry file in your case and starting filebeat service resolves the issue?
On 24/02/2017, at 11:17 PM, leinad13 notifications@github.com wrote:
Apologies I havn't had time to read the whole thread, but I think I have experienced this issue as well. I got an ELK dev stack up and running on my laptop, installed filebeat and had it working yesterday. Then when i booted this morning, checked Kibana - no logs since last night. My laptop didn't have a un-graceful shutdown so not sure what caused the issue.
2017-02-24T10:10:27Z INFO filebeat start running. 2017-02-24T10:10:27Z INFO Registry file set to: C:ProgramDatafilebeatregistry 2017-02-24T10:10:27Z INFO Loading registrar data from C:ProgramDatafilebeatregistry 2017-02-24T10:10:27Z ERR Error decoding old state: invalid character 'x00' looking for beginning of value 2017-02-24T10:10:27Z INFO Total non-zero values: 2017-02-24T10:10:27Z INFO Uptime: 35.9984ms 2017-02-24T10:10:27Z INFO filebeat stopped. 2017-02-24T10:10:27Z CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: invalid character 'x00' looking for beginning of value
β
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
@avleen PR just go merged. I really appreciate the work you put into this one to find the issue and point it out.
Thanks @ruflin !
I was having the same error as @leinad13 , after a BSOD, namely "2017-09-07T10:51:51+03:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: invalid character 'x00' looking for beginning of value"
I am using filebeat 5.5.2.
The fix was to delete the files in "C:ProgramDatafilebeat" if it helps someone reading this thread.
I was having the same error as @leinad13 , after a BSOD, namely "2017-09-07T10:51:51+03:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: invalid character 'x00' looking for beginning of value"
We had the same issue, registry file was full of NUL characters.
I deleted registry and renamed registry.old to registry and things seem to be working now.
This was on 5.2.2
I just had the same issue after a host failure, filebeat v 5.6.3
@cosminpopescu @JoshSchreuder @trondhindenes There must be something more there as the above fix is in all 5.x branches. Any chance you guys could provide some more informations like
I am still able to reproduce it with filebeat 6.1.1 and 6.2.1 on Linux and FreeBSD.
cat /dev/null > /var/db/beats/filebeat/data/registry
service filebeat start
Start fails with empty registry
beat.go:635: Exiting: Could not start registrar: Error loading state: Error decoding states: EOF
I suggest not catting /dev/null into your registry file ;-)
@ruflin I've also had the same issue with filebeat not starting after a hard shutdown on multiple systems. Here are some details:
2018-03-05T06:30:05.118Z INFO instance/beat.go:468 Home path: [C:\ProgramData\chocolatey\lib\filebeat\tools\filebeat-6.2.2-windows-x86_64] Config path: [C:\ProgramData\chocolatey\lib\filebeat\tools\filebeat-6.2.2-windows-x86_64] Data path: [C:\\ProgramData\\filebeat] Logs path: [C:\\ProgramData\\filebeat\logs]
2018-03-05T06:30:05.161Z INFO instance/beat.go:475 Beat UUID: 36b5d0e5-691c-4343-bb3b-b64ca4ac2ad2
2018-03-05T06:30:05.161Z INFO instance/beat.go:213 Setup Beat: filebeat; Version: 6.2.2
2018-03-05T06:30:05.162Z INFO pipeline/module.go:76 Beat name: STAGE-HOSTNAME
2018-03-05T06:30:05.391Z INFO instance/beat.go:301 filebeat start running.
2018-03-05T06:30:05.392Z INFO [monitoring] log/log.go:97 Starting metrics logging every 30s
2018-03-05T06:30:05.404Z INFO registrar/registrar.go:108 Loading registrar data from C:\ProgramData\filebeat\registry
2018-03-05T06:30:05.404Z INFO instance/beat.go:308 filebeat stopped.
2018-03-05T06:30:21.991Z INFO [monitoring] log/log.go:132 Total non-zero metrics {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":78,"time":78},"total":{"ticks":78,"time":93,"value":0},"user":{"ticks":0,"time":15}},"info":{"ephemeral_id":"5772a22c-bf28-4ae4-8e54-97ee4b5f703d","uptime":{"ms":16483}},"memstats":{"gc_next":4473924,"memory_alloc":2891360,"memory_total":2891360,"rss":19214336}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"output":{"type":"logstash"},"pipeline":{"clients":0,"events":{"active":0}}},"registrar":{"states":{"current":0}},"system":{"cpu":{"cores":2}}}}}
2018-03-05T06:30:21.991Z INFO [monitoring] log/log.go:133 Uptime: 16.4832551s
2018-03-05T06:30:21.991Z INFO [monitoring] log/log.go:110 Stopping metrics logging.
2018-03-05T06:30:21.994Z ERROR instance/beat.go:667 Exiting: Could not start registrar: Error loading state: Error decoding states: invalid character '\x00' looking for beginning of value
filebeat.prospectors:
- input_type: log
paths:
- C:\AppLogs\Service1\Service1.log
multiline.pattern: '(INFO|ERROR|DEBUG|TRACE|FATAL)'
multiline.negate: true
multiline.match: after
- input_type: log
paths:
- C:\AppLogs\Service2\Service2.log
- C:\AppLogs\Service3\Service3.log
- C:\AppLogs\Service4\Service4.log
- C:\AppLogs\Service5\Service5.log
- C:\AppLogs\Service6\Service6.log
- C:\AppLogs\Service7\Service7.log
- C:\AppLogs\Service8\Service8.log
#exclude_lines: ["^DBG"]
#include_lines: ["^ERR", "^WARN"]
#exclude_files: [".gz$"]
#fields:
# level: debug
# review: 1
ignore_older: 15m
multiline.pattern: ^\d{1,2}\/\d{1,2}\/\d{4}
multiline.negate: true
multiline.match: after
#name:
#tags: ["service-X", "web-tier"]
fields:
env: STAGE
#output.elasticsearch:
#protocol: "https"
#username: "elastic"
#password: "changeme"
output.logstash:
hosts: ["logstash.example.local:5044"]
#ssl.certificate_authorities: ["/etc/pki/root/ca.pem"]
#ssl.certificate: "/etc/pki/client/cert.pem"
#ssl.key: "/etc/pki/client/cert.key"
#logging.level: debug
#logging.selectors: ["*"]
[{"source":"C:\\AppLogs\\Service4\\Service4.log","offset":323454,"timestamp":"2018-03-05T09:06:20.3922249Z","ttl":-1,"type":"log","FileStateOS":{"idxhi":2228224,"idxlo":36817,"vol":3841148702}}]
Let me know if there's anything else I can get from these systems to get to the bottom of this.
@polmolea We potential have a fix for this which we applied in winlogbeat but missed in Filebeat: https://github.com/elastic/beats/pull/2434 We will also apply it to Filebeat.
Here is the Filebeat issue for it: https://github.com/elastic/beats/issues/6792
Most helpful comment
Hi folks,
On a hunch, I'm going to suggest the problem is a lack of
fsyncin here:https://github.com/elastic/beats/blob/0af759a991b8cdc5e130ac50613bcc078c49727c/filebeat/registrar/registrar.go#L296-L328
Event timeline:
The fix is probably to open the file with
O_SYNCon L301.