Graylog2-server: Disk journal stopped working after disk filled

Created on 4 Aug 2016  路  11Comments  路  Source: Graylog2/graylog2-server

Expected Behavior

Disk journal should resume processing queued messages

Current Behavior

Processing was paused and messages kept queuing in disk journal.

Possible Solution

unknown

Steps to Reproduce (for bugs)

  1. Let the disk fill to 100%
  2. Clear some space manually via SSH terminal
  3. Restart server
  4. Check Processing Status and Disk Journal under System ->聽Node -> Details and check the server logs.

Context

After the disk filled to 100% (misconfig from me) and the issue was fixed, server was restarted but disk journal messages processing did not resume.
I can add the web and API interface stopped responding when disk was full because mongoDB could not launch.

By looking at the server logs, I get this error:

2016-08-04_14:20:54.49828 2016-08-04 10:20:54,494 ERROR: com.google.common.util.concurrent.ServiceManager - Service JournalReader [FAILED] has failed in the RUNNING state.
2016-08-04_14:20:54.49830 java.lang.IllegalStateException: Invalid message size: -897035486
2016-08-04_14:20:54.49830       at kafka.log.FileMessageSet.searchFor(FileMessageSet.scala:127) ~[graylog.jar:?]
2016-08-04_14:20:54.49830       at kafka.log.LogSegment.translateOffset(LogSegment.scala:105) ~[graylog.jar:?]
2016-08-04_14:20:54.49830       at kafka.log.LogSegment.read(LogSegment.scala:147) ~[graylog.jar:?]
2016-08-04_14:20:54.49831       at kafka.log.Log.read(Log.scala:443) ~[graylog.jar:?]
2016-08-04_14:20:54.49831       at org.graylog2.shared.journal.KafkaJournal.read(KafkaJournal.java:462) ~[graylog.jar:?]
2016-08-04_14:20:54.49831       at org.graylog2.shared.journal.KafkaJournal.read(KafkaJournal.java:435) ~[graylog.jar:?]
2016-08-04_14:20:54.49831       at org.graylog2.shared.journal.JournalReader.run(JournalReader.java:136) ~[graylog.jar:?]
2016-08-04_14:20:54.49831       at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:60) [graylog.jar:?]
2016-08-04_14:20:54.49831       at com.google.common.util.concurrent.Callables$3.run(Callables.java:100) [graylog.jar:?]
2016-08-04_14:20:54.49832       at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]

Your Environment

  • Graylog Version: 2.0.3 (OVA image)
  • Elasticsearch Version:
  • MongoDB Version:
  • Operating System:
  • Browser version:
improvement

Most helpful comment

Same situation. In my case, I found success by:

  1. stop graylog-server
  2. backup the journal/ directory
  3. delete all the .index files
  4. delete the single oldest .log file, since this would have been present when the corruption occurred
  5. start server and have fun watching your cluster churn through a bajillion queued messages

All 11 comments

The error looks like the journal has been corrupted when it ran out of disk space.
It might be possible to delete the latest journal segment file while Graylog is stopped, but I'm afraid the message in that segment cannot be recovered.

From the code side, I don't think we can sensibly recover from this. It is _really_ important not to run out of disk for journalling, like it is with databases.

Yes, I can agree with you and I've also noticed the warning by Graylog when the disk reached near max capacity.
However, a mechanism to recover from such events would be very helpful to facilitate the handling of those events (although clearing "/var/opt/graylog/data/journal/*" and restarting graylog isn't that hard either).

I have just run into the same issue.

Sorry if the following question is silly, but:

I had noticed Graylog wrote a bunch of messages into the journal after I cleaned up some space, so I don't know which segment was faulty.

I have moved the journal files out of the way instead of deleting them. Now my question is:

Can I stop Graylog and put the files back in place one by one to see which one is the culprit?

From my understanding of the journal, it wont allow this since the order messages arrive is important. (see slide 5 here: http://www.slideshare.net/Graylog/graylog-engineering-design-your-architecture)

You basically have to flush it all out and restart it.

Try stopping the graylog services, delete just the .index files, keep the .log files, restart graylog
Worked for me

Same situation. In my case, I found success by:

  1. stop graylog-server
  2. backup the journal/ directory
  3. delete all the .index files
  4. delete the single oldest .log file, since this would have been present when the corruption occurred
  5. start server and have fun watching your cluster churn through a bajillion queued messages

@jimbocoder 's solution did it for me. Thank you.

@jimbocoder should we to delete file graylog2-committed-read-offset and recovery-point-offset-checkpoint and should we delete all .log file?

@kieulam141 I can't say for sure. Whatever you do make sure you do the backup step and it should be okay in the end.

@kieulam141 did you have to delete them to get it working?

@BrijToSuccess I don't recall at this point but I'm pretty sure the least destructive strategy is in step 4:

4. **delete the single oldest .log file**, since this would have been present when the corruption occurred

(instead of deleting all the .log files.)

Was this page helpful?
0 / 5 - 0 ratings