Graylog2-server: Disk journal stopped working after disk filled

Created on 4 Aug 2016 · 11Comments · Source: Graylog2/graylog2-server

Expected Behavior

Disk journal should resume processing queued messages

Current Behavior

Processing was paused and messages kept queuing in disk journal.

Possible Solution

unknown

Steps to Reproduce (for bugs)

Let the disk fill to 100%
Clear some space manually via SSH terminal
Restart server
Check Processing Status and Disk Journal under System -> Node -> Details and check the server logs.

Context

After the disk filled to 100% (misconfig from me) and the issue was fixed, server was restarted but disk journal messages processing did not resume.
I can add the web and API interface stopped responding when disk was full because mongoDB could not launch.

By looking at the server logs, I get this error:

2016-08-04_14:20:54.49828 2016-08-04 10:20:54,494 ERROR: com.google.common.util.concurrent.ServiceManager - Service JournalReader [FAILED] has failed in the RUNNING state.
2016-08-04_14:20:54.49830 java.lang.IllegalStateException: Invalid message size: -897035486
2016-08-04_14:20:54.49830       at kafka.log.FileMessageSet.searchFor(FileMessageSet.scala:127) ~[graylog.jar:?]
2016-08-04_14:20:54.49830       at kafka.log.LogSegment.translateOffset(LogSegment.scala:105) ~[graylog.jar:?]
2016-08-04_14:20:54.49830       at kafka.log.LogSegment.read(LogSegment.scala:147) ~[graylog.jar:?]
2016-08-04_14:20:54.49831       at kafka.log.Log.read(Log.scala:443) ~[graylog.jar:?]
2016-08-04_14:20:54.49831       at org.graylog2.shared.journal.KafkaJournal.read(KafkaJournal.java:462) ~[graylog.jar:?]
2016-08-04_14:20:54.49831       at org.graylog2.shared.journal.KafkaJournal.read(KafkaJournal.java:435) ~[graylog.jar:?]
2016-08-04_14:20:54.49831       at org.graylog2.shared.journal.JournalReader.run(JournalReader.java:136) ~[graylog.jar:?]
2016-08-04_14:20:54.49831       at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:60) [graylog.jar:?]
2016-08-04_14:20:54.49831       at com.google.common.util.concurrent.Callables$3.run(Callables.java:100) [graylog.jar:?]
2016-08-04_14:20:54.49832       at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]

Your Environment

Graylog Version: 2.0.3 (OVA image)
Elasticsearch Version:
MongoDB Version:
Operating System:
Browser version:

improvement

Source

JulioQc

👀1 👍1

Most helpful comment

Same situation. In my case, I found success by:

stop graylog-server
backup the journal/ directory
delete all the .index files
delete the single oldest .log file, since this would have been present when the corruption occurred
start server and have fun watching your cluster churn through a bajillion queued messages

jimbocoder on 9 Aug 2017

👍12

All 11 comments

The error looks like the journal has been corrupted when it ran out of disk space.
It might be possible to delete the latest journal segment file while Graylog is stopped, but I'm afraid the message in that segment cannot be recovered.

From the code side, I don't think we can sensibly recover from this. It is _really_ important not to run out of disk for journalling, like it is with databases.

kroepke on 5 Aug 2016

Yes, I can agree with you and I've also noticed the warning by Graylog when the disk reached near max capacity.
However, a mechanism to recover from such events would be very helpful to facilitate the handling of those events (although clearing "/var/opt/graylog/data/journal/*" and restarting graylog isn't that hard either).

JulioQc on 5 Aug 2016

I have just run into the same issue.

Sorry if the following question is silly, but:

I had noticed Graylog wrote a bunch of messages into the journal after I cleaned up some space, so I don't know which segment was faulty.

I have moved the journal files out of the way instead of deleting them. Now my question is:

Can I stop Graylog and put the files back in place one by one to see which one is the culprit?

AVGP on 30 Aug 2016

From my understanding of the journal, it wont allow this since the order messages arrive is important. (see slide 5 here: http://www.slideshare.net/Graylog/graylog-engineering-design-your-architecture)

You basically have to flush it all out and restart it.

JulioQc on 31 Aug 2016

Try stopping the graylog services, delete just the .index files, keep the .log files, restart graylog
Worked for me

giz83 on 14 Jul 2017

Same situation. In my case, I found success by:

stop graylog-server
backup the journal/ directory
delete all the .index files
delete the single oldest .log file, since this would have been present when the corruption occurred
start server and have fun watching your cluster churn through a bajillion queued messages

jimbocoder on 9 Aug 2017

👍12

@jimbocoder 's solution did it for me. Thank you.

raphaelsalomao3 on 11 Aug 2017

@jimbocoder should we to delete file graylog2-committed-read-offset and recovery-point-offset-checkpoint and should we delete all .log file?

kieulam141 on 2 Oct 2017

@kieulam141 I can't say for sure. Whatever you do make sure you do the backup step and it should be okay in the end.

jimbocoder on 4 Oct 2017

@kieulam141 did you have to delete them to get it working?

BrijToSuccess on 20 Dec 2017

@BrijToSuccess I don't recall at this point but I'm pretty sure the least destructive strategy is in step 4:

4. **delete the single oldest .log file**, since this would have been present when the corruption occurred

(instead of deleting all the .log files.)

jimbocoder on 27 Dec 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

https://example.com/api/system/sessions requires basic auth behind reverse proxies

albix · 3Comments

Graylog-server error: "Couldn't calculate index range for index "logstash-2016.11.25"..."

deanilenko · 3Comments

Search Query Flashes Login Prompt Briefly

eroji · 4Comments

Graylog can't search for messages

ajpen · 3Comments

make password rotation possible

jalogisch · 3Comments