Graylog2-server: Output buffer not being filled with new messages

Created on 8 Nov 2017  Â·  41Comments  Â·  Source: Graylog2/graylog2-server

Expected Behavior

I expect messages to be moved from the processor buffer to output buffer constantly

Current Behavior

Processor buffer is being filled and output buffer stays empty, nothing is being written to database

Possible Solution

Possibly a thread lock or something like that

Steps to Reproduce (for bugs)



I reproduce this putting a lot of messages into graylog, it works for some time and then stops to send messages to ES cluster. I have no idea how can You reproduce this

Context

Pushing a large amount cca 10k/s high volume messages to graylog

Thread dump of affected node:
Thread dump of node 6e52bb4b / graylog2 Taken at Wed Nov 08 2017 11:32:36 GMT+0100

"outputbufferprocessor-0" id=32 state=WAITING
    - waiting on <0x1d302449> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x1d302449> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:45)
    at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
    at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:148)
    at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66)
    at java.lang.Thread.run(Thread.java:748)
"outputbuffer-processor-executor-0" id=219 state=WAITING
    - waiting on <0x2e3c0fa7> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x2e3c0fa7> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
"outputbufferprocessor-1" id=33 state=WAITING
    - waiting on <0x1d302449> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x1d302449> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:45)
    at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
    at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:148)
    at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66)
    at java.lang.Thread.run(Thread.java:748)

"outputbufferprocessor-2" id=34 state=WAITING
    - waiting on <0x1d302449> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x1d302449> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:45)
    at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
    at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:148)
    at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66)
    at java.lang.Thread.run(Thread.java:748)


"outputbuffer-processor-executor-0" id=220 state=WAITING
    - waiting on <0x59428d5b> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x59428d5b> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"outputbuffer-processor-executor-1" id=221 state=WAITING
    - waiting on <0x59428d5b> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x59428d5b> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"outputbuffer-processor-executor-2" id=222 state=WAITING
    - waiting on <0x59428d5b> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x59428d5b> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"outputbuffer-processor-executor-0" id=223 state=WAITING
    - waiting on <0x11eb6f6a> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x11eb6f6a> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"outputbuffer-processor-executor-1" id=224 state=WAITING
    - waiting on <0x2e3c0fa7> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x2e3c0fa7> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"outputbuffer-processor-executor-2" id=225 state=WAITING
    - waiting on <0x2e3c0fa7> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x2e3c0fa7> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"outputbuffer-processor-executor-1" id=226 state=WAITING
    - waiting on <0x11eb6f6a> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x11eb6f6a> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"outputbuffer-processor-executor-2" id=227 state=WAITING
    - waiting on <0x11eb6f6a> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x11eb6f6a> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Your Environment

  • Graylog Version: v2.3.2+3df951e
  • Elasticsearch Version: 5.6.3
  • MongoDB Version: v3.4.10
  • Operating System: RHEL 7.3
  • Browser version: irrelevant

Most helpful comment

This happens a multiple times a day for us. The only thing that helps is to delete the journal. We are losing a couple million messages a day to this. Not mentioning the countless hours of dev and ops time wasted on this.

This renders our graylog cluster all but unusable. If no solution is found we probably have to move our org away from graylog.

All 41 comments

This seems to be the same problem as in https://github.com/Graylog2/graylog2-server/issues/3553

Some thread data from JMX:
graylog_jmx

Seeing the same issue here, lowered from 10 outputprocessors to 5 and raised the batch size to 1500 but no luck.

Its Processing for a few seconds, then dying again.

Graylog 2.3.2+3df951e on syslog.xyz.local (Oracle Corporation 1.8.0_144 on Linux 3.10.0-693.2.2.el7.x86_64)

We need the complete Graylog and Elasticsearch logs to be able to start debugging this.

Experiencing exactly the same issue here. Above a certain threshold, seeing exact same results as madiTG. Threshold varies depending on various graylog config options related to output buffers, but in the end, graylog hangs and has to be restarted, no problems with ES, no errors in log.

As the https://community.graylog.org/t/graylog-stops-processing-writing-messages/2719 comments indicate, logs are empty.

image

@bernd its a little bit late for logs, but both logs were empty. Otherwise i would have posted it all along the issue.
As @davemcphee said, it simply stops processing messages and begins writing it to all buffers and journal, but they will never be processed.

@davemcphee @gruselglatz Okay, thanks for the update. After reading the topic in the community forum again, it looks like some message processing is blocking. Since the output buffers are basically empty but the process buffer is full, Graylog cannot process any more messages.

To have a chance to debug this further we need the following information:

  1. A list of installed plugins

    • curl localhost:9000/api/system/plugins

  2. All inputs + all extractors

    • curl localhost:9000/api/system/inputs (collect all input ids)

    • curl localhost:9000/api/system/inputs/<input-id>/extractors (for every input id)

  3. All streams + streamrules

    • curl localhost:9000/api/streams

  4. All drools rules (if you use any)
  5. Message processors config

    • curl localhost:9000/api/system/messageprocessors/config

  6. All lookup tables, caches and adapters

    • curl localhost:9000/api/system/lookup/tables

    • curl localhost:9000/api/system/lookup/caches

    • curl localhost:9000/api/system/lookup/adapters

  7. All pipelines, pipeline connections and pipeline rules

    • curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/pipeline

    • curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/rule

    • curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/connections

  8. All grok patterns

    • curl localhost:9000/api/system/grok

Feel free to send me this information as a private message in the community forum if you don't want to share all of this in public.

Hi all. This problem torments me for a month. I had to add new nodes to reduce the load. I'm using version 2.2.3. I hope this bug will fix soon. Tell me if you need more information about my instance.

Moin @bernd ,

The GL community doesn't trust me enough to PM you random links yet, so here you are: https://www.dropbox.com/s/pu5r9yfz8x0oyyt/graylog_debug.tar.gz?dl=0

Very little of interest there I'm afraid. It's a uat system so only one "real" input: the Syslog UDP listener, which is listening for nginx access logs. These logs are in json format, but wrapped in an rsyslog header, so I use a regex to cut out the json string, then key/value extract that json. The UAT system can generate 20k msg/s and bursts up to 40k, but we can (and ofc do) throttle this.

I have elasticsearch and graylog prometheus exporters running, and would be happy to test various config changes, tweak message throttling, and show results.

@davemcphee Thank you for the information! As you said, there isn't that much in there, unfortunately.

Can you send me a complete thread dump of the Graylog JVM process when this happens again? If processing is blocked, wait a few minutes and then take a thread dump. Also please do a curl localhost:9000/api/system/metrics after you took the dump and include it as well.

How is the CPU load when the processing is blocked? If you have some graphs for that as well, it would be much appreciated.

Could you show us an example raw message with a typical JSON payload?

Thank you!

@bernd please check the community forum (referenced here) for the CPU charts. There are some thread dumps out there too.

You may have worked through your situation already, but what I found when I encountered a similar scenario earlier this year is that under certain message conditions and ingest loads my regex extractors were taking an exorbitantly long time to run (relatively speaking). Took awhile to narrow that down but once I was on that track I was able to use the metrics for individual extractors to gauge my success in tuning. I went through each of my extractors and tuned the regex, dropping execution times from like 50,000μs to 50μs. I had to move away from using "any character" matches to specific matches, and also be much more intentional with my extractor condition matches. At worst last Spring my Graylog nodes were backing up by > 1 million events each, several times per week. Today my ingest/search rates are higher than back then but I almost never have backups. The only time I have backups now (small ones at that) is when a given system or device goes haywire and floods with 10x normal message rate for a few minutes.

I only use json extractors, tried with pipelines turned off - same effect. Already went through regex optimalisation. Execution times trend to 0 as nothing is being processed.

@madiTG I read the topic in the community forum and saw your graphs. The thread dumps were not complete and contained only the part for the outputbuffer threads. Also, I guess your setup is not the same as @davemcphee's so his CPU graphs might be different. :wink:

I had never used jstack before so after capturing one or two thread dumps I began playing with some config settings and seeing if I could observe a difference in the jstack output, and ummmm ... I think I've fixed my issue :grimacing: Isn't that just typical ....

I can't confirm anything and need to test further (and probably not before end of year) but I believe it had something to do with size of my ring_size and inputbuffer_ring_size - I reduced them massively.

I'm currently running with ring_size=8192 and inputbuffer_ring_size=32768, and 12 output processors, batch size 12k, and am able to process 30K msg/s, bursting to over 40K, on a single machine.

These numbers did require tuning heap of both ES and Graylog extensively (less is more), along with number of shards per ES index (more is more). I'll keep an eye on it and report back if I have anything to add.

@davemcphee

hey dave, got any update on that?

Still happening:

@berndleinfelder

Did the cancel thing, nothing happened

thread dump here:

jstack_dump.txt

org.graylog2.buffers.process.size
Gauge

Value:
8,192

org.graylog2.buffers.process.usage
Gauge

Value:
8,192

org.graylog2.shared.buffers.ProcessBuffer.decodeTime
Timer

95th percentile:
33μs
98th percentile:
47μs
99th percentile:
55μs
Standard deviation:
7μs
Mean:
21μs
Minimum:
15μs
Maximum:
97μs

org.graylog2.shared.buffers.ProcessBuffer.incomingMessages
Meter

Total:
189,234,468 events
Mean:
1,036.23 events/second
1 minute avg:
0 events/second
5 minute avg:
62.37 events/second
15 minute avg:
634.78 events/second

org.graylog2.shared.buffers.ProcessBuffer.parseTime
Timer

95th percentile:
11μs
98th percentile:
23μs
99th percentile:
34μs
Standard deviation:
4μs
Mean:
7μs
Minimum:
3μs
Maximum:
56μs

org.graylog2.shared.buffers.ProcessBuffer.thread-factory.created
Meter

Total:
4 events
Mean:
0 events/second
1 minute avg:
0 events/second
5 minute avg:
0 events/second
15 minute avg:
0 events/second

org.graylog2.shared.buffers.ProcessBuffer.thread-factory.running
Counter

Value:
4

org.graylog2.shared.buffers.ProcessBuffer.thread-factory.terminated
Meter

Total:
0 events
Mean:
0 events/second
1 minute avg:
0 events/second
5 minute avg:
0 events/second
15 minute avg:
0 events/second

org.graylog2.shared.buffers.processors.ProcessBufferProcessor.filteredOutMessages
Meter

Total:
0 events
Mean:
0 events/second
1 minute avg:
0 events/second
5 minute avg:
0 events/second
15 minute avg:
0 events/second

org.graylog2.shared.buffers.processors.ProcessBufferProcessor.incomingMessages
Meter

Total:
189,234,468 events
Mean:
1,036.25 events/second
1 minute avg:
0 events/second
5 minute avg:
62.84 events/second
15 minute avg:
636.41 events/second

org.graylog2.shared.buffers.processors.ProcessBufferProcessor.outgoingMessages
Meter

Total:
189,234,467 events
Mean:
1,036.25 events/second
1 minute avg:
0 events/second
5 minute avg:
62.84 events/second
15 minute avg:
636.41 events/second

org.graylog2.shared.buffers.processors.ProcessBufferProcessor.processTime
Timer

95th percentile:
730μs
98th percentile:
1,298μs
99th percentile:
1,993μs
Standard deviation:
821μs
Mean:
447μs
Minimum:
225μs
Maximum:
17,226μs

Changed my config values to those provided by dave

przechwytywanie
przechwytywanie2

Restarting node solved the issue

Anybody found a solution?

In my logs I only have tons of:

2018-05-23T10:25:47.339Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #70).
2018-05-23T10:25:47.860Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #30).
2018-05-23T10:26:06.049Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #70).
2018-05-23T10:26:10.231Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #71).
2018-05-23T10:26:18.574Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #71).
2018-05-23T10:26:19.230Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #31).
2018-05-23T10:26:37.210Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #71).
2018-05-23T10:26:41.180Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #72).
2018-05-23T10:26:49.858Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #72).
2018-05-23T10:26:50.438Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #32).
2018-05-23T10:27:08.339Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #72).
2018-05-23T10:27:12.147Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #73).
2018-05-23T10:27:21.060Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #73).
2018-05-23T10:27:21.591Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #33).
2018-05-23T10:27:39.468Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #73).
2018-05-23T10:27:43.102Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #74).
2018-05-23T10:27:52.301Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #74).
2018-05-23T10:27:52.737Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #34).
2018-05-23T10:28:10.706Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #74).
2018-05-23T10:28:14.151Z INFO  [Messages] Bulk indexing finally successful (attempt #75).

@drAlberT Your Elasticsearch cluster is down or unreachable.

Please post this issue to our discussion forum or join the #graylog channel on freenode IRC.

Thank you!

Anyway, I'm not so sure:

root@graylog-a:~# curl -L https://vpc-graylog-22XXXXXXXXXXXjsuie4.eu-west-1.es.amazonaws.com
{
  "name" : "jS4Ja1Q",
  "cluster_name" : "903437758265:graylog",
  "cluster_uuid" : "nWkEX1wmRcaCuh5MDePpTA",
  "version" : {
    "number" : "5.5.2",
    "build_hash" : "Unknown",
    "build_date" : "2017-10-18T04:35:01.381Z",
    "build_snapshot" : false,
    "lucene_version" : "6.6.0"
  },
  "tagline" : "You Know, for Search"
}

So the ES cluster is alive, the graylog node can reach it, I double-checked GL conf and the clustername is right ..
after a bit of changes I got the following trace .. it should be saying I need to increase the Java heap of GL process .. right?

2018-05-24T16:58:33.364Z ERROR [Messages] Caught exception during bulk indexing: java.lang.OutOfMemoryError: Java heap space, retrying (attempt #1).
2018-05-24T16:58:33.386Z ERROR [Messages] Couldn't bulk index 500 messages.
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at com.github.rholder.retry.Retryer$ExceptionAttempt.<init>(Retryer.java:254) ~[graylog.jar:?]
        at com.github.rholder.retry.Retryer.call(Retryer.java:163) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.runBulkRequest(Messages.java:169) [graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.bulkIndex(Messages.java:152) [graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.bulkIndex(Messages.java:128) [graylog.jar:?]
        at org.graylog2.outputs.ElasticSearchOutput.writeMessageEntries(ElasticSearchOutput.java:111) [graylog.jar:?]
        at org.graylog2.outputs.BlockingBatchedESOutput.flush(BlockingBatchedESOutput.java:129) [graylog.jar:?]
        at org.graylog2.outputs.BlockingBatchedESOutput.writeMessageEntry(BlockingBatchedESOutput.java:110) [graylog.jar:?]
        at org.graylog2.outputs.BlockingBatchedESOutput.write(BlockingBatchedESOutput.java:92) [graylog.jar:?]
        at org.graylog2.buffers.processors.OutputBufferProcessor$1.run(OutputBufferProcessor.java:191) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) [graylog.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_171]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332) ~[?:1.8.0_171]
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) ~[?:1.8.0_171]
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) ~[?:1.8.0_171]
        at java.lang.StringBuilder.append(StringBuilder.java:136) ~[?:1.8.0_171]
        at java.lang.StringBuilder.append(StringBuilder.java:131) ~[?:1.8.0_171]
        at io.searchbox.core.Bulk.getData(Bulk.java:108) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.prepareRequest(JestHttpClient.java:134) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:75) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:64) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.lambda$runBulkRequest$2(Messages.java:169) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages$$Lambda$231/1508904764.call(Unknown Source) ~[?:?]
        at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78) ~[graylog.jar:?]
        at com.github.rholder.retry.Retryer.call(Retryer.java:160) ~[graylog.jar:?]
        ... 14 more
2018-05-24T16:58:33.394Z ERROR [BlockingBatchedESOutput] Unable to flush message buffer
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at org.graylog2.indexer.messages.Messages.runBulkRequest(Messages.java:176) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.bulkIndex(Messages.java:152) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.bulkIndex(Messages.java:128) ~[graylog.jar:?]
        at org.graylog2.outputs.ElasticSearchOutput.writeMessageEntries(ElasticSearchOutput.java:111) ~[graylog.jar:?]
        at org.graylog2.outputs.BlockingBatchedESOutput.flush(BlockingBatchedESOutput.java:129) [graylog.jar:?]
        at org.graylog2.outputs.BlockingBatchedESOutput.writeMessageEntry(BlockingBatchedESOutput.java:110) [graylog.jar:?]
        at org.graylog2.outputs.BlockingBatchedESOutput.write(BlockingBatchedESOutput.java:92) [graylog.jar:?]
        at org.graylog2.buffers.processors.OutputBufferProcessor$1.run(OutputBufferProcessor.java:191) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) [graylog.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_171]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
        at com.github.rholder.retry.Retryer$ExceptionAttempt.<init>(Retryer.java:254) ~[graylog.jar:?]
        at com.github.rholder.retry.Retryer.call(Retryer.java:163) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.runBulkRequest(Messages.java:169) ~[graylog.jar:?]
        ... 13 more
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332) ~[?:1.8.0_171]
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) ~[?:1.8.0_171]
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) ~[?:1.8.0_171]
        at java.lang.StringBuilder.append(StringBuilder.java:136) ~[?:1.8.0_171]
        at java.lang.StringBuilder.append(StringBuilder.java:131) ~[?:1.8.0_171]
        at io.searchbox.core.Bulk.getData(Bulk.java:108) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.prepareRequest(JestHttpClient.java:134) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:75) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:64) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.lambda$runBulkRequest$2(Messages.java:169) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages$$Lambda$231/1508904764.call(Unknown Source) ~[?:?]
        at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78) ~[graylog.jar:?]
        at com.github.rholder.retry.Retryer.call(Retryer.java:160) ~[graylog.jar:?]
        at org.graylog2.indexer.messages.Messages.runBulkRequest(Messages.java:169) ~[graylog.jar:?]
        ... 13 more
2018-05-24T16:58:34.350Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #1).

Having increased it to 3G removed the Heap error .. but not the original issue:

2018-05-24T17:05:13.607Z INFO  [InputSetupService] Triggering launching persisted inputs, node transitioned from Uninitialized [LB:DEAD] to Running [LB:ALIVE]
2018-05-24T17:05:13.632Z INFO  [InputStateListener] Input [Syslog UDP/5a76da165c6f490910ffbc15] is now STARTING
2018-05-24T17:05:13.639Z INFO  [InputStateListener] Input [Syslog TCP/5a76e2b7c228180627417070] is now STARTING
2018-05-24T17:05:13.642Z INFO  [InputStateListener] Input [Beats/5a859f85d0cf6e4b7317d648] is now STARTING
2018-05-24T17:05:13.678Z INFO  [InputStateListener] Input [Beats/5a859f85d0cf6e4b7317d648] is now RUNNING
2018-05-24T17:05:13.681Z INFO  [InputStateListener] Input [Syslog TCP/5a76e2b7c228180627417070] is now RUNNING
2018-05-24T17:05:13.709Z INFO  [InputStateListener] Input [Syslog UDP/5a76da165c6f490910ffbc15] is now RUNNING
2018-05-24T17:05:23.928Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #1).
2018-05-24T17:05:26.594Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #2).
2018-05-24T17:05:28.522Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #3).
2018-05-24T17:05:30.235Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #4).
2018-05-24T17:05:31.623Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #5).
2018-05-24T17:05:33.120Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #6).
2018-05-24T17:05:34.664Z ERROR [Messages] Caught exception during bulk indexing: java.net.SocketException: Broken pipe (Write failed), retrying (attempt #7).

Any changes in this, we are getting about 6 messages a second but nothing is being processed there are no errors in the logs and everything is healthy same context as this. This seems to be a bug as we are not running much configuration differences from the default.

@Geramy you should move over to the community. Please post this issue to our discussion forum or join the #graylog channel on freenode IRC.

  1. you did not add any valuable information that might help to debug
  2. you did not include what version of Graylog you are having the trouble
  3. you did not describe properly how this happens.

The same issue. ES is green. Mongo without any error. All messages stuck in journal.

Graylog 2.4
Mongo 3.6
ES 5.6.12

I've solved the problem by removing graylog journal data. Messages started being indexed immediately.

I am closing this issue now. We didn't manage to find the root cause unfortunately, but it also doesn't make sense to keep it open. Please visit the discussion forum for help if you run into this again or create a new issue with updated information. (logs, metrics, etc.) Also please make sure to run the latest Graylog stable release.

Thank you!

We just faced with the same issue today.

Graylog node have stopped processing messages.
Meanwhile there were no issues with the ES, no issued with the Disk, space, Memory amount or the CPU load on the affected node.
It just got stuck. Processing buffer was full, but output buffer was empty.

So, I've disabled the affected node at the LB, in order to prevent messages queue groving and started googling.

And I've arrived here from the https://community.graylog.org/t/graylog-stops-processing-writing-messages/2719/10

And at the end found that bug closed without any resolution but with the advice to go back to community.

So looks like nobody's intrested in the solution, that's sad.

One good thing in our case, after 40-60 minutes of being in the stuck state, node started processing messages again.

This happens a multiple times a day for us. The only thing that helps is to delete the journal. We are losing a couple million messages a day to this. Not mentioning the countless hours of dev and ops time wasted on this.

This renders our graylog cluster all but unusable. If no solution is found we probably have to move our org away from graylog.

@EpiqSty @mzhaase I will re-open this. We need the following information to have a chance to debug this:

  1. Graylog version
  2. Elasticsearch version
  3. Full Graylog server logs (from the time when the issue appears in your setup)
  4. Full Elasticsearch server logs (from the time when the issue appears in your setup)
  5. Cluster overview, JVM info and threaddumps (from the time when the issue appears in your setup)

    • curl localhost:9000/api/cluster

    • curl localhost:9000/api/cluster/<node-id>/jvm (for every node in the cluster)

    • curl localhost:9000/api/cluster/<node-id>/threaddump (for every node in the cluster)

  6. Graylog server metrics output

    • curl localhost:9000/api/system/metrics (for every node in the cluster)



      1. When the Graylog servers work as expected


      2. When the Graylog servers run into the issue



  7. A list of installed plugins

    • curl localhost:9000/api/system/plugins

  8. All inputs + all extractors

    • curl localhost:9000/api/system/inputs (collect all input ids)

    • curl localhost:9000/api/system/inputs/<input-id>/extractors (for every input id)

  9. All streams + streamrules

    • curl localhost:9000/api/streams

  10. All drools rules (if you use any - only pre-3.0 setups)
  11. Message processors config

    • curl localhost:9000/api/system/messageprocessors/config

  12. All lookup tables, caches and adapters

    • curl localhost:9000/api/system/lookup/tables

    • curl localhost:9000/api/system/lookup/caches

    • curl localhost:9000/api/system/lookup/adapters

  13. All pipelines, pipeline connections and pipeline rules

    • curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/pipeline

    • curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/rule

    • curl localhost:9000/api/plugins/org.graylog.plugins.pipelineprocessor/system/pipelines/connections

  14. All grok patterns

    • curl localhost:9000/api/system/grok

Thank you!

Graylog 2.4.7

ES 5.6.14

I collected all other information however there is a lot of business logic and personally identifiable information in the multiple 100ks of lines. Can I send it somewhere per email instead of posting it here?

@mzhaase Feel free to send it to my-github-name at graylog.com.

Send

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

Hi @bernd. Is there any update on investigation?
I have the same issue with Graylog 2.4.6+ceaa7e4 on graylog (Oracle Corporation 1.8.0_172 on Linux 4.4.0-1044-aws)

@sturman Nope, unfortunately not. I wasn't able to find anything suspicious in the provided information. :confused: This issue also fell through the cracks on my side because it was closed by our "no-response" bot for some reason. Sorry about that, @mzhaase :disappointed:

My only guess is that some broken message is stuck in the front of the journal and that's why message processing doesn't continue. (two people reported that deleting the journal helps) But without anything related to that in the logs I am unable to follow up on this to find the cause, unfortunately.

I think that @bernd is correct with something getting stuck in the journal and/or some extractor rule being overly intensive with a specific log message. I have observed this same problem on our graylog cluster since adding the first GELF input. I have been able to get things rolling again by the following commands:

rm -rf /var/lib/graylog-server/journal/.lock
systemctl restart graylog-server
systemctl status graylog-server

Then it begins processing the journal logs and works for some time, and eventually hangs up again randomly. I am still researching the cause of this.

@on3iropolos @sturman @mzhaase @EpiqSty @mahnunchik @madiTG

We implemented a fix to handle problems with bulk requests that are too big. (This is the "broken pipe" error) Previously, if Graylog tried a bulk index request against Elasticsearch which was too big (http.max_content_length setting - default 100MB), we would retry the request over and over again which results in processing being stopped. This could happen if the batch size is high and the batch contained lots of big, oder fewer really big messages. Also, the http.max_content_length setting is lower in some cloud environments like AWS.

This could be an explanation for the issues you folks saw.

The fix has been implemented in https://github.com/Graylog2/graylog2-server/pull/7071 and will be part of the upcoming 3.2.0 feature and 3.1.4 bug fix releases.

@bernd - That's great news! I will implement 3.1.4 as soon as I am able and report back if the issue still occurs. Thanks!

@on3iropolos Awesome, thank you! :smiley:

@bernd - I have installed the 3.1.4 update. I will update if I experience an issue like the one before.

@bernd GJ

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bpbp-boop picture bpbp-boop  Â·  4Comments

jalogisch picture jalogisch  Â·  4Comments

jozefbarcin picture jozefbarcin  Â·  3Comments

mhaasEFD picture mhaasEFD  Â·  4Comments

jalogisch picture jalogisch  Â·  3Comments