Hello
I receive some packet gelf that causes graylog input down. The last message in graylog-server.log is:
2019-02-20T16:07:56.099+01:00 ERROR [NettyTransport] Error in Input [GELF UDP/58ebb5b9ba0e204a705a894c] (channel [id: 0xd79eba88, L:/0.0.0.0:12201]) (cause java.lang.IllegalStateException: GELF message is too short. Not even the type header would fit.)
The web interface report that input is OK, but if I see the port open in machine, I see down:
[root@machine1 ~]# ss -u -a |grep 12201
[root@machine1 ~]#
With graylog 2.5 version works fine.
Can you reproduce this? Is this happening repeatedly? If yes, is there even a chance to capture a pcap during the time when it happens?
I found the "bug", I sent the log with netcat and this is cause that error. I switched to nxlog and now the input can't stop. Thanks
This is happening for us aswell. We are getting this error message in the log:
ERROR: org.graylog2.plugin.inputs.transports.NettyTransport - Error in Input [GELF UDP/59a7d14166282b0001d46ea6] (channel [id: 0xa5cdc923, L:/0.0.0.0:22201]) (cause java.lang.IllegalStateException: GELF message is too short. Not even the type header would fit.)
Afterwards the input on the node reports as "running", but in fact accepts no new messages. After a while all Gelf UDP inputs on all nodes receive a bad package and stop; effectively not accepting any messages through UDP anymore.
We do not know which system is producing these messages and therefore it would be complicated to capture a pcap during the time.
Would be cool if the input continues to work, despite the bad package.
Hello,
we were able to reproduce this issue by actually sending a single byte to the UDP input. This directly causes the GELF message is too short. error message and the input is not reachable afterwards anymore.
Steps to reproduce
Verify that the UDP port is open:
$ nc -vzu 10.10.55.54 22201
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 10.10.55.54:22201.
Ncat: UDP packet sent successfully
Ncat: 1 bytes sent, 0 bytes received in 2.01 seconds.
Send a single byte over UDP:
$ echo -en '\x1e' | nc -vu 10.10.55.54 22201
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 10.10.55.54:22201.
Ncat: 1 bytes sent, 0 bytes received in 0.01 seconds.
Somehow the input does not stop immediately:
$ nc -vzu 10.10.55.54 22201
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 10.10.55.54:22201.
Ncat: UDP packet sent successfully
Ncat: 1 bytes sent, 0 bytes received in 2.01 seconds.
A couple of seconds later the port is closed:
$ nc -vzu 10.10.55.54 22201
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 10.10.55.54:22201.
Ncat: Connection refused.
@bjoernhaeuser @svenwltr Thank you for the updated information. We will reopen the ticket so we can try to reproduce and fix it.
@bernd thank you! If there is anything else we can do, let us know. I am not familiar enough with the codebase to open a PR by myself, but let me know if there is anything I can do.
It is important to have in mind what we highlight in the docs: http://docs.graylog.org/en/3.0/pages/gelf.html#gelf-via-udp
Please note, that the UDP-Inputs of Graylog use the SO_REUSEPORT socket option, which was introduced in Linux kernel version 3.9. So be aware, that UDP inputs will NOT work on Linux kernel versions prior to 3.9.
@jalogisch Our cluster environment is based on CoreOS and therefore kernel 4.19.x, so this shouldn't be an issue.
Happens to us as well, exact same symptoms,
however, the packet to crash the listen port is actually that netcat test port command:
nc -vzu graylog-node 22201
After that the listener (all threads of it) close, and never comeback, all while the master web gui, shows listener as active on that node (and every node which has crashed...)
And since UDP doesn't have a transmission mechanism, the data is lost (since load balancer on udp cant really health check the port, and if it does, with this bug, it might just kill the server with a simple healthcheck - port check)
System - Docker:
Graylog v3.0.2+1686930
Uname -a: Linux 4.14.128-112.105.amzn2.x86_64 #1 SMP x86_64 GNU/Linux
(Using Official Docker Image)
I am somewhat glad this has been picked up by more graylog users already, And I'm not the only one experiencing it..
Thanks @ismaelpuerto, @svenwltr, @bjoernhaeuser, @asaf400, @der-eismann for reporting this and helping us to reproduce it. I was able to reproduce it and identify the root cause. A fix is prepared and will hopefully become part of 3.1.0.
@ismaelpuerto @bjoernhaeuser @svenwltr @der-eismann @asaf400 We just merged the PR that is fixing this issue. It will be part of the upcoming Graylog 3.1.0 and we will also consider a backport to 3.0.x.
Thanks for the input everyone! :+1:
Most helpful comment
@ismaelpuerto @bjoernhaeuser @svenwltr @der-eismann @asaf400 We just merged the PR that is fixing this issue. It will be part of the upcoming Graylog 3.1.0 and we will also consider a backport to 3.0.x.
Thanks for the input everyone! :+1: