User reports a panic ("sync: negative WaitGroup counter") after a week of running filebeat. filebeat stays running but does not accept the connections.
Stacktrace:
github.com/elastic/beats/libbeat/logp.Recover
/home/jason/go/src/github.com/elastic/beats/libbeat/logp/global.go:88
runtime.call32
/usr/lib/go-1.10/src/runtime/asm_amd64.s:573
runtime.gopanic
/usr/lib/go-1.10/src/runtime/panic.go:502
sync.(*WaitGroup).Add
/usr/lib/go-1.10/src/sync/waitgroup.go:73
github.com/elastic/beats/filebeat/beater.(*eventCounter).Add
/home/jason/go/src/github.com/elastic/beats/filebeat/beater/channels.go:61
github.com/elastic/beats/filebeat/channel.(*outlet).OnEvent
/home/jason/go/src/github.com/elastic/beats/filebeat/channel/outlet.go:43
github.com/elastic/beats/filebeat/harvester.(*Forwarder).Send
/home/jason/go/src/github.com/elastic/beats/filebeat/harvester/forwarder.go:33
github.com/elastic/beats/filebeat/input/tcp.NewInput.func1
/home/jason/go/src/github.com/elastic/beats/filebeat/input/tcp/input.go:59
github.com/elastic/beats/filebeat/inputsource/tcp.(*client).handle
/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/client.go:71
github.com/elastic/beats/filebeat/inputsource/tcp.(*Server).run.func1
/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/server.go:99
Log (two consecutive panics, same stracktrace):
2018-05-29T10:55:39.577-0600 ERROR sync/waitgroup.go:73 recovering from a tcp client crash. Recovering, but please report this. {"panic": "sync: negative WaitGroup counter", "stack": "github.com/elastic/beats/libbeat/logp.Recover\n\t/home/jason/go/src/github.com/elastic/beats/libbeat/logp/global.go:88\nruntime.call32\n\t/usr/lib/go-1.10/src/runtime/asm_amd64.s:573\nruntime.gopanic\n\t/usr/lib/go-1.10/src/runtime/panic.go:502\nsync.(*WaitGroup).Add\n\t/usr/lib/go-1.10/src/sync/waitgroup.go:73\ngithub.com/elastic/beats/filebeat/beater.(*eventCounter).Add\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/beater/channels.go:61\ngithub.com/elastic/beats/filebeat/channel.(*outlet).OnEvent\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/channel/outlet.go:43\ngithub.com/elastic/beats/filebeat/harvester.(*Forwarder).Send\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/harvester/forwarder.go:33\ngithub.com/elastic/beats/filebeat/input/tcp.NewInput.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/input/tcp/input.go:59\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*client).handle\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/client.go:71\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*Server).run.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/server.go:99"}
2018-05-29T10:55:39.637-0600 ERROR sync/waitgroup.go:73 recovering from a tcp client crash. Recovering, but please report this. {"panic": "sync: negative WaitGroup counter", "stack": "github.com/elastic/beats/libbeat/logp.Recover\n\t/home/jason/go/src/github.com/elastic/beats/libbeat/logp/global.go:88\nruntime.call32\n\t/usr/lib/go-1.10/src/runtime/asm_amd64.s:573\nruntime.gopanic\n\t/usr/lib/go-1.10/src/runtime/panic.go:502\nsync.(*WaitGroup).Add\n\t/usr/lib/go-1.10/src/sync/waitgroup.go:73\ngithub.com/elastic/beats/filebeat/beater.(*eventCounter).Add\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/beater/channels.go:61\ngithub.com/elastic/beats/filebeat/channel.(*outlet).OnEvent\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/channel/outlet.go:43\ngithub.com/elastic/beats/filebeat/harvester.(*Forwarder).Send\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/harvester/forwarder.go:33\ngithub.com/elastic/beats/filebeat/input/tcp.NewInput.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/input/tcp/input.go:59\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*client).handle\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/client.go:71\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*Server).run.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/server.go:99"}
For confirmed bugs, please report:
After having a look at the code, the only possibility that seems plausible is that the internal counter from the WaitGroup is overflown (it uses an int32).
Can be reproduced with this example:
wg := sync.WaitGroup{}
wg.Add(math.MaxInt32)
wg.Add(1) // panic: sync: negative WaitGroup counter
That means that wg.Done() is not being called for the Redis output
Edit: Not being called at all for any kind of output when TCP input is used, as currently the logic to call finishedLogger.Done is embedded into the Registrar.
Thanks for the report will verify that
I believe TCP/UDP and redis are affected by this issues well.
Just to leave some feedback on this issue.
This is the current flow of events for the Redis, TCP and up.
Private Field on the event is empty; The Field handles the registry update.The problem is the current implementation of the registry is global, even if we don't need it.
I have started to take a look at the refactoring, but it will take more time; I think we should have a short-term fix.
I will check if I can use the same strategy as the stdin.
Most helpful comment
Thanks for the report will verify that