Beats: Filebeat TCP input panics after 2^31 events received

Created on 30 May 2018 · 4Comments · Source: elastic/beats

User reports a panic ("sync: negative WaitGroup counter") after a week of running filebeat. filebeat stays running but does not accept the connections.

Stacktrace:

github.com/elastic/beats/libbeat/logp.Recover
    /home/jason/go/src/github.com/elastic/beats/libbeat/logp/global.go:88
runtime.call32
    /usr/lib/go-1.10/src/runtime/asm_amd64.s:573
runtime.gopanic
    /usr/lib/go-1.10/src/runtime/panic.go:502
sync.(*WaitGroup).Add
    /usr/lib/go-1.10/src/sync/waitgroup.go:73
github.com/elastic/beats/filebeat/beater.(*eventCounter).Add
    /home/jason/go/src/github.com/elastic/beats/filebeat/beater/channels.go:61
github.com/elastic/beats/filebeat/channel.(*outlet).OnEvent
    /home/jason/go/src/github.com/elastic/beats/filebeat/channel/outlet.go:43
github.com/elastic/beats/filebeat/harvester.(*Forwarder).Send
    /home/jason/go/src/github.com/elastic/beats/filebeat/harvester/forwarder.go:33
github.com/elastic/beats/filebeat/input/tcp.NewInput.func1
    /home/jason/go/src/github.com/elastic/beats/filebeat/input/tcp/input.go:59
github.com/elastic/beats/filebeat/inputsource/tcp.(*client).handle
    /home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/client.go:71
github.com/elastic/beats/filebeat/inputsource/tcp.(*Server).run.func1
    /home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/server.go:99

Log (two consecutive panics, same stracktrace):

2018-05-29T10:55:39.577-0600 ERROR sync/waitgroup.go:73 recovering from a tcp client crash. Recovering, but please report this. {"panic": "sync: negative WaitGroup counter", "stack": "github.com/elastic/beats/libbeat/logp.Recover\n\t/home/jason/go/src/github.com/elastic/beats/libbeat/logp/global.go:88\nruntime.call32\n\t/usr/lib/go-1.10/src/runtime/asm_amd64.s:573\nruntime.gopanic\n\t/usr/lib/go-1.10/src/runtime/panic.go:502\nsync.(*WaitGroup).Add\n\t/usr/lib/go-1.10/src/sync/waitgroup.go:73\ngithub.com/elastic/beats/filebeat/beater.(*eventCounter).Add\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/beater/channels.go:61\ngithub.com/elastic/beats/filebeat/channel.(*outlet).OnEvent\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/channel/outlet.go:43\ngithub.com/elastic/beats/filebeat/harvester.(*Forwarder).Send\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/harvester/forwarder.go:33\ngithub.com/elastic/beats/filebeat/input/tcp.NewInput.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/input/tcp/input.go:59\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*client).handle\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/client.go:71\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*Server).run.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/server.go:99"}
2018-05-29T10:55:39.637-0600 ERROR sync/waitgroup.go:73 recovering from a tcp client crash. Recovering, but please report this. {"panic": "sync: negative WaitGroup counter", "stack": "github.com/elastic/beats/libbeat/logp.Recover\n\t/home/jason/go/src/github.com/elastic/beats/libbeat/logp/global.go:88\nruntime.call32\n\t/usr/lib/go-1.10/src/runtime/asm_amd64.s:573\nruntime.gopanic\n\t/usr/lib/go-1.10/src/runtime/panic.go:502\nsync.(*WaitGroup).Add\n\t/usr/lib/go-1.10/src/sync/waitgroup.go:73\ngithub.com/elastic/beats/filebeat/beater.(*eventCounter).Add\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/beater/channels.go:61\ngithub.com/elastic/beats/filebeat/channel.(*outlet).OnEvent\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/channel/outlet.go:43\ngithub.com/elastic/beats/filebeat/harvester.(*Forwarder).Send\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/harvester/forwarder.go:33\ngithub.com/elastic/beats/filebeat/input/tcp.NewInput.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/input/tcp/input.go:59\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*client).handle\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/client.go:71\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*Server).run.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/server.go:99"}

For confirmed bugs, please report:

Version: 6.3.0 (amd64), libbeat 6.3.0 [489ad50276d34fd597cb31faa644aba1ecf47574 built 2018-05-11 19:02:09 +0000 UTC
Operating System: N/A
Discuss Forum URL: https://discuss.elastic.co/t/filebeats-6-3-tcp-client-crash/133759
Steps to Reproduce: N/A

Filebeat bug

Source

adriansr

Most helpful comment

Thanks for the report will verify that

ph on 30 May 2018

👍2

All 4 comments

After having a look at the code, the only possibility that seems plausible is that the internal counter from the WaitGroup is overflown (it uses an int32).

Can be reproduced with this example:

wg := sync.WaitGroup{}
wg.Add(math.MaxInt32)
wg.Add(1) // panic: sync: negative WaitGroup counter

That means that wg.Done() is not being called ~~for the Redis output~~
Edit: Not being called at all for any kind of output when TCP input is used, as currently the logic to call finishedLogger.Done is embedded into the Registrar.

adriansr on 30 May 2018

Thanks for the report will verify that

ph on 30 May 2018

👍2

I believe TCP/UDP and redis are affected by this issues well.

ph on 30 May 2018

Just to leave some feedback on this issue.

This is the current flow of events for the Redis, TCP and up.

send events to the pipeline
ES receives the event
Generate ACK
Global ACK handler receives the ACK
The Private Field on the event is empty; The Field handles the registry update.
Done() is not called on the waiting group.

The problem is the current implementation of the registry is global, even if we don't need it.
I have started to take a look at the refactoring, but it will take more time; I think we should have a short-term fix.

I will check if I can use the same strategy as the stdin.

ph on 30 May 2018

Was this page helpful?

0 / 5 - 0 ratings