Beats: Filebeat TCP input panics after 2^31 events received

Created on 30 May 2018  路  4Comments  路  Source: elastic/beats

User reports a panic ("sync: negative WaitGroup counter") after a week of running filebeat. filebeat stays running but does not accept the connections.

Stacktrace:

github.com/elastic/beats/libbeat/logp.Recover
    /home/jason/go/src/github.com/elastic/beats/libbeat/logp/global.go:88
runtime.call32
    /usr/lib/go-1.10/src/runtime/asm_amd64.s:573
runtime.gopanic
    /usr/lib/go-1.10/src/runtime/panic.go:502
sync.(*WaitGroup).Add
    /usr/lib/go-1.10/src/sync/waitgroup.go:73
github.com/elastic/beats/filebeat/beater.(*eventCounter).Add
    /home/jason/go/src/github.com/elastic/beats/filebeat/beater/channels.go:61
github.com/elastic/beats/filebeat/channel.(*outlet).OnEvent
    /home/jason/go/src/github.com/elastic/beats/filebeat/channel/outlet.go:43
github.com/elastic/beats/filebeat/harvester.(*Forwarder).Send
    /home/jason/go/src/github.com/elastic/beats/filebeat/harvester/forwarder.go:33
github.com/elastic/beats/filebeat/input/tcp.NewInput.func1
    /home/jason/go/src/github.com/elastic/beats/filebeat/input/tcp/input.go:59
github.com/elastic/beats/filebeat/inputsource/tcp.(*client).handle
    /home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/client.go:71
github.com/elastic/beats/filebeat/inputsource/tcp.(*Server).run.func1
    /home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/server.go:99

Log (two consecutive panics, same stracktrace):

2018-05-29T10:55:39.577-0600 ERROR sync/waitgroup.go:73 recovering from a tcp client crash. Recovering, but please report this. {"panic": "sync: negative WaitGroup counter", "stack": "github.com/elastic/beats/libbeat/logp.Recover\n\t/home/jason/go/src/github.com/elastic/beats/libbeat/logp/global.go:88\nruntime.call32\n\t/usr/lib/go-1.10/src/runtime/asm_amd64.s:573\nruntime.gopanic\n\t/usr/lib/go-1.10/src/runtime/panic.go:502\nsync.(*WaitGroup).Add\n\t/usr/lib/go-1.10/src/sync/waitgroup.go:73\ngithub.com/elastic/beats/filebeat/beater.(*eventCounter).Add\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/beater/channels.go:61\ngithub.com/elastic/beats/filebeat/channel.(*outlet).OnEvent\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/channel/outlet.go:43\ngithub.com/elastic/beats/filebeat/harvester.(*Forwarder).Send\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/harvester/forwarder.go:33\ngithub.com/elastic/beats/filebeat/input/tcp.NewInput.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/input/tcp/input.go:59\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*client).handle\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/client.go:71\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*Server).run.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/server.go:99"}
2018-05-29T10:55:39.637-0600 ERROR sync/waitgroup.go:73 recovering from a tcp client crash. Recovering, but please report this. {"panic": "sync: negative WaitGroup counter", "stack": "github.com/elastic/beats/libbeat/logp.Recover\n\t/home/jason/go/src/github.com/elastic/beats/libbeat/logp/global.go:88\nruntime.call32\n\t/usr/lib/go-1.10/src/runtime/asm_amd64.s:573\nruntime.gopanic\n\t/usr/lib/go-1.10/src/runtime/panic.go:502\nsync.(*WaitGroup).Add\n\t/usr/lib/go-1.10/src/sync/waitgroup.go:73\ngithub.com/elastic/beats/filebeat/beater.(*eventCounter).Add\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/beater/channels.go:61\ngithub.com/elastic/beats/filebeat/channel.(*outlet).OnEvent\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/channel/outlet.go:43\ngithub.com/elastic/beats/filebeat/harvester.(*Forwarder).Send\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/harvester/forwarder.go:33\ngithub.com/elastic/beats/filebeat/input/tcp.NewInput.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/input/tcp/input.go:59\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*client).handle\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/client.go:71\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*Server).run.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/server.go:99"}

For confirmed bugs, please report:

Filebeat bug

Most helpful comment

Thanks for the report will verify that

All 4 comments

After having a look at the code, the only possibility that seems plausible is that the internal counter from the WaitGroup is overflown (it uses an int32).

Can be reproduced with this example:

wg := sync.WaitGroup{}
wg.Add(math.MaxInt32)
wg.Add(1) // panic: sync: negative WaitGroup counter

That means that wg.Done() is not being called for the Redis output
Edit: Not being called at all for any kind of output when TCP input is used, as currently the logic to call finishedLogger.Done is embedded into the Registrar.

Thanks for the report will verify that

I believe TCP/UDP and redis are affected by this issues well.

Just to leave some feedback on this issue.

This is the current flow of events for the Redis, TCP and up.

  1. send events to the pipeline
  2. ES receives the event
  3. Generate ACK
  4. Global ACK handler receives the ACK
  5. The Private Field on the event is empty; The Field handles the registry update.
  6. Done() is not called on the waiting group.

The problem is the current implementation of the registry is global, even if we don't need it.
I have started to take a look at the refactoring, but it will take more time; I think we should have a short-term fix.

I will check if I can use the same strategy as the stdin.

Was this page helpful?
0 / 5 - 0 ratings