Logstash: [8.0.0-SNAPSHOT] Beats: spotted i/o timeouts during bulk publish

Created on 20 Jan 2020  ·  17Comments  ·  Source: elastic/logstash

Hi Team,

since a couple of days we can observe i/o timeouts in Beats integration tests (master branch).

Sample output log: https://api.travis-ci.org/v3/job/638647730/log.txt (ctrl+f: i/o timeout):

=== RUN   TestSendMultipleBigBatchesViaLogstashTCP
2020-01-17T23:04:10.680Z    DEBUG   [logstash]  logstash/async.go:111   connect
2020-01-17T23:04:10.686Z    INFO    elasticsearch/client.go:174 Elasticsearch url: http://elasticsearch:9200
2020-01-17T23:04:10.686Z    DEBUG   [elasticsearch] elasticsearch/client.go:774 DELETE http://elasticsearch:9200/beat-logstash-int-multiple-big-tcp-7668-2020.01.17  <nil>
2020-01-17T23:04:10.696Z    DEBUG   [elasticsearch] elasticsearch/client.go:774 PUT http://elasticsearch:9200/beat-logstash-int-multiple-big-tcp-7668-2020.01.17  {"settings":{"number_of_replicas":0,"number_of_shards":1}}
2020-01-17T23:04:11.050Z    DEBUG   [logstash]  logstash/async.go:159   128 events out of 128 events sent to logstash host logstash:5044. Continue sending
2020-01-17T23:04:32.153Z    DEBUG   [transport] transport/client.go:218 handle error: read tcp 172.18.0.9:35788->172.18.0.8:5044: i/o timeout
2020-01-17T23:04:32.153Z    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 172.18.0.9:35788->172.18.0.8:5044: i/o timeout
2020-01-17T23:04:32.153Z    DEBUG   [transport] transport/client.go:131 closing
2020-01-17T23:04:36.342Z    DEBUG   [transport] transport/client.go:218 handle error: read tcp 172.18.0.9:51662->172.18.0.8:5055: i/o timeout
2020-01-17T23:04:36.342Z    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 172.18.0.9:51662->172.18.0.8:5055: i/o timeout
2020-01-17T23:04:36.342Z    DEBUG   [transport] transport/client.go:131 closing
2020-01-17T23:04:37.632Z    DEBUG   [transport] transport/client.go:218 handle error: read tcp 172.18.0.9:35838->172.18.0.8:5044: i/o timeout
2020-01-17T23:04:37.632Z    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 172.18.0.9:35838->172.18.0.8:5044: i/o timeout
2020-01-17T23:04:37.632Z    DEBUG   [transport] transport/client.go:131 closing
2020-01-17T23:04:39.803Z    DEBUG   [transport] transport/client.go:218 handle error: read tcp 172.18.0.9:51716->172.18.0.8:5055: i/o timeout
2020-01-17T23:04:39.803Z    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 172.18.0.9:51716->172.18.0.8:5055: i/o timeout
2020-01-17T23:04:39.803Z    DEBUG   [transport] transport/client.go:131 closing
2020-01-17T23:04:41.050Z    DEBUG   [transport] transport/client.go:218 handle error: read tcp 172.18.0.9:35882->172.18.0.8:5044: i/o timeout
2020-01-17T23:04:41.050Z    ERROR   logstash/async.go:256   Failed to publish events caused by: read tcp 172.18.0.9:35882->172.18.0.8:5044: i/o timeout
2020-01-17T23:04:41.050Z    DEBUG   [transport] transport/client.go:131 closing
2020-01-17T23:04:41.064Z    DEBUG   [logstash]  logstash/async.go:159   128 events out of 128 events sent to logstash host logstash:5044. Continue sending
2020-01-17T23:04:41.064Z    DEBUG   [logstash]  logstash/async.go:116   close connection
2020-01-17T23:04:41.064Z    DEBUG   [logstash]  logstash/async.go:116   close connection
2020-01-17T23:04:41.064Z    ERROR   logstash/async.go:256   Failed to publish events caused by: client is not connected

In the PR (https://github.com/elastic/beats/pull/15568) I verified that timeouts don't occur if the logstash container image is downgraded to version: docker.elastic.co/logstash/logstash@sha256:e01cf165142edf8d67485115b938c94deeda66153e9516aa2ce69ee417c5fc33.

I did also a comparison (with version control) between the mentioned image and its newer release:

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   logstash-core/lib/jars/logstash-core.jar
    modified:   logstash-core/lib/logstash/build.rb

Untracked files:
  (use "git add <file>..." to include in what will be committed)

    vendor/bundle/jruby/2.5.0/gems/aws-sdk-2.11.428/
    vendor/bundle/jruby/2.5.0/gems/aws-sdk-core-2.11.428/
    vendor/bundle/jruby/2.5.0/gems/aws-sdk-resources-2.11.428/
    vendor/bundle/jruby/2.5.0/gems/logstash-input-beats-6.0.5-java/
    vendor/bundle/jruby/2.5.0/specifications/aws-sdk-2.11.428.gemspec
    vendor/bundle/jruby/2.5.0/specifications/aws-sdk-core-2.11.428.gemspec
    vendor/bundle/jruby/2.5.0/specifications/aws-sdk-resources-2.11.428.gemspec
    vendor/bundle/jruby/2.5.0/specifications/logstash-input-beats-6.0.5-java.gemspec

The changes in logstash don't seem to be relevant:

diff --git a/logstash-core/lib/logstash/build.rb b/logstash-core/lib/logstash/build.rb
index 1bebfdd..a7fb6af 100644
--- a/logstash-core/lib/logstash/build.rb
+++ b/logstash-core/lib/logstash/build.rb
@@ -1,2 +1,2 @@
 # encoding: utf-8
-BUILD_INFO = {"build_date"=>"2020-01-13T17:50:11+00:00", "build_sha"=>"1bc0aeab3424bd77c6c391994cbebf087198db7c", "build_snapshot"=>true}
\ No newline at end of file
+BUILD_INFO = {"build_date"=>"2020-01-14T11:21:37+00:00", "build_sha"=>"4b29112efd21278117dad0921c25d2a93e2b44bb", "build_snapshot"=>true}
\ No newline at end of file

... but, I I'm not convinced about netty update with logstash-input-beats 6.0.5.

Thanks for any help here!

Most helpful comment

Ok, found the change that seems to be causing this: https://github.com/netty/netty/commit/51112e2b36 (PR) it is in the IdleStateHandler, and was introduced in 4.1.35, which fits the problem we're seeing (tests with large messages)

All 17 comments

You may try to upgrade netty to the next released version 4.1.45.Final (first one this year). Looking at the reported issues, we might have seen this bug (https://github.com/netty/netty/pull/9931).

We switched from Netty 4.1.30 -> 4.1.44 to fix https://github.com/logstash-plugins/logstash-input-beats/pull/376 and probably introduced it.

@andsel do you have any for ETA for such issues? When could we expect any bugfix update?

@jsvd is looking deeply on this, trying before to exercise the bug locally to later prove it's fixed before pushing anew version of the logstash-input-beats

hi @mtojek, we'll be updating the netty dependency on the beats input a releasing a new plugin soon.
I'm currently looking at the test failure and trying to replicate locally, so we verify that the update is enough to fix it. How do you recommend replicating TestSendMultipleBigBatchesViaLogstashTCP locally in my environment?

_Notice: I didn't manage to replicate this behavior locally (only in Beats CI)._

I believe the fastest way to replicate this locally would be:

  1. Fetch https://github.com/elastic/beats repository.
  2. cd testing/environments
  3. Modify snapshot.yml to publish locally Logstash and Elasticsearch ports (5044, 5055, 9200).
  4. docker-compose -f snapshot.yml up logstash elasticsearch to boot up services.
  5. Back to project root.
  6. cd libbeat/outputs/logstash
  7. go test -count=1 -v -run TestSendMultipleBigBatchesViaLogstashTCP -tags integration .

To replicate in the Beats CI, just open a Draft PR and let the Travis CI spin tests for you. You can take a look at this PR: https://github.com/elastic/beats/pull/15568

ok, running locally I can make the test have different results depending wether the latest beats input is being used. I don't get exactly the errors that you see but it's enough to use as a baseline for debugging.

For reference, here's the libbeat test suite crashing when connecting to the offending beats input:

❯ go test -count=1 -v -run TestSendMultipleBigBatchesViaLogstashTCP -tags integration .
=== RUN   TestSendMultipleBigBatchesViaLogstashTCP
--- FAIL: TestSendMultipleBigBatchesViaLogstashTCP (31.60s)
    logstash_integration_test.go:408: 
            Error Trace:    logstash_integration_test.go:408
                                        logstash_integration_test.go:363
                                        logstash_integration_test.go:355
            Error:          Not equal: 
                            expected: true
                            actual  : false
            Test:           TestSendMultipleBigBatchesViaLogstashTCP
    logstash_integration_test.go:408: 
            Error Trace:    logstash_integration_test.go:408
                                        logstash_integration_test.go:363
                                        logstash_integration_test.go:355
            Error:          Not equal: 
                            expected: true
                            actual  : false
            Test:           TestSendMultipleBigBatchesViaLogstashTCP
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x43f5026]

goroutine 20 [running]:
testing.tRunner.func1(0xc000128100)
    /usr/local/opt/go/libexec/src/testing/testing.go:874 +0x3a3
panic(0x4566d20, 0x4adb240)
    /usr/local/opt/go/libexec/src/runtime/panic.go:679 +0x1b2
github.com/elastic/beats/vendor/github.com/elastic/go-lumber/client/v2.(*AsyncClient).Send(0x0, 0xc0001be090, 0xc000454800, 0x80, 0x80, 0x403019a, 0xc000128101)

If you apply changes in this PR https://github.com/elastic/beats/pull/15568 you'll will get rid of panic (you can leave SNAPSHOT reference in docker-compose file).

so this issue still happens with the latest netty 4.1.45.
The problem started when we bumped the dependency from 4.1.30 to 4.1.44, and after some manual bisecting the failure happens from .34 to .35.
Release notes for 4.1.35 can be found here, the list of changes in github is tracked here.

Spending a few more hours trying to identify the actual problem, but we may have to either pin netty back to a version between 4.1.30 and 4.1.34

Ok, found the change that seems to be causing this: https://github.com/netty/netty/commit/51112e2b36 (PR) it is in the IdleStateHandler, and was introduced in 4.1.35, which fits the problem we're seeing (tests with large messages)

also (lately) spotted a buffer related bug in 4.1.44 that was resolved later in 4.1.45
... but seems we might need going :arrow_backward: for now.
just for context the update did 2 things remove CBC ciphers and update netty dependencies.
the CBC removal fixed an issue where the cipher negotiation started failing with beats in 7.5.1

Ok, found the change that seems to be causing this: netty/netty@51112e2 (PR) it is in the IdleStateHandler, and was introduced in 4.1.35, which fits the problem we're seeing (tests with large messages)

Nice! There is also comment at the bottom of PR https://github.com/netty/netty/issues/8912 , claiming that the issue hasn't been fully fixed.

ok While we get this sorted I'll browse the recent changes to see if there's no regression that prevents us from downgrading netty to .34, which allows us to stick with the newer .28 version of tcnative.

It looks like the Netty version bump made it into Logstash 7.5.2, does this issue mean that it’s better to avoid that release for now?

Some more details: this issue should go unnoticed for most production scenarios because beats has pipelining enabled. By allowing two (by default) batches of data inflight, the second write to the socket allows the ack from the first to be get unstuck from the IdleStateHandler.

So as long as the flow of data is rather consistent there should be no noticeable effects. And to correct my comment above, this is not related to large messages. Any single message traveling the outbound path the netty pipeline will get stuck in the IdleStateHandler until it either fires a WRITE_IDLE or something else comes inbound (like the second batch from filebeat's pipelining feature).

Our test suites were not catching this issue because they either send a single batch of data or use filebeat with pipelining enabled (since it's the default). Either forcing multiple batches or disabling pipelining causes our current spec suite to hang and fail: https://travis-ci.org/logstash-plugins/logstash-input-beats/builds/640782215

In this scenario I think we should move netty back to .34, which still allows us to use the latest tcnative and skip this issue that was introduced in .35

Version 4.1.47 of netty has solved this bug in https://github.com/netty/netty/pull/10069/files#diff-5d5eb2c10e26184226a4f493481bcf87R930-R935, so we can again upgrade to a more recent version of netty

Was this page helpful?
0 / 5 - 0 ratings