Hi Team,
since a couple of days we can observe i/o timeouts in Beats integration tests (master branch).
Sample output log: https://api.travis-ci.org/v3/job/638647730/log.txt (ctrl+f: i/o timeout):
=== RUN TestSendMultipleBigBatchesViaLogstashTCP
2020-01-17T23:04:10.680Z DEBUG [logstash] logstash/async.go:111 connect
2020-01-17T23:04:10.686Z INFO elasticsearch/client.go:174 Elasticsearch url: http://elasticsearch:9200
2020-01-17T23:04:10.686Z DEBUG [elasticsearch] elasticsearch/client.go:774 DELETE http://elasticsearch:9200/beat-logstash-int-multiple-big-tcp-7668-2020.01.17 <nil>
2020-01-17T23:04:10.696Z DEBUG [elasticsearch] elasticsearch/client.go:774 PUT http://elasticsearch:9200/beat-logstash-int-multiple-big-tcp-7668-2020.01.17 {"settings":{"number_of_replicas":0,"number_of_shards":1}}
2020-01-17T23:04:11.050Z DEBUG [logstash] logstash/async.go:159 128 events out of 128 events sent to logstash host logstash:5044. Continue sending
2020-01-17T23:04:32.153Z DEBUG [transport] transport/client.go:218 handle error: read tcp 172.18.0.9:35788->172.18.0.8:5044: i/o timeout
2020-01-17T23:04:32.153Z ERROR logstash/async.go:256 Failed to publish events caused by: read tcp 172.18.0.9:35788->172.18.0.8:5044: i/o timeout
2020-01-17T23:04:32.153Z DEBUG [transport] transport/client.go:131 closing
2020-01-17T23:04:36.342Z DEBUG [transport] transport/client.go:218 handle error: read tcp 172.18.0.9:51662->172.18.0.8:5055: i/o timeout
2020-01-17T23:04:36.342Z ERROR logstash/async.go:256 Failed to publish events caused by: read tcp 172.18.0.9:51662->172.18.0.8:5055: i/o timeout
2020-01-17T23:04:36.342Z DEBUG [transport] transport/client.go:131 closing
2020-01-17T23:04:37.632Z DEBUG [transport] transport/client.go:218 handle error: read tcp 172.18.0.9:35838->172.18.0.8:5044: i/o timeout
2020-01-17T23:04:37.632Z ERROR logstash/async.go:256 Failed to publish events caused by: read tcp 172.18.0.9:35838->172.18.0.8:5044: i/o timeout
2020-01-17T23:04:37.632Z DEBUG [transport] transport/client.go:131 closing
2020-01-17T23:04:39.803Z DEBUG [transport] transport/client.go:218 handle error: read tcp 172.18.0.9:51716->172.18.0.8:5055: i/o timeout
2020-01-17T23:04:39.803Z ERROR logstash/async.go:256 Failed to publish events caused by: read tcp 172.18.0.9:51716->172.18.0.8:5055: i/o timeout
2020-01-17T23:04:39.803Z DEBUG [transport] transport/client.go:131 closing
2020-01-17T23:04:41.050Z DEBUG [transport] transport/client.go:218 handle error: read tcp 172.18.0.9:35882->172.18.0.8:5044: i/o timeout
2020-01-17T23:04:41.050Z ERROR logstash/async.go:256 Failed to publish events caused by: read tcp 172.18.0.9:35882->172.18.0.8:5044: i/o timeout
2020-01-17T23:04:41.050Z DEBUG [transport] transport/client.go:131 closing
2020-01-17T23:04:41.064Z DEBUG [logstash] logstash/async.go:159 128 events out of 128 events sent to logstash host logstash:5044. Continue sending
2020-01-17T23:04:41.064Z DEBUG [logstash] logstash/async.go:116 close connection
2020-01-17T23:04:41.064Z DEBUG [logstash] logstash/async.go:116 close connection
2020-01-17T23:04:41.064Z ERROR logstash/async.go:256 Failed to publish events caused by: client is not connected
In the PR (https://github.com/elastic/beats/pull/15568) I verified that timeouts don't occur if the logstash container image is downgraded to version: docker.elastic.co/logstash/logstash@sha256:e01cf165142edf8d67485115b938c94deeda66153e9516aa2ce69ee417c5fc33.
I did also a comparison (with version control) between the mentioned image and its newer release:
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: logstash-core/lib/jars/logstash-core.jar
modified: logstash-core/lib/logstash/build.rb
Untracked files:
(use "git add <file>..." to include in what will be committed)
vendor/bundle/jruby/2.5.0/gems/aws-sdk-2.11.428/
vendor/bundle/jruby/2.5.0/gems/aws-sdk-core-2.11.428/
vendor/bundle/jruby/2.5.0/gems/aws-sdk-resources-2.11.428/
vendor/bundle/jruby/2.5.0/gems/logstash-input-beats-6.0.5-java/
vendor/bundle/jruby/2.5.0/specifications/aws-sdk-2.11.428.gemspec
vendor/bundle/jruby/2.5.0/specifications/aws-sdk-core-2.11.428.gemspec
vendor/bundle/jruby/2.5.0/specifications/aws-sdk-resources-2.11.428.gemspec
vendor/bundle/jruby/2.5.0/specifications/logstash-input-beats-6.0.5-java.gemspec
The changes in logstash don't seem to be relevant:
diff --git a/logstash-core/lib/logstash/build.rb b/logstash-core/lib/logstash/build.rb
index 1bebfdd..a7fb6af 100644
--- a/logstash-core/lib/logstash/build.rb
+++ b/logstash-core/lib/logstash/build.rb
@@ -1,2 +1,2 @@
# encoding: utf-8
-BUILD_INFO = {"build_date"=>"2020-01-13T17:50:11+00:00", "build_sha"=>"1bc0aeab3424bd77c6c391994cbebf087198db7c", "build_snapshot"=>true}
\ No newline at end of file
+BUILD_INFO = {"build_date"=>"2020-01-14T11:21:37+00:00", "build_sha"=>"4b29112efd21278117dad0921c25d2a93e2b44bb", "build_snapshot"=>true}
\ No newline at end of file
... but, I I'm not convinced about netty update with logstash-input-beats 6.0.5.
Thanks for any help here!
You may try to upgrade netty to the next released version 4.1.45.Final (first one this year). Looking at the reported issues, we might have seen this bug (https://github.com/netty/netty/pull/9931).
We switched from Netty 4.1.30 -> 4.1.44 to fix https://github.com/logstash-plugins/logstash-input-beats/pull/376 and probably introduced it.
@andsel do you have any for ETA for such issues? When could we expect any bugfix update?
@jsvd is looking deeply on this, trying before to exercise the bug locally to later prove it's fixed before pushing anew version of the logstash-input-beats
hi @mtojek, we'll be updating the netty dependency on the beats input a releasing a new plugin soon.
I'm currently looking at the test failure and trying to replicate locally, so we verify that the update is enough to fix it. How do you recommend replicating TestSendMultipleBigBatchesViaLogstashTCP locally in my environment?
_Notice: I didn't manage to replicate this behavior locally (only in Beats CI)._
I believe the fastest way to replicate this locally would be:
cd testing/environmentsdocker-compose -f snapshot.yml up logstash elasticsearch to boot up services.cd libbeat/outputs/logstashgo test -count=1 -v -run TestSendMultipleBigBatchesViaLogstashTCP -tags integration .To replicate in the Beats CI, just open a Draft PR and let the Travis CI spin tests for you. You can take a look at this PR: https://github.com/elastic/beats/pull/15568
ok, running locally I can make the test have different results depending wether the latest beats input is being used. I don't get exactly the errors that you see but it's enough to use as a baseline for debugging.
For reference, here's the libbeat test suite crashing when connecting to the offending beats input:
❯ go test -count=1 -v -run TestSendMultipleBigBatchesViaLogstashTCP -tags integration .
=== RUN TestSendMultipleBigBatchesViaLogstashTCP
--- FAIL: TestSendMultipleBigBatchesViaLogstashTCP (31.60s)
logstash_integration_test.go:408:
Error Trace: logstash_integration_test.go:408
logstash_integration_test.go:363
logstash_integration_test.go:355
Error: Not equal:
expected: true
actual : false
Test: TestSendMultipleBigBatchesViaLogstashTCP
logstash_integration_test.go:408:
Error Trace: logstash_integration_test.go:408
logstash_integration_test.go:363
logstash_integration_test.go:355
Error: Not equal:
expected: true
actual : false
Test: TestSendMultipleBigBatchesViaLogstashTCP
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x43f5026]
goroutine 20 [running]:
testing.tRunner.func1(0xc000128100)
/usr/local/opt/go/libexec/src/testing/testing.go:874 +0x3a3
panic(0x4566d20, 0x4adb240)
/usr/local/opt/go/libexec/src/runtime/panic.go:679 +0x1b2
github.com/elastic/beats/vendor/github.com/elastic/go-lumber/client/v2.(*AsyncClient).Send(0x0, 0xc0001be090, 0xc000454800, 0x80, 0x80, 0x403019a, 0xc000128101)
If you apply changes in this PR https://github.com/elastic/beats/pull/15568 you'll will get rid of panic (you can leave SNAPSHOT reference in docker-compose file).
so this issue still happens with the latest netty 4.1.45.
The problem started when we bumped the dependency from 4.1.30 to 4.1.44, and after some manual bisecting the failure happens from .34 to .35.
Release notes for 4.1.35 can be found here, the list of changes in github is tracked here.
Spending a few more hours trying to identify the actual problem, but we may have to either pin netty back to a version between 4.1.30 and 4.1.34
Ok, found the change that seems to be causing this: https://github.com/netty/netty/commit/51112e2b36 (PR) it is in the IdleStateHandler, and was introduced in 4.1.35, which fits the problem we're seeing (tests with large messages)
also (lately) spotted a buffer related bug in 4.1.44 that was resolved later in 4.1.45
... but seems we might need going :arrow_backward: for now.
just for context the update did 2 things remove CBC ciphers and update netty dependencies.
the CBC removal fixed an issue where the cipher negotiation started failing with beats in 7.5.1
Ok, found the change that seems to be causing this: netty/netty@51112e2 (PR) it is in the IdleStateHandler, and was introduced in 4.1.35, which fits the problem we're seeing (tests with large messages)
Nice! There is also comment at the bottom of PR https://github.com/netty/netty/issues/8912 , claiming that the issue hasn't been fully fixed.
ok While we get this sorted I'll browse the recent changes to see if there's no regression that prevents us from downgrading netty to .34, which allows us to stick with the newer .28 version of tcnative.
Draft PR created at https://github.com/logstash-plugins/logstash-input-beats/pull/378
It looks like the Netty version bump made it into Logstash 7.5.2, does this issue mean that it’s better to avoid that release for now?
Some more details: this issue should go unnoticed for most production scenarios because beats has pipelining enabled. By allowing two (by default) batches of data inflight, the second write to the socket allows the ack from the first to be get unstuck from the IdleStateHandler.
So as long as the flow of data is rather consistent there should be no noticeable effects. And to correct my comment above, this is not related to large messages. Any single message traveling the outbound path the netty pipeline will get stuck in the IdleStateHandler until it either fires a WRITE_IDLE or something else comes inbound (like the second batch from filebeat's pipelining feature).
Our test suites were not catching this issue because they either send a single batch of data or use filebeat with pipelining enabled (since it's the default). Either forcing multiple batches or disabling pipelining causes our current spec suite to hang and fail: https://travis-ci.org/logstash-plugins/logstash-input-beats/builds/640782215
In this scenario I think we should move netty back to .34, which still allows us to use the latest tcnative and skip this issue that was introduced in .35
Version 4.1.47 of netty has solved this bug in https://github.com/netty/netty/pull/10069/files#diff-5d5eb2c10e26184226a4f493481bcf87R930-R935, so we can again upgrade to a more recent version of netty
Most helpful comment
Ok, found the change that seems to be causing this: https://github.com/netty/netty/commit/51112e2b36 (PR) it is in the IdleStateHandler, and was introduced in 4.1.35, which fits the problem we're seeing (tests with large messages)