Logstash: Stalled Shutdown in Concurrent Situations

Created on 12 Mar 2018 · 20Comments · Source: elastic/logstash

CC @andrewvc @danhermann

@yaauie logged into my build server and grabbed some data during a stalled build. This is with 6.2.2. His theory is that there's some kind of thread shutdown issue specifically related to using the stdin input. I've seen it in 3 or 4 of our different demo builds. For example, one where we actually write custom ruby code:

[2018-03-05T02:27:16,328][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>0, "stalling_thread_info"=>{["LogStash::Filters::Ruby", {"code"=>"\n      if event.get(\"type\") == \"business\"\n        category_city = []\n        event.get(\"categories\").each { |category|\n            category_city.push(category + \" \" + event.get(\"city\"))\n        }\n        event.set(\"category_city\", category_city)\n        event.set(\"businessid_name\", event.get(\"business_id\") + \" \" + event.get(\"name\"))\n        event.set(\"[@metadata][document_id]\", \"business \" + event.get(\"business_id\"))\n      elsif event.get(\"type\") == \"user\"\n        if event.get(\"name\")\n            event.set(\"userid_name\", event.get(\"user_id\") + \" \" + event.get(\"name\"))\n        else\n            event.set(\"userid_name\", event.get(\"user_id\"))\n        end\n        if event.get(\"friends\")\n            event.set(\"friends_count\", event.get(\"friends\").length)\n        else\n            event.set(\"friends_count\", 0)\n            event.set(\"friends\", [])\n        end\n        event.set(\"[@metadata][document_id]\", \"user \" + event.get(\"user_id\"))\n      elsif event.get(\"type\") == \"review\"\n        event.set(\"[@metadata][document_id]\", \"review \" + event.get(\"review_id\"))\n      end\n    ", "id"=>"4a339ab35bee8e587299b2374d5a59c5eb08c31b1b94861e882c3ed7dca568be"}]=>[{"thread_id"=>34, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in `read_batch'"}, {"thread_id"=>37, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in `read_batch'"}, {"thread_id"=>42, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in `read_batch'"}, {"thread_id"=>45, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in `read_batch'"}]}}

But today i actually saw it in another build that doesn't itself have custom ruby code but I guess uses filters that are written in ruby:

sa-demo-singaporelta-ls-build        | [2018-03-12T14:22:18,293][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>0, "stalling_thread_info"=>{["LogStash::Filters::Mutate", {"add_field"=>{"[geo][lat]"=>"%{Latitude}", "[geo][lon]"=>"%{Longitude}"}, "rename"=>{"EquipmentID"=>"equipment_id", "Message"=>"message"}, "remove_field"=>["CreateDate", "Date", "Attribute", "Summary", "Distance", "host", "__metadata", "Latitude", "Longitude", "VMSID"], "id"=>"2e2d6c0960928015af288b2853997830e51683148e7c8539c5d07f0de6bfad4f"}]=>[{"thread_id"=>45, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in `read_batch'"}]}}

bug

Source

ctindel

All 20 comments

@ctindel, I'm trying to reproduce this locally, but I need a few more details:

Could you provide one of the configs on which you experienced this problem in case there's something specific to the ruby filter that contributes to the problem?
Did the stall occur while Logstash was running or when you shut down Logstash?
If the stall occurred while running, could I get a sample of the data you were using?

danhermann on 13 Mar 2018

Fyi, I am able to reproduce this problem on @ctindel's build server with the command below and only the stdin input and elasticsearch output, so I don't think the ruby filter has anything to do with it.

cat /tmp/yelp/yelp_scraper_data/yelp_scraper_review.json | jq -c '._source + {type: ._type}' | logstash -f /tmp/src/yelp_logstash.conf

danhermann on 13 Mar 2018

Interesting. I assumed it was the ruby filter because it always indicated a ruby thread in the error message.

ctindel on 13 Mar 2018

Yeah, I think the stall reporter doesn't assume the pipeline can have a bug
is my guess. WDYT dan?

On Tue, Mar 13, 2018 at 4:53 PM, Chad Tindel notifications@github.com
wrote:

Interesting. I assumed it was the ruby filter because it always indicated
a ruby thread in the error message.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/elastic/logstash/issues/9232#issuecomment-372831596,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAIBY3gwmRqFRy_CBTYF1cIY8KcKAZShks5teD_tgaJpZM4SnQgM
.

andrewvc on 14 Mar 2018

If it were a pipeline bug, I'd expect to see the problem with outputs other than ES, but so far, I have been unable to reproduce it with any output other than ES.

danhermann on 14 Mar 2018

Oh, interesting, I was going by the stack trace which is stuck on /logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90.

Maybe that's incorrect somehow?

andrewvc on 14 Mar 2018

@andrewvc @danhermann is there some kind of debugging we can turn on? Maybe the stdin input never receives the EOF or something? Or maybe some thread isn't receiving the shutdown signal?

ctindel on 15 Mar 2018

@ctindel, I don't think any of the debug logging that we have will be helpful here because I think it's a multi-threading issue. It's difficult to isolate because it doesn't happen deterministically. I was wondering if running with a single worker thread (-w 1) would be an interim workaround, but I haven't had a chance to try that, yet.

danhermann on 15 Mar 2018

@danhermann
Is there a way for me to try that in the docker image? I guess it just goes on the cmd line? I'll give it a shot.

ctindel on 15 Mar 2018

@ctindel, I think you could either modify the line below in the load_data.sh script or jump into the docker container as you described to me and run it from the command line directly.

cat /tmp/yelp/yelp_scraper_data/yelp_scraper_$t.json | jq -c '._source + {type: ._type}' | logstash -w 1 -f /tmp/src/yelp_logstash.conf

danhermann on 15 Mar 2018

Well I've done a few builds now with -w 1 and haven't run into the problem.

ctindel on 15 Mar 2018

Ship it! ;) I'm going to keep working on it because it's definitely a problem, but hopefully you're at least unblocked on those demos. Do let me know if you run into it with a single worker, though.

danhermann on 15 Mar 2018

I've seen this error a lot, since at least 5.6. Since setting --pipeline.workers 1it has not recurred. Thank you !
Example errors were like
[2018-03-21T15:05:49,861][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>0, "stalling_thread_info"=>{["LogStash::Filters::CSV", {"columns"=>["id", "portal", "seller", "event", "venue", "start_datetime", "end_datetime", "timezone", "capacity", "oversell_capacity", "sold_quantity", "redeemed_count", "used_capacity", "sold_out", "oversold_out", "pct_sold", "day_of_week", "hour_of_day"], "separator"=>"\t", "id"=>"d91e7dae433f89d60b985407bc03172f8a87d9a1451f85404761f36a22fb4bb1"}]=>[{"thread_id"=>31, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in read_batch'"}]}}

ensslen on 21 Mar 2018

👍1

We believe this PR will resolve this problem for the Ruby execution engine: https://github.com/elastic/logstash/pull/9285. As of 6.3, the Java execution engine should not exhibit this problem, either.

danhermann on 23 Mar 2018

@danhermann what release will that fix ship in? Happy to test it with my stuff when it does just to verify.

ctindel on 24 Mar 2018

@ctindel, it's scheduled to ship in 6.2.4 and 6.3.0+ right now.

danhermann on 26 Mar 2018

I'm using Logstash 6.3.1 and I'm still seeing this error, for example:

[WARN ] 2018-08-17 18:21:25.926 [Ruby-0-Thread-52: /usr/share/logstash/logstash-core/lib/logstash/shutdown_watcher.rb:35] shutdownwatcher - {"inflight_count"=>0, "stalling_thread_info"=>{"other"=>[{"thread_id"=>30, "name"=>"[main]<sqs", "current_call"=>"uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/net/protocol.rb:158:in `wait'"}], ["LogStash::Filters::Mutate", {"add_field"=>{"version"=>"1.1"}, "id"=>"9c42ab17107a9c675abaced288ad676fb9a01aec597e1af05aafbaf2a573918d"}]=>[{"thread_id"=>24, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}, {"thread_id"=>25, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}, {"thread_id"=>26, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}, {"thread_id"=>27, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}]}}

phspagiari on 17 Aug 2018

For what it's worth, using Logstash 6.4.0, with the udp input and elasticsearch output and also seeing this error:

[2018-09-19T16:38:33,442][WARN ][org.logstash.execution.ShutdownWatcherExt] {"inflight_count"=>0, "stalling_threads_info"=>{["LogStash::Filters::Mutate", {"add_field"=>{"typeogram"=>"log_message"}, "id"=>"2c6110795d82d46bc3efab19e33aadd1367100914e23c3d7b97d697180e3319b"}]=>[{"thread_id"=>22, "name"=>nil, "current_call"=>"[...]/vendor/bundle/jruby/2.3.0/gems/stud-0.0.23/lib/stud/interval.rb:89:in `sleep'"}]}}

garbetjie on 19 Sep 2018

Also seeing this issue in 6.4.0, with http+kafka inputs, and elasticsearch output:

[2018-10-18T06:41:54,157][WARN ][org.logstash.execution.ShutdownWatcherExt] {"inflight_count"=>0, "stalling_threads_info"=>{"other"=>[{"thread_id"=>26, "name"=>nil, "current_call"=>"[...]/vendor/bundle/jruby/2.3.0/gems/stud-0.0.23/lib/stud/interval.rb:89:in `sleep'"}]}}

It's troublesome because Logstash can only be stopped with a SIGKILL, and where we don't have persistency in the pipeline, events are lost.

hackery on 23 Oct 2018

👍1

Come across with the same issue in 6.5.4.
But the data process stays success with a stdout error,means the data is still importing to elasticsearch,
but the piped data is being record to journald,Thus makes jounald very busy.