Logstash: Stalled Shutdown in Concurrent Situations

Created on 12 Mar 2018  Â·  20Comments  Â·  Source: elastic/logstash

CC @andrewvc @danhermann

@yaauie logged into my build server and grabbed some data during a stalled build. This is with 6.2.2. His theory is that there's some kind of thread shutdown issue specifically related to using the stdin input. I've seen it in 3 or 4 of our different demo builds. For example, one where we actually write custom ruby code:

[2018-03-05T02:27:16,328][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>0, "stalling_thread_info"=>{["LogStash::Filters::Ruby", {"code"=>"\n      if event.get(\"type\") == \"business\"\n        category_city = []\n        event.get(\"categories\").each { |category|\n            category_city.push(category + \" \" + event.get(\"city\"))\n        }\n        event.set(\"category_city\", category_city)\n        event.set(\"businessid_name\", event.get(\"business_id\") + \" \" + event.get(\"name\"))\n        event.set(\"[@metadata][document_id]\", \"business \" + event.get(\"business_id\"))\n      elsif event.get(\"type\") == \"user\"\n        if event.get(\"name\")\n            event.set(\"userid_name\", event.get(\"user_id\") + \" \" + event.get(\"name\"))\n        else\n            event.set(\"userid_name\", event.get(\"user_id\"))\n        end\n        if event.get(\"friends\")\n            event.set(\"friends_count\", event.get(\"friends\").length)\n        else\n            event.set(\"friends_count\", 0)\n            event.set(\"friends\", [])\n        end\n        event.set(\"[@metadata][document_id]\", \"user \" + event.get(\"user_id\"))\n      elsif event.get(\"type\") == \"review\"\n        event.set(\"[@metadata][document_id]\", \"review \" + event.get(\"review_id\"))\n      end\n    ", "id"=>"4a339ab35bee8e587299b2374d5a59c5eb08c31b1b94861e882c3ed7dca568be"}]=>[{"thread_id"=>34, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in `read_batch'"}, {"thread_id"=>37, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in `read_batch'"}, {"thread_id"=>42, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in `read_batch'"}, {"thread_id"=>45, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in `read_batch'"}]}}

But today i actually saw it in another build that doesn't itself have custom ruby code but I guess uses filters that are written in ruby:

sa-demo-singaporelta-ls-build        | [2018-03-12T14:22:18,293][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>0, "stalling_thread_info"=>{["LogStash::Filters::Mutate", {"add_field"=>{"[geo][lat]"=>"%{Latitude}", "[geo][lon]"=>"%{Longitude}"}, "rename"=>{"EquipmentID"=>"equipment_id", "Message"=>"message"}, "remove_field"=>["CreateDate", "Date", "Attribute", "Summary", "Distance", "host", "__metadata", "Latitude", "Longitude", "VMSID"], "id"=>"2e2d6c0960928015af288b2853997830e51683148e7c8539c5d07f0de6bfad4f"}]=>[{"thread_id"=>45, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in `read_batch'"}]}}
bug

All 20 comments

@ctindel, I'm trying to reproduce this locally, but I need a few more details:

  • Could you provide one of the configs on which you experienced this problem in case there's something specific to the ruby filter that contributes to the problem?
  • Did the stall occur while Logstash was running or when you shut down Logstash?
  • If the stall occurred while running, could I get a sample of the data you were using?

Fyi, I am able to reproduce this problem on @ctindel's build server with the command below and only the stdin input and elasticsearch output, so I don't think the ruby filter has anything to do with it.

cat /tmp/yelp/yelp_scraper_data/yelp_scraper_review.json | jq -c '._source + {type: ._type}' | logstash -f /tmp/src/yelp_logstash.conf

Interesting. I assumed it was the ruby filter because it always indicated a ruby thread in the error message.

Yeah, I think the stall reporter doesn't assume the pipeline can have a bug
is my guess. WDYT dan?

On Tue, Mar 13, 2018 at 4:53 PM, Chad Tindel notifications@github.com
wrote:

Interesting. I assumed it was the ruby filter because it always indicated
a ruby thread in the error message.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/elastic/logstash/issues/9232#issuecomment-372831596,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAIBY3gwmRqFRy_CBTYF1cIY8KcKAZShks5teD_tgaJpZM4SnQgM
.

If it were a pipeline bug, I'd expect to see the problem with outputs other than ES, but so far, I have been unable to reproduce it with any output other than ES.

Oh, interesting, I was going by the stack trace which is stuck on /logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90.

Maybe that's incorrect somehow?

@andrewvc @danhermann is there some kind of debugging we can turn on? Maybe the stdin input never receives the EOF or something? Or maybe some thread isn't receiving the shutdown signal?

@ctindel, I don't think any of the debug logging that we have will be helpful here because I think it's a multi-threading issue. It's difficult to isolate because it doesn't happen deterministically. I was wondering if running with a single worker thread (-w 1) would be an interim workaround, but I haven't had a chance to try that, yet.

@danhermann
Is there a way for me to try that in the docker image? I guess it just goes on the cmd line? I'll give it a shot.

@ctindel, I think you could either modify the line below in the load_data.sh script or jump into the docker container as you described to me and run it from the command line directly.

cat /tmp/yelp/yelp_scraper_data/yelp_scraper_$t.json | jq -c '._source + {type: ._type}' | logstash -w 1 -f /tmp/src/yelp_logstash.conf

Well I've done a few builds now with -w 1 and haven't run into the problem.

Ship it! ;) I'm going to keep working on it because it's definitely a problem, but hopefully you're at least unblocked on those demos. Do let me know if you run into it with a single worker, though.

I've seen this error a lot, since at least 5.6. Since setting --pipeline.workers 1it has not recurred. Thank you !
Example errors were like
[2018-03-21T15:05:49,861][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>0, "stalling_thread_info"=>{["LogStash::Filters::CSV", {"columns"=>["id", "portal", "seller", "event", "venue", "start_datetime", "end_datetime", "timezone", "capacity", "oversell_capacity", "sold_quantity", "redeemed_count", "used_capacity", "sold_out", "oversold_out", "pct_sold", "day_of_week", "hour_of_day"], "separator"=>"\t", "id"=>"d91e7dae433f89d60b985407bc03172f8a87d9a1451f85404761f36a22fb4bb1"}]=>[{"thread_id"=>31, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in read_batch'"}]}}

We believe this PR will resolve this problem for the Ruby execution engine: https://github.com/elastic/logstash/pull/9285. As of 6.3, the Java execution engine should not exhibit this problem, either.

@danhermann what release will that fix ship in? Happy to test it with my stuff when it does just to verify.

@ctindel, it's scheduled to ship in 6.2.4 and 6.3.0+ right now.

I'm using Logstash 6.3.1 and I'm still seeing this error, for example:

[WARN ] 2018-08-17 18:21:25.926 [Ruby-0-Thread-52: /usr/share/logstash/logstash-core/lib/logstash/shutdown_watcher.rb:35] shutdownwatcher - {"inflight_count"=>0, "stalling_thread_info"=>{"other"=>[{"thread_id"=>30, "name"=>"[main]<sqs", "current_call"=>"uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/net/protocol.rb:158:in `wait'"}], ["LogStash::Filters::Mutate", {"add_field"=>{"version"=>"1.1"}, "id"=>"9c42ab17107a9c675abaced288ad676fb9a01aec597e1af05aafbaf2a573918d"}]=>[{"thread_id"=>24, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}, {"thread_id"=>25, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}, {"thread_id"=>26, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}, {"thread_id"=>27, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}]}}

For what it's worth, using Logstash 6.4.0, with the udp input and elasticsearch output and also seeing this error:

[2018-09-19T16:38:33,442][WARN ][org.logstash.execution.ShutdownWatcherExt] {"inflight_count"=>0, "stalling_threads_info"=>{["LogStash::Filters::Mutate", {"add_field"=>{"typeogram"=>"log_message"}, "id"=>"2c6110795d82d46bc3efab19e33aadd1367100914e23c3d7b97d697180e3319b"}]=>[{"thread_id"=>22, "name"=>nil, "current_call"=>"[...]/vendor/bundle/jruby/2.3.0/gems/stud-0.0.23/lib/stud/interval.rb:89:in `sleep'"}]}}

Also seeing this issue in 6.4.0, with http+kafka inputs, and elasticsearch output:

[2018-10-18T06:41:54,157][WARN ][org.logstash.execution.ShutdownWatcherExt] {"inflight_count"=>0, "stalling_threads_info"=>{"other"=>[{"thread_id"=>26, "name"=>nil, "current_call"=>"[...]/vendor/bundle/jruby/2.3.0/gems/stud-0.0.23/lib/stud/interval.rb:89:in `sleep'"}]}}

It's troublesome because Logstash can only be stopped with a SIGKILL, and where we don't have persistency in the pipeline, events are lost.

Come across with the same issue in 6.5.4.
But the data process stays success with a stdout error,means the data is still importing to elasticsearch,
but the piped data is being record to journald,Thus makes jounald very busy.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

cschotke picture cschotke  Â·  3Comments

simmel picture simmel  Â·  4Comments

packetrevolt picture packetrevolt  Â·  3Comments

suyograo picture suyograo  Â·  5Comments

jsvd picture jsvd  Â·  3Comments