CC @andrewvc @danhermann
@yaauie logged into my build server and grabbed some data during a stalled build. This is with 6.2.2. His theory is that there's some kind of thread shutdown issue specifically related to using the stdin input. I've seen it in 3 or 4 of our different demo builds. For example, one where we actually write custom ruby code:
[2018-03-05T02:27:16,328][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>0, "stalling_thread_info"=>{["LogStash::Filters::Ruby", {"code"=>"\n if event.get(\"type\") == \"business\"\n category_city = []\n event.get(\"categories\").each { |category|\n category_city.push(category + \" \" + event.get(\"city\"))\n }\n event.set(\"category_city\", category_city)\n event.set(\"businessid_name\", event.get(\"business_id\") + \" \" + event.get(\"name\"))\n event.set(\"[@metadata][document_id]\", \"business \" + event.get(\"business_id\"))\n elsif event.get(\"type\") == \"user\"\n if event.get(\"name\")\n event.set(\"userid_name\", event.get(\"user_id\") + \" \" + event.get(\"name\"))\n else\n event.set(\"userid_name\", event.get(\"user_id\"))\n end\n if event.get(\"friends\")\n event.set(\"friends_count\", event.get(\"friends\").length)\n else\n event.set(\"friends_count\", 0)\n event.set(\"friends\", [])\n end\n event.set(\"[@metadata][document_id]\", \"user \" + event.get(\"user_id\"))\n elsif event.get(\"type\") == \"review\"\n event.set(\"[@metadata][document_id]\", \"review \" + event.get(\"review_id\"))\n end\n ", "id"=>"4a339ab35bee8e587299b2374d5a59c5eb08c31b1b94861e882c3ed7dca568be"}]=>[{"thread_id"=>34, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in `read_batch'"}, {"thread_id"=>37, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in `read_batch'"}, {"thread_id"=>42, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in `read_batch'"}, {"thread_id"=>45, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in `read_batch'"}]}}
But today i actually saw it in another build that doesn't itself have custom ruby code but I guess uses filters that are written in ruby:
sa-demo-singaporelta-ls-build | [2018-03-12T14:22:18,293][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>0, "stalling_thread_info"=>{["LogStash::Filters::Mutate", {"add_field"=>{"[geo][lat]"=>"%{Latitude}", "[geo][lon]"=>"%{Longitude}"}, "rename"=>{"EquipmentID"=>"equipment_id", "Message"=>"message"}, "remove_field"=>["CreateDate", "Date", "Attribute", "Summary", "Distance", "host", "__metadata", "Latitude", "Longitude", "VMSID"], "id"=>"2e2d6c0960928015af288b2853997830e51683148e7c8539c5d07f0de6bfad4f"}]=>[{"thread_id"=>45, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in `read_batch'"}]}}
@ctindel, I'm trying to reproduce this locally, but I need a few more details:
Fyi, I am able to reproduce this problem on @ctindel's build server with the command below and only the stdin input and elasticsearch output, so I don't think the ruby filter has anything to do with it.
cat /tmp/yelp/yelp_scraper_data/yelp_scraper_review.json | jq -c '._source + {type: ._type}' | logstash -f /tmp/src/yelp_logstash.conf
Interesting. I assumed it was the ruby filter because it always indicated a ruby thread in the error message.
Yeah, I think the stall reporter doesn't assume the pipeline can have a bug
is my guess. WDYT dan?
On Tue, Mar 13, 2018 at 4:53 PM, Chad Tindel notifications@github.com
wrote:
Interesting. I assumed it was the ruby filter because it always indicated
a ruby thread in the error message.—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
https://github.com/elastic/logstash/issues/9232#issuecomment-372831596,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAIBY3gwmRqFRy_CBTYF1cIY8KcKAZShks5teD_tgaJpZM4SnQgM
.
If it were a pipeline bug, I'd expect to see the problem with outputs other than ES, but so far, I have been unable to reproduce it with any output other than ES.
Oh, interesting, I was going by the stack trace which is stuck on /logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90.
Maybe that's incorrect somehow?
@andrewvc @danhermann is there some kind of debugging we can turn on? Maybe the stdin input never receives the EOF or something? Or maybe some thread isn't receiving the shutdown signal?
@ctindel, I don't think any of the debug logging that we have will be helpful here because I think it's a multi-threading issue. It's difficult to isolate because it doesn't happen deterministically. I was wondering if running with a single worker thread (-w 1) would be an interim workaround, but I haven't had a chance to try that, yet.
@danhermann
Is there a way for me to try that in the docker image? I guess it just goes on the cmd line? I'll give it a shot.
@ctindel, I think you could either modify the line below in the load_data.sh script or jump into the docker container as you described to me and run it from the command line directly.
cat /tmp/yelp/yelp_scraper_data/yelp_scraper_$t.json | jq -c '._source + {type: ._type}' | logstash -w 1 -f /tmp/src/yelp_logstash.conf
Well I've done a few builds now with -w 1 and haven't run into the problem.
Ship it! ;) I'm going to keep working on it because it's definitely a problem, but hopefully you're at least unblocked on those demos. Do let me know if you run into it with a single worker, though.
I've seen this error a lot, since at least 5.6. Since setting --pipeline.workers 1it has not recurred. Thank you !
Example errors were like
[2018-03-21T15:05:49,861][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>0, "stalling_thread_info"=>{["LogStash::Filters::CSV", {"columns"=>["id", "portal", "seller", "event", "venue", "start_datetime", "end_datetime", "timezone", "capacity", "oversell_capacity", "sold_quantity", "redeemed_count", "used_capacity", "sold_out", "oversold_out", "pct_sold", "day_of_week", "hour_of_day"], "separator"=>"\t", "id"=>"d91e7dae433f89d60b985407bc03172f8a87d9a1451f85404761f36a22fb4bb1"}]=>[{"thread_id"=>31, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:90:in read_batch'"}]}}
We believe this PR will resolve this problem for the Ruby execution engine: https://github.com/elastic/logstash/pull/9285. As of 6.3, the Java execution engine should not exhibit this problem, either.
@danhermann what release will that fix ship in? Happy to test it with my stuff when it does just to verify.
@ctindel, it's scheduled to ship in 6.2.4 and 6.3.0+ right now.
I'm using Logstash 6.3.1 and I'm still seeing this error, for example:
[WARN ] 2018-08-17 18:21:25.926 [Ruby-0-Thread-52: /usr/share/logstash/logstash-core/lib/logstash/shutdown_watcher.rb:35] shutdownwatcher - {"inflight_count"=>0, "stalling_thread_info"=>{"other"=>[{"thread_id"=>30, "name"=>"[main]<sqs", "current_call"=>"uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/net/protocol.rb:158:in `wait'"}], ["LogStash::Filters::Mutate", {"add_field"=>{"version"=>"1.1"}, "id"=>"9c42ab17107a9c675abaced288ad676fb9a01aec597e1af05aafbaf2a573918d"}]=>[{"thread_id"=>24, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}, {"thread_id"=>25, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}, {"thread_id"=>26, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}, {"thread_id"=>27, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}]}}
For what it's worth, using Logstash 6.4.0, with the udp input and elasticsearch output and also seeing this error:
[2018-09-19T16:38:33,442][WARN ][org.logstash.execution.ShutdownWatcherExt] {"inflight_count"=>0, "stalling_threads_info"=>{["LogStash::Filters::Mutate", {"add_field"=>{"typeogram"=>"log_message"}, "id"=>"2c6110795d82d46bc3efab19e33aadd1367100914e23c3d7b97d697180e3319b"}]=>[{"thread_id"=>22, "name"=>nil, "current_call"=>"[...]/vendor/bundle/jruby/2.3.0/gems/stud-0.0.23/lib/stud/interval.rb:89:in `sleep'"}]}}
Also seeing this issue in 6.4.0, with http+kafka inputs, and elasticsearch output:
[2018-10-18T06:41:54,157][WARN ][org.logstash.execution.ShutdownWatcherExt] {"inflight_count"=>0, "stalling_threads_info"=>{"other"=>[{"thread_id"=>26, "name"=>nil, "current_call"=>"[...]/vendor/bundle/jruby/2.3.0/gems/stud-0.0.23/lib/stud/interval.rb:89:in `sleep'"}]}}
It's troublesome because Logstash can only be stopped with a SIGKILL, and where we don't have persistency in the pipeline, events are lost.
Come across with the same issue in 6.5.4.
But the data process stays success with a stdout error,means the data is still importing to elasticsearch,
but the piped data is being record to journald,Thus makes jounald very busy.