Vector: thread 'tokio-runtime-worker' panic

Created on 12 Sep 2020  路  19Comments  路  Source: timberio/vector

Vector Version

vector 0.11.0 (g50a9c3a x86_64-unknown-linux-musl 2020-09-10)

Vector Configuration File

data_dir = "/vector-data-dir"
[log_schema]
  host_key = "host"
  message_key = "message"
  source_type_key = "source_type"
  timestamp_key = "timestamp"
[sources.internal_metrics]
  type = "internal_metrics"
[sources.kafka]
  type = "kafka"
  bootstrap_servers = "logging-kafka-bootstrap.logging-kafka:9092"
  group_id = "test-vector"
  topics = ["test-vector"]

[transforms.message_parser]
  type = "json_parser"
  inputs = ["kafka"]
  field = "message"
  drop_invalid = true

[sinks.elasticsearch]
  type = "elasticsearch"
  inputs = ["message_parser"]
  host = "http://logging-es-http.logging:9200"
  index = "logstash-vector-test"
  auth.strategy = "basic"
  auth.password = "${ELASTICSEARCH_PASSWORD}"
  auth.user = "${ELASTICSEARCH_USER}"
  # https://github.com/timberio/vector/pull/3716
  #encoding.except_fields = ["index"]

[sinks.prometheus]
  type = "prometheus"
  inputs = ["internal_metrics"]
  address = "0.0.0.0:9598"
  namespace = "vector"

Debug Output


https://gist.github.com/spencergilbert/52f59f015769e19f4723a08da9c29f2f

Expected Behavior


No panic!

Actual Behavior


Panic!

Example Data

Additional Context

References

json_parser bug

All 19 comments

The build might have been a few days old, I'm trying to reproduce with the nightly from 9/11

EDIT: happened with today's as well https://gist.github.com/spencergilbert/088a64260234c9f7e522d00a87a3c73a

I suspect this has something to do with the HTTP stuff, might've been fixed already

@MOZGIII I'll try the latest again today, and also verify this wasn't happening with 0.10.0

I think I ran the stable release for long enough originally and didn't see any issues but it would be good to confirm there's no regression

I checked the history and didn't find the relevant mentions of a fix too. I just saw a mention of the fix in Discord, so take it with a grain of salt. This definitely needs to be properly verified.

Noting this was also reported as https://github.com/timberio/vector/issues/3863

I'll be running some tests this morning - if there is anything helpful I can do to help debug let me know and I'll try and make time for it

Would be great to run a build with debuginfo present, so stacks are properly annotated - but that would require a custom vector build and docker image. We can help with that.

Sure @MOZGIII either hit me up in the discord or details here and I'd be happy to. I'm currently unable to reproduce this with the 0.10.0-debian image

@spencergilbert We just probably made a fix, can you try it out? It's on our main branch now, sadly not a nightly yet. (Tomorrow!)

Still have an issue with "timberio/vector:nightly-2020-09-17-debian". Not sure is it same or not but looks similar:
https://gist.github.com/afoninsky/cb6b2fcc8196cbfa19a072f19ce561e3

I've started testing on the following release:

Sep 17 15:25:18.599  INFO vector: Vector has started. version="0.11.0" git_version="v0.9.0-734-g2ad948f" released="Thu, 17 Sep 2020 06:25:48 +0000" arch="x86_64"

The service has lived longer than before without panicking, I'm going to keep it running on nightly and see if I can reproduce.

EDIT: Scratch that - got another panic, pulling the backtrace...

EDIT2: It's panicking much less regularly, it hasn't panicked again after I set the stacktrace env var

I've just merged #4024, which has two parts:

  1. It should catch and avoid the immediate cause of this panic, allowing vector to continue working normally.
  2. It will collect data about the circumstances leading up to this panic and log it out when the panic would have occurred, which will hopefully help us track down the root cause.

Once this going out in the next nightly, it'd be great if either of you were able to deploy to environment where you've been seeing this issue, see if it continues to occur, and send us the output of any log lines containing Linger polled after ready.

I'll roll it out first thing tomorrow morning, thanks @lukesteensen

Been running the build from today for ~6 hours at this point, no sign of panic or that linger log

Been running without problems - looks like it might solved?

I've jinxed myself of course:

Sep 22 16:21:34.385  WARN sink{name=elasticsearch type=elasticsearch}: vector::sinks::util::sink: Linger polled after ready. clues=[Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: false }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: tr
ue }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }]
Sep 22 16:21:34.385  WARN sink{name=elasticsearch type=elasticsearch}: vector::sinks::util::sink: Linger polled after ready. clues=[Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: false }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: tr
ue }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }]
Sep 22 16:21:34.385  WARN sink{name=elasticsearch type=elasticsearch}: vector::sinks::util::sink: Linger polled after ready. clues=[Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: false }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: tr
ue }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }]
Sep 22 16:21:34.385  WARN sink{name=elasticsearch type=elasticsearch}: vector::sinks::util::sink: Linger polled after ready. clues=[Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: false }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: tr
ue }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }]
Sep 22 16:21:34.385  WARN sink{name=elasticsearch type=elasticsearch}: vector::sinks::util::sink: Linger polled after ready. clues=[Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: false }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: tr
ue }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }]
Sep 22 16:21:34.389  WARN sink{name=elasticsearch type=elasticsearch}: vector::sinks::util::sink: Linger polled after ready. clues=[Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: false }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: tr
ue }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }, Clue { caller: "linger_elapsed", closing: false, batch_was_full: false, batch_is_empty: false, service_was_not_ready: true }]

Happened today, from discord chat:

Sep 24 11:02:46.395 ERROR transform{name=t1_src_file type=add_fields}: vector::internal_events::add_fields: Field overwritten. self.field=host rate_limit_secs=30
Sep 24 11:02:46.395 ERROR transform{name=t1_src_file type=add_fields}: vector::internal_events::add_fields: "Field overwritten." is being rate limited. rate_limit_secs=5
Sep 24 11:02:46.423  INFO vector::topology::builder: Healthcheck: Passed.
Sep 24 11:02:46.944  INFO vector::topology::builder: Healthcheck: Passed.
thread 'tokio-runtime-worker' panicked at 'Map must not be polled after it returned `Poll::Ready`', /rustc/49cae55760da0a43428eba73abcb659bb70cf2e4/src/libstd/macros.rs:13:23
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Sep 24 11:03:01.653 ERROR sink{name=sink_hmo type=humio_logs}: vector::topology: An error occurred that vector couldn't handle.
Sep 24 11:03:01.653 ERROR transform{name=t2_src_file type=regex}: vector::buffers: sender error: SendError("...")
Sep 24 11:03:01.653  INFO vector: Vector has stopped.
Sep 24 11:03:01.663  INFO vector::topology: Shutting down... Waiting on: src_file, t1_src_file, t2_src_file, sink_splunk_hec. 59 seconds left

Just to summarize the current state here, the change in #4024 seems to have fixed the problem. The diagnostic info it spits out also seems to have confirmed the suspected source of the error, so we should be good to clean things up and close this issue.

The root of the issue was the following code in BatchSink:
https://github.com/timberio/vector/blob/bb5993d349a14d50b8e7d47b9f609e1e30a89400/src/sinks/util/sink.rs#L260-L264
https://github.com/timberio/vector/blob/bb5993d349a14d50b8e7d47b9f609e1e30a89400/src/sinks/util/sink.rs#L189-L198

The first snippet is part of the Sink::poll_complete implementation, where check if we should send a new batch and then attempt to do so if we should. The error happens when should_send calls delay.poll() via linger_elapsed, which returns Ready, and then service.poll_ready() return NotReady, causing PollComplete to return NotReady without sending a request or clearing the linger timeout. This leaves us in a state where the delay future has returned Ready but we haven't done anything with the value. When poll_complete gets called again, we follow the same path as before, resulting in another delay.poll() call which breaks the futures contract that you should never re-poll a future that has returned Ready, causing a panic.

The fix essentially just makes it safe to poll delay again, since we do want to remember that the timer has expired even if the inner service isn't quite ready to make a request yet.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kaarolch picture kaarolch  路  3Comments

LucioFranco picture LucioFranco  路  3Comments

jhgg picture jhgg  路  4Comments

Hoverbear picture Hoverbear  路  3Comments

valyala picture valyala  路  3Comments