Vector: Vector - thread 'main' panicked during restart

Created on 12 May 2020  路  7Comments  路  Source: timberio/vector

Hi,
We use vector 0.9 and during machine re-provisioning sometimes we need to restart vector to load the new configuration. Sometimes process panic during shutdown:

-- Logs begin at Thu 2020-05-07 05:44:49 UTC. --
May 11 10:06:46 vector[2718]: May 11 10:06:46.440  INFO vector::topology: Removing transform "haproxy_add_fields"
May 11 10:06:46 vector[2718]: May 11 10:06:46.440  INFO vector::topology: Removing transform "ec2_metadata_haproxy"
May 11 10:06:46 vector[2718]: May 11 10:06:46.440  INFO vector::topology: Removing transform "haproxy_socket_message"
May 11 10:06:46 vector[2718]: May 11 10:06:46.440  INFO vector::topology: Removing transform "haproxy_blacklist_message"
May 11 10:06:46 vector[2718]: May 11 10:06:46.440  INFO vector::topology: Removing transform "haproxy_available_message"
May 11 10:06:46 vector[2718]: May 11 10:06:46.440  INFO vector::topology: Removing transform "haproxy_nobackend_message"
May 11 10:06:46 vector[2718]: thread 'main' panicked at 'no entry found for key', /rustc/b8cedc00407a4c56a3bda1ed605c6fc166655447/src/libstd/collections/hash/map.rs:1025:9
May 11 10:06:46 vector[2718]: note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
May 11 10:06:46 systemd[1]: vector.service: Main process exited, code=exited, status=101/n/a
May 11 10:06:46 systemd[1]: vector.service: Failed with result 'exit-code'.

We got the same message on multiple nodes.

topology bug

All 7 comments

@ktff would you mind taking a look here? Thanks.

Hi @kaarolch, could you share with us your config file so we can reproduce the issue. We only need the topology information [sources.*], [transforms.*], [sinks.*], and their type=* and inputs=*, so if you have some sensitive information in the config, feel free to delete it. It would really help.

Sure:


data_dir = "{{ vector.datadir }}"

[sources.internal_metrics]
  type = "internal_metrics"

#
# Add EC2 metadata into Nginx access logs
#
[transforms.ec2_metadata_internal_metrics]
  type = "aws_ec2_metadata" # required
  inputs = ["internal_metrics"] # required

[transforms.tags_internal_metrics]
  # General
  type = "add_tags" # required
  inputs = ["ec2_metadata_internal_metrics"] # required
#
# Nginx
#
[sources.syslog_nginx_access]
  # General
  type = "syslog" # required

[sources.syslog_nginx_error]
  # General
  type = "syslog" # required

#
# Haproxy
#
[sources.syslog_haproxy_traffic]
  # General
  type = "syslog" # required

[sources.syslog_haproxy_internal]
  # General
  type = "syslog" # required

[sources.syslog_haproxy_healthchecks]
  # General
  type = "syslog" # required
#
#  Nginx
#
[transforms.nginx_json_parse]
  type = "json_parser" # required
  inputs = ["syslog_nginx_access"] # required

[transforms.nginx_json_parse_message]
  type = "json_parser" # required
  inputs = ["nginx_json_parse"] # required


#
# Add EC2 metadata into Nginx access logs
#
[transforms.ec2_metadata_nginx_access]
  type = "aws_ec2_metadata" # required
  inputs = ["nginx_json_parse_message"] # required

[transforms.nginx_error_json_parse]
  type = "json_parser" # required
  inputs = ["syslog_nginx_error"] # required

[transforms.nginx_error_json_parse_message]
  type = "json_parser" # required
  inputs = ["nginx_error_json_parse"] # required

[transforms.ec2_metadata_nginx_error]
  type = "aws_ec2_metadata" # required
  inputs = ["nginx_error_json_parse_message"] # required

[transforms.nginx_error_add_fields]
  # General
  type = "add_fields" # required
  inputs = ["ec2_metadata_nginx_error"] # required

# Nginx access split status string codes if any

[transforms.nginx_access_split_statuss_1]
  type = "split" # required
  inputs = ["ec2_metadata_nginx_access"] # required

[transforms.nginx_access_split_status]
  type = "split" # required
  inputs = ["nginx_access_split_statuss_1"] # required

# Nginx access addd fields

[transforms.nginx_add_fields]
  # General
  type = "add_fields" # required
  inputs = ["nginx_access_split_status"] # required

[transforms.nginx_rename_fields]
  # General
  type = "rename_fields" # required
  inputs = ["nginx_add_fields"] # required

#
# Haproxy
#

#
# Parse traffic logs
#

[transforms.haproxy_traffic_parse_message]
  type = "json_parser" # required
  inputs = ["syslog_haproxy_traffic"] # required

[transforms.haproxy_traffic_parse]
  type = "json_parser" # required
  inputs = ["haproxy_traffic_parse_message"] # required

#
# Parse healthchecks logs
#
[transforms.haproxy_healthchecks_parse_routing]
  type = "swimlanes"
  inputs = ["syslog_haproxy_healthchecks"]
  [transforms.haproxy_healthchecks_parse_routing.lanes.backend_status]
    type="check_fields"
    "message.starts_with" = "Server"
  [transforms.haproxy_healthchecks_parse_routing.lanes.health_checks]
    type="check_fields"
    "message.starts_with" = "Health"
  [transforms.haproxy_healthchecks_parse_routing.lanes.no_backend]
    type="check_fields"
    "message.ends_with" = "available!"

[transforms.haproxy_healthchecks_parse_message_1]
  type = "regex_parser"
  inputs = ["haproxy_healthchecks_parse_routing.backend_status"]

[transforms.haproxy_healthchecks_parse_message_2]
  type = "regex_parser"
  inputs = ["haproxy_healthchecks_parse_routing.health_checks"]

[transforms.haproxy_healthchecks_parse_message_3]
  type = "regex_parser" # required
  inputs = ["haproxy_healthchecks_parse_message_1", "haproxy_healthchecks_parse_message_2"] # required


[transforms.haproxy_healthchecks_parse_message_4]
  type = "regex_parser"
  inputs = ["haproxy_healthchecks_parse_routing.no_backend"]

[transforms.haproxy_healthchecks_route_serverity]
  type = "swimlanes"
  inputs = ["haproxy_healthchecks_parse_message_3"]
  [transforms.haproxy_healthchecks_route_serverity.lanes.ok]
    type="check_fields"
    "status.regex" = "succeeded|UP"
  [transforms.haproxy_healthchecks_route_serverity.lanes.warning]
    type="check_fields"
    "status.regex" = "failed|DOWN"

[transforms.haproxy_healthchecks_modify_serverity_1]
  type = "add_fields"
  inputs = ["haproxy_healthchecks_route_serverity.warning"]

[transforms.haproxy_healthchecks_modify_serverity_2]
  type = "add_fields"
  inputs = ["haproxy_healthchecks_parse_message_4"]

[transforms.haproxy_healthchecks_parse]
  type = "json_parser" # required
  inputs = ["haproxy_healthchecks_modify_serverity_1","haproxy_healthchecks_modify_serverity_2","haproxy_healthchecks_route_serverity.ok"] # required

#
# Parse internal logs
#
[transforms.haproxy_internal_parse_routing]
  type = "swimlanes"
  inputs = ["syslog_haproxy_internal"]
  [transforms.haproxy_internal_parse_routing.lanes.parsing_problem]
    type="check_fields"
    "message.starts_with" = "parsing"
  [transforms.haproxy_internal_parse_routing.lanes.stopping]
    type="check_fields"
    "message.starts_with" = "Stopping"
  [transforms.haproxy_internal_parse_routing.lanes.proxy_stop]
    type="check_fields"
    "message.starts_with" = "Proxy"
  [transforms.haproxy_internal_parse_routing.lanes.missing_backend]
    type="check_fields"
    "message.contains" = "use_backend"

[transforms.haproxy_internal_parse_message_1]
  type = "regex_parser"
  inputs = ["haproxy_internal_parse_routing.parsing_problem"]

[transforms.haproxy_internal_parse_message_2]
  type = "regex_parser"
  inputs = ["haproxy_internal_parse_routing.stopping"]

[transforms.haproxy_internal_parse_message_3]
  type = "regex_parser"
  inputs = ["haproxy_internal_parse_routing.proxy_stop"]

[transforms.haproxy_internal_parse_message_4]
  type = "regex_parser"
  inputs = ["haproxy_internal_parse_routing.missing_backend"]

[transforms.haproxy_internal_modify_serverity]
  type = "add_fields"
  inputs = ["haproxy_internal_parse_message_1","haproxy_internal_parse_message_4"]

[transforms.haproxy_internal_parse]
  type = "json_parser" # required
  inputs = ["haproxy_internal_parse_message_2","haproxy_internal_parse_message_3","haproxy_internal_modify_serverity"] # required

#
# Add EC2 metadata into Haproxy traffic logs
#

[transforms.haproxy_traffic_ec2_metadata]
  type = "aws_ec2_metadata" # required
  inputs = ["haproxy_traffic_parse"] # required

#
# Add EC2 metadata into Haproxy healthchecks logs
#

[transforms.haproxy_healthchecks_ec2_metadata]
  type = "aws_ec2_metadata" # required
  inputs = ["haproxy_healthchecks_parse"] # required


#
# Add EC2 metadata into Haproxy internal logs
#

[transforms.haproxy_internal_ec2_metadata]
  type = "aws_ec2_metadata" # required
  inputs = ["haproxy_internal_parse"] # required

#
# Split fields for traffic logs
#

[transforms.haproxy_traffic_split_fields]
  type = "split" # required
  inputs = ["haproxy_traffic_ec2_metadata"] # required

#
# Split & parse healthcheck fields
#

[transforms.haproxy_healthchecks_split_fields]
  type = "split" # required
  inputs = ["haproxy_healthchecks_ec2_metadata"]

#
# Split fields for internal haproxy logs
#

[transforms.haproxy_internal_split_fields]
  type = "split" # required
  inputs = ["haproxy_internal_ec2_metadata"] # required

#
# Convert time to seconds
#

[transforms.haproxy_traffic_time_conversion]
  type = "lua" # required
  inputs = ["haproxy_traffic_split_fields"] # required
  version = "2" # required

[transforms.haproxy_healthchecks_time_conversion]
  type = "lua" # required
  inputs = ["haproxy_healthchecks_split_fields"] # required
  version = "2" # required

[transforms.haproxy_internal_time_conversion]
  type = "lua" # required
  inputs = ["haproxy_internal_split_fields"] # required
  version = "2" # required
#
# Add required fields
#

[transforms.haproxy_traffic_add_fields]
  # General
  type = "add_fields" # required
  inputs = ["haproxy_traffic_time_conversion"] # required

[transforms.haproxy_healthchecks_add_fields]
  # General
  type = "add_fields" # required
  inputs = ["haproxy_healthchecks_time_conversion"] # required


[transforms.haproxy_internal_add_fields]
  # General
  type = "add_fields" # required
  inputs = ["haproxy_internal_time_conversion"] # required

#
# Rename fields
#

[transforms.haproxy_rename_fields]
  # General
  type = "rename_fields" # required
  inputs = ["haproxy_internal_add_fields", "haproxy_healthchecks_add_fields"] # required_names

[transforms.haproxy_traffic_rename_fields]
  # General
  type = "rename_fields" # required
  inputs = ["haproxy_traffic_add_fields"] # required

[sinks.s3_archives_nginx]
  inputs       = ["syslog_nginx_access","syslog_nginx_error"] # don't sample
  type         = "aws_s3"

[sinks.s3_archives_haproxy]
  inputs       = ["syslog_haproxy_traffic","syslog_haproxy_healthchecks","syslog_haproxy_internal"] # don't sample
  type         = "aws_s3"

[sinks.datadog_logs_nginx]
  # General
  type = "datadog_logs" # required
  inputs = ["nginx_rename_fields","nginx_error_add_fields"] # required

[sinks.datadog_logs_haproxy]
  # General
  type = "datadog_logs" # required
  inputs = ["haproxy_rename_fields","haproxy_traffic_rename_fields"] # required

Let me know if you need to know something more.

Well, this was a head spinner. Here is a minimal config for reproduction:

[sources.input]
  type = "stdin"

[transforms.trans1]
  type = "json_parser"
  inputs = ["input"]
# Delete this
  drop_field = false 

# Delete this
[transforms.trans2]
  type = "json_parser"
  inputs = ["trans1"] 

[sinks.output1]
  inputs       = ["trans1"]
  type         = "blackhole"
  print_amount = 1000 

# Delete this
[sinks.output2]
  inputs       = ["trans2"]
  type         = "blackhole"
  print_amount = 1000 

Steps to reproduce:

  1. Run vector with above config and with -w watch feature.
  2. Delete commented, and then save.
  3. Watch the panic.

It happens during output wiring of a changed transform to inputs of other transform. In it, if we find a transform in the old config with the input of the changed transform, we assume it's present. Which this example shows is not true.

Before v0.9.1 this was only caused with (transform,transform) pair, but with v0.9.1 (#2449) this also became problem with (transform,sink) pair so the following config would also panic:

[sources.input]
  type = "stdin"

[transforms.trans]
  type = "json_parser"
  inputs = ["input"]
# Delete this
  drop_field = false 

# Delete this
[sinks.output2]
  inputs       = ["trans"]
  type         = "blackhole"
  print_amount = 1000 

[sinks.output1]
  inputs       = ["trans"]
  type         = "blackhole"
  print_amount = 1000 

Fix

There are two ways to fix it:

  • Make config reflect real state of the topology. That is, to model the transitional state when we have shut down part of the topology and are preparing to start new/changed components.

    • More complex.

    • Is more correct.

    • Opens a can of worms of other transitional states.

  • Don't assume, in select functions, that everything in the topology config is running.

    • Quite simple.

    • It's not that it isn't correct, but the logical connection between config and running topology will be relaxed.

I am for the simpler version + documenting that running topology doesn't correspond to the config during this period.

I am also in favor of the simple solution for right now. Ideally, we can address this case, add a test to make sure it doesn't happen again , and then think about more "correct" management of this complexity in the future.

Also, awesome job tracking this down @ktff! Topology changes can be very difficult to wrap your head around :smile:

Yea, I've putted it to strongly. It isn't more correct, just plain more complex and not needed at the moment.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

leebenson picture leebenson  路  3Comments

binarylogic picture binarylogic  路  4Comments

LucioFranco picture LucioFranco  路  3Comments

LucioFranco picture LucioFranco  路  3Comments

LucioFranco picture LucioFranco  路  3Comments