We have seen this crash a few times. Somehow we are trying to delete a host that the health checker doesn't know about. I would imagine this somehow has to do with us specifying the same DNS host multiple times but I'm not sure.
cc @dio @snowp for recent work in this area. cc @tonya11en who can explain the config we use that might be causing this. I haven't looked into this in any detail yet.
Backtrace:
#0 operator() (this=0x6ba7428, __den=23, __num=<optimized out>) at /usr/include/c++/5/bits/hashtable_policy.h:446
#0 operator() (this=0x6ba7428, __den=23, __num=<optimized out>) at /usr/include/c++/5/bits/hashtable_policy.h:446
#1 _M_bucket_index (this=0x6ba7428, __n=23, __p=0x0) at /usr/include/c++/5/bits/hashtable_policy.h:1180
#2 _M_bucket_index (this=0x6ba7428, __n=0x0) at /usr/include/c++/5/bits/hashtable.h:617
#3 erase (__it=..., this=0x6ba7428) at /usr/include/c++/5/bits/hashtable.h:1732
#4 erase (__it=..., this=0x6ba7428) at /usr/include/c++/5/bits/hashtable.h:745
#5 erase (__position=..., this=0x6ba7428) at /usr/include/c++/5/bits/unordered_map.h:523
#6 Envoy::Upstream::HealthCheckerImplBase::onClusterMemberUpdate (this=0x6ba7350, hosts_added=..., hosts_removed=...) at external/envoy/source/common/upstream/health_checker_base_impl.cc:126
#7 0x000000000067448f in operator() (__args#2=..., __args#1=..., __args#0=1, this=<optimized out>) at /usr/include/c++/5/functional:2267
#8 runCallbacks (args#2=..., args#1=..., args#0=1, this=<optimized out>) at bazel-out/k8-opt/bin/external/envoy/source/common/common/_virtual_includes/callback_impl_lib/common/common/callback_impl.h:40
#9 runUpdateCallbacks (hosts_removed=..., hosts_added=..., priority=1, this=<optimized out>) at bazel-out/k8-opt/bin/external/envoy/source/common/upstream/_virtual_includes/upstream_includes/common/upstream/upstream_impl.h:348
#10 operator() (hosts_removed=..., hosts_added=..., priority=1, __closure=<optimized out>) at external/envoy/source/common/upstream/upstream_impl.cc:295
#11 std::_Function_handler<void(unsigned int, const std::vector<std::shared_ptr<Envoy::Upstream::Host>, std::allocator<std::shared_ptr<Envoy::Upstream::Host> > >&, const std::vector<std::shared_ptr<Envoy::Upstream::Host>, std::allocator<std::shared_ptr<Envoy::Upstream::Host> > >&), Envoy::Upstream::PrioritySetImpl::getOrCreateHostSet(uint32_t, absl::optional<unsigned int>)::<lambda(uint32_t, const HostVector&, const HostVector&)> >::_M_invoke(const std::_Any_data &, <unknown type in /usr/sbin/envoy, CU 0x51da897, DIE 0x52faca0>, const std::vector<std::shared_ptr<Envoy::Upstream::Host>, std::allocator<std::shared_ptr<Envoy::Upstream::Host> > > &, const std::vector<std::shared_ptr<Envoy::Upstream::Host>, std::allocator<std::shared_ptr<Envoy::Upstream::Host> > > &) (__functor=..., __args#0=<optimized out>, __args#1=..., __args#2=...) at /usr/include/c++/5/functional:1871
#12 0x000000000067b25c in operator() (__args#2=..., __args#1=..., __args#0=1, this=<optimized out>) at /usr/include/c++/5/functional:2267
#13 runCallbacks (args#2=..., args#1=..., args#0=1, this=0x9ad6a70) at bazel-out/k8-opt/bin/external/envoy/source/common/common/_virtual_includes/callback_impl_lib/common/common/callback_impl.h:40
#14 runUpdateCallbacks (hosts_removed=..., hosts_added=..., this=0x9ad6a20) at bazel-out/k8-opt/bin/external/envoy/source/common/upstream/_virtual_includes/upstream_includes/common/upstream/upstream_impl.h:289
#15 Envoy::Upstream::HostSetImpl::updateHosts (this=this@entry=0x9ad6a20, hosts=..., healthy_hosts=..., hosts_per_locality=..., healthy_hosts_per_locality=..., locality_weights=..., hosts_added=..., hosts_removed=..., overprovisioning_factor=...) at external/envoy/source/common/upstream/upstream_impl.cc:255
#16 0x0000000000685b1e in Envoy::Upstream::PriorityStateManager::updateClusterPrioritySet(unsigned int, std::shared_ptr<std::vector<std::shared_ptr<Envoy::Upstream::Host>, std::allocator<std::shared_ptr<Envoy::Upstream::Host> > > >&&, absl::optional<std::vector<std::shared_ptr<Envoy::Upstream::Host>, std::allocator<std::shared_ptr<Envoy::Upstream::Host> > > > const&, absl::optional<std::vector<std::shared_ptr<Envoy::Upstream::Host>, std::allocator<std::shared_ptr<Envoy::Upstream::Host> > > > const&, absl::optional<Envoy::Upstream::Host::HealthFlag>, absl::optional<unsigned int>) (this=this@entry=0x7ffd1a372bf0, priority=priority@entry=1, current_hosts=current_hosts@entry=<unknown type in /usr/sbin/envoy, CU 0x51da897, DIE 0x5378365>, hosts_added=..., hosts_removed=..., health_checker_flag=..., health_checker_flag@entry=..., overprovisioning_factor=...) at external/envoy/source/common/upstream/upstream_impl.cc:862
#17 0x00000000006864d1 in Envoy::Upstream::StrictDnsClusterImpl::updateAllHosts (this=0x6ba6dc0, hosts_added=..., hosts_removed=..., current_priority=1) at external/envoy/source/common/upstream/upstream_impl.cc:1158
#18 0x0000000000686bc5 in operator() (address_list=<optimized out>, __closure=0x141f81e8) at external/envoy/source/common/upstream/upstream_impl.cc:1210
#19 std::_Function_handler<void(const std::list<std::shared_ptr<const Envoy::Network::Address::Instance>, std::allocator<std::shared_ptr<const Envoy::Network::Address::Instance> > >&&), Envoy::Upstream::StrictDnsClusterImpl::ResolveTarget::startResolve()::<lambda(const std::list<std::shared_ptr<const Envoy::Network::Address::Instance>, std::allocator<std::shared_ptr<const Envoy::Network::Address::Instance> > >&&)> >::_M_invoke(const std::_Any_data &, <unknown type in /usr/sbin/envoy, CU 0x51da897, DIE 0x537de5c>) (__functor=..., __args#0=<unknown type in /usr/sbin/envoy, CU 0x51da897, DIE 0x537de5c>) at /usr/include/c++/5/functional:1871
#20 0x00000000005fb62e in operator() (__args#0=<optimized out>, this=0x141f81e8) at /usr/include/c++/5/functional:2267
#21 Envoy::Network::DnsResolverImpl::PendingResolution::onAresHostCallback (this=0x141f81e0, status=<optimized out>, timeouts=2, hostent=0x0) at external/envoy/source/common/network/dns_impl.cc:124
#22 0x00000000005fcb7a in end_hquery (hquery=0x16a5ccc0, status=<optimized out>, host=0x0) at ../ares_gethostbyname.c:234
#23 0x00000000005fcdfc in next_lookup (hquery=hquery@entry=0x16a5ccc0, status_code=status_code@entry=12) at ../ares_gethostbyname.c:176
#24 0x00000000005fcf72 in host_callback (arg=0x16a5ccc0, status=12, timeouts=<optimized out>, abuf=<optimized out>, alen=<optimized out>) at ../ares_gethostbyname.c:228
#25 0x0000000000603c2c in end_squery (squery=0xcf4a100, status=<optimized out>, abuf=<optimized out>, alen=<optimized out>) at ../ares_search.c:210
#26 0x0000000000603d12 in search_callback (arg=0xcf4a100, status=<optimized out>, timeouts=<optimized out>, abuf=<optimized out>, alen=<optimized out>) at ../ares_search.c:157
#27 0x00000000006039bb in qcallback (arg=0x17c87a60, status=<optimized out>, timeouts=<optimized out>, abuf=<optimized out>, alen=<optimized out>) at ../ares_query.c:183
#28 0x0000000000602084 in end_query (channel=0x2194000, query=0x167acc30, status=12, abuf=abuf@entry=0x0, alen=alen@entry=0) at ../ares_process.c:1437
#29 0x00000000006029a7 in next_server (channel=channel@entry=0x2194000, query=<optimized out>, now=now@entry=0x7ffd1a3730c0) at ../ares_process.c:802
#30 0x00000000006036f4 in process_timeouts (now=0x7ffd1a3730c0, channel=0x2194000) at ../ares_process.c:561
#31 processfds (channel=0x2194000, read_fds=read_fds@entry=0x0, read_fd=read_fd@entry=-1, write_fds=write_fds@entry=0x0, write_fd=write_fd@entry=-1) at ../ares_process.c:132
#32 0x0000000000603973 in ares_process_fd (channel=<optimized out>, read_fd=read_fd@entry=-1, write_fd=write_fd@entry=-1) at ../ares_process.c:152
#33 0x00000000005fbd9c in onEventCallback (events=0, fd=-1, this=0x20d4720) at external/envoy/source/common/network/dns_impl.cc:168
#34 operator() (__closure=<optimized out>) at external/envoy/source/common/network/dns_impl.cc:27
#35 std::_Function_handler<void(), Envoy::Network::DnsResolverImpl::DnsResolverImpl(Envoy::Event::Dispatcher&, const std::vector<std::shared_ptr<const Envoy::Network::Address::Instance> >&)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/5/functional:1871
#36 0x00000000008aef80 in event_process_active_single_queue (base=base@entry=0x21922c0, max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0, activeq=<optimized out>) at ../event.c:1646
#37 0x00000000008af65f in event_process_active (base=base@entry=0x21922c0) at ../event.c:1738
#38 0x00000000008b23b8 in event_base_loop (base=0x21922c0, flags=<optimized out>) at ../event.c:1961
#39 0x00000000005b75cc in Envoy::Server::InstanceImpl::run (this=0x2192000) at external/envoy/source/server/server.cc:460
#40 0x0000000000452671 in Envoy::MainCommonBase::run (this=this@entry=0x20f0870) at external/envoy/source/exe/main_common.cc:102
#41 0x0000000000409a53 in run (this=<optimized out>) at bazel-out/k8-opt/bin/external/envoy/source/exe/_virtual_includes/envoy_main_common_lib/exe/main_common.h:81
#42 main (argc=17, argv=0x7ffd1a3734b8) at external/envoy/source/exe/main.cc:37
@tonya11en when you have time please share the config with us. Thanks!
It's hard to say without looking at the config + knowing what the dns names resolve, to, but I have a suspicion that this might be due to two resolve targets sharing the same host object when their resolve to the same ip (by way of all_hosts_).
This fails: https://github.com/envoyproxy/envoy/compare/master...snowp:strict-dns?expand=1, showing that the host is reused between resolve targets. I imagine this can lead to the same host object being passed as hosts_removed multiple times, causing this kind of failure.
@snowp yes I think that is exactly what is happening. We specify the same DNS name multiple times I'm assuming to help with load spread. @tonya11en can provide more info.
The failure seen in @snowp's unit test looks plausible for us given what we've configured. Our config looks something like this:
- common_http_protocol_options: {idle_timeout: 60s}
connect_timeout: 1s
dns_lookup_family: V4_ONLY
http2_protocol_options: {}
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: egress_interconnect_googleapis_www
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address: {address: endpoint_A.lyft.net, port_value: 9201,
protocol: TCP}
priority: 0
- lb_endpoints:
- endpoint:
address:
socket_address: {address: endpoint_A.lyft.net, port_value: 9201,
protocol: TCP}
priority: 0
- lb_endpoints:
- endpoint:
address:
socket_address: {address: endpoint_A.lyft.net, port_value: 9201,
protocol: TCP}
priority: 0
- lb_endpoints:
- endpoint:
address:
socket_address: {address: endpoint_A.lyft.net, port_value: 9201,
protocol: TCP}
priority: 0
- lb_endpoints:
- endpoint:
address:
socket_address: {address: endpoint_A.lyft.net, port_value: 9201,
protocol: TCP}
priority: 0
- lb_endpoints:
- endpoint:
address:
socket_address: {address: endpoint_B.lyft.net, port_value: 9201,
protocol: TCP}
priority: 1
- lb_endpoints:
- endpoint:
address:
socket_address: {address: endpoint_B.lyft.net, port_value: 9201,
protocol: TCP}
priority: 1
- lb_endpoints:
- endpoint:
address:
socket_address: {address: endpoint_B.lyft.net, port_value: 9201,
protocol: TCP}
priority: 1
- lb_endpoints:
- endpoint:
address:
socket_address: {address: endpoint_B.lyft.net, port_value: 9201,
protocol: TCP}
priority: 1
- lb_endpoints:
- endpoint:
address:
socket_address: {address: endpoint_B.lyft.net, port_value: 9201,
protocol: TCP}
priority: 1
- lb_endpoints:
- endpoint:
address:
socket_address: {address: www.googleapis.com, port_value: 443, protocol: TCP}
priority: 2
I can give more specifics if needed tomorrow and help out.
So we should definitely fix this, and the quickest fix I can think of is to scope all_hosts_ to each resolve target for STRICT_DNS instead of for the entire cluster (which is what EDS does).
This makes the behavior of STRICT_DNS significantly different from EDS wrt host duplicates, but given the difference between the update mechanism in EDS and STRICT_DNS it might not be reasonable to expect them to match. See #4590 which is very much related.
/assign @snowp
@tonya11en could you please try to apply the changeset at #5075 and test against your env when you have time? Thanks!