Envoy: Envoy dns_filter causes OOM

Created on 6 Aug 2020  路  13Comments  路  Source: envoyproxy/envoy

2020-08-05T23:12:06.994966Z     critical        envoy assert    panic: out of memory
2020-08-05T23:12:06.995014Z     critical        envoy backtrace Caught Aborted, suspect faulting address 0x53900000010
2020-08-05T23:12:06.995024Z     critical        envoy backtrace Backtrace (use tools/stack_decode.py to get line numbers):
2020-08-05T23:12:06.995026Z     critical        envoy backtrace Envoy version: 601cb2e4053746c53bad5811bd9b5f380228278a/1.16.0-dev/Clean/RELEASE/BoringSSL
2020-08-05T23:12:07.002342Z     critical        envoy backtrace #0: __restore_rt [0x7f351b8a68a0]
2020-08-05T23:12:07.022274Z     critical        envoy backtrace #1: [0x5575cc212089]
2020-08-05T23:12:07.029666Z     critical        envoy backtrace #2: (anonymous namespace)::handle_oom() [0x5575ce2aaaa5]
2020-08-05T23:12:07.038984Z     critical        envoy backtrace #3: tcmalloc::allocate_full_cpp_throw_oom() [0x5575ce315c90]
2020-08-05T23:12:07.046672Z     critical        envoy backtrace #4: fmt::v6::basic_memory_buffer<>::grow() [0x5575cc19895d]
2020-08-05T23:12:07.053475Z     critical        envoy backtrace #5: fmt::v6::visit_format_arg<>() [0x5575cc189c92]
2020-08-05T23:12:07.060924Z     critical        envoy backtrace #6: fmt::v6::internal::parse_format_string<>() [0x5575cc1892b5]
2020-08-05T23:12:07.067446Z     critical        envoy backtrace #7: fmt::v6::internal::vformat<>() [0x5575cc1a4032]
2020-08-05T23:12:07.073824Z     critical        envoy backtrace #8: Envoy::Server::ListenSocketFactoryImpl::createListenSocketAndApplyOptions() [0x5575cd7b9d6e]
2020-08-05T23:12:07.081319Z     critical        envoy backtrace #9: Envoy::Server::ListenSocketFactoryImpl::getListenSocket() [0x5575cd7ba208]
2020-08-05T23:12:07.088832Z     critical        envoy backtrace #10: Envoy::Server::ActiveRawUdpListener::ActiveRawUdpListener() [0x5575cd821236]
2020-08-05T23:12:07.095647Z     critical        envoy backtrace #11: Envoy::Server::ActiveRawUdpListenerFactory::createActiveUdpListener() [0x5575cd7b3ec8]
2020-08-05T23:12:07.102233Z     critical        envoy backtrace #12: Envoy::Server::ConnectionHandlerImpl::addListener() [0x5575cd81d3c7]
2020-08-05T23:12:07.109722Z     critical        envoy backtrace #13: std::__1::__function::__func<>::operator()() [0x5575cd81c72e]
2020-08-05T23:12:07.117505Z     critical        envoy backtrace #14: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x5575cd8271f6]
2020-08-05T23:12:07.125794Z     critical        envoy backtrace #15: event_process_active_single_queue [0x5575cdc93cf7]
2020-08-05T23:12:07.136046Z     critical        envoy backtrace #16: event_base_loop [0x5575cdc9286e]
2020-08-05T23:12:07.155180Z     critical        envoy backtrace #17: Envoy::Server::WorkerImpl::threadRoutine() [0x5575cd81be74]
2020-08-05T23:12:07.165376Z     critical        envoy backtrace #18: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::__invoke() [0x5575cdd44463]
2020-08-05T23:12:07.165516Z     critical        envoy backtrace #19: start_thread [0x7f351b89b6db]

We have had this happen repeatedly. Thanks to @howardjohn , there is some correlation..

2020-08-06T00:00:03.197497Z     debug   envoy filter    Loading DNS table from external file: Success
2020-08-06T00:00:03.197608Z     debug   envoy config    new fc_contexts has 0 filter chains, including 0 newly built
2020-08-06T00:00:03.197616Z     debug   envoy config    add warming listener: name=dns, hash=2887861284968934043, address=0.0.0.0:15013
2020-08-06T00:00:03.197622Z     debug   envoy misc      Initialize listener dns local-init-manager.
2020-08-06T00:00:03.197625Z     debug   envoy init      init manager Listener-local-init-manager dns 2887861284968934043 contains no targets
2020-08-06T00:00:03.197628Z     debug   envoy init      init manager Listener-local-init-manager dns 2887861284968934043 initialized, notifying Listener-local-init-watcher dns
2020-08-06T00:00:03.197661Z     debug   envoy config    warm complete. updating active listener: name=dns, hash=2887861284968934043, address=0.0.0.0:15013
2020-08-06T00:00:03.197666Z     debug   envoy config    draining listener: name=dns, hash=16912794089181669957, address=0.0.0.0:15013
2020-08-06T00:00:03.197677Z     info    envoy upstream  lds: add/update listener 'dns'
2020-08-06T00:00:03.197688Z     debug   envoy config    Resuming discovery requests for type.googleapis.com/envoy.config.route.v3.RouteConfiguration
2020-08-06T00:00:03.197692Z     debug   envoy config    Resuming discovery requests for type.googleapis.com/envoy.api.v2.RouteConfiguration
2020-08-06T00:00:03.197697Z     debug   envoy config    gRPC config for type.googleapis.com/envoy.config.listener.v3.Listener accepted with 62 resources with version 2020-08-06T00:00:03Z/1671
2020-08-06T00:00:03.197730Z     critical        envoy assert    panic: out of memory

xds update of the DNS (udp) listener.
https://github.com/istio/istio/issues/26171

aredns bug help wanted

All 13 comments

2020-08-05T21:38:25.757174Z     critical        envoy backtrace Caught Segmentation fault, suspect faulting address 0x0
2020-08-05T21:38:25.761073Z     critical        envoy backtrace Backtrace (use tools/stack_decode.py to get line numbers):
2020-08-05T21:38:25.761089Z     critical        envoy backtrace Envoy version: 601cb2e4053746c53bad5811bd9b5f380228278a/1.16.0-dev/Clean/RELEASE/BoringSSL
2020-08-05T21:38:25.761274Z     critical        envoy backtrace #0: __restore_rt [0x7f77968928a0]
2020-08-05T21:38:25.790495Z     critical        envoy backtrace #1: Envoy::Network::UdpListenerImpl::send() [0x557c794edb15]
2020-08-05T21:38:25.798024Z     critical        envoy backtrace #2: Envoy::Extensions::UdpFilters::DnsFilter::DnsFilter::sendDnsResponse() [0x557c79218341]
2020-08-05T21:38:25.805647Z     critical        envoy backtrace #3: std::__1::__function::__func<>::operator()() [0x557c7921c318]
2020-08-05T21:38:25.813076Z     critical        envoy backtrace #4: Envoy::Extensions::UdpFilters::DnsFilter::DnsFilterResolver::invokeCallback() [0x557c7921d6ca]
2020-08-05T21:38:25.822320Z     critical        envoy backtrace #5: std::__1::__function::__func<>::operator()() [0x557c7921f11f]
2020-08-05T21:38:25.852765Z     critical        envoy backtrace #6: Envoy::Network::DnsResolverImpl::PendingResolution::onAresGetAddrInfoCallback() [0x557c794e8efd]
2020-08-05T21:38:25.875241Z     critical        envoy backtrace #7: end_hquery [0x557c79586f12]
2020-08-05T21:38:25.883623Z     critical        envoy backtrace #8: qcallback [0x557c7958eae3]
2020-08-05T21:38:25.892917Z     critical        envoy backtrace #9: end_query [0x557c7958de55]
2020-08-05T21:38:25.900867Z     critical        envoy backtrace #10: processfds [0x557c7958d21c]
2020-08-05T21:38:25.908194Z     critical        envoy backtrace #11: std::__1::__function::__func<>::operator()() [0x557c794ea3cd]
2020-08-05T21:38:25.915726Z     critical        envoy backtrace #12: event_process_active_single_queue [0x557c79948cf7]
2020-08-05T21:38:25.922806Z     critical        envoy backtrace #13: event_base_loop [0x557c7994786e]
2020-08-05T21:38:25.930302Z     critical        envoy backtrace #14: Envoy::Server::WorkerImpl::threadRoutine() [0x557c794d0e74]
2020-08-05T21:38:25.937601Z     critical        envoy backtrace #15: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::__invoke() [0x557c799f9463]
2020-08-05T21:38:25.937656Z     critical        envoy backtrace #16: start_thread [0x7f77968876db]

Also a crash ^ on request. I don't have more info on that one, was doing a lot of stuff that the time

[2020-08-06 00:19:28.315][18][critical][backtrace] Caught Segmentation fault, suspect faulting address 0x347ade30
[2020-08-06 00:19:28.315][18][critical][backtrace] Backtrace (use tools/stack_decode.py to get line numbers):
[2020-08-06 00:19:28.315][18][critical][backtrace] Envoy version: 601cb2e4053746c53bad5811bd9b5f380228278a/1.16.0-dev/Clean/RELEASE/BoringSSL
[2020-08-06 00:19:28.315][18][critical][backtrace] #0: __restore_rt [0x7f6e8aaea8a0]
[2020-08-06 00:19:28.325][18][critical][backtrace] #1: std::__1::__function::__func<>::operator()() [0x5629338c9318]
[2020-08-06 00:19:28.337][18][critical][backtrace] #2: Envoy::Extensions::UdpFilters::DnsFilter::DnsFilterResolver::invokeCallback() [0x5629338ca6ca]
[2020-08-06 00:19:28.349][18][critical][backtrace] #3: std::__1::__function::__func<>::operator()() [0x5629338cc11f]
[2020-08-06 00:19:28.360][18][critical][backtrace] #4: Envoy::Network::DnsResolverImpl::PendingResolution::onAresGetAddrInfoCallback() [0x562933b95efd]
[2020-08-06 00:19:28.369][18][critical][backtrace] #5: end_hquery [0x562933c33f12]
[2020-08-06 00:19:28.382][18][critical][backtrace] #6: next_lookup [0x562933c33df6]
[2020-08-06 00:19:28.396][18][critical][backtrace] #7: qcallback [0x562933c3bae3]
[2020-08-06 00:19:28.409][18][critical][backtrace] #8: end_query [0x562933c3ae55]
[2020-08-06 00:19:28.422][18][critical][backtrace] #9: process_answer [0x562933c3b6af]
[2020-08-06 00:19:28.433][18][critical][backtrace] #10: processfds [0x562933c3a0a0]
[2020-08-06 00:19:28.443][18][critical][backtrace] #11: std::__1::__function::__func<>::operator()() [0x562933b983a5]
[2020-08-06 00:19:28.452][18][critical][backtrace] #12: Envoy::Event::FileEventImpl::assignEvents()::$_1::__invoke() [0x562933b8a606]
[2020-08-06 00:19:28.461][18][critical][backtrace] #13: event_process_active_single_queue [0x562933ff5feb]
[2020-08-06 00:19:28.470][18][critical][backtrace] #14: event_base_loop [0x562933ff486e]
[2020-08-06 00:19:28.479][18][critical][backtrace] #15: Envoy::Server::WorkerImpl::threadRoutine() [0x562933b7de74]
[2020-08-06 00:19:28.489][18][critical][backtrace] #16: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::__invoke() [0x5629340a6463]
[2020-08-06 00:19:28.490][18][critical][backtrace] #17: start_thread [0x7f6e8aadf6db]

This is reproduce in plain envoy, with just a single DNS listener changing address rapidly. I will post full reproducer soon

Thanks, @howardjohn! Yes, a reproducer will be very helpful!

It is possible that it is related to the race condition discussed here (https://github.com/envoyproxy/envoy/pull/11914#discussion_r465885114).

$ docker run -p 15013:15013/udp -p 9901:9901 -v $HOME/kube/local:/config -w / --entrypoint envoy envoyproxy/envoy-alpine-debug-dev -c /config/envoy-dns.yaml -l warning --log-format-prefix-with-location 0 --log-format "%v"
there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections


Caught Segmentation fault, suspect faulting address 0x130
Backtrace (use tools/stack_decode.py to get line numbers):
Envoy version: 904cc6b36614e0a5c443d11465b408d3a1c1a342/1.16.0-dev/Clean/RELEASE/BoringSSL
#0: [0x7f747650f3d0]
#1: std::__1::__function::__func<>::operator()() [0x55ac720a6648]
#2: Envoy::Extensions::UdpFilters::DnsFilter::DnsFilterResolver::invokeCallback() [0x55ac720a792a]
#3: std::__1::__function::__func<>::operator()() [0x55ac720a9193]
#4: Envoy::Network::DnsResolverImpl::PendingResolution::onAresGetAddrInfoCallback() [0x55ac7235acb1]
#5: end_hquery [0x55ac7220c465]
#6: next_lookup [0x55ac7220c346]
#7: qcallback [0x55ac7220fa03]
#8: end_query [0x55ac7220eda5]
#9: process_answer [0x55ac7220f5db]
#10: processfds [0x55ac7220dfe0]
#11: std::__1::__function::__func<>::operator()() [0x55ac7235d185]
#12: Envoy::Event::FileEventImpl::assignEvents()::$_1::__invoke() [0x55ac7234f0f6]
#13: event_process_active_single_queue [0x55ac727e23d8]
#14: event_base_loop [0x55ac727e0cee]
#15: Envoy::Server::WorkerImpl::threadRoutine() [0x55ac72342b84]
#16: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::__invoke() [0x55ac728781c3]
#17: [0x7f7476503f00]

envoy-dns.yaml:

admin:
  access_log_path: /tmp/admin_access.log
  address:
    socket_address:
      protocol: TCP
      address: 0.0.0.0
      port_value: 9901
node:
  id: id
  cluster: envoy
dynamic_resources:
  lds_config:
    path: /config/lds-dns.yaml

lds-dns.yaml

resources: [{
  "@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
  "name": "dns",
  "address": {
   "socket_address": {
    "protocol": "UDP",
    "address": "0.0.0.0",
    "port_value": 15013
   }
  },
  "listener_filters": [
   {
    "name": "envoy.filters.udp.dns_filter",
    "typed_config": {
     "@type": "type.googleapis.com/envoy.extensions.filters.udp.dns_filter.v3alpha.DnsFilterConfig",
     "stat_prefix": "dns",
     "server_config": {
      "inline_dns_table": {
       "virtual_domains": [
        {
         "name": "fake.local",
         "endpoint": {
          "address_list": {
           "address": [
            "1.2.3.4"
           ]
          }
         }
        },
        {
         "name": "other.fake.local",
         "endpoint": {
          "address_list": {
           "address": [
            "2.3.4.5"
           ]
          }
         }
        }
       ],
       "known_suffixes": [
        {
         "safe_regex": {
          "google_re2": {},
          "regex": ".*"
         }
        }
       ]
      }
     },
     "client_config": {
      "resolver_timeout": "10s",
      "max_pending_lookups": "256"
     }
    }
   }
  ],
  "traffic_direction": "OUTBOUND",
  "reuse_port": true
}]

switch lds-dns.yaml back and forth repeatedly between what is posted above and the other.fake.local removed. At the time same,

for i in {0..1000}; do
    nslookup -port=15013 fake.local. localhost | grep Address
done

While I have the reproducer for ^ above, this was artificial. The OOM and other panic are seen in real environment.

I can trivially reproduce the OOM but only in an Istio environment, haven't gotten it in pure Envoy env

Thanks for the reproducer @howardjohn, I can reproduce the issue. Seems like the PR https://github.com/envoyproxy/envoy/pull/11914 linked by @rshriram fixes it. I baked that to this image: dio123/envoy:11914-dns-fix and did not see the crash.

@dio thanks! I couldn't reproduce this issue: https://github.com/envoyproxy/envoy/issues/12508#issuecomment-669618203

But I do still see the main issue of OOM:

2020-08-06T04:28:28.400765Z     critical        envoy assert    panic: out of memory
2020-08-06T04:28:28.400824Z     critical        envoy backtrace Caught Aborted, suspect faulting address 0x53900000011
2020-08-06T04:28:28.400830Z     critical        envoy backtrace Backtrace (use tools/stack_decode.py to get line numbers):
2020-08-06T04:28:28.400835Z     critical        envoy backtrace Envoy version: 601cb2e4053746c53bad5811bd9b5f380228278a/1.16.0-dev/Clean/RELEASE/BoringSSL
2020-08-06T04:28:28.401187Z     critical        envoy backtrace #0: __restore_rt [0x7f77867048a0]
2020-08-06T04:28:28.422105Z     critical        envoy backtrace #1: [0x5556d0548089]
2020-08-06T04:28:28.430418Z     critical        envoy backtrace #2: (anonymous namespace)::handle_oom() [0x5556d25e0aa5]
2020-08-06T04:28:28.437321Z     critical        envoy backtrace #3: tcmalloc::allocate_full_cpp_throw_oom() [0x5556d264bc90]
2020-08-06T04:28:28.443532Z     critical        envoy backtrace #4: fmt::v6::basic_memory_buffer<>::grow() [0x5556d04ce95d]
2020-08-06T04:28:28.449246Z     critical        envoy backtrace #5: fmt::v6::visit_format_arg<>() [0x5556d04bfc92]
2020-08-06T04:28:28.455528Z     critical        envoy backtrace #6: fmt::v6::internal::parse_format_string<>() [0x5556d04bf2b5]
2020-08-06T04:28:28.460893Z     critical        envoy backtrace #7: fmt::v6::internal::vformat<>() [0x5556d04da032]
2020-08-06T04:28:28.466432Z     critical        envoy backtrace #8: Envoy::Server::ListenSocketFactoryImpl::createListenSocketAndApplyOptions() [0x5556d1aefd6e]
2020-08-06T04:28:28.471869Z     critical        envoy backtrace #9: Envoy::Server::ListenSocketFactoryImpl::getListenSocket() [0x5556d1af0208]
2020-08-06T04:28:28.477304Z     critical        envoy backtrace #10: Envoy::Server::ActiveRawUdpListener::ActiveRawUdpListener() [0x5556d1b57236]
2020-08-06T04:28:28.482779Z     critical        envoy backtrace #11: Envoy::Server::ActiveRawUdpListenerFactory::createActiveUdpListener() [0x5556d1ae9ec8]
2020-08-06T04:28:28.488218Z     critical        envoy backtrace #12: Envoy::Server::ConnectionHandlerImpl::addListener() [0x5556d1b533c7]
2020-08-06T04:28:28.493671Z     critical        envoy backtrace #13: std::__1::__function::__func<>::operator()() [0x5556d1b5272e]
2020-08-06T04:28:28.499123Z     critical        envoy backtrace #14: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x5556d1b5d1f6]
2020-08-06T04:28:28.504590Z     critical        envoy backtrace #15: event_process_active_single_queue [0x5556d1fc9cf7]
2020-08-06T04:28:28.510030Z     critical        envoy backtrace #16: event_base_loop [0x5556d1fc886e]
2020-08-06T04:28:28.515453Z     critical        envoy backtrace #17: Envoy::Server::WorkerImpl::threadRoutine() [0x5556d1b51e74]
2020-08-06T04:28:28.520881Z     critical        envoy backtrace #18: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::__invoke() [0x5556d207a463]
2020-08-06T04:28:28.520910Z     critical        envoy backtrace #19: start_thread [0x7f77866f96db]

I also disabled our iptables capturing of DNS and just directly forwarded the DNS Requests, same issue. I believe this proves its not some infinite iptables loop type issue

@howardjohn I鈥檓 interested in your setup for Istio for reproducing this. Probably share it in the that Istio ticket? Thanks!

@dio if you check out istio/istio (I ran at bc1ea95017bd7ef99994998de2c2aeeccfda1736), then run:

kind create cluster # Happens on any cluster, kind is just simple
go test -v -p 1 ./tests/integration/pilot/ -run TestLocality -count 100 --istio.test.kube.loadbalancer=false --istio.test.nocleanup

This will deploy a bunch of pods to echo-******** namespace

It may take a few tries, but eventually you should see all the pods crash. kubectl logs c-v1-7bb495cfbf-p22dz -c istio-proxy -p will show the crash.

One interesting thing I found is that I just ran it and it took a few minutes to trigger, but when it did I had 4 (out of 5) pods all hit this at the same time

@mattklein123 I have tried with #11914, still broken. See https://github.com/envoyproxy/envoy/issues/12508#issuecomment-669677645. It fixed one crash but not the main OOM issue we see

cc @abaptiste

Was this page helpful?
0 / 5 - 0 ratings