Title:"/tap admin endpoint is taken" error on hot reloads
Description:
After enabling the tap filter via the config snippet below, we are getting errors when attempting to reload Envoy. Shutting down envoy and doing a full restart works flawlessly.
Envoy version 1.12.2.p0.g55af249-1p49.g8b1f2e3, using https://www.getenvoy.io/ packages.
- name: envoy.filters.http.tap
typed_config:
"@type": type.googleapis.com/envoy.config.filter.http.tap.v2alpha.Tap
common_config:
admin_config:
config_id: some_config_id
We are using the hot reload script provided by Envoy.
Here is the exception stack trace we get when we try to perform a reload, while the configuration above is active (log edited for readability since it's originally very wide):
[info][main] [external/envoy/source/server/server.cc:455] runtime: layers:
- name: base
static_layer:
{}
- name: admin
admin_layer:
{}
[info][config] [external/envoy/source/server/configuration_impl.cc:62] loading 0 static secret(s)
[info][config] [external/envoy/source/server/configuration_impl.cc:68] loading 2 cluster(s)
[warning][runtime] [external/envoy/source/common/runtime/runtime_impl.cc:63] Unable to use runtime singleton for feature envoy.reloadable_features.strict_header_validation
[warning][runtime] [external/envoy/source/common/runtime/runtime_impl.cc:63] Unable to use runtime singleton for feature envoy.reloadable_features.connection_header_sanitization
[info][config] [external/envoy/source/server/configuration_impl.cc:72] loading 3 listener(s)
[critical][assert] [external/envoy/source/extensions/common/tap/admin.cc:33] assert failure: rc. Details: /tap admin endpoint is taken
[critical][backtrace] Caught Aborted, suspect faulting address 0xb42d
[critical][backtrace] Backtrace (use tools/stack_decode.py to get line numbers):
[critical][backtrace] Envoy version: bb7ceff4c3c5bd4555dff28b6e56d27f2f8be0a7/1.13.0/clean-getenvoy-4bd7718-envoy/RELEASE/BoringSSL
[critical][backtrace] #0: __restore_rt [0x7f6635337890]
[critical][backtrace] #1: std::__1::__function::__func<>::operator()() [0x55f53760faa0]
[critical][backtrace] #2: Envoy::Singleton::ManagerImpl::get() [0x55f5376eb019]
[critical][backtrace] #3: Envoy::Singleton::Manager::getTyped<>() [0x55f53760e80e]
[critical][backtrace] #4: Envoy::Extensions::Common::Tap::AdminHandler::getSingleton() [0x55f53760e6f7]
[critical][backtrace] #5: Envoy::Extensions::Common::Tap::ExtensionConfigBase::ExtensionConfigBase() [0x55f53760da0a]
[critical][backtrace] #6: Envoy::Extensions::HttpFilters::TapFilter::FilterConfigImpl::FilterConfigImpl() [0x55f53700864e]
[critical][backtrace] #7: Envoy::Extensions::HttpFilters::TapFilter::TapFilterFactory::createFilterFactoryFromProtoTyped() [0x55f537005865]
[critical][backtrace] #8: Envoy::Extensions::HttpFilters::Common::FactoryBase<>::createFilterFactoryFromProto() [0x55f537005c2f]
[critical][backtrace] #9: Envoy::Extensions::NetworkFilters::HttpConnectionManager::HttpConnectionManagerConfig::processFilter() [0x55f5376a57d3]
[critical][backtrace] #10: Envoy::Extensions::NetworkFilters::HttpConnectionManager::HttpConnectionManagerConfig::HttpConnectionManagerConfig() [0x55f5376a4373]
[critical][backtrace] #11: Envoy::Extensions::NetworkFilters::HttpConnectionManager::HttpConnectionManagerFilterConfigFactory::createFilterFactoryFromProtoTyped() [0x55f5376a270b]
[critical][backtrace] #12: Envoy::Extensions::NetworkFilters::Common::FactoryBase<>::createFilterFactoryFromProto() [0x55f5376a69b6]
[critical][backtrace] #13: Envoy::Server::ProdListenerComponentFactory::createNetworkFilterFactoryList_() [0x55f537673aeb]
[critical][backtrace] #14: Envoy::Server::ValidationInstance::createNetworkFilterFactoryList() [0x55f53764f384]
[critical][backtrace] #15: Envoy::Server::ListenerFilterChainFactoryBuilder::buildFilterChainInternal() [0x55f53767b0ca]
[critical][backtrace] #16: Envoy::Server::ListenerFilterChainFactoryBuilder::buildFilterChain() [0x55f53767af50]
[critical][backtrace] #17: Envoy::Server::FilterChainManagerImpl::addFilterChain() [0x55f5376819b0]
[critical][backtrace] #18: Envoy::Server::ListenerImpl::ListenerImpl() [0x55f53766faa4]
[critical][backtrace] #19: Envoy::Server::ListenerManagerImpl::addOrUpdateListenerInternal() [0x55f537677c34]
[critical][backtrace] #20: Envoy::Server::ListenerManagerImpl::addOrUpdateListener() [0x55f537677438]
[critical][backtrace] #21: Envoy::Server::Configuration::MainImpl::initialize() [0x55f53769cfeb]
[critical][backtrace] #22: Envoy::Server::ValidationInstance::initialize() [0x55f53764e5e9]
[critical][backtrace] #23: Envoy::Server::ValidationInstance::ValidationInstance() [0x55f53764d2d5]
[critical][backtrace] #24: Envoy::Server::validateConfig() [0x55f53764c4ad]
[critical][backtrace] #25: Envoy::MainCommonBase::run() [0x55f536d3e52a]
[critical][backtrace] #26: main [0x55f536d3d152]
[critical][backtrace] #27: __libc_start_main [0x7f6634f55b97]
cc. @mattklein123
@perlun I tried the following scenario several times and cannot get the crash you describe (I am running the latest master).
start Envoy via hot-restarter:
python hot-restarter.py start_envoy.sh
connect to /tap:
curl -X POST --data-binary @./body.txt 127.0.0.1:9901/tap
hot-reload Envoy:
kill -SIGHUP <pid>
I tried various scenarios: hot-reload when curl is connected, connect curl just after the hot-reload and never got the crash.
I am running Ubuntu18.04 and run Envoy directly not as container.
Is your sequence different?
Thanks @cpakulski. I just happened to run into this again on Ubuntu 18.04, Envoy version as follows:
envoy version: 1a0363c885c2dbb1e48b03847dbd706d1ba43eba/1.14.2/clean-getenvoy-fbeeb15-envoy/RELEASE/BoringSSL
The way I restarted Envoy was by sudo systemctl reload envoy.service or similar (triggered by Ansible). It gives me this error if I run it manually:
$ sudo systemctl reload envoy.service
Job for envoy.service failed because a fatal signal was delivered causing the control process to dump core.
See "systemctl status envoy.service" and "journalctl -xe" for details.
Envoy is launched via a systemd unit that looks like this:
$ cat /etc/systemd/system/envoy.service
[Unit]
Description=Envoy edge and service proxy
Documentation=https://www.envoyproxy.io
After=network.target
[Service]
EnvironmentFile=-/etc/default/envoy
ExecStart=/usr/local/bin/envoy-hot-restarter.py /usr/local/bin/envoy-hot-restarter.sh
ExecReload=/usr/bin/envoy --mode validate --config-path /etc/envoy/envoy.yaml
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
RestartPreventExitStatus=255
Type=simple
User=root
Group=root
[Install]
WantedBy=multi-user.target
Let me know if this helps you to reproduce the problem. If not, I can try to extract parts of my envoy.yaml that might be triggering it.
@perlun Thanks for this info. I will try to reproduce the crash using your scripts.
@perlun Do you have a connection open to tap when you do hot-reload?
@perlun Do you have a connection open to tap when you do hot-reload?
Nope, this happens without any active tap connections (or rather: any _known_ tap connections. :wink:)
@perlun I managed to repro it. I believe that what happens is that when you reload envoy.service, the check for configuration (first ExecReload command) is executed while "old" instance of Envoy is still running. The proper solution is not to attach to shared memory nor check tap sockets when running in --mode validate and I will try to code it like that. In the meantime you can move config validation to envoy-hot-restarter.sh before the real Envoy is invoked. This is executed when "old" Envoy has been shut down.
@cpakulski thanks for that, highly appreciated!
I guess the reason for doing the --mode validate run in in the ExecReload stage is that we wanted to avoid triggering a reload if there are syntax error. Maybe Envoy handles that case gracefully in all situations anyway, and it's safe to trigger a reload with a broken envoy.yaml? Correct me if I misunderstood something, but my understand of your suggestion is that the validation would only be done before starting Envoy the first time.
@slovdahl What you try to do makes perfect sense: verify syntax of the config file before restarting Envoy. The goal is not to start "new" instance of Envoy if the new config is wrong, but keep "old" instance running.
However, as I explained above there is a bug in Envoy which does not allow you to run in config-verify mode if there is another instance of Envoy currently running.
Maybe Envoy handles that case gracefully in all situations anyway, and it's safe to trigger a reload with a broken envoy.yaml?
No, if new config is wrong Envoy will not start. So, in your situation, if you provide new (broken config) and do hot-restart you will end up with no Envoy running.
Correct me if I misunderstood something, but my understand of your suggestion is that the validation would only be done before starting Envoy the first time.
The envoy-hot-restarter.sh script is run each time you do hot reload, but it is run after "old" instance released its resources and you can run config validation. However, if the new config is broken, the "new" instance will not start and you will end up with no service. Until this issue is fixed, I think you should put some logic into envoy-hot-restarter.sh that keeps old config. Run verification on the new config and if it is good, rename it to current config and start Envoy. If the new config is wrong, start Envoy with previous config. In that case you will always have Envoy running.
@cpakulski thanks for the clarifications, this was very helpful!
Thanks for the hints about a workaround, it does sound like it would make our lifes easier. I'll see if could manage to get it implemented in the envoy-hot-restarter.sh script. Our current workaround for this is to manually drain traffic, and after that deploy the configuration change, stop and start Envoy. As long as there are redundant Envoys running this works fine, although it's a bit tedious.
@slovdahl I debugged the code and found the root cause. Unfortunately what I described above is not true. Running in config validation mode will always crash when envoy.filters.http.tap filter is specified in the config file, regardless whether there is other Envoy instance running or not. Moving config check to envoy-hot-restarter.sh script just masked the problem, but Envoy crashed inside the scripf. I just opened PR with the fix.
Thanks a lot for this, @cpakulski! Looking forward to ~1.15.0~ 1.16.0 where this fix will be available.
Most helpful comment
@slovdahl I debugged the code and found the root cause. Unfortunately what I described above is not true. Running in config validation mode will always crash when
envoy.filters.http.tapfilter is specified in the config file, regardless whether there is other Envoy instance running or not. Moving config check toenvoy-hot-restarter.shscript just masked the problem, but Envoy crashed inside the scripf. I just opened PR with the fix.