Using the aws_kinesis logger results in an increasing handle_count as well as increasing memory usage.
version = 10.0.18362
build = 18362
platform = windows
version = 4.0.2
Run osquery using the aws_kinesis logger. Use the following scheduled query to check the health of osquery:
"schedule": {
"osquery_process_info": {
"query": "SELECT * FROM processes WHERE name = 'osqueryd.exe';",
"interval": 60
}
},
In order to collect trends about handle_count and memory usage, let the scheduled query execute a few times. An hour should be sufficient.
Check the scheduled query results using your favorite log viewer.
Notice that the handle_count steadily increases on the osquery worker process (not the watcher). Also, the memory usage (resident_size and total_size) increases.
Expected that both the handle_count as well as memory usage would remain stable.
Both handle_count as well as memory usage steadily increases.
@chrisbroome have you isolated this to the aws_kinesis logger? If you instead use the filesystem logger the handle_count stays the same?
If you add another query to the schedule does the handle_count increase at the same rate?
The filesystem logger does not have the increased handle count nor, as far as I could tell, the increased memory usage.
Also note that the exact same osquery code and logger config on Linux does not have these issues. I haven't tested MacOS yet.
You could try to update the AWS submodule to their latest version and see if the handle leak was in their library code (assuming it鈥檚 somehow fixed recently)?
Yeah I'm thinking the bug is probably in the AWS SDK code. I didn't see the issue reported in their repository. I'm trying to get a minimal reproducible example working so that I can file a bug there and link it here.
Honestly, I'm at a bit of a loss as to where exactly this leak is occurring.
There are a few places to check:
1) The BufferedLogForwarder base class
2) The AwsLogForwarder class that derives from BufferedLogForwarder
3) The KinesisLogForwarder class which derives from AwsLogForwarder
4) The custom osquery::OsqueryHttpClient that we're using with the AWS SDK
Nothing jumps out to me as obvious while looking through any of that code.
FWIW, v4.1.2 also has this issue.
I've narrowed the handle leak to somewhere in the osquery::http::Client code. There are definitely windows-specific codepaths that are taken (specifically in the constructor https://github.com/osquery/osquery/blob/master/osquery/remote/http_client.h#L217) . I've added some logger statements to various methods and here's the output which directly correlates with the ever-increasing handle_count:
I1220 16:57:48.772842 9136 aws_log_forwarder.h:187] sendBatch() begin
I1220 16:57:48.772842 9136 aws_kinesis.cpp:100] internalSend() begin: 2 records, 567 handles
I1220 16:57:48.776860 9136 aws_kinesis.cpp:103] before request.WithStreamName().SetRecords(): 567
I1220 16:57:48.776860 9136 aws_kinesis.cpp:106] after request.WithStreamName().SetRecords(): 567
I1220 16:57:48.786000 9136 http_client.h:221] Client()
I1220 16:57:48.804879 9136 http_client.cpp:85] Client::closeSocket() 573 handles
I1220 16:57:48.805855 9136 http_client.cpp:157] Client::createConnection() begin: 573 handles
I1220 16:57:48.806881 9136 http_client.cpp:172] Client::createConnection() callNetworkOperation, begin r.async_resolve: 573 handles
I1220 16:57:48.809834 9136 http_client.cpp:33] Client::callNetworkOperation() begin: 573 handles
I1220 16:57:48.809834 9136 http_client.cpp:175] Client::createConnection() callNetworkOperation, begincb r.async_resolve: 573 handles
I1220 16:57:48.811856 9136 http_client.cpp:184] Client::createConnection() callNetworkOperation, endcb r.async_resolve: 575 handles
I1220 16:57:48.814877 9136 http_client.cpp:117] Client::resolveHandler() begin: 579 handles
I1220 16:57:48.815915 9136 http_client.cpp:119] Client::resolveHandler() begin async_connect: 579 handles
I1220 16:57:48.818822 9136 http_client.cpp:127] Client::resolveHandler() end async_connect: 580 handles
I1220 16:57:48.821835 9136 http_client.cpp:137] Client::resolveHandler() end: 580 handles
I1220 16:57:48.822834 9136 http_client.cpp:107] Client::connectHandler() begin: 580 handles
I1220 16:57:48.822834 9136 http_client.cpp:110] Client::connectHandler() end: 580 handles
I1220 16:57:48.823828 9136 http_client.cpp:51] Client::callNetworkOperation() end: 573 handles
I1220 16:57:48.824828 9136 http_client.cpp:53] Client::callNetworkOperation() leak: 7 handles
I1220 16:57:48.824828 9136 http_client.cpp:189] Client::createConnection() callNetworkOperation, end r.async_resolve: 580 handles
I1220 16:57:48.825852 9136 http_client.cpp:253] Client::createConnection() end: 580 handles
I1220 16:57:49.310981 9136 http_client.cpp:33] Client::callNetworkOperation() begin: 580 handles
I1220 16:57:49.352906 9136 http_client.cpp:51] Client::callNetworkOperation() end: 580 handles
I1220 16:57:49.352906 9136 http_client.cpp:53] Client::callNetworkOperation() leak: 0 handles
I1220 16:57:49.355964 9136 http_client.cpp:316] Client::sendRequest() 580 handles
I1220 16:57:49.355964 9136 http_client.cpp:33] Client::callNetworkOperation() begin: 580 handles
I1220 16:57:49.357832 9136 http_client.cpp:51] Client::callNetworkOperation() end: 580 handles
I1220 16:57:49.357832 9136 http_client.cpp:53] Client::callNetworkOperation() leak: 0 handles
I1220 16:57:49.358917 9136 http_client.cpp:33] Client::callNetworkOperation() begin: 580 handles
I1220 16:57:49.372885 9136 http_client.cpp:51] Client::callNetworkOperation() end: 580 handles
I1220 16:57:49.372885 9136 http_client.cpp:53] Client::callNetworkOperation() leak: 0 handles
I1220 16:57:49.374909 9136 http_client.h:266] ~Client
I1220 16:57:49.376991 9136 http_client.cpp:85] Client::closeSocket() 580 handles
I1220 16:57:49.377861 9136 http_client.cpp:87] Client::closeSocket():isSocketOpen=true
I1220 16:57:49.377861 9136 http_client.cpp:90] sock_.shutdown(): rc: 0 The operation completed successfully 580 handles
I1220 16:57:49.378911 9136 http_client.cpp:93] sock_.close(): rc: 0 The operation completed successfully 579 handles
I1220 16:57:49.686817 9136 aws_kinesis.cpp:111] internalSend() end: 571 handles
I1220 16:57:49.687887 9136 aws_kinesis.cpp:112] internalSend() leaked: 4 handles
I1220 16:57:49.690928 9136 aws_log_forwarder.h:220] aws_kinesis: Successfully sent 2 out of 2 log records
I1220 16:57:49.690928 9136 aws_log_forwarder.h:262] sendBatch() end
I pushed the changes to my fork in the logs-for-handle-leak branch. Here are the diffs:
https://github.com/osquery/osquery/compare/master...chrisbroome:logs-for-handle-leak
It's not the prettiest code, and I made no attempt to guard for windows-specific issues, but I think it does illustrate the leak pretty well.
I noticed that many of the classes declared in aws_util.h are virtual but lack virtual destructors. This is a common source of leaks.
https://stackoverflow.com/questions/461203/when-to-use-virtual-destructors
This is probably a separate issue with larger scope, but I'm not sure what template to use for a code quality task.
I inspected the loggers and forwarders and found that none of these are present (sorted):
+ virtual ~AwsLogForwarder() = default;
+ virtual ~BufferedLogForwarder() = default;
+ virtual ~FilesystemLoggerPlugin() = default;
+ virtual ~FirehoseLogForwarder() = default;
+ virtual ~FirehoseLoggerPlugin() = default;
+ virtual ~Forwarder() = default;
+ virtual ~KinesisForwarder() = default;
+ virtual ~KinesisLoggerPlugin() = default;
+ virtual ~LoggerPlugin() = default;
+ virtual ~OsqueryAWSCredentialsProviderChain() = default;
+ virtual ~OsqueryFlagsAWSCredentialsProvider() = default;
+ virtual ~OsqueryHttpClient() = default;
+ virtual ~OsqueryHttpClientFactory() = default;
+ virtual ~OsquerySTSAWSCredentialsProvider() = default;
+ virtual ~TLSLogForwarder() = default;
+ virtual ~TLSLoggerPlugin() = default;
After reading through some of the boost source code, I noticed this comment in their documentation for win_thread:
https://github.com/boostorg/asio/blob/develop/include/boost/asio/detail/impl/win_thread.ipp#L35-L41
win_thread::~win_thread()
{
::CloseHandle(thread_);
// The exit_event_ handle is deliberately allowed to leak here since it
// is an error for the owner of an internal thread not to join() it.
}
Not sure if that's related to this issue, but I don't immediately see any places in the osquery code where the http client is calling join on the thread.
@verntx I think @alessandrogario or @Smjert were exploring adding clang's static analyzer errors to one of the build's statuses (the build fails if there are any significant issues). And I imagine the static analyzer can find missing virtual dtors?
I have tracked the leak using htrace (https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/-htrace) in windbg. It's a tool to snapshot your open handles, execute code and trace the stack of each new handle since your last snapshot. From this I can trace the call creating the handle:
Handle = 0x00000000000006cc - OPEN
Thread ID = 0x00000000000010d8, Process ID = 0x00000000000024b0
0x00007ffc747c5aa4: ntdll!NtDuplicateObject+0x0000000000000014
0x00007ffc70eb736b: KERNELBASE!DuplicateHandle+0x000000000000004b
0x00007ffc7460d52e: WS2_32!WahOpenCurrentThread+0x000000000000005e
0x00007ffc7460d42a: WS2_32!DTHREAD::CreateDThreadForCurrentThread+0x00000000000000ca
0x00007ffc74609016: WS2_32!GetProtocolStateForFamily+0x0000000000000236
0x00007ffc7460a9ce: WS2_32!LookupAddressForName+0x00000000000000ae
0x00007ffc7460b6da: WS2_32!GetAddrInfoW+0x000000000000028a
0x00007ffc7460da9d: WS2_32!getaddrinfo+0x00000000000000dd
0x00007ff72e7e4382: osqueryd!boost::asio::detail::socket_ops::getaddrinfo+0x0000000000000112
0x00007ff72e7d0205: osqueryd!boost::asio::detail::socket_ops::background_getaddrinfo+0x0000000000000075
0x00007ff72e7daf13: +0x0000000000000193
0x00007ff72e5d1335: osqueryd!boost::asio::detail::scheduler_operation::complete+0x0000000000000045
0x00007ff72e5d26cb: osqueryd!boost::asio::detail::scheduler::do_run_one+0x000000000000023b
0x00007ff72e5d8279: osqueryd!boost::asio::detail::scheduler::run+0x0000000000000109
This is being called from a thread in boost_asio in the Http client. From here I've got the offending commit: https://github.com/osquery/osquery/commit/e5eb30ea5fad07c4577e394fd8cd4cba7f74341a, I can see this happening with this change, if I revert the leak no longer happens.
I have then narrowed this down to the call to the async_resolve call in Client::createConnection (https://github.com/osquery/osquery/blob/master/osquery/remote/http_client.cpp#L134). Looks like previously this was a call to connect. Once async_resolve has been called I can see the handle from the trace is left open.
I have also noticed osquery is going through some code paths that support much older windows versions defining BOOST_ASIO_HAS_GETADDRINFO https://github.com/boostorg/asio/blob/develop/include/boost/asio/detail/impl/socket_ops.ipp#L3321 means getaddrinfo can be called directly.
I have created a test program that demonstrates what happens:
#include <iostream>
#include <boost/asio.hpp>
namespace boost_system = boost::system;
namespace boost_asio = boost::asio;
class Client
{
public:
Client() :
r_(ioc_), sock_(ioc_)
{
boost::asio::detail::win_thread::set_terminate_threads(true);
}
~Client()
{
}
void connectHandler(const boost::system::error_code& ec,
const boost_asio::ip::tcp::endpoint&) {
}
void resolveHandler(
const boost::system::error_code& ec,
boost_asio::ip::tcp::resolver::results_type results) {
if (!ec) {
boost::asio::async_connect(sock_,
results,
std::bind(&Client::connectHandler,
this,
std::placeholders::_1,
std::placeholders::_2));
}
else
{
std::cout << ec.message();
}
}
void Connect()
{
r_.async_resolve(boost_asio::ip::tcp::resolver::query{ "kinesis.us-west-2.amazonaws.com", "443" },
std::bind(&Client::resolveHandler,
this,
std::placeholders::_1,
std::placeholders::_2));
{
boost_system::error_code rc;
ioc_.run(rc);
ioc_.reset();
if (rc) {
std::cout << "FAILURE: " << rc;
}
}
boost_asio::socket_base::keep_alive option(true);
sock_.set_option(option);
}
private:
boost_asio::io_context ioc_;
boost_asio::ip::tcp::resolver r_;
boost_asio::ip::tcp::socket sock_;
};
int main()
{
while (true)
{
Client client;
client.Connect();
Sleep(1 * 1000);
}
}
If you run this for a few seconds you can see the handle count increasing. As mentioned in the previous post this is due to a call to getaddrinfo in the resolver object. The call duplicates the handle of the calling thread these handles are then released when the thread completes. Because boost::asio::detail::win_thread::set_terminate_threads is set the thread is suddenly terminated: https://github.com/boostorg/asio/blob/develop/include/boost/asio/detail/impl/win_thread.ipp#L50 and these handles are not released. Removing the call to boost::asio::detail::win_thread::set_terminate_threads will then mean the thread closes down properly and the handles no longer leak.
I have also created a small program that shows what boost is doing:
#include <iostream>
#include <WinSock2.h>
#include <WS2tcpip.h>
#include <windows.h>
#include <process.h>
#include <Thread>
#pragma comment (lib, "Ws2_32.lib")
unsigned __stdcall GetAddrInfoThread(void* pArguments)
{
std::string host("kinesis.us-west-2.amazonaws.com");
std::string protocol("443");
addrinfo hints;
memset(&hints, 0, sizeof(hints));
hints.ai_flags = 1024;
hints.ai_socktype = 1;
hints.ai_protocol = 6;
addrinfo* result;
auto res = ::getaddrinfo(host.c_str(), protocol.c_str(), &hints, &result);
if (res)
{
std::cout << "Failed with result " << res;
}
::SleepEx(INFINITE, TRUE);
return 0;
}
int main()
{
WSADATA wsa_data;
auto startUp = ::WSAStartup(MAKEWORD('2', '0'), &wsa_data);
if (startUp)
{
std::cout << "Start up failed with result " << startUp;
}
unsigned threadID;
while (true)
{
HANDLE hThread = (HANDLE)::_beginthreadex(0, 0, &GetAddrInfoThread, NULL, 0, &threadID);
Sleep(1 * 1000);
::TerminateThread(hThread, 0);
Sleep(1 * 1000);
::CloseHandle( hThread );
}
}
This will also leak the handles created in getaddrinfo.
I started investigating why it doesn't appear in the TLS client and I suspect it does, it's just the TLS client has a longer living HttpClient. The HttpClient is recreated hourly with the default configuration: https://github.com/osquery/osquery/blob/master/osquery/remote/transports/tls.cpp#L180 I've tested with the flag tls_session_reuse=false and I am seeing similar leak on each call to getClient.
From here I can think of a few options:
Nice sleuthing @Breakwell !
To fix the leak for now in my repo I've done this change:
https://github.com/osquery/osquery/compare/master...Breakwell:remove_http_client_async_connect
Removing the async connect call this fixes the leak. To really fix this we need to not set set_terminate_threads. This "fixes" a deadlock issue by using the terminate_thread behavior to kill every thread instead of shutting it down. This will inevitably cause other problems like the one seen here.
Most helpful comment
I have created a test program that demonstrates what happens:
If you run this for a few seconds you can see the handle count increasing. As mentioned in the previous post this is due to a call to getaddrinfo in the resolver object. The call duplicates the handle of the calling thread these handles are then released when the thread completes. Because boost::asio::detail::win_thread::set_terminate_threads is set the thread is suddenly terminated: https://github.com/boostorg/asio/blob/develop/include/boost/asio/detail/impl/win_thread.ipp#L50 and these handles are not released. Removing the call to boost::asio::detail::win_thread::set_terminate_threads will then mean the thread closes down properly and the handles no longer leak.
I have also created a small program that shows what boost is doing:
This will also leak the handles created in getaddrinfo.
I started investigating why it doesn't appear in the TLS client and I suspect it does, it's just the TLS client has a longer living HttpClient. The HttpClient is recreated hourly with the default configuration: https://github.com/osquery/osquery/blob/master/osquery/remote/transports/tls.cpp#L180 I've tested with the flag tls_session_reuse=false and I am seeing similar leak on each call to getClient.
From here I can think of a few options: