Osquery: AWS Kinesis Logger Leak on Windows

Created on 13 Dec 2019 · 16Comments · Source: osquery/osquery

Bug report

Using the aws_kinesis logger results in an increasing handle_count as well as increasing memory usage.

What operating system and version are you using?

 version = 10.0.18362
   build = 18362
platform = windows

What version of osquery are you using?

version = 4.0.2

What steps did you take to reproduce the issue?

Run osquery using the aws_kinesis logger. Use the following scheduled query to check the health of osquery:

  "schedule": {
    "osquery_process_info": {
      "query": "SELECT * FROM processes WHERE name = 'osqueryd.exe';",
      "interval": 60
    }
  },

In order to collect trends about handle_count and memory usage, let the scheduled query execute a few times. An hour should be sufficient.

Check the scheduled query results using your favorite log viewer.

Notice that the handle_count steadily increases on the osquery worker process (not the watcher). Also, the memory usage (resident_size and total_size) increases.

What did you expect to see?

Expected that both the handle_count as well as memory usage would remain stable.

What did you see instead?

Both handle_count as well as memory usage steadily increases.

logging triage

Source

chrisbroome

👍4

Most helpful comment

I have created a test program that demonstrates what happens:

#include <iostream>
#include <boost/asio.hpp>

namespace boost_system = boost::system;
namespace boost_asio = boost::asio;

class Client
{
public:

    Client() :
        r_(ioc_), sock_(ioc_)
    {
        boost::asio::detail::win_thread::set_terminate_threads(true);
    }

    ~Client()
    {
    }

    void connectHandler(const boost::system::error_code& ec,
        const boost_asio::ip::tcp::endpoint&) {
    }

    void resolveHandler(
        const boost::system::error_code& ec,
        boost_asio::ip::tcp::resolver::results_type results) {
        if (!ec) {
            boost::asio::async_connect(sock_,
                results,
                std::bind(&Client::connectHandler,
                    this,
                    std::placeholders::_1,
                    std::placeholders::_2));
        }
        else
        {
            std::cout << ec.message();
        }
    }

    void Connect()
    {
        r_.async_resolve(boost_asio::ip::tcp::resolver::query{ "kinesis.us-west-2.amazonaws.com", "443" },
            std::bind(&Client::resolveHandler,
                this,
                std::placeholders::_1,
                std::placeholders::_2));

        {
            boost_system::error_code rc;
            ioc_.run(rc);
            ioc_.reset();
            if (rc) {
                std::cout << "FAILURE: " << rc;
            }
        }

        boost_asio::socket_base::keep_alive option(true);
        sock_.set_option(option);
    }

private:
    boost_asio::io_context ioc_;
    boost_asio::ip::tcp::resolver r_;
    boost_asio::ip::tcp::socket sock_;
};

int main()
{
    while (true)
    {
        Client client;
        client.Connect();
        Sleep(1 * 1000);
    }
}

If you run this for a few seconds you can see the handle count increasing. As mentioned in the previous post this is due to a call to getaddrinfo in the resolver object. The call duplicates the handle of the calling thread these handles are then released when the thread completes. Because boost::asio::detail::win_thread::set_terminate_threads is set the thread is suddenly terminated: https://github.com/boostorg/asio/blob/develop/include/boost/asio/detail/impl/win_thread.ipp#L50 and these handles are not released. Removing the call to boost::asio::detail::win_thread::set_terminate_threads will then mean the thread closes down properly and the handles no longer leak.

I have also created a small program that shows what boost is doing:

#include <iostream>

#include <WinSock2.h>
#include <WS2tcpip.h>

#include <windows.h>
#include <process.h>

#include <Thread>

#pragma comment (lib, "Ws2_32.lib")


unsigned __stdcall GetAddrInfoThread(void* pArguments)
{
    std::string host("kinesis.us-west-2.amazonaws.com");
    std::string protocol("443");

    addrinfo hints;
    memset(&hints, 0, sizeof(hints));

    hints.ai_flags = 1024;
    hints.ai_socktype = 1;
    hints.ai_protocol = 6;


    addrinfo* result;
    auto res = ::getaddrinfo(host.c_str(), protocol.c_str(), &hints, &result);
    if (res)
    {
        std::cout << "Failed with result " << res;
    }
    ::SleepEx(INFINITE, TRUE);
    return 0;
}

int main()
{
    WSADATA wsa_data;
    auto startUp = ::WSAStartup(MAKEWORD('2', '0'), &wsa_data);
    if (startUp)
    {
        std::cout << "Start up failed with result " << startUp;
    }

    unsigned threadID;

    while (true)
    {
        HANDLE hThread = (HANDLE)::_beginthreadex(0, 0, &GetAddrInfoThread, NULL, 0, &threadID);
        Sleep(1 * 1000);
        ::TerminateThread(hThread, 0);
        Sleep(1 * 1000);
        ::CloseHandle( hThread );
    }
}

This will also leak the handles created in getaddrinfo.

I started investigating why it doesn't appear in the TLS client and I suspect it does, it's just the TLS client has a longer living HttpClient. The HttpClient is recreated hourly with the default configuration: https://github.com/osquery/osquery/blob/master/osquery/remote/transports/tls.cpp#L180 I've tested with the flag tls_session_reuse=false and I am seeing similar leak on each call to getClient.

From here I can think of a few options:

Revisit #4235 and #5341 and investigate if we actually need to set terminate
Revert the change to use async_resolve
Change how we resolve in the client - I am not that familiar with boost maybe someone has some better ideas on how we do this?

Breakwell on 26 Feb 2020

👍2

All 16 comments

@chrisbroome have you isolated this to the aws_kinesis logger? If you instead use the filesystem logger the handle_count stays the same?

theopolis on 13 Dec 2019

If you add another query to the schedule does the handle_count increase at the same rate?

theopolis on 13 Dec 2019

The filesystem logger does not have the increased handle count nor, as far as I could tell, the increased memory usage.

chrisbroome on 14 Dec 2019

Also note that the exact same osquery code and logger config on Linux does not have these issues. I haven't tested MacOS yet.

chrisbroome on 14 Dec 2019

You could try to update the AWS submodule to their latest version and see if the handle leak was in their library code (assuming it’s somehow fixed recently)?

theopolis on 14 Dec 2019

Yeah I'm thinking the bug is probably in the AWS SDK code. I didn't see the issue reported in their repository. I'm trying to get a minimal reproducible example working so that I can file a bug there and link it here.

chrisbroome on 16 Dec 2019

Honestly, I'm at a bit of a loss as to where exactly this leak is occurring.

There are a few places to check:

1) The BufferedLogForwarder base class
2) The AwsLogForwarder class that derives from BufferedLogForwarder
3) The KinesisLogForwarder class which derives from AwsLogForwarder
4) The custom osquery::OsqueryHttpClient that we're using with the AWS SDK

Nothing jumps out to me as obvious while looking through any of that code.

chrisbroome on 18 Dec 2019

FWIW, v4.1.2 also has this issue.

chrisbroome on 19 Dec 2019

I've narrowed the handle leak to somewhere in the osquery::http::Client code. There are definitely windows-specific codepaths that are taken (specifically in the constructor https://github.com/osquery/osquery/blob/master/osquery/remote/http_client.h#L217) . I've added some logger statements to various methods and here's the output which directly correlates with the ever-increasing handle_count:

I1220 16:57:48.772842  9136 aws_log_forwarder.h:187] sendBatch() begin
I1220 16:57:48.772842  9136 aws_kinesis.cpp:100] internalSend()  begin: 2 records, 567 handles
I1220 16:57:48.776860  9136 aws_kinesis.cpp:103] before request.WithStreamName().SetRecords(): 567
I1220 16:57:48.776860  9136 aws_kinesis.cpp:106] after  request.WithStreamName().SetRecords(): 567
I1220 16:57:48.786000  9136 http_client.h:221] Client()
I1220 16:57:48.804879  9136 http_client.cpp:85] Client::closeSocket() 573 handles
I1220 16:57:48.805855  9136 http_client.cpp:157] Client::createConnection() begin: 573 handles
I1220 16:57:48.806881  9136 http_client.cpp:172] Client::createConnection() callNetworkOperation, begin r.async_resolve: 573 handles
I1220 16:57:48.809834  9136 http_client.cpp:33] Client::callNetworkOperation() begin: 573 handles
I1220 16:57:48.809834  9136 http_client.cpp:175] Client::createConnection() callNetworkOperation,  begincb r.async_resolve: 573 handles
I1220 16:57:48.811856  9136 http_client.cpp:184] Client::createConnection() callNetworkOperation,    endcb r.async_resolve: 575 handles
I1220 16:57:48.814877  9136 http_client.cpp:117] Client::resolveHandler() begin: 579 handles
I1220 16:57:48.815915  9136 http_client.cpp:119] Client::resolveHandler() begin async_connect: 579 handles
I1220 16:57:48.818822  9136 http_client.cpp:127] Client::resolveHandler()   end async_connect: 580 handles
I1220 16:57:48.821835  9136 http_client.cpp:137] Client::resolveHandler()   end: 580 handles
I1220 16:57:48.822834  9136 http_client.cpp:107] Client::connectHandler() begin: 580 handles
I1220 16:57:48.822834  9136 http_client.cpp:110] Client::connectHandler()   end: 580 handles
I1220 16:57:48.823828  9136 http_client.cpp:51] Client::callNetworkOperation()   end: 573 handles
I1220 16:57:48.824828  9136 http_client.cpp:53] Client::callNetworkOperation()  leak: 7 handles
I1220 16:57:48.824828  9136 http_client.cpp:189] Client::createConnection() callNetworkOperation,   end r.async_resolve: 580 handles
I1220 16:57:48.825852  9136 http_client.cpp:253] Client::createConnection()   end: 580 handles
I1220 16:57:49.310981  9136 http_client.cpp:33] Client::callNetworkOperation() begin: 580 handles
I1220 16:57:49.352906  9136 http_client.cpp:51] Client::callNetworkOperation()   end: 580 handles
I1220 16:57:49.352906  9136 http_client.cpp:53] Client::callNetworkOperation()  leak: 0 handles
I1220 16:57:49.355964  9136 http_client.cpp:316] Client::sendRequest() 580 handles
I1220 16:57:49.355964  9136 http_client.cpp:33] Client::callNetworkOperation() begin: 580 handles
I1220 16:57:49.357832  9136 http_client.cpp:51] Client::callNetworkOperation()   end: 580 handles
I1220 16:57:49.357832  9136 http_client.cpp:53] Client::callNetworkOperation()  leak: 0 handles
I1220 16:57:49.358917  9136 http_client.cpp:33] Client::callNetworkOperation() begin: 580 handles
I1220 16:57:49.372885  9136 http_client.cpp:51] Client::callNetworkOperation()   end: 580 handles
I1220 16:57:49.372885  9136 http_client.cpp:53] Client::callNetworkOperation()  leak: 0 handles
I1220 16:57:49.374909  9136 http_client.h:266] ~Client
I1220 16:57:49.376991  9136 http_client.cpp:85] Client::closeSocket() 580 handles
I1220 16:57:49.377861  9136 http_client.cpp:87] Client::closeSocket():isSocketOpen=true
I1220 16:57:49.377861  9136 http_client.cpp:90] sock_.shutdown(): rc: 0 The operation completed successfully 580 handles
I1220 16:57:49.378911  9136 http_client.cpp:93] sock_.close(): rc: 0 The operation completed successfully 579 handles
I1220 16:57:49.686817  9136 aws_kinesis.cpp:111] internalSend()    end: 571 handles
I1220 16:57:49.687887  9136 aws_kinesis.cpp:112] internalSend() leaked: 4 handles
I1220 16:57:49.690928  9136 aws_log_forwarder.h:220] aws_kinesis: Successfully sent 2 out of 2 log records
I1220 16:57:49.690928  9136 aws_log_forwarder.h:262] sendBatch() end

I pushed the changes to my fork in the logs-for-handle-leak branch. Here are the diffs:

https://github.com/osquery/osquery/compare/master...chrisbroome:logs-for-handle-leak

It's not the prettiest code, and I made no attempt to guard for windows-specific issues, but I think it does illustrate the leak pretty well.

chrisbroome on 20 Dec 2019

I noticed that many of the classes declared in aws_util.h are virtual but lack virtual destructors. This is a common source of leaks.
https://stackoverflow.com/questions/461203/when-to-use-virtual-destructors

This is probably a separate issue with larger scope, but I'm not sure what template to use for a code quality task.

I inspected the loggers and forwarders and found that none of these are present (sorted):

+  virtual ~AwsLogForwarder() = default;
+  virtual ~BufferedLogForwarder() = default;
+  virtual ~FilesystemLoggerPlugin() = default;
+  virtual ~FirehoseLogForwarder() = default;
+  virtual ~FirehoseLoggerPlugin() = default;
+  virtual ~Forwarder() = default;
+  virtual ~KinesisForwarder() = default;
+  virtual ~KinesisLoggerPlugin() = default;
+  virtual ~LoggerPlugin() = default;
+  virtual ~OsqueryAWSCredentialsProviderChain() = default;
+  virtual ~OsqueryFlagsAWSCredentialsProvider() = default;
+  virtual ~OsqueryHttpClient() = default;
+  virtual ~OsqueryHttpClientFactory() = default;
+  virtual ~OsquerySTSAWSCredentialsProvider() = default;
+  virtual ~TLSLogForwarder() = default;
+  virtual ~TLSLoggerPlugin() = default;

verntx on 9 Jan 2020

After reading through some of the boost source code, I noticed this comment in their documentation for win_thread:

https://github.com/boostorg/asio/blob/develop/include/boost/asio/detail/impl/win_thread.ipp#L35-L41

win_thread::~win_thread()
{
  ::CloseHandle(thread_);

  // The exit_event_ handle is deliberately allowed to leak here since it
  // is an error for the owner of an internal thread not to join() it.
}

Not sure if that's related to this issue, but I don't immediately see any places in the osquery code where the http client is calling join on the thread.

chrisbroome on 10 Jan 2020

@verntx I think @alessandrogario or @Smjert were exploring adding clang's static analyzer errors to one of the build's statuses (the build fails if there are any significant issues). And I imagine the static analyzer can find missing virtual dtors?

theopolis on 14 Jan 2020

I have tracked the leak using htrace (https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/-htrace) in windbg. It's a tool to snapshot your open handles, execute code and trace the stack of each new handle since your last snapshot. From this I can trace the call creating the handle:

Handle = 0x00000000000006cc - OPEN
Thread ID = 0x00000000000010d8, Process ID = 0x00000000000024b0

0x00007ffc747c5aa4: ntdll!NtDuplicateObject+0x0000000000000014
0x00007ffc70eb736b: KERNELBASE!DuplicateHandle+0x000000000000004b
0x00007ffc7460d52e: WS2_32!WahOpenCurrentThread+0x000000000000005e
0x00007ffc7460d42a: WS2_32!DTHREAD::CreateDThreadForCurrentThread+0x00000000000000ca
0x00007ffc74609016: WS2_32!GetProtocolStateForFamily+0x0000000000000236
0x00007ffc7460a9ce: WS2_32!LookupAddressForName+0x00000000000000ae
0x00007ffc7460b6da: WS2_32!GetAddrInfoW+0x000000000000028a
0x00007ffc7460da9d: WS2_32!getaddrinfo+0x00000000000000dd
0x00007ff72e7e4382: osqueryd!boost::asio::detail::socket_ops::getaddrinfo+0x0000000000000112
0x00007ff72e7d0205: osqueryd!boost::asio::detail::socket_ops::background_getaddrinfo+0x0000000000000075
0x00007ff72e7daf13: +0x0000000000000193
0x00007ff72e5d1335: osqueryd!boost::asio::detail::scheduler_operation::complete+0x0000000000000045
0x00007ff72e5d26cb: osqueryd!boost::asio::detail::scheduler::do_run_one+0x000000000000023b
0x00007ff72e5d8279: osqueryd!boost::asio::detail::scheduler::run+0x0000000000000109

This is being called from a thread in boost_asio in the Http client. From here I've got the offending commit: https://github.com/osquery/osquery/commit/e5eb30ea5fad07c4577e394fd8cd4cba7f74341a, I can see this happening with this change, if I revert the leak no longer happens.

I have then narrowed this down to the call to the async_resolve call in Client::createConnection (https://github.com/osquery/osquery/blob/master/osquery/remote/http_client.cpp#L134). Looks like previously this was a call to connect. Once async_resolve has been called I can see the handle from the trace is left open.

I have also noticed osquery is going through some code paths that support much older windows versions defining BOOST_ASIO_HAS_GETADDRINFO https://github.com/boostorg/asio/blob/develop/include/boost/asio/detail/impl/socket_ops.ipp#L3321 means getaddrinfo can be called directly.

Breakwell on 3 Feb 2020

🎉1

I have created a test program that demonstrates what happens:

#include <iostream>
#include <boost/asio.hpp>

namespace boost_system = boost::system;
namespace boost_asio = boost::asio;

class Client
{
public:

    Client() :
        r_(ioc_), sock_(ioc_)
    {
        boost::asio::detail::win_thread::set_terminate_threads(true);
    }

    ~Client()
    {
    }

    void connectHandler(const boost::system::error_code& ec,
        const boost_asio::ip::tcp::endpoint&) {
    }

    void resolveHandler(
        const boost::system::error_code& ec,
        boost_asio::ip::tcp::resolver::results_type results) {
        if (!ec) {
            boost::asio::async_connect(sock_,
                results,
                std::bind(&Client::connectHandler,
                    this,
                    std::placeholders::_1,
                    std::placeholders::_2));
        }
        else
        {
            std::cout << ec.message();
        }
    }

    void Connect()
    {
        r_.async_resolve(boost_asio::ip::tcp::resolver::query{ "kinesis.us-west-2.amazonaws.com", "443" },
            std::bind(&Client::resolveHandler,
                this,
                std::placeholders::_1,
                std::placeholders::_2));

        {
            boost_system::error_code rc;
            ioc_.run(rc);
            ioc_.reset();
            if (rc) {
                std::cout << "FAILURE: " << rc;
            }
        }

        boost_asio::socket_base::keep_alive option(true);
        sock_.set_option(option);
    }

private:
    boost_asio::io_context ioc_;
    boost_asio::ip::tcp::resolver r_;
    boost_asio::ip::tcp::socket sock_;
};

int main()
{
    while (true)
    {
        Client client;
        client.Connect();
        Sleep(1 * 1000);
    }
}

I have also created a small program that shows what boost is doing:

#include <iostream>

#include <WinSock2.h>
#include <WS2tcpip.h>

#include <windows.h>
#include <process.h>

#include <Thread>

#pragma comment (lib, "Ws2_32.lib")


unsigned __stdcall GetAddrInfoThread(void* pArguments)
{
    std::string host("kinesis.us-west-2.amazonaws.com");
    std::string protocol("443");

    addrinfo hints;
    memset(&hints, 0, sizeof(hints));

    hints.ai_flags = 1024;
    hints.ai_socktype = 1;
    hints.ai_protocol = 6;


    addrinfo* result;
    auto res = ::getaddrinfo(host.c_str(), protocol.c_str(), &hints, &result);
    if (res)
    {
        std::cout << "Failed with result " << res;
    }
    ::SleepEx(INFINITE, TRUE);
    return 0;
}

int main()
{
    WSADATA wsa_data;
    auto startUp = ::WSAStartup(MAKEWORD('2', '0'), &wsa_data);
    if (startUp)
    {
        std::cout << "Start up failed with result " << startUp;
    }

    unsigned threadID;

    while (true)
    {
        HANDLE hThread = (HANDLE)::_beginthreadex(0, 0, &GetAddrInfoThread, NULL, 0, &threadID);
        Sleep(1 * 1000);
        ::TerminateThread(hThread, 0);
        Sleep(1 * 1000);
        ::CloseHandle( hThread );
    }
}

This will also leak the handles created in getaddrinfo.

From here I can think of a few options:

Revisit #4235 and #5341 and investigate if we actually need to set terminate
Revert the change to use async_resolve
Change how we resolve in the client - I am not that familiar with boost maybe someone has some better ideas on how we do this?

Breakwell on 26 Feb 2020

👍2

Nice sleuthing @Breakwell !

packetzero on 26 Feb 2020

To fix the leak for now in my repo I've done this change:
https://github.com/osquery/osquery/compare/master...Breakwell:remove_http_client_async_connect

Removing the async connect call this fixes the leak. To really fix this we need to not set set_terminate_threads. This "fixes" a deadlock issue by using the terminate_thread behavior to kill every thread instead of shutting it down. This will inevitably cause other problems like the one seen here.

Breakwell on 9 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings