Yugabyte-db: Seeing occasional high latency when running redis pipeline workload

Created on 9 Feb 2018  Â·  9Comments  Â·  Source: yugabyte/yugabyte-db

Running some load against yugabyte primarily consisting of hmset/hmget operations.
Occasionally observing higher latencies. Found some occurrences of the following stack in the yb-tserver.INFO log file.

W0208 21:55:19.366822 46558 kernel_stack_watchdog.cc:139] Thread 46772 stuck at ../../../../../src/yb/rpc/outbound_call.cc:326 for 104ms:
Kernel stack:
[<ffffffff810f5464>] futex_wait_queue_me+0xc4/0x120
[<ffffffff810f5fd9>] futex_wait+0x179/0x280
[<ffffffff810f80de>] do_futex+0xfe/0x5b0
[<ffffffff810f8610>] SyS_futex+0x80/0x180
[<ffffffff81697749>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

User stack:

    @     0x7f57ab5bbe43  sys_futex (./src/base/linux_syscall_support.h:2097)
    @     0x7f57ab5bbe43  base::internal::SpinLockDelay(int volatile*, int, int) (./src/base/spinlock_linux-inl.h:88)
    @     0x7f57ab5bbd13  SpinLock::SlowLock() (src/base/spinlock.cc:133)
    @     0x7f57a5cc1156  SpinLock::Lock() (src/base/spinlock.h:71)
    @     0x7f57a5cc1156  SpinLockHolder::SpinLockHolder(SpinLock*) (src/base/spinlock.h:136)
    @     0x7f57a5cc1156  tcmalloc::ThreadCache::IncreaseCacheLimit() (src/thread_cache.cc:276)
    @     0x7f57a5cd2e11  tcmalloc::ThreadCache::Deallocate(void*, unsigned long) (src/thread_cache.h:392)
    @     0x7f57a5cd2e11  do_free_helper (src/tcmalloc.cc:1195)
    @     0x7f57a5cd2e11  do_free_with_callback (src/tcmalloc.cc:1228)
    @     0x7f57a5cd2e11  do_free (src/tcmalloc.cc:1234)
    @     0x7f57a5cd2e11  tc_deletearray (src/tcmalloc.cc:1665)
    @     0x7f57a6bd9f77  google::protobuf::internal::ArenaStringPtr::DestroyNoArena(std::string const*) (thirdparty/installed/uninstrumented/include/google/protobuf/arenastring.h:264)
    @     0x7f57a6bd9f77  yb::RedisKeyValueSubKeyPB::clear_subkey() (yb/common/redis_protocol.pb.cc:4357)
    @     0x7f57a6bda00e  yb::RedisKeyValueSubKeyPB::~RedisKeyValueSubKeyPB() (yb/common/redis_protocol.pb.cc:4321)
    @     0x7f57a6bda444  yb::RedisKeyValueSubKeyPB::~RedisKeyValueSubKeyPB() (yb/common/redis_protocol.pb.cc:4322)
    @     0x7f57a6bda444  google::protobuf::internal::GenericTypeHandler<yb::RedisKeyValueSubKeyPB>::Delete(yb::RedisKeyValueSubKeyPB*, google::protobuf::Arena*) (thirdparty/installed/uninstrumented/include/google/protobuf/repeated_field.h:623)
    @     0x7f57a6bda444  void google::protobuf::internal::RepeatedPtrFieldBase::Destroy<google::protobuf::RepeatedPtrField<yb::RedisKeyValueSubKeyPB>::TypeHandler>() (thirdparty/installed/uninstrumented/include/google/protobuf/repeated_field.h:1473)
    @     0x7f57a6bda444  google::protobuf::RepeatedPtrField<yb::RedisKeyValueSubKeyPB>::~RepeatedPtrField() (thirdparty/installed/uninstrumented/include/google/protobuf/repeated_field.h:1934)
    @     0x7f57a6bda444  yb::RedisKeyValuePB::~RedisKeyValuePB() (yb/common/redis_protocol.pb.cc:4680)
    @     0x7f57a6bdb35f  yb::RedisKeyValuePB::~RedisKeyValuePB() (yb/common/redis_protocol.pb.cc:4683)
    @     0x7f57a6bdb35f  yb::RedisReadRequestPB::SharedDtor() (yb/common/redis_protocol.pb.cc:2474)
    @     0x7f57a6bdb42e  yb::RedisReadRequestPB::~RedisReadRequestPB() (yb/common/redis_protocol.pb.cc:2470)
    @     0x7f57a6bdb480  yb::RedisReadRequestPB::~RedisReadRequestPB() (yb/common/redis_protocol.pb.cc:2471)
    @     0x7f57abb3ad08  std::default_delete<yb::RedisReadRequestPB>::operator()(yb::RedisReadRequestPB*) const (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/unique_ptr.h:76)
    @     0x7f57abb3ad08  std::unique_ptr<yb::RedisReadRequestPB, std::default_delete<yb::RedisReadRequestPB> >::~unique_ptr() (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/unique_ptr.h:236)
    @     0x7f57abb3ad08  yb::client::YBRedisReadOp::~YBRedisReadOp() (yb/client/yb_op.cc:109)
    @     0x7f57abac2f58  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/shared_ptr_base.h:150)
    @     0x7f57abac2f58  std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/shared_ptr_base.h:659)
    @     0x7f57abac2f58  std::__shared_ptr<yb::client::YBOperation, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/shared_ptr_base.h:925)
    @     0x7f57abac2f58  std::shared_ptr<yb::client::YBOperation>::~shared_ptr() (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/shared_ptr.h:93)
    @     0x7f57abac2f58  yb::client::internal::InFlightOp::~InFlightOp() (yb/client/in_flight_op.h:91)
    @     0x7f57abac2f58  void __gnu_cxx::new_allocator<yb::client::internal::InFlightOp>::destroy<yb::client::internal::InFlightOp>(yb::client::internal::InFlightOp*) (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/ext/new_allocator.h:124)
    @     0x7f57abac2f58  std::enable_if<std::__and_<std::allocator_traits<std::allocator<yb::client::internal::InFlightOp> >::__destroy_helper<yb::client::internal::InFlightOp>::type>::value, void>::type std::allocator_traits<std::allocator<yb::client::internal::InFlightOp> >::_S_destroy<yb::client::internal::InFlightOp>(std::allocator<yb::client::internal::InFlightOp>&, yb::client::internal::InFlightOp*) (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/alloc_traits.h:285)
    @     0x7f57abac2f58  void std::allocator_traits<std::allocator<yb::client::internal::InFlightOp> >::destroy<yb::client::internal::InFlightOp>(std::allocator<yb::client::internal::InFlightOp>&, yb::client::internal::InFlightOp*) (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/alloc_traits.h:414)
    @     0x7f57abac2f58  std::_Sp_counted_ptr_inplace<yb::client::internal::InFlightOp, std::allocator<yb::client::internal::InFlightOp>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/shared_ptr_base.h:531)
    @     0x7f57abab5a84  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/shared_ptr_base.h:150)
    @     0x7f57abab5a84  std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/shared_ptr_base.h:659)
    @     0x7f57abab5a84  std::__shared_ptr<yb::client::internal::InFlightOp, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/shared_ptr_base.h:925)
    @     0x7f57abab5a84  std::shared_ptr<yb::client::internal::InFlightOp>::~shared_ptr() (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/shared_ptr.h:93)
    @     0x7f57abab5a84  void std::_Destroy<std::shared_ptr<yb::client::internal::InFlightOp> >(std::shared_ptr<yb::client::internal::InFlightOp>*) (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/stl_construct.h:93)
    @     0x7f57abab5a84  void std::_Destroy_aux<false>::__destroy<std::shared_ptr<yb::client::internal::InFlightOp>*>(std::shared_ptr<yb::client::internal::InFlightOp>*, std::shared_ptr<yb::client::internal::InFlightOp>*) (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/stl_construct.h:103)
    @     0x7f57abab5a84  void std::_Destroy<std::shared_ptr<yb::client::internal::InFlightOp>*>(std::shared_ptr<yb::client::internal::InFlightOp>*, std::shared_ptr<yb::client::internal::InFlightOp>*) (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/stl_construct.h:126)
    @     0x7f57abab5a84  void std::_Destroy<std::shared_ptr<yb::client::internal::InFlightOp>*, std::shared_ptr<yb::client::internal::InFlightOp> >(std::shared_ptr<yb::client::internal::InFlightOp>*, std::shared_ptr<yb::client::internal::InFlightOp>*, std::allocator<std::shared_ptr<yb::client::internal::InFlightOp> >&) (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/stl_construct.h:151)
    @     0x7f57abab5a84  std::vector<std::shared_ptr<yb::client::internal::InFlightOp>, std::allocator<std::shared_ptr<yb::client::internal::InFlightOp> > >::~vector() (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/stl_vector.h:424)
    @     0x7f57abab5a84  yb::client::internal::AsyncRpc::~AsyncRpc() (yb/client/async_rpc.cc:103)
    @     0x7f57abab8c80  yb::client::internal::AsyncRpcBase<yb::tserver::ReadRequestPB, yb::tserver::ReadResponsePB>::~AsyncRpcBase() (yb/client/async_rpc.h:111)
    @     0x7f57abab8c80  yb::client::internal::ReadRpc::~ReadRpc() (yb/client/async_rpc.cc:464)
    @     0x7f57abab5914  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/shared_ptr_base.h:150)
    @     0x7f57abab5914  std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/shared_ptr_base.h:659)
    @     0x7f57abab5914  std::__shared_ptr<yb::rpc::RpcCommand, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/shared_ptr_base.h:925)
    @     0x7f57abab5914  std::__shared_ptr<yb::rpc::RpcCommand, (__gnu_cxx::_Lock_policy)2>::reset() (/home/centos/.linuxbrew-yb-build/Cellar/gcc/5.3.0/include/c++/5.3.0/bits/shared_ptr_base.h:1022)
    @     0x7f57abab5914  yb::client::internal::AsyncRpc::Finished(yb::Status const&) (yb/client/async_rpc.cc:137)

Most helpful comment

@alk I increased TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 268435456 and currently TCMALLOC_CURRENT_TOTAL_THREAD_CACHE_BYTES is about 172705296. Although, during startup of the process there were still a few spinlock issues, but after TCMALLOC_CURRENT_TOTAL_THREAD_CACHE_BYTES stabilized, it seems like there aren't any further issues. These are a few stack traces that I saw during startup:

W0211 09:36:53.940006  6785 kernel_stack_watchdog.cc:139] Thread 6774 stuck at ../../../../../src/yb/rpc/outbound_call.cc:326 for 236ms:
Kernel stack:
[<ffffffff810f5136>] futex_wait_queue_me+0xc6/0x130
[<ffffffff810f5dfb>] futex_wait+0x17b/0x280
[<ffffffff810f7b36>] do_futex+0x106/0x5a0
[<ffffffff810f8050>] SyS_futex+0x80/0x180
[<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

User stack:
    @     0x7ff3e41d30df  (unknown) (/tmp/glibc-20180109-1462-njgus6/glibc-2.23/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0)
    @     0x7ff3ed612ec3  sys_futex (./src/base/linux_syscall_support.h:2097)
    @     0x7ff3ed612ec3  base::internal::SpinLockDelay(int volatile*, int, int) (./src/base/spinlock_linux-inl.h:88)
    @     0x7ff3ed612d93  SpinLock::SlowLock() (src/base/spinlock.cc:133)
    @     0x7ff3e7d229a7  SpinLock::Lock() (src/base/spinlock.h:71)
    @     0x7ff3e7d229a7  SpinLockHolder::SpinLockHolder(SpinLock*) (src/base/spinlock.h:136)
    @     0x7ff3e7d229a7  tcmalloc::CentralFreeList::Populate() (src/central_freelist.cc:328)
    @     0x7ff3e7d22bb7  tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) (src/central_freelist.cc:284)
    @     0x7ff3e7d22c5a  tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) (src/central_freelist.cc:264)
    @     0x7ff3e7d25dd0  tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) (src/thread_cache.cc:166)
    @     0x7ff3e7d37ba7  tcmalloc::ThreadCache::Allocate(unsigned long, unsigned long) (src/thread_cache.h:365)
    @     0x7ff3e7d37ba7  do_malloc_small (src/tcmalloc.cc:1103)
    @     0x7ff3e7d37ba7  do_malloc_no_errno (src/tcmalloc.cc:1112)
    @     0x7ff3e7d37ba7  cpp_alloc (src/tcmalloc.cc:1438)
    @     0x7ff3e7d37ba7  tc_newarray (src/tcmalloc.cc:1646)
    @     0x7ff3e724b236  google::LogMessage::Init(char const*, int, int, void (google::LogMessage::*)()) (src/logging.cc:1191)
    @     0x7ff3e724b9fb  google::LogMessage::LogMessage(char const*, int) (src/logging.cc:1155)
    @     0x7ff3ec8f1336  yb::consensus::LeaderElection::HandleVoteDeniedUnlocked(std::string const&, yb::consensus::LeaderElection::VoterState const&) (yb/consensus/leader_election.cc:410)
    @     0x7ff3ec8f1a1c  yb::consensus::LeaderElection::VoteResponseRpcCallback(std::string const&, scoped_refptr<yb::consensus::LeaderElection> const&) (yb/consensus/leader_election.cc:339)
    @     0x7ff3e9c278ae  std::function<void ()>::operator()() const (/n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:2267)
    @     0x7ff3e9c278ae  yb::rpc::OutboundCall::CallCallback() (yb/rpc/outbound_call.cc:327)
    @     0x7ff3e9c288a4  yb::rpc::OutboundCall::SetResponse(yb::rpc::CallResponse&&) (yb/rpc/outbound_call.cc:366)
    @     0x7ff3e9c17a7b  yb::rpc::Connection::HandleCallResponse(yb::Slice) (yb/rpc/connection.cc:410)
    @     0x7ff3e9c5b8cd  yb::rpc::YBConnectionContext::HandleCall(std::shared_ptr<yb::rpc::Connection> const&, yb::Slice) (yb/rpc/yb_rpc.cc:166)
W0211 09:37:11.758664  6785 kernel_stack_watchdog.cc:139] Thread 7063 stuck at ../../../../../src/yb/consensus/log.cc:245 for 177ms:
Kernel stack:
[<ffffffffffffffff>] 0xffffffffffffffff

User stack:
    @     0x7ff3e41d30df  (unknown) (/tmp/glibc-20180109-1462-njgus6/glibc-2.23/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0)
    @     0x7ff3e4282346  (unknown) (../sysdeps/unix/syscall-template.S:84)
    @     0x7ff3e7d21969  TCMalloc_SystemRelease(void*, unsigned long) (src/system-alloc.cc:539)
    @     0x7ff3e7d23060  tcmalloc::PageHeap::DecommitSpan(tcmalloc::Span*) (src/page_heap.cc:248)
    @     0x7ff3e7d234fb  tcmalloc::PageHeap::ReleaseLastNormalSpan(tcmalloc::PageHeap::SpanList*) (src/page_heap.cc:460)
    @     0x7ff3e7d235be  tcmalloc::PageHeap::ReleaseAtLeastNPages(unsigned long) (src/page_heap.cc:484)
    @     0x7ff3e7d1793e  TCMallocImplementation::ReleaseToSystem(unsigned long) (src/tcmalloc.cc:757)
    @     0x7ff3e82933ad  yb::MemTracker::GcTcmalloc() (yb/util/mem_tracker.cc:528)
    @     0x7ff3e8294a84  yb::MemTracker::Release(long) (yb/util/mem_tracker.cc:384)
    @     0x7ff3ec8f4d54  yb::consensus::LogCache::AccountForMessageRemovalUnlocked(std::shared_ptr<yb::consensus::ReplicateMsg> const&) (yb/consensus/log_cache.cc:409)
    @     0x7ff3ec8f6a04  yb::consensus::LogCache::EvictSomeUnlocked(long, long) (yb/consensus/log_cache.cc:397)
    @     0x7ff3ec8f70ef  yb::consensus::LogCache::LogCallback(long, bool, yb::Callback<void (yb::Status const&)> const&, yb::Status const&) (yb/consensus/log_cache.cc:241)
    @     0x7ff3ec67c9ae  yb::Callback<void (yb::Status const&)>::Run(yb::Status const&) const (yb/gutil/callback.h:451)
    @     0x7ff3ec67c9ae  yb::log::Log::AppendThread::RunThread() (yb/consensus/log.cc:248)
    @     0x7ff3e82ebc25  std::function<void ()>::operator()() const (/n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:2267)
    @     0x7ff3e82ebc25  yb::Thread::SuperviseThread(void*) (yb/util/thread.cc:602)
    @     0x7ff3e4544693  start_thread (/tmp/glibc-20180109-1462-njgus6/glibc-2.23/nptl/pthread_create.c:333)
    @     0x7ff3e42863cc  (unknown) (../sysdeps/unix/sysv/linux/x86_64/clone.S:109)

W0211 09:37:13.770215  6785 kernel_stack_watchdog.cc:139] Thread 7063 stuck at ../../../../../src/yb/consensus/log.cc:245 for 258ms:
Kernel stack:
[<ffffffff810b4cd8>] hrtimer_try_to_cancel.part.25+0x58/0x100
[<ffffffff810b4dc8>] hrtimer_cancel+0x28/0x40
[<ffffffff810f5e25>] futex_wait+0x1a5/0x280
[<ffffffff810f7b5a>] do_futex+0x12a/0x5a0
[<ffffffff810f8050>] SyS_futex+0x80/0x180
[<ffffffff816b51ab>] sysret_audit+0x17/0x21
[<ffffffffffffffff>] 0xffffffffffffffff

User stack:
    @     0x7ff3e41d30df  (unknown) (/tmp/glibc-20180109-1462-njgus6/glibc-2.23/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0)
    @     0x7ff3ed612ec3  sys_futex (./src/base/linux_syscall_support.h:2097)
    @     0x7ff3ed612ec3  base::internal::SpinLockDelay(int volatile*, int, int) (./src/base/spinlock_linux-inl.h:88)
    @     0x7ff3ed612d93  SpinLock::SlowLock() (src/base/spinlock.cc:133)
    @     0x7ff3e7d227cf  SpinLock::Lock() (src/base/spinlock.h:71)
    @     0x7ff3e7d227cf  SpinLockHolder::SpinLockHolder(SpinLock*) (src/base/spinlock.h:136)
    @     0x7ff3e7d227cf  tcmalloc::CentralFreeList::InsertRange(void*, void*, int) (src/central_freelist.cc:232)
    @     0x7ff3e7d25f86  tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) (src/thread_cache.cc:236)
    @     0x7ff3e7d2602b  tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned long) (src/thread_cache.cc:197)
    @     0x7ff3e7d380a7  tcmalloc::ThreadCache::Deallocate(void*, unsigned long) (src/thread_cache.h:390)
    @     0x7ff3e7d380a7  do_free_helper (src/tcmalloc.cc:1195)
    @     0x7ff3e7d380a7  do_free_with_callback (src/tcmalloc.cc:1225)
    @     0x7ff3e7d380a7  do_free (src/tcmalloc.cc:1234)
    @     0x7ff3e7d380a7  tc_deletearray (src/tcmalloc.cc:1665)
    @     0x7ff3e9359a6a  std::string::_Rep::_M_dispose(std::allocator<char> const&) (/n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/bits/basic_string.h:2740)
    @     0x7ff3e9359a6a  std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() (/n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/bits/basic_string.h:3037)
    @     0x7ff3e9359a6a  google::protobuf::internal::ArenaStringPtr::DestroyNoArena(std::string const*) (thirdparty/yugabyte-thirdparty-2018-02-06T17_39_35/thirdparty/installed/uninstrumented/include/google/protobuf/arenastring.h:264)
    @     0x7ff3e9359a6a  yb::docdb::KeyValuePairPB::SharedDtor() (yb/docdb/docdb.pb.cc:253)
    @     0x7ff3e9359aae  yb::docdb::KeyValuePairPB::~KeyValuePairPB() (yb/docdb/docdb.pb.cc:248)
    @     0x7ff3e9359b00  yb::docdb::KeyValuePairPB::~KeyValuePairPB() (yb/docdb/docdb.pb.cc:249)
    @     0x7ff3e935b92d  google::protobuf::internal::GenericTypeHandler<yb::docdb::KeyValuePairPB>::Delete(yb::docdb::KeyValuePairPB*, google::protobuf::Arena*) (thirdparty/yugabyte-thirdparty-2018-02-06T17_39_35/thirdparty/installed/uninstrumented/include/google/protobuf/repeated_field.h:623)
    @     0x7ff3e935b92d  void google::protobuf::internal::RepeatedPtrFieldBase::Destroy<google::protobuf::RepeatedPtrField<yb::docdb::KeyValuePairPB>::TypeHandler>() (thirdparty/yugabyte-thirdparty-2018-02-06T17_39_35/thirdparty/installed/uninstrumented/include/google/protobuf/repeated_field.h:1473)
    @     0x7ff3e935af77  google::protobuf::RepeatedPtrField<yb::docdb::KeyValuePairPB>::~RepeatedPtrField() (thirdparty/yugabyte-thirdparty-2018-02-06T17_39_35/thirdparty/installed/uninstrumented/include/google/protobuf/repeated_field.h:1934)
    @     0x7ff3e935af77  yb::docdb::KeyValueWriteBatchPB::~KeyValueWriteBatchPB() (yb/docdb/docdb.pb.cc:557)
    @     0x7ff3e935afd0  yb::docdb::KeyValueWriteBatchPB::~KeyValueWriteBatchPB() (yb/docdb/docdb.pb.cc:560)
    @     0x7ff3e957e415  yb::tserver::WriteRequestPB::SharedDtor() (yb/tserver/tserver.pb.cc:1332)
    @     0x7ff3e958630e  yb::tserver::WriteRequestPB::~WriteRequestPB() (yb/tserver/tserver.pb.cc:1327)
    @     0x7ff3e9586380  yb::tserver::WriteRequestPB::~WriteRequestPB() (yb/tserver/tserver.pb.cc:1328)

Also, I was reading the tcmalloc documentation and came across this:

If the sum of the thread cache sizes is less than --tcmalloc_max_total_thread_cache_bytes, 
max_size grows easily. If not, thread cache 1 will try to steal from thread cache 2 (picked round-
robin) by decreasing thread cache 2's max_size. In this way, threads that are more active will steal
 memory from other threads more often than they are have memory stolen from themselves. 
Mostly idle threads end up with small caches and active threads end up with big caches. Note 
that this stealing can cause the sum of the thread cache sizes to be greater than --
tcmalloc_max_total_thread_cache_bytes until thread cache 2 deallocates some memory to trigger
 a garbage collection.

This sounds similar to what might be happening in our case and hence increasing tcmalloc_max_total_thread_cache_bytes might help. Just wanted to confirm this is your reasoning as well to bump up tcmalloc_max_total_thread_cache_bytes.

All 9 comments

Thanks @mchang31415 . Could you share the RAM on the yugabyte nodes?

@pritamdamania87 : assigning to you for an initial look.

Thanks @mchang31415. Could you provide some additional details regarding the workload so that we can try to reproduce this on our end? In particular:

  1. How many hmget/hmset operations per second?
  2. Number of keys read/written in each hmget/hmset?
  3. Average number of elements in each hash?

Also opened an issue on gperftools: gperftools/gperftools#959, to see if they have some recommendations for tcmalloc.

How many threads are there ? One tweak you can try is bumping thread cache size via TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES environment variable.

If the issue is easily reproducible, or if you want to help improve tcmalloc for your workloads, please consider contributing malloc trace as described at https://github.com/alk/malloc-trace-replay

@alk Thanks a lot for the recommendation! We'll try setting a higher value for TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES and see if that helps. We have about 300 threads and looking at our metrics I see that tcmalloc_max_total_thread_cache_bytes is 33554432 and tcmalloc_current_total_thread_cache_bytes is 67018544. If I understand correctly, is this an indication that the thread cache is too small since the total size is almost double the max allowed size?

@alk I increased TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 268435456 and currently TCMALLOC_CURRENT_TOTAL_THREAD_CACHE_BYTES is about 172705296. Although, during startup of the process there were still a few spinlock issues, but after TCMALLOC_CURRENT_TOTAL_THREAD_CACHE_BYTES stabilized, it seems like there aren't any further issues. These are a few stack traces that I saw during startup:

W0211 09:36:53.940006  6785 kernel_stack_watchdog.cc:139] Thread 6774 stuck at ../../../../../src/yb/rpc/outbound_call.cc:326 for 236ms:
Kernel stack:
[<ffffffff810f5136>] futex_wait_queue_me+0xc6/0x130
[<ffffffff810f5dfb>] futex_wait+0x17b/0x280
[<ffffffff810f7b36>] do_futex+0x106/0x5a0
[<ffffffff810f8050>] SyS_futex+0x80/0x180
[<ffffffff816b5089>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

User stack:
    @     0x7ff3e41d30df  (unknown) (/tmp/glibc-20180109-1462-njgus6/glibc-2.23/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0)
    @     0x7ff3ed612ec3  sys_futex (./src/base/linux_syscall_support.h:2097)
    @     0x7ff3ed612ec3  base::internal::SpinLockDelay(int volatile*, int, int) (./src/base/spinlock_linux-inl.h:88)
    @     0x7ff3ed612d93  SpinLock::SlowLock() (src/base/spinlock.cc:133)
    @     0x7ff3e7d229a7  SpinLock::Lock() (src/base/spinlock.h:71)
    @     0x7ff3e7d229a7  SpinLockHolder::SpinLockHolder(SpinLock*) (src/base/spinlock.h:136)
    @     0x7ff3e7d229a7  tcmalloc::CentralFreeList::Populate() (src/central_freelist.cc:328)
    @     0x7ff3e7d22bb7  tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) (src/central_freelist.cc:284)
    @     0x7ff3e7d22c5a  tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) (src/central_freelist.cc:264)
    @     0x7ff3e7d25dd0  tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long) (src/thread_cache.cc:166)
    @     0x7ff3e7d37ba7  tcmalloc::ThreadCache::Allocate(unsigned long, unsigned long) (src/thread_cache.h:365)
    @     0x7ff3e7d37ba7  do_malloc_small (src/tcmalloc.cc:1103)
    @     0x7ff3e7d37ba7  do_malloc_no_errno (src/tcmalloc.cc:1112)
    @     0x7ff3e7d37ba7  cpp_alloc (src/tcmalloc.cc:1438)
    @     0x7ff3e7d37ba7  tc_newarray (src/tcmalloc.cc:1646)
    @     0x7ff3e724b236  google::LogMessage::Init(char const*, int, int, void (google::LogMessage::*)()) (src/logging.cc:1191)
    @     0x7ff3e724b9fb  google::LogMessage::LogMessage(char const*, int) (src/logging.cc:1155)
    @     0x7ff3ec8f1336  yb::consensus::LeaderElection::HandleVoteDeniedUnlocked(std::string const&, yb::consensus::LeaderElection::VoterState const&) (yb/consensus/leader_election.cc:410)
    @     0x7ff3ec8f1a1c  yb::consensus::LeaderElection::VoteResponseRpcCallback(std::string const&, scoped_refptr<yb::consensus::LeaderElection> const&) (yb/consensus/leader_election.cc:339)
    @     0x7ff3e9c278ae  std::function<void ()>::operator()() const (/n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:2267)
    @     0x7ff3e9c278ae  yb::rpc::OutboundCall::CallCallback() (yb/rpc/outbound_call.cc:327)
    @     0x7ff3e9c288a4  yb::rpc::OutboundCall::SetResponse(yb::rpc::CallResponse&&) (yb/rpc/outbound_call.cc:366)
    @     0x7ff3e9c17a7b  yb::rpc::Connection::HandleCallResponse(yb::Slice) (yb/rpc/connection.cc:410)
    @     0x7ff3e9c5b8cd  yb::rpc::YBConnectionContext::HandleCall(std::shared_ptr<yb::rpc::Connection> const&, yb::Slice) (yb/rpc/yb_rpc.cc:166)
W0211 09:37:11.758664  6785 kernel_stack_watchdog.cc:139] Thread 7063 stuck at ../../../../../src/yb/consensus/log.cc:245 for 177ms:
Kernel stack:
[<ffffffffffffffff>] 0xffffffffffffffff

User stack:
    @     0x7ff3e41d30df  (unknown) (/tmp/glibc-20180109-1462-njgus6/glibc-2.23/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0)
    @     0x7ff3e4282346  (unknown) (../sysdeps/unix/syscall-template.S:84)
    @     0x7ff3e7d21969  TCMalloc_SystemRelease(void*, unsigned long) (src/system-alloc.cc:539)
    @     0x7ff3e7d23060  tcmalloc::PageHeap::DecommitSpan(tcmalloc::Span*) (src/page_heap.cc:248)
    @     0x7ff3e7d234fb  tcmalloc::PageHeap::ReleaseLastNormalSpan(tcmalloc::PageHeap::SpanList*) (src/page_heap.cc:460)
    @     0x7ff3e7d235be  tcmalloc::PageHeap::ReleaseAtLeastNPages(unsigned long) (src/page_heap.cc:484)
    @     0x7ff3e7d1793e  TCMallocImplementation::ReleaseToSystem(unsigned long) (src/tcmalloc.cc:757)
    @     0x7ff3e82933ad  yb::MemTracker::GcTcmalloc() (yb/util/mem_tracker.cc:528)
    @     0x7ff3e8294a84  yb::MemTracker::Release(long) (yb/util/mem_tracker.cc:384)
    @     0x7ff3ec8f4d54  yb::consensus::LogCache::AccountForMessageRemovalUnlocked(std::shared_ptr<yb::consensus::ReplicateMsg> const&) (yb/consensus/log_cache.cc:409)
    @     0x7ff3ec8f6a04  yb::consensus::LogCache::EvictSomeUnlocked(long, long) (yb/consensus/log_cache.cc:397)
    @     0x7ff3ec8f70ef  yb::consensus::LogCache::LogCallback(long, bool, yb::Callback<void (yb::Status const&)> const&, yb::Status const&) (yb/consensus/log_cache.cc:241)
    @     0x7ff3ec67c9ae  yb::Callback<void (yb::Status const&)>::Run(yb::Status const&) const (yb/gutil/callback.h:451)
    @     0x7ff3ec67c9ae  yb::log::Log::AppendThread::RunThread() (yb/consensus/log.cc:248)
    @     0x7ff3e82ebc25  std::function<void ()>::operator()() const (/n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:2267)
    @     0x7ff3e82ebc25  yb::Thread::SuperviseThread(void*) (yb/util/thread.cc:602)
    @     0x7ff3e4544693  start_thread (/tmp/glibc-20180109-1462-njgus6/glibc-2.23/nptl/pthread_create.c:333)
    @     0x7ff3e42863cc  (unknown) (../sysdeps/unix/sysv/linux/x86_64/clone.S:109)

W0211 09:37:13.770215  6785 kernel_stack_watchdog.cc:139] Thread 7063 stuck at ../../../../../src/yb/consensus/log.cc:245 for 258ms:
Kernel stack:
[<ffffffff810b4cd8>] hrtimer_try_to_cancel.part.25+0x58/0x100
[<ffffffff810b4dc8>] hrtimer_cancel+0x28/0x40
[<ffffffff810f5e25>] futex_wait+0x1a5/0x280
[<ffffffff810f7b5a>] do_futex+0x12a/0x5a0
[<ffffffff810f8050>] SyS_futex+0x80/0x180
[<ffffffff816b51ab>] sysret_audit+0x17/0x21
[<ffffffffffffffff>] 0xffffffffffffffff

User stack:
    @     0x7ff3e41d30df  (unknown) (/tmp/glibc-20180109-1462-njgus6/glibc-2.23/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0)
    @     0x7ff3ed612ec3  sys_futex (./src/base/linux_syscall_support.h:2097)
    @     0x7ff3ed612ec3  base::internal::SpinLockDelay(int volatile*, int, int) (./src/base/spinlock_linux-inl.h:88)
    @     0x7ff3ed612d93  SpinLock::SlowLock() (src/base/spinlock.cc:133)
    @     0x7ff3e7d227cf  SpinLock::Lock() (src/base/spinlock.h:71)
    @     0x7ff3e7d227cf  SpinLockHolder::SpinLockHolder(SpinLock*) (src/base/spinlock.h:136)
    @     0x7ff3e7d227cf  tcmalloc::CentralFreeList::InsertRange(void*, void*, int) (src/central_freelist.cc:232)
    @     0x7ff3e7d25f86  tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) (src/thread_cache.cc:236)
    @     0x7ff3e7d2602b  tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned long) (src/thread_cache.cc:197)
    @     0x7ff3e7d380a7  tcmalloc::ThreadCache::Deallocate(void*, unsigned long) (src/thread_cache.h:390)
    @     0x7ff3e7d380a7  do_free_helper (src/tcmalloc.cc:1195)
    @     0x7ff3e7d380a7  do_free_with_callback (src/tcmalloc.cc:1225)
    @     0x7ff3e7d380a7  do_free (src/tcmalloc.cc:1234)
    @     0x7ff3e7d380a7  tc_deletearray (src/tcmalloc.cc:1665)
    @     0x7ff3e9359a6a  std::string::_Rep::_M_dispose(std::allocator<char> const&) (/n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/bits/basic_string.h:2740)
    @     0x7ff3e9359a6a  std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() (/n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/bits/basic_string.h:3037)
    @     0x7ff3e9359a6a  google::protobuf::internal::ArenaStringPtr::DestroyNoArena(std::string const*) (thirdparty/yugabyte-thirdparty-2018-02-06T17_39_35/thirdparty/installed/uninstrumented/include/google/protobuf/arenastring.h:264)
    @     0x7ff3e9359a6a  yb::docdb::KeyValuePairPB::SharedDtor() (yb/docdb/docdb.pb.cc:253)
    @     0x7ff3e9359aae  yb::docdb::KeyValuePairPB::~KeyValuePairPB() (yb/docdb/docdb.pb.cc:248)
    @     0x7ff3e9359b00  yb::docdb::KeyValuePairPB::~KeyValuePairPB() (yb/docdb/docdb.pb.cc:249)
    @     0x7ff3e935b92d  google::protobuf::internal::GenericTypeHandler<yb::docdb::KeyValuePairPB>::Delete(yb::docdb::KeyValuePairPB*, google::protobuf::Arena*) (thirdparty/yugabyte-thirdparty-2018-02-06T17_39_35/thirdparty/installed/uninstrumented/include/google/protobuf/repeated_field.h:623)
    @     0x7ff3e935b92d  void google::protobuf::internal::RepeatedPtrFieldBase::Destroy<google::protobuf::RepeatedPtrField<yb::docdb::KeyValuePairPB>::TypeHandler>() (thirdparty/yugabyte-thirdparty-2018-02-06T17_39_35/thirdparty/installed/uninstrumented/include/google/protobuf/repeated_field.h:1473)
    @     0x7ff3e935af77  google::protobuf::RepeatedPtrField<yb::docdb::KeyValuePairPB>::~RepeatedPtrField() (thirdparty/yugabyte-thirdparty-2018-02-06T17_39_35/thirdparty/installed/uninstrumented/include/google/protobuf/repeated_field.h:1934)
    @     0x7ff3e935af77  yb::docdb::KeyValueWriteBatchPB::~KeyValueWriteBatchPB() (yb/docdb/docdb.pb.cc:557)
    @     0x7ff3e935afd0  yb::docdb::KeyValueWriteBatchPB::~KeyValueWriteBatchPB() (yb/docdb/docdb.pb.cc:560)
    @     0x7ff3e957e415  yb::tserver::WriteRequestPB::SharedDtor() (yb/tserver/tserver.pb.cc:1332)
    @     0x7ff3e958630e  yb::tserver::WriteRequestPB::~WriteRequestPB() (yb/tserver/tserver.pb.cc:1327)
    @     0x7ff3e9586380  yb::tserver::WriteRequestPB::~WriteRequestPB() (yb/tserver/tserver.pb.cc:1328)

Also, I was reading the tcmalloc documentation and came across this:

If the sum of the thread cache sizes is less than --tcmalloc_max_total_thread_cache_bytes, 
max_size grows easily. If not, thread cache 1 will try to steal from thread cache 2 (picked round-
robin) by decreasing thread cache 2's max_size. In this way, threads that are more active will steal
 memory from other threads more often than they are have memory stolen from themselves. 
Mostly idle threads end up with small caches and active threads end up with big caches. Note 
that this stealing can cause the sum of the thread cache sizes to be greater than --
tcmalloc_max_total_thread_cache_bytes until thread cache 2 deallocates some memory to trigger
 a garbage collection.

This sounds similar to what might be happening in our case and hence increasing tcmalloc_max_total_thread_cache_bytes might help. Just wanted to confirm this is your reasoning as well to bump up tcmalloc_max_total_thread_cache_bytes.

@pritamdamania87 - can we close out the issue with a link to the two commits you did in this regard?

Hey, sorry Pritam just saw this now.

  1. How many hmget/hmset operations per second? We were experiencing
    issues at 6k-11k ops per second.
  2. Number of keys read/written in each hmget/hmset? ~100 subkeys read
    per hmget
  3. Average number of elements in each hash? ~200 elements in the hash

Thanks,
Michael Chang

On Mon, Feb 12, 2018 at 8:28 PM, Kannan Muthukkaruppan <
[email protected]> wrote:

@pritamdamania87 https://github.com/pritamdamania87 - can we close out
the issue with a link to the two commits you did in this regard?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/YugaByte/yugabyte-db/issues/47#issuecomment-365147522,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABn6RrfA6-SN82c9uIRWtubVdc1aX-ZOks5tUQ9fgaJpZM4R_QdQ
.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nocaway picture nocaway  Â·  3Comments

kevbaker picture kevbaker  Â·  3Comments

robertpang picture robertpang  Â·  3Comments

IS-Josh picture IS-Josh  Â·  3Comments

bmatican picture bmatican  Â·  3Comments