Yugabyte-db: MVCC issue on tablet server restart: New operation's hybrid time too low

Created on 21 Mar 2020  路  14Comments  路  Source: yugabyte/yugabyte-db

Encountered on a cluster where I was testing some encryption fixes.

cat yb-tserver.FATAL.details.2020-03-20T07_19_06.pid2214.txt
F20200320 07:19:06 ../../src/yb/tablet/mvcc.cc:190] T fc59d3574f67475c87ab99a2df541347 P d734db7152d34ce9a9a03d5956e1e313: New operation's hybrid time too low: { physical: 1584669130187203 }
  max_safe_time_returned_with_lease_={ safe_time: <min> source: kUnknown }
  *ht < max_safe_time_returned_with_lease_.safe_time=0
  static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_with_lease_.safe_time.ToUint64())=6490804757246783488
  ht->PhysicalDiff(max_safe_time_returned_with_lease_.safe_time)=1584669130187203

  max_safe_time_returned_without_lease_={ safe_time: <min> source: kUnknown }
  *ht < max_safe_time_returned_without_lease_.safe_time=0
  static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_without_lease_.safe_time.ToUint64())=6490804757246783488
  ht->PhysicalDiff(max_safe_time_returned_without_lease_.safe_time)=1584669130187203

  max_safe_time_returned_for_follower_={ safe_time: <min> source: kUnknown }
  *ht < max_safe_time_returned_for_follower_.safe_time=0
  static_cast<int64_t>(ht->ToUint64() - max_safe_time_returned_for_follower_.safe_time.ToUint64())=6490804757246783488
  ht->PhysicalDiff(max_safe_time_returned_for_follower_.safe_time)=1584669130187203

  (SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown})={ safe_time: { physical: 1584669361373923 } source: kUnknown }
  *ht < (SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time=1
  static_cast<int64_t>(ht->ToUint64() - (SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time.ToUint64())=-946940805120
  ht->PhysicalDiff((SafeTimeWithSource{last_replicated_, SafeTimeSource::kUnknown}).safe_time)=-231186720

  (SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown})={ safe_time: <min> source: kUnknown }
  *ht < (SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time=0
  static_cast<int64_t>(ht->ToUint64() - (SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time.ToUint64())=6490804757246783488
  ht->PhysicalDiff((SafeTimeWithSource{last_ht_in_queue, SafeTimeSource::kUnknown}).safe_time)=1584669130187203

  is_follower_side=1
  queue_.size()=0
  queue_=[]
  aborted=[]
    @     0x7f30b9462ecc  yb::LogFatalHandlerSink::send()
    @     0x7f30b864c336  google::LogMessage::SendToLog()
    @     0x7f30b864979a  google::LogMessage::Flush()
    @     0x7f30b864c869  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f30c1402d30  yb::tablet::MvccManager::AddPending()
    @     0x7f30c13a39cf  yb::tablet::TabletBootstrap::PlayWriteRequest()
    @     0x7f30c13a46c8  yb::tablet::TabletBootstrap::HandleOperation()
    @     0x7f30c139f5b0  yb::tablet::TabletBootstrap::HandleEntryPair()
    @     0x7f30c13a4b7f  yb::tablet::ReplayState::ApplyCommittedPendingReplicates<>()
    @     0x7f30c139fd26  yb::tablet::TabletBootstrap::HandleReplicateMessage()
    @     0x7f30c13a0334  yb::tablet::TabletBootstrap::HandleEntry()
    @     0x7f30c13a1378  yb::tablet::TabletBootstrap::PlaySegments()
    @     0x7f30c13a2f9d  yb::tablet::TabletBootstrap::Bootstrap()
    @     0x7f30c13a8adf  yb::tablet::BootstrapTablet()
    @     0x7f30c1c9223c  yb::tserver::TSTabletManager::OpenTablet()
    @     0x7f30b94f4714  yb::ThreadPool::DispatchThread()
    @     0x7f30b94f0e1f  yb::Thread::SuperviseThread()
    @     0x7f30b3d28694  start_thread
    @     0x7f30b346541d  __clone
    @              (nil)  (unknown)

aredocdb priorithigh

All 14 comments

This issue is not present with commit 2b8b24990bcdcdd044ec35468a6091bad0204b36 (tried to restart the tablet server 4 times and waited for tablet bootstrap).

I think we've seen this crop up in a couple of scenarios before.

I noticed this in a test, about 2wks ago as well, but during the write path: https://github.com/yugabyte/yugabyte-db/issues/2895#issuecomment-593657961
cc @spolitov

Trying different versions on my cluster, where the issue is 100% reproducible in some cases. 1932381cdc05905e4209449602123ebf15f6ca07 does not have this issue

Revision bade523c8a6a9a0a4791e9130d5608ac61d47a0c has the issue.

One of these commits introduced the issue, bisecting:

$ git log 1932381cdc05905e4209449602123ebf15f6ca07^..bade523c8a6a9a0a4791e9130d5608ac61d47a0c --pretty=$'%H\t\t%aN\t\t\t%s' src
bade523c8a6a9a0a4791e9130d5608ac61d47a0c                Mikhail Bautin                  [#4023] A flag for veryfing meta block checksums
e52eed54c4635d0f23fefa16f011de35ec97eb87                Amitanand Aiyer                 [#4019] Support the ability to fetch the oldest timestamp record that is more recent than a specified time
9ce54c10608907d93a48372ea5a5d63710185953                Sergei Politov                  [#3989] Allow write failures in SnapshotTxnTest.RemoteBootstrapOnStart
7dcb1fe6dfccd6b7e7749791f90f409c316fc4ba                Sergei Politov                  [#3989] Use union to fetch balance from multiple tables in PgLibPqTest.MultiBankAccountSnapshot
323fa8d28326f4d5cd4d3d558d806ede9358ff01                Sergei Politov                  [#3989] Disable transaction timeout in SealTxnTest.NumBatchesDisable
f99f062eeffcb2d872a7f26edeffa74ef8ea39a0                Dmitry Uspenskiy                        #3991 [YSQL] Handle transaction conflict status in case of parallel read operations
4911ad9c2251a154412d8a261dfc68edab9c3d1b                Bogdan Matican                  [#3012] Move remote bootstrap call to OpenTablet to the threadpool
3258d9ee51b9a6f1d8d1636ed388b6024ad299b9                Rahul Desirazu                  [#3976] Fix encryption format for newly created files
35936e325e79da2bf26c54f08f012f67c4d9f7a5                Sergei Politov                  [#3989] Relax requirement on number of log segments in QLTransactionTest.PreserveLogs
77eef2e034dbc55ac7b8b8b2c3665dd7e2a656ba                Amitanand Aiyer                 [#3611] Persist backfilling timestamp across master failures
34cf5136b060654f7b28d6964819a9283e517b9a                Dmitry Uspenskiy                        #3521 [docdb][ysql] Reuse iterators during txn conflict resolution
8de51f46f8a718ed7378d05751a718e936f00bf8                Jason Kim                       yb-admin: flush_table_by_id, compact_table_by_id (#3814)
a6d2160508d75edade1dd28cbd9ed34c1766b97e                Alex Abdugafarov                        #2711: [YSQL] Partial support for operator families, ALTER OPERATOR and ALTER EXTENSION
18d3a16c6bba5f477eb396adbac8e6c8ecb00427                Sanketh I                       [#3743] Batch copy table operations that happen during create database
da6fa33a8c0af44e36f110b2c1353f0d8822cb8e                Mikhail Bautin                  #3974 Enable checksum verification for meta blocks of encrypted files
0bc2bcaed220f539089d2bac63d64317768e3886                Oleg Loginov                    #3779: [YSQL] Fix for PgGate shutdown race on shutdown_done flag
bb3daa50ff542c97a3ff4428b5a6ac9b0b4ee25d                Sergei Politov                  [#1032] First phase of transaction aware snapshot
d6fa3f9c9b5f9a40ecd264ac1eb150a5e9f04805                rajukumaryb                     #3750: [docdb] Deregister callback for common mem trackers from root mem tracker in destructor of vector
9f4448d40d74757b17d0a4917f9a3e84441066da                neil                    [YSQL] #3896 IndexScan returns all rows when there is no match
99ea0162e986066d4305dba5fabe7caac0492919                Sergei Politov                  [#3918] Add nodiscard attribute to Status class
a03059fee585bd705f4b7f78286f891fdd82b354                Sergei Politov                  [#3916] Fix transaction status resolution with old peer version
d3f3d0a12de5f6cd13fa00dec917c6ab1b9c096d                neil                    [YSQL] #3937 Fixed bug when processing PgsqlConditionPB::operands::column_id
a193dd6bb1325c90fd87cecb900020ad0341bee2                Bogdan Matican                  [#3922] Fix the TS UI transactions link
b265c6f1db4e6c9b04cc4a93679ec9c05991139e                Sergei Politov                  [#3917] Fix determining the oldest SST file for intest cleanup
adf5b66df1dca71861bdefd21bf02417a04e9e40                Bogdan Matican                  [#3876] Cleaner error reporting from process main
27d5b30fd737c8a4d9f699413687535e280bb1e3                Sergei Politov                  [#3921] Fix StackWatchdogTest.TestWatchdog
0a1097ffbc5fb6baea739fa3bf419f47dcc9497c                Amitanand Aiyer                 [#3773] fix tsan race between IsBackfilling and SetIsBackfilling
78570cad663295d00e890f2cfebf34e304b885f4                Amitanand Aiyer                 [#3901] Rename indices to indexs in the context of index-backfill
be41a2fd90b7f3a0bbd8963530b631a36d8d39e3                Amitanand Aiyer                 [#3879] Implement paging for index backfill.
76160264d4bd1c4051746b748a19ee26c311870f                rajukumaryb                     #3624: [YSQL] Parallel queries across tablets for SELECT COUNT|MIN|MAX|SUM (*|col_id) FROM ...
a9b79f1b237c5bd61f6d4a4699a6fedea160d67f                Dmitry Uspenskiy                        #3583 [YSQL] Update on table with multiple FK tables crashes postgres
4ebd611713ffe4a08d72f647b14b6af49dfced5c                Sergei Politov                  Utility macro to create ToString functions
1e12fe19feda58735bcc543a759bf0bffad61fb0                Bogdan Matican                  [#3842] Comment out Connection::Shutdown info log
61ae8ceaa23f1bc8a326aea2ab4a0efa0e0b71f6                Sergei Politov                  [#3828] Fix apply of transaction with deleted tablets
277d24446fd2ef7ae8e07d20dd850225e4765e62                Amitanand Aiyer                 [#3668] Show backfill progress at master/tasks
bea9a45f480b74e180a97418848ef643216fc63b                Jason Kim                       test: increase timeout of TableColocationRemoteBootstrapTest (#3686)
1932381cdc05905e4209449602123ebf15f6ca07                Sergei Politov                  [#3823] Transactions could hang at transaction participant after write stopped

d6fa3f9c9b5f9a40ecd264ac1eb150a5e9f04805 does not have the issue (restarted the tablet server 4 times, waiting for tablet bootstrap each time). So the issue was introduced in this range of commits:

$ git log 1932381cdc05905e4209449602123ebf15f6ca07^..bade523c8a6a9a0a4791e9130d5608ac61d47a0c --pretty=$'%H\t\t%aN\t\t\t%s' src
bade523c8a6a9a0a4791e9130d5608ac61d47a0c                Mikhail Bautin                  [#4023] A flag for veryfing meta block checksums
e52eed54c4635d0f23fefa16f011de35ec97eb87                Amitanand Aiyer                 [#4019] Support the ability to fetch the oldest timestamp record that is more recent than a specified time
9ce54c10608907d93a48372ea5a5d63710185953                Sergei Politov                  [#3989] Allow write failures in SnapshotTxnTest.RemoteBootstrapOnStart
7dcb1fe6dfccd6b7e7749791f90f409c316fc4ba                Sergei Politov                  [#3989] Use union to fetch balance from multiple tables in PgLibPqTest.MultiBankAccountSnapshot
323fa8d28326f4d5cd4d3d558d806ede9358ff01                Sergei Politov                  [#3989] Disable transaction timeout in SealTxnTest.NumBatchesDisable
f99f062eeffcb2d872a7f26edeffa74ef8ea39a0                Dmitry Uspenskiy                        #3991 [YSQL] Handle transaction conflict status in case of parallel read operations
4911ad9c2251a154412d8a261dfc68edab9c3d1b                Bogdan Matican                  [#3012] Move remote bootstrap call to OpenTablet to the threadpool
3258d9ee51b9a6f1d8d1636ed388b6024ad299b9                Rahul Desirazu                  [#3976] Fix encryption format for newly created files
35936e325e79da2bf26c54f08f012f67c4d9f7a5                Sergei Politov                  [#3989] Relax requirement on number of log segments in QLTransactionTest.PreserveLogs
77eef2e034dbc55ac7b8b8b2c3665dd7e2a656ba                Amitanand Aiyer                 [#3611] Persist backfilling timestamp across master failures
34cf5136b060654f7b28d6964819a9283e517b9a                Dmitry Uspenskiy                        #3521 [docdb][ysql] Reuse iterators during txn conflict resolution
8de51f46f8a718ed7378d05751a718e936f00bf8                Jason Kim                       yb-admin: flush_table_by_id, compact_table_by_id (#3814)
a6d2160508d75edade1dd28cbd9ed34c1766b97e                Alex Abdugafarov                        #2711: [YSQL] Partial support for operator families, ALTER OPERATOR and ALTER EXTENSION
18d3a16c6bba5f477eb396adbac8e6c8ecb00427                Sanketh I                       [#3743] Batch copy table operations that happen during create database
da6fa33a8c0af44e36f110b2c1353f0d8822cb8e                Mikhail Bautin                  #3974 Enable checksum verification for meta blocks of encrypted files
0bc2bcaed220f539089d2bac63d64317768e3886                Oleg Loginov                    #3779: [YSQL] Fix for PgGate shutdown race on shutdown_done flag
bb3daa50ff542c97a3ff4428b5a6ac9b0b4ee25d                Sergei Politov                  [#1032] First phase of transaction aware snapshot
d6fa3f9c9b5f9a40ecd264ac1eb150a5e9f04805                rajukumaryb                     #3750: 

35936e325e79da2bf26c54f08f012f67c4d9f7a5 does not have the issue (restarted the tablet server 4 times and waited for tablet bootstrap each time). The list of remaining commits:

$ git log 1932381cdc05905e4209449602123ebf15f6ca07^..bade523c8a6a9a0a4791e9130d5608ac61d47a0c --pretty=$'%H\t\t%aN\t\t\t%s' src
bade523c8a6a9a0a4791e9130d5608ac61d47a0c                Mikhail Bautin                  [#4023] A flag for veryfing meta block checksums
e52eed54c4635d0f23fefa16f011de35ec97eb87                Amitanand Aiyer                 [#4019] Support the ability to fetch the oldest timestamp record that is more recent than a specified time
9ce54c10608907d93a48372ea5a5d63710185953                Sergei Politov                  [#3989] Allow write failures in SnapshotTxnTest.RemoteBootstrapOnStart
7dcb1fe6dfccd6b7e7749791f90f409c316fc4ba                Sergei Politov                  [#3989] Use union to fetch balance from multiple tables in PgLibPqTest.MultiBankAccountSnapshot
323fa8d28326f4d5cd4d3d558d806ede9358ff01                Sergei Politov                  [#3989] Disable transaction timeout in SealTxnTest.NumBatchesDisable
f99f062eeffcb2d872a7f26edeffa74ef8ea39a0                Dmitry Uspenskiy                        #3991 [YSQL] Handle transaction conflict status in case of parallel read operations
4911ad9c2251a154412d8a261dfc68edab9c3d1b                Bogdan Matican                  [#3012] Move remote bootstrap call to OpenTablet to the threadpool
3258d9ee51b9a6f1d8d1636ed388b6024ad299b9                Rahul Desirazu                  [#3976] Fix encryption format for newly created files
35936e325e79da2bf26c54f08f012f67c4d9f7a5                Sergei Politov                  [#3989] Relax requirement on number of log segments in QLTransactionTest.PreserveLogs

Most of those are test fixes, perhaps e52eed54c4635d0f23fefa16f011de35ec97eb87 ?

323fa8d28326f4d5cd4d3d558d806ede9358ff01 has the issue

323fa8d28326f4d5cd4d3d558d806ede9358ff01                Sergei Politov                  [#3989] Disable transaction timeout in SealTxnTest.NumBatchesDisable
f99f062eeffcb2d872a7f26edeffa74ef8ea39a0                Dmitry Uspenskiy                        #3991 [YSQL] Handle transaction conflict status in case of parallel read operations
4911ad9c2251a154412d8a261dfc68edab9c3d1b                Bogdan Matican                  [#3012] Move remote bootstrap call to OpenTablet to the threadpool
3258d9ee51b9a6f1d8d1636ed388b6024ad299b9                Rahul Desirazu                  [#3976] Fix encryption format for newly created files
35936e325e79da2bf26c54f08f012f67c4d9f7a5                Sergei Politov                  [#3989] Relax requirement on number of log segments in QLTransactionTest.PreserveLogs

4911ad9c2251a154412d8a261dfc68edab9c3d1b does not have the issue. See a later comment.

323fa8d28326f4d5cd4d3d558d806ede9358ff01                Sergei Politov                  [#3989] Disable transaction timeout in SealTxnTest.NumBatchesDisable
f99f062eeffcb2d872a7f26edeffa74ef8ea39a0                Dmitry Uspenskiy                        #3991 [YSQL] Handle transaction conflict status in case of parallel read operations
4911ad9c2251a154412d8a261dfc68edab9c3d1b                Bogdan Matican                  [#3012] Move remote bootstrap call to OpenTablet to the threadpool

323fa8d28326f4d5cd4d3d558d806ede9358ff01 is a test-only change 323fa8d28326f4d5cd4d3d558d806ede9358ff01 is mostly a test change, but also has a change in transaction.cc. But f99f062eeffcb2d872a7f26edeffa74ef8ea39a0 also does not look like it could be causing the issue.

It looks like this bug is nondeterministic, and 4911ad9c2251a154412d8a261dfc68edab9c3d1b does in fact have it.

3258d9ee51b9a6f1d8d1636ed388b6024ad299b9 has the issue.

Closing this issue as it was found on a cluster with a lot of corrupted files (artificially injected corruptions).

Was this page helpful?
0 / 5 - 0 ratings