Every time we need to write Term and index to RocksDB. I think I only need to write the index to RocksDB. When a term is needed, we can read the term from the Log using LogReader::LookupOpId(Because LogGCOp will not clean up the logs we need.).
@BiterrorChen Thank you for your question! This particular line sets something we call "frontiers", one of our custom additions to RocksDB. The frontiers get added to a RocksDB write batch, then saved to RocksDB SSTables and manifest when flushing the memtable, and then we can look at them on tablet server restart to determine the last OpId of the Raft operation that has been persisted in RocksDB. Then, at log replay time (see tablet_bootstrap.cc) we only replay log operations that are not yet persistent in RocksDB. This helps us avoid maintaining a separate log for RocksDB and Raft. If a tablet server crashes and the data in the RocksDB memtable gets lost, it gets replayed from the Raft log. So, when you are saying we're writing term and index to RocksDB, we're not really writing them with every key/value pair -- we're just updating some in-memory metadata that gets flushed to the SSTable / manifest at the next flush operation. And at that point it does not matter if we write one or two 64-bit integers, it is already much smaller than the memtable size being flushed. Now, when you're talking about LookupOpId, that tells us nothing about the latest OpId of an operation that made it to a RocksDB SSTable. In fact, in our design, the log always contains data corresponding to a longer history of changes than RocksDB, because in RocksDB the memtable only gets flushed periodically and the memtable could get lost on server restart. But because all of the committed data is always persistent in the Raft log, no acknowledged writes get lost, and by not maintaining a separate log for RocksDB we achieve high performance. The "frontiers" mechanism is what lets us figure out what Raft log entries need to be replayed on restart. Please let us know if this addresses your question.
Most helpful comment
@BiterrorChen Thank you for your question! This particular line sets something we call "frontiers", one of our custom additions to RocksDB. The frontiers get added to a RocksDB write batch, then saved to RocksDB SSTables and manifest when flushing the memtable, and then we can look at them on tablet server restart to determine the last OpId of the Raft operation that has been persisted in RocksDB. Then, at log replay time (see
tablet_bootstrap.cc) we only replay log operations that are not yet persistent in RocksDB. This helps us avoid maintaining a separate log for RocksDB and Raft. If a tablet server crashes and the data in the RocksDB memtable gets lost, it gets replayed from the Raft log. So, when you are saying we're writing term and index to RocksDB, we're not really writing them with every key/value pair -- we're just updating some in-memory metadata that gets flushed to the SSTable / manifest at the next flush operation. And at that point it does not matter if we write one or two 64-bit integers, it is already much smaller than the memtable size being flushed. Now, when you're talking about LookupOpId, that tells us nothing about the latest OpId of an operation that made it to a RocksDB SSTable. In fact, in our design, the log always contains data corresponding to a longer history of changes than RocksDB, because in RocksDB the memtable only gets flushed periodically and the memtable could get lost on server restart. But because all of the committed data is always persistent in the Raft log, no acknowledged writes get lost, and by not maintaining a separate log for RocksDB we achieve high performance. The "frontiers" mechanism is what lets us figure out what Raft log entries need to be replayed on restart. Please let us know if this addresses your question.