Scylla version (or git commit hash): 54069162f545f57c7031973a2479eb356dca1a2a
Cluster size: 3
AWS AMI: c3.8xlarge
_Description_
1) Enable a Slow Query Logging
2) Run a cassandra-stress: cassandra-stress read n=10000000 -node <address> -rate threads=500
3) See the collectd tracing statistics with scyllatop "*trac*". Note that the "cached_records" counter has a huge value.
The above is caused by a fact that a cached component of a tracing budget becomes "negative": we return more than we have consumed. And since this is an unsigned value it translates to a huge value.
This doesn't happen when a regular tracing is enabled, so there must be some logic error in a budget handling related to a Slow Query Logging.
I continue digging.
The issue is caused by the fact that the trace_state migrates to the other shard without using global_trace_state_ptr.
I'm looking for a specific place in a code where it happens now...
The abusing trace point is
tracing::trace(_trace_state, "Reading key {} from sstable {}", *_rp.key(), seastar::value_of([&sstable] { return sstable->get_filename(); }));
@duarten FYI ;)
Yikes! I thought that had been fixed with #1678 :/
Nope, I only fixed the issue in a storage_proxy I knew about. If you know about any other place, please, don't hesitate to share... ;)
The patch fixing THIS problem in on a list. I hope this is the last place like this... ;)
That seems to be the only missing one!
_Conclusion_
The issue was affecting not only the Slow Query Logging but also a regular Tracing and it was a luck that it didn't crash line in #1678.
So, I'd define this issue as critical and would suggest to merge it into the scylla-1.4 branch.
Looking at https://github.com/scylladb/scylla/commit/46b86ff80126c72b22a13d4245f3e11ab869c6ba, the following places in storage_proxy need the global_trace_state_ptr: