Discussed separately with @jordanlewis: we should allow the per-txn and per-session monitor to relinquish their mis-accounted monitored bytes peacefully, with a mere error to the reporting server but without panic. The working assumption is that this memory has been effectively released by the time these monitors are closed.
Yes, I'd like to propose changing the semantics of accounting failures from causing a panic to merely logging an error and reporting to Sentry.
I think we've been playing a bit of whack-a-mole with these issues, and I think it'd be better to build some tolerance into the system to prevent server crashes.
To make this viable, we need to make sure that we still panic for misaccounted memory in testing, and that we throttle the rate of errors to Sentry to prevent a common query from flooding the server.
Is this the wrong place to bring up the larger question of: should we recover from panics that don't indicate data correctness issues? I know @jordanlewis has brought this up before. It seems unfortunate that a nil pointer during something like planning causes the node to crash, rather than just giving an error.
@justinj this has been discussed before. As a matter of fact I had a PR ready for this: #15063
See the discussion on the PR to understand why it wasn't adopted.
Also fundamentally this issue: #9884
As of #22685 we no longer panic on unexpected %d leftover bytes in release builds and instead log and report the error. But my fix only targeted that specific panic, so we might still want to expand this behavior to other memory accounting errors. The current implementation also doesn't include the relevant SQL statement in the report, which would be great to fix but seems nontrivial.
@jordanlewis I think we've covered much ground already so that the remaining issues are either related to admission control (not exactly the same topic any more) or specific monitoring bugs. Do we still have a need for an overarching meta issue?
You're right. Thanks!
Most helpful comment
Yes, I'd like to propose changing the semantics of accounting failures from causing a
panicto merely logging an error and reporting to Sentry.I think we've been playing a bit of whack-a-mole with these issues, and I think it'd be better to build some tolerance into the system to prevent server crashes.
To make this viable, we need to make sure that we still panic for misaccounted memory in testing, and that we throttle the rate of errors to Sentry to prevent a common query from flooding the server.