Cockroach: sql: meta-issue to point to a pattern of memory monitoring errors

Created on 22 Jan 2018  路  8Comments  路  Source: cockroachdb/cockroach

  • stability: panic while rebalancing away from node. #18777
  • sql: unreleased memory detected by monitor during finishSQLTxn #15766
  • sql: leaked memory panic during a prepared ROLLBACK #19276
  • sql: Joining subqueries causes OOM #21165
  • sql: adminServer.RangeLog can trigger BytesMonitor panic #19110
  • distsql: memory account panic: unexpected 7085 leftover bytes #20720
  • sql: panic in SELECT DISTINCT _, _ FROM (SELECT _, _, _ FROM _._) AS _ JOIN (SELECT _, _, _ FROM _._) AS _ USING (_, _) WHERE ((_ > (-_)) OR ((_ = (-_)) AND (_ > _))) ORDER BY _ LIMIT _ #21744
  • sql: SCRUB complaining about memory budget exceeded #21544
A-sql-memmon C-investigation meta-issue

Most helpful comment

Yes, I'd like to propose changing the semantics of accounting failures from causing a panic to merely logging an error and reporting to Sentry.

I think we've been playing a bit of whack-a-mole with these issues, and I think it'd be better to build some tolerance into the system to prevent server crashes.

To make this viable, we need to make sure that we still panic for misaccounted memory in testing, and that we throttle the rate of errors to Sentry to prevent a common query from flooding the server.

All 8 comments

Discussed separately with @jordanlewis: we should allow the per-txn and per-session monitor to relinquish their mis-accounted monitored bytes peacefully, with a mere error to the reporting server but without panic. The working assumption is that this memory has been effectively released by the time these monitors are closed.

Yes, I'd like to propose changing the semantics of accounting failures from causing a panic to merely logging an error and reporting to Sentry.

I think we've been playing a bit of whack-a-mole with these issues, and I think it'd be better to build some tolerance into the system to prevent server crashes.

To make this viable, we need to make sure that we still panic for misaccounted memory in testing, and that we throttle the rate of errors to Sentry to prevent a common query from flooding the server.

Is this the wrong place to bring up the larger question of: should we recover from panics that don't indicate data correctness issues? I know @jordanlewis has brought this up before. It seems unfortunate that a nil pointer during something like planning causes the node to crash, rather than just giving an error.

@justinj this has been discussed before. As a matter of fact I had a PR ready for this: #15063
See the discussion on the PR to understand why it wasn't adopted.

Also fundamentally this issue: #9884

As of #22685 we no longer panic on unexpected %d leftover bytes in release builds and instead log and report the error. But my fix only targeted that specific panic, so we might still want to expand this behavior to other memory accounting errors. The current implementation also doesn't include the relevant SQL statement in the report, which would be great to fix but seems nontrivial.

@jordanlewis I think we've covered much ground already so that the remaining issues are either related to admission control (not exactly the same topic any more) or specific monitoring bugs. Do we still have a need for an overarching meta issue?

You're right. Thanks!

Was this page helpful?
0 / 5 - 0 ratings