Cockroach: sql: meta-issue to point to a pattern of memory monitoring errors

Created on 22 Jan 2018 · 8Comments · Source: cockroachdb/cockroach

stability: panic while rebalancing away from node. #18777
sql: unreleased memory detected by monitor during finishSQLTxn #15766
sql: leaked memory panic during a prepared ROLLBACK #19276
sql: Joining subqueries causes OOM #21165
sql: adminServer.RangeLog can trigger BytesMonitor panic #19110
distsql: memory account panic: unexpected 7085 leftover bytes #20720
sql: panic in SELECT DISTINCT _, _ FROM (SELECT _, _, _ FROM _._) AS _ JOIN (SELECT _, _, _ FROM _._) AS _ USING (_, _) WHERE ((_ > (-_)) OR ((_ = (-_)) AND (_ > _))) ORDER BY _ LIMIT _ #21744
sql: SCRUB complaining about memory budget exceeded #21544

A-sql-memmon C-investigation meta-issue

Source

knz

Most helpful comment

Yes, I'd like to propose changing the semantics of accounting failures from causing a panic to merely logging an error and reporting to Sentry.

I think we've been playing a bit of whack-a-mole with these issues, and I think it'd be better to build some tolerance into the system to prevent server crashes.

To make this viable, we need to make sure that we still panic for misaccounted memory in testing, and that we throttle the rate of errors to Sentry to prevent a common query from flooding the server.

jordanlewis on 24 Jan 2018

👍3

All 8 comments

Discussed separately with @jordanlewis: we should allow the per-txn and per-session monitor to relinquish their mis-accounted monitored bytes peacefully, with a mere error to the reporting server but without panic. The working assumption is that this memory has been effectively released by the time these monitors are closed.

knz on 24 Jan 2018

Yes, I'd like to propose changing the semantics of accounting failures from causing a panic to merely logging an error and reporting to Sentry.

I think we've been playing a bit of whack-a-mole with these issues, and I think it'd be better to build some tolerance into the system to prevent server crashes.

To make this viable, we need to make sure that we still panic for misaccounted memory in testing, and that we throttle the rate of errors to Sentry to prevent a common query from flooding the server.

jordanlewis on 24 Jan 2018

👍3

Is this the wrong place to bring up the larger question of: should we recover from panics that don't indicate data correctness issues? I know @jordanlewis has brought this up before. It seems unfortunate that a nil pointer during something like planning causes the node to crash, rather than just giving an error.

justinj on 24 Jan 2018

@justinj this has been discussed before. As a matter of fact I had a PR ready for this: #15063
See the discussion on the PR to understand why it wasn't adopted.

knz on 24 Jan 2018

Also fundamentally this issue: #9884

knz on 24 Jan 2018

As of #22685 we no longer panic on unexpected %d leftover bytes in release builds and instead log and report the error. But my fix only targeted that specific panic, so we might still want to expand this behavior to other memory accounting errors. The current implementation also doesn't include the relevant SQL statement in the report, which would be great to fix but seems nontrivial.

solongordon on 15 Feb 2018

@jordanlewis I think we've covered much ground already so that the remaining issues are either related to admission control (not exactly the same topic any more) or specific monitoring bugs. Do we still have a need for an overarching meta issue?

knz on 6 May 2019

You're right. Thanks!

jordanlewis on 7 May 2019

Was this page helpful?

0 / 5 - 0 ratings