Our k8s templates set --cache and --max-sql-memory to 25%, on the assumption that kubernetes sets the cgroup memory limit appropriately (and that we detect this, which is also an issue: #31750). That is apparently not true, so kubernetes deployments commonly exceed their memory limits and crash. We need to update the k8s templates to communicate the memory limit to cockroach in a way it will understand.
So we have a 12 node cluster of which nodes are OOMKilled frequently. However, this has no impact on production uptime or availability. I assume this is kind of "expected" then and ok for now? Or would I be able to reduce the amount of restarts by providing a higher limit?
CockroachDB can tolerate nodes being OOM killed, but it's not good for performance and it's not something that should happen under normal usage. Until we implement an automatic fix for this, you can work around it in one of two ways:
25% (in the --cache and --max-sql-memory flags) with amounts appropriate for the container's memory allocation. For example, if you're allocating 8GB of memory to the container, use --cache 2GB --max-sql-memory 2GB. @timveil, since we were just talking about k8s in production, this is probably the type of thing we need to get in our docs asap. I'll work on that.
Not quite - those issues make CockroachDB understand the memory limit of the container it's running in. However, I believe there's still work to do here because our default k8s configuration templates don't set a memory limit (I think. I haven't verified this recently). This issue is about updating the k8s templates and not the database itself.