Charts: RabbitMQ high CPU usage on idle VM

Created on 23 Feb 2018 · 41Comments · Source: helm/charts

Is this a request for help?: Yes

Is this a BUG REPORT or FEATURE REQUEST? (choose one): FEATURE REQUEST

Version of Helm and Kubernetes:
Helm: 2.8.0
Kubectl server: 1.7.12-gke.1 (current) (It's GCP)

Which chart:
stable/rabbitmq

What happened:
High CPU usage on idle VM with only RabbitMQ running, generated by readiness/liveness probes. Basing on Stackdriver charts I see 100% CPU usage on n1-standard-2 VM. After forking chart, replacing probes with simple tcpSocket to 15672 port, it decreased to ~0%.

What you expected to happen:
To allow customizing health checks, use tcpSocket or httpGet instead of exec probes.

How to reproduce it (as minimally and precisely as possible):
Run it on GCP with n1-standard-2 VM type, and watch

Source

TomaszUrugOlszewski

👍5

Most helpful comment

Hi, I've found another reason why RabbitMQ can have noticeable CPU usage when in idle or under a light load. RabbitMQ is running on Erlang and it is using its scheduling capabilities. To schedule a process Erlang is using scheduler threads and its number by default depends on a number of logical cores. And this is a problem when running in docker/kubernetes because RabbitMQ will think it has more resources than it actually has. In our case, a RabbitMQ node is running on a server with 40 cores but we are limiting it only to have 1 core in Kubernetes. Erlang will run 40 scheduler threads that are constantly context switching which generates the cpu usage. When I set the number of scheduler threads to 1 the CPU usage dropped from 23 % to 3 %.
You can check how many scheduler threads you are using with rabbitmqctl status:

$ rabbitmqctl status
...
{erlang_version,
     "Erlang/OTP 20 [erts-9.3.3.3] [source] [64-bit] [smp:40:40] [ds:40:40:10] [async-threads:640] [hipe] [kernel-poll:true]\n"},
...

numbers after smp are the number of threads. For more info see Erlang scheduler details. You can set the value using environment variable:

RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+S 1:1"

Artimi on 1 Apr 2019

👍7 🎉1

All 41 comments

Would like to add that I'm also seeing this on GCP K8s 1.8.7-gke.1aprox 60% usage at idle.
n1-standard-1 (3.7G) free tier.
erl_child_setup and beam.smp being the consumers a around 25% each.

edit : chart : rabbitmq-0.6.17

edit: Upgraded to 0.6.25, no change.

sbnl on 15 Mar 2018

👍1

We're seeing this as well (k8s 1.8.10, rabbitmq-0.6.25). This is caused by a longstanding erlang issue related to nofile ulimit which has been known since at least 2014.

If you disable the liveness and readiness probes you will find the idle usage will come down a lot.

See https://github.com/bitnami/bitnami-docker-rabbitmq/pull/63

macropin on 5 Apr 2018

👍1

cool! the solution of disabling readiness and liveness worked so far! but is there any option to change the ulimit in the docker image, the chart, or the deployment itself?

robermorales on 23 Apr 2018

@robermorales the docker image including the fix (https://github.com/bitnami/bitnami-docker-rabbitmq/pull/69) is being prepared by bitnami; then the default value should be OK, but we should still expose it in the chart values.

thomas-riccardi on 23 Apr 2018

thanks!

robermorales on 23 Apr 2018

The fix has been released in docker images 3.7.4-r4 and 3.6.15-r4 (and their aliases: 3.7.4, 3.7, 3.6.15 and 3.6).

Since #4591 the values.yaml became non-prod, and the default image tag is the floating tag 3.7.4 instead of 3.7.4-r1. (values-production.yaml still refers to 3.7.4-r1).
Alas, we also have by default pullPolicy: IfNotPresent so in practice the floating tag is not a great idea...

I'm not sure if values-production.yaml is a good pattern; maybe it could just overrides some values instead of redefining everything. (only uses: redis and rabbitmq).

Maybe we should remove the floating tag and use read-only tags: 3.7.4-r4.

In any case, we should bump to -r4 to get the high CPU usage fix.

thomas-riccardi on 2 May 2018

🎉1 👎1

I still have the exact same issue in 3.7.6-r8.

rips-hb on 20 Jun 2018

@rips-hb I also have some high CPU usage, but it's periodic, not constant. I found out it's the probes (rabbitmqctl status) which use that much CPU periodically; it's a different issue (that should be created here), and I'm not sure what to do to fix it.

thomas-riccardi on 20 Jun 2018

@thomas-riccardi for me it is constant unfortunately and I could resolve it by disabling liveness and readiness as suggested in this thread. Since it is only a test system that is no problem but I would rather have this checks on a production system. I will investigate a bit more and if I find something else I will create a new ticket.

rips-hb on 20 Jun 2018

👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] on 19 Aug 2018

still an issue

nerumo on 19 Aug 2018

😕1 👍1

stale[bot] on 18 Sep 2018

The fix https://github.com/bitnami/bitnami-docker-rabbitmq/pull/63 is incomplete and does not set the ulimit for the liveness / readiness probes. So this is still an issue.

macropin on 25 Sep 2018

@macropin you are right.
However, I did not find a difference in execution time for rabbitmqctl status (or rabbitmqctl node_health_check) with and without ulimit -n 1024 (or ulimit -n 65536), so trying to add the ulimit -n to healtchecks will probably not help.

thomas-riccardi on 25 Sep 2018

@macropin see also rabbitmq-ha advancements: #7378 (and #7752).

thomas-riccardi on 25 Sep 2018

@macropin @thomas-riccardi As I see it, the fix in bitnami/bitnami-docker-rabbitmq#63 is not only incomplete, it is completely inapplicable because it modifies the Docker entrypoint that we _do not_ use.

intelfx on 2 Oct 2018

Also I concur with @thomas-riccardi's findings — I did some testing too, and it turns out that even setting ridiculously low ulimit -n 128 does not help neither to reduce the probes' CPU time nor to reduce overall Pod's CPU usage.

intelfx on 2 Oct 2018

could it fixed at helm install? I have same problem here.

alexsandro-xpt on 6 Oct 2018

stale[bot] on 5 Nov 2018

@TomaszUrugOlszewski It should be fixed by #8140 (and subsequent fixes), modulo the issue about disk or memory alarms (see #8635).

thomas-riccardi on 6 Nov 2018

3.5.7 same issure,high cpu load

FrankXuLei on 9 Nov 2018

rabbitmq 3.7.8 ，Erlang 21.1 same issue,high cpu useage. qps about 200/s

FrankXuLei on 9 Nov 2018

But, Is this was fixed?

alexsandro-xpt on 13 Nov 2018

I'm seeing the exact same behavior on 3.7.8 (Erlang 21.1) with very high CPU usage when idle and as a workaround disabling both the readiness and liveliness checks seems to fix the issue.

dNetGuru on 19 Nov 2018

Hi,

Thanks for the feedback. If this issue is constant, then maybe it makes sense to change the readiness/liveness probes to simple tcp port checking. Thoughts on that?

javsalgar on 7 Dec 2018

@javsalgar I tried again with the latest version of the chart (rabbitmq-4.0.1) with Rabbitmq 3.7.9 (Erlang 21.1) and this problem does not seem to happen anymore.

dNetGuru on 8 Dec 2018

Hi everyone,

Do you still suffer High CPU usage because of readiness/liveness probes in the latest versions of the cart??

I agree with @javsalgar, we can make simpler probes (such as tcp port checking) or decrease the frequency if you're running into issues because of that.

juan131 on 2 Jan 2019

In my case, I greatly reduced the frequency of the probes.

desaintmartin on 3 Jan 2019

What values did you use @desaintmartin ?

juan131 on 3 Jan 2019

  livenessProbe:
    timeoutSeconds: 30
    periodSeconds: 30
  readinessProbe:
    timeoutSeconds: 30
    periodSeconds: 30

desaintmartin on 3 Jan 2019

So the "fix" isn't a fix, rather a workaround. This should be reopened. Were the ulimits ever updated for the probes?

macropin on 4 Jan 2019

Hi @macropin

Currently liveness/readiness probes use curl instead of rabbitmqctl. Why do you consider necessary to update the ulimits on the probes?

juan131 on 4 Jan 2019

Oh that's great. I missed that change. The High cpu usage of rabbitmqctl was due to it not inheriting the entrypoint ulimit which cased a cpu usage issue with the Erlang JVM.

macropin on 5 Jan 2019

Yes, it was one of the reasons why the CPU was so high. That's why they were moved to curl on
https://github.com/helm/charts/pull/8140

juan131 on 8 Jan 2019

$ rabbitmqctl status
...
{erlang_version,
     "Erlang/OTP 20 [erts-9.3.3.3] [source] [64-bit] [smp:40:40] [ds:40:40:10] [async-threads:640] [hipe] [kernel-poll:true]\n"},
...

numbers after smp are the number of threads. For more info see Erlang scheduler details. You can set the value using environment variable:

RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+S 1:1"

Artimi on 1 Apr 2019

👍7 🎉1

@Artimi how did you set"RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS" with rabbitmq helm?

infa-ddeore on 10 Apr 2019

@infa-ddeore We actually are not using helm, just custom made kubernetes deployment. I just wrote it here because I had similar problems as are in this issue.

Artimi on 11 Apr 2019

@infa-ddeore the feature seems to be missing indeed.
It could be easily added like I did in https://github.com/helm/charts/pull/12908 for the metrics container.

We could also use the downward api to get the cpu requests or limits, and generate automatically the correct value for RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS in the command. Or just helm templating.

thomas-riccardi on 11 Apr 2019

@thomas-riccardi thanks for the pointers, for now I will update stable/rabbitmq chart locally with this variable and deploy that

infa-ddeore on 12 Apr 2019

@Artimi setting "RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS" to "+S 1:1" doesn't seem helping me, I will try disabling liveness and readyness checks, my rmq is 3.7.8 and erlang 21

infa-ddeore on 12 Apr 2019

Thanks for reporting it @Artimi

I just created a PR so the user has a couple of parameters to limit the number of scheduler threads.

juan131 on 15 Apr 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings