Charts: RabbitMQ high CPU usage on idle VM

Created on 23 Feb 2018  Â·  41Comments  Â·  Source: helm/charts

Is this a request for help?: Yes


Is this a BUG REPORT or FEATURE REQUEST? (choose one): FEATURE REQUEST

Version of Helm and Kubernetes:
Helm: 2.8.0
Kubectl server: 1.7.12-gke.1 (current) (It's GCP)

Which chart:
stable/rabbitmq

What happened:
High CPU usage on idle VM with only RabbitMQ running, generated by readiness/liveness probes. Basing on Stackdriver charts I see 100% CPU usage on n1-standard-2 VM. After forking chart, replacing probes with simple tcpSocket to 15672 port, it decreased to ~0%.

image

What you expected to happen:
To allow customizing health checks, use tcpSocket or httpGet instead of exec probes.

How to reproduce it (as minimally and precisely as possible):
Run it on GCP with n1-standard-2 VM type, and watch

Most helpful comment

Hi, I've found another reason why RabbitMQ can have noticeable CPU usage when in idle or under a light load. RabbitMQ is running on Erlang and it is using its scheduling capabilities. To schedule a process Erlang is using scheduler threads and its number by default depends on a number of logical cores. And this is a problem when running in docker/kubernetes because RabbitMQ will think it has more resources than it actually has. In our case, a RabbitMQ node is running on a server with 40 cores but we are limiting it only to have 1 core in Kubernetes. Erlang will run 40 scheduler threads that are constantly context switching which generates the cpu usage. When I set the number of scheduler threads to 1 the CPU usage dropped from 23 % to 3 %.
You can check how many scheduler threads you are using with rabbitmqctl status:

$ rabbitmqctl status
...
{erlang_version,
     "Erlang/OTP 20 [erts-9.3.3.3] [source] [64-bit] [smp:40:40] [ds:40:40:10] [async-threads:640] [hipe] [kernel-poll:true]\n"},
...

numbers after smp are the number of threads. For more info see Erlang scheduler details. You can set the value using environment variable:

RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+S 1:1"

All 41 comments

Would like to add that I'm also seeing this on GCP K8s 1.8.7-gke.1aprox 60% usage at idle.
n1-standard-1 (3.7G) free tier.
erl_child_setup and beam.smp being the consumers a around 25% each.

edit : chart : rabbitmq-0.6.17

edit: Upgraded to 0.6.25, no change.

We're seeing this as well (k8s 1.8.10, rabbitmq-0.6.25). This is caused by a longstanding erlang issue related to nofile ulimit which has been known since at least 2014.

If you disable the liveness and readiness probes you will find the idle usage will come down a lot.

See https://github.com/bitnami/bitnami-docker-rabbitmq/pull/63

cool! the solution of disabling readiness and liveness worked so far! but is there any option to change the ulimit in the docker image, the chart, or the deployment itself?

@robermorales the docker image including the fix (https://github.com/bitnami/bitnami-docker-rabbitmq/pull/69) is being prepared by bitnami; then the default value should be OK, but we should still expose it in the chart values.

thanks!

The fix has been released in docker images 3.7.4-r4 and 3.6.15-r4 (and their aliases: 3.7.4, 3.7, 3.6.15 and 3.6).

Since #4591 the values.yaml became non-prod, and the default image tag is the floating tag 3.7.4 instead of 3.7.4-r1. (values-production.yaml still refers to 3.7.4-r1).
Alas, we also have by default pullPolicy: IfNotPresent so in practice the floating tag is not a great idea...

I'm not sure if values-production.yaml is a good pattern; maybe it could just overrides some values instead of redefining everything. (only uses: redis and rabbitmq).

Maybe we should remove the floating tag and use read-only tags: 3.7.4-r4.

In any case, we should bump to -r4 to get the high CPU usage fix.

I still have the exact same issue in 3.7.6-r8.

@rips-hb I also have some high CPU usage, but it's periodic, not constant. I found out it's the probes (rabbitmqctl status) which use that much CPU periodically; it's a different issue (that should be created here), and I'm not sure what to do to fix it.

@thomas-riccardi for me it is constant unfortunately and I could resolve it by disabling liveness and readiness as suggested in this thread. Since it is only a test system that is no problem but I would rather have this checks on a production system. I will investigate a bit more and if I find something else I will create a new ticket.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

still an issue

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

The fix https://github.com/bitnami/bitnami-docker-rabbitmq/pull/63 is incomplete and does not set the ulimit for the liveness / readiness probes. So this is still an issue.

@macropin you are right.
However, I did not find a difference in execution time for rabbitmqctl status (or rabbitmqctl node_health_check) with and without ulimit -n 1024 (or ulimit -n 65536), so trying to add the ulimit -n to healtchecks will probably not help.

@macropin see also rabbitmq-ha advancements: #7378 (and #7752).

@macropin @thomas-riccardi As I see it, the fix in bitnami/bitnami-docker-rabbitmq#63 is not only incomplete, it is completely inapplicable because it modifies the Docker entrypoint that we _do not_ use.

Also I concur with @thomas-riccardi's findings — I did some testing too, and it turns out that even setting ridiculously low ulimit -n 128 does not help neither to reduce the probes' CPU time nor to reduce overall Pod's CPU usage.

could it fixed at helm install? I have same problem here.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@TomaszUrugOlszewski It should be fixed by #8140 (and subsequent fixes), modulo the issue about disk or memory alarms (see #8635).

3.5.7 same issure,high cpu load

rabbitmq 3.7.8 ,Erlang 21.1 same issue,high cpu useage. qps about 200/s

But, Is this was fixed?

I'm seeing the exact same behavior on 3.7.8 (Erlang 21.1) with very high CPU usage when idle and as a workaround disabling both the readiness and liveliness checks seems to fix the issue.

Hi,

Thanks for the feedback. If this issue is constant, then maybe it makes sense to change the readiness/liveness probes to simple tcp port checking. Thoughts on that?

@javsalgar I tried again with the latest version of the chart (rabbitmq-4.0.1) with Rabbitmq 3.7.9 (Erlang 21.1) and this problem does not seem to happen anymore.

Hi everyone,

Do you still suffer High CPU usage because of readiness/liveness probes in the latest versions of the cart??

I agree with @javsalgar, we can make simpler probes (such as tcp port checking) or decrease the frequency if you're running into issues because of that.

In my case, I greatly reduced the frequency of the probes.

What values did you use @desaintmartin ?

  livenessProbe:
    timeoutSeconds: 30
    periodSeconds: 30
  readinessProbe:
    timeoutSeconds: 30
    periodSeconds: 30

So the "fix" isn't a fix, rather a workaround. This should be reopened. Were the ulimits ever updated for the probes?

Hi @macropin

Currently liveness/readiness probes use curl instead of rabbitmqctl. Why do you consider necessary to update the ulimits on the probes?

Oh that's great. I missed that change. The High cpu usage of rabbitmqctl was due to it not inheriting the entrypoint ulimit which cased a cpu usage issue with the Erlang JVM.

Yes, it was one of the reasons why the CPU was so high. That's why they were moved to curl on
https://github.com/helm/charts/pull/8140

Hi, I've found another reason why RabbitMQ can have noticeable CPU usage when in idle or under a light load. RabbitMQ is running on Erlang and it is using its scheduling capabilities. To schedule a process Erlang is using scheduler threads and its number by default depends on a number of logical cores. And this is a problem when running in docker/kubernetes because RabbitMQ will think it has more resources than it actually has. In our case, a RabbitMQ node is running on a server with 40 cores but we are limiting it only to have 1 core in Kubernetes. Erlang will run 40 scheduler threads that are constantly context switching which generates the cpu usage. When I set the number of scheduler threads to 1 the CPU usage dropped from 23 % to 3 %.
You can check how many scheduler threads you are using with rabbitmqctl status:

$ rabbitmqctl status
...
{erlang_version,
     "Erlang/OTP 20 [erts-9.3.3.3] [source] [64-bit] [smp:40:40] [ds:40:40:10] [async-threads:640] [hipe] [kernel-poll:true]\n"},
...

numbers after smp are the number of threads. For more info see Erlang scheduler details. You can set the value using environment variable:

RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+S 1:1"

@Artimi how did you set"RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS" with rabbitmq helm?

@infa-ddeore We actually are not using helm, just custom made kubernetes deployment. I just wrote it here because I had similar problems as are in this issue.

@infa-ddeore the feature seems to be missing indeed.
It could be easily added like I did in https://github.com/helm/charts/pull/12908 for the metrics container.

We could also use the downward api to get the cpu requests or limits, and generate automatically the correct value for RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS in the command. Or just helm templating.

@thomas-riccardi thanks for the pointers, for now I will update stable/rabbitmq chart locally with this variable and deploy that

@Artimi setting "RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS" to "+S 1:1" doesn't seem helping me, I will try disabling liveness and readyness checks, my rmq is 3.7.8 and erlang 21

Thanks for reporting it @Artimi

I just created a PR so the user has a couple of parameters to limit the number of scheduler threads.

Was this page helpful?
0 / 5 - 0 ratings