Minikube: Rabbit MQ does not start correctly in minikube 1.5.0

Created on 29 Oct 2019 · 23Comments · Source: kubernetes/minikube

The exact command to reproduce the issue:

{
   "apiVersion": "apps/v1",
   "kind": "Deployment",
   "metadata": {
      "labels": {
         "name": "rabbit"
      },
      "name": "rabbit"
   },
   "spec": {
      "replicas": 1,
      "selector": {
         "matchLabels": {
            "name": "rabbit"
         }
      },
      "template": {
         "metadata": {
            "labels": {
               "name": "rabbit"
            }
         },
         "spec": {
            "containers": [
               {
                  "image": "rabbitmq:3.8-management",
                  "imagePullPolicy": "IfNotPresent",
                  "name": "rabbit",
                  "ports": [
                     {
                        "containerPort": 5672
                     },
                     {
                        "containerPort": 15672
                     }
                  ]
               }
            ]
         }
      }
   }
}

The full output of the command that failed:
There was no log records from rabbitmq container, and no one is listen on ports 5672 and 15672 but this deployment worked fine on version 1.4

The output of the minikube logs command:


[minikube.log](https://github.com/kubernetes/minikube/files/3783720/minikube.log)

The operating system version:
Linux hobo 5.0.0-32-generic #34~18.04.2-Ubuntu SMP Thu Oct 10 10:36:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

areguest-vm prioritimportant-soon

Source

mpakki

👍1

All 23 comments

I'm also having a problem with containers not starting on 1.5 where they worked fine on 1.4. Unfortunately my container is private so I can't share it with you, but here are some details I can provide.

There are no log messages for the failed container
The container never reports as Ready

groksrc on 29 Oct 2019

Same here. I got RabbitMQ, Cassandra, ElasticSearch and MySQL deployments that worked just fine in 1.3, but won't in 1.5.

There are also no logs.

andrebrait on 29 Oct 2019

Are these errors also related to the number of files allowed, i.e. same as the MySQL craziness ?

afbjorklund on 29 Oct 2019

I am still testing, but there seems to be some specific misbehavior on 1.15 launched VMs for ALL erlang processes (of which RabbitMQ is one example)

running e.g. kubectl run --rm -it --image elixir:1.9.0-alpine iex hangs after process start but prior to giving an iex prompt* for ~9 minutes on my machine. Similar slow downs can be seen with simple hello world elixir programs and mix tasks in a on-cluster docker build.

The problem persists across kubernetes versions of at least 1.16.2 and 1.14.7 (so I imagine more).

For me the hang also includes a high CPU load.

There are no such problems when running python workloads on the same cluster.

*it is notable that the BEAM has started insofar as sending an interrupt gives the conventional erlang BREAK prompt

BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
       (v)ersion (k)ill (D)b-tables (d)istribution

but the process as all seems unable to start in a timely fashion

donaldguy on 29 Oct 2019

👍1

Can confirm that downgrading to minikube 1.14 (iso and binary) seems to restore reasonable erlang performance

donaldguy on 29 Oct 2019

We believe that the major difference with the new ISO is the new version of systemd (240).

It raises the value of the maximum number of open files, /proc/sys/fs/nr_open, by 1K.

From 1048576 to 1073741816, if not accounted for it can cause _major_ issues of slowdown.

There is also a new kernel, but most things seem to be OK with the new version of linux...

afbjorklund on 29 Oct 2019

Yeah, to confirm the relatively obvious (just cause I wasted the time doing it), using the v1.5.0 binary with a --iso-url set back to 1.4.0 image works fine

(I'm on macOS 10.15 (19A602) by the way; but since this is pretty clearly an inside-the-hypervisor issue, ¯_(ツ)_/¯ )

donaldguy on 29 Oct 2019

Does the same kind of workaround as in #5751 (MySQL) change anything ?

docker run -it elixir:1.9.0-alpine sh -c "ulimit -n 4096 && exec iex"

afbjorklund on 29 Oct 2019

yes

donaldguy on 29 Oct 2019

I was starting down the road of trying to strace things, but there is no strace in the iso buildroot (and no package manager) ; but indeed lowering the limit makes it very speedy to get done

donaldguy on 29 Oct 2019

Should be fixed by the same change then, thanks for verifying.

You have toolbox, if you need to run any commands in the VM

afbjorklund on 29 Oct 2019

oh wait! I will reconfirm, but I think I actually ran that on the 1.4 iso VM (cause I was trying to get a good strace to diff against when I saw your question) D:

donaldguy on 29 Oct 2019

okay I'm not sure as yet – this time to avoid taking on bad configuration from existing scripts like last time, I started with all the 1.5 defaults (well except still virtualbox instead of hyperkit on this mac, but including 2G of RAM instead of earlier when I was running with 8G)

and now iex is just getting promptly OOM killed on start, with and without the ulimit set :(

donaldguy on 29 Oct 2019

okay, yes. With 8GB of RAM again, I can confirm that the ulimit is in fact fixing things like we thought

❯ sh bin/launch_kubernetes_local.sh
🔥  Deleting "minikube" in virtualbox ...
💔  The "minikube" cluster has been deleted.
😄  minikube v1.5.0 on Darwin 10.15
🔥  Creating virtualbox VM (CPUs=4, Memory=8000MB, Disk=20000MB) ...
🐳  Preparing Kubernetes v1.16.2 on Docker 18.09.9 ...
🚜  Pulling images ...
🚀  Launching Kubernetes ...
⌛  Waiting for: apiserver proxy etcd scheduler controller dns
🏄  Done! kubectl is now configured to use "airflow"

❯ minikube ssh
                         _             _
            _         _ ( )           ( )
  ___ ___  (_)  ___  (_)| |/')  _   _ | |_      __
/' _ ` _ `\| |/' _ `\| || , <  ( ) ( )| '_`\  /'__`\
| ( ) ( ) || || ( ) || || |\`\ | (_) || |_) )(  ___/
(_) (_) (_)(_)(_) (_)(_)(_) (_)`\___/'(_,__/'`\____)

$ docker run -it --entrypoint /bin/sh elixir:1.9.0-alpine
Unable to find image 'elixir:1.9.0-alpine' locally
1.9.0-alpine: Pulling from library/elixir
e7c96db7181b: Pull complete
c306444e2774: Pull complete
4b80324f258b: Pull complete
Digest: sha256:bd6e03f00f57a121f6f31374d693e9027ce6053e25700be4d7c64f83dd249a56
Status: Downloaded newer image for elixir:1.9.0-alpine
/ # iex
^C
BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
       (v)ersion (k)ill (D)b-tables (d)istribution
^C/ # ulimit -n 4096
/ # iex
Erlang/OTP 22 [erts-10.4.4] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]

Interactive Elixir (1.9.0) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)>

donaldguy on 29 Oct 2019

bleh I also forgot the -n on the ulimit when I was testing on the 2GB cluster;

I can confirm on a new 2GB cluster, that ulimit _with_ -n (so... setting not only anything, but the correct thing.) also stops the OOM kill

/ # iex
Killed
/ # ulimit -n 4096
/ # iex
Erlang/OTP 22 [erts-10.4.4] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [hipe]

the fact that memory usage is tied to the available handles this way may explain why OP is seeing RabbitMQ _never_ go ready (despite no livenessProbe), as opposed to my experience of just seeing a long hang on start of various (smaller) erlang/elixir processes

(and ulimit -n 1048576 is also sufficient to avoid an OOM)

donaldguy on 30 Oct 2019

Can someone confirm whether or not this is fixed by the v1.5.1 release?

tstromberg on 30 Oct 2019

👍1

I've been searching for the cause of this issue for two whole days, unable to understand what was wrong :)

My CouchDB (Erlang based) pods did start but CouchDB wouldn't respond. The Erlang parent process uses 100% CPU for 5-10 minutes before CouchDB can actually start.

I'll test with 1.5.1 to see if it works again.

dsebastien on 30 Oct 2019

I can happily confirm that with minikube v1.5.1 the following command works as intended:

helm install --set rabbitmq.username=guest --set rabbitmq.password=guest --set rbac.enable=false --name rabbit stable/rabbitmq

(local development environment)

Please note, that on minikube 1.5.0 I had to disable the readiness probe to make the service start successfully:

helm install --set rabbitmq.username=guest --set rabbitmq.password=guest --set rbac.enable=false --set readinessProbe.enabled=false --name rabbit stable/rabbitmq