Nomad: Issue with Vertx Java micro services when managed by nomad - Zookeeper memberRemoved

Created on 7 Feb 2018 · 4Comments · Source: hashicorp/nomad

I have a java vertx micro service. One of the micro service will connect to Kafka to read messages, puts it on vertx event bus and the other reads from the event bus and process it further.

When I load tested this outside of nomad i.e just by starting java jars I did not face any issues. But when I run micro services via nomad some of the micro services works for couple of mins and then drops out of zookeeper cluster and stops working. In addition, vertx eventbus is unable to distribute load across cluster evenly.

In addition, I would like to know there a way to start all these process from normal root user and not chroot. I did user user option in task but it doesn't work.

If filing a bug please include the following:

Nomad version

Output from nomad version
Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e)

Operating system and Environment details

RHEL 6.8

Issue

Vertx Micro services behaves differently with nomad.

Reproduction steps

Start zookeeper server
Start all micro services with nomad
Some of the micro services gets dropped out of zookeeper cluster while this issue doesn't happen without nomad.

Nomad Server logs (if appropriate)

No logs related to Java process in nomad server or client. It only logs that it connected to consul server.

Nomad Client logs (if appropriate)

Feb 07, 2018 1:03:40 PM io.vertx.core.impl.HAManager
WARNING: Timed out waiting for group information to appear
Feb 07, 2018 1:03:40 PM io.vertx.core.impl.HAManager
WARNING: Timed out waiting for group information to appear
Feb 07, 2018 1:03:40 PM io.vertx.core.impl.HAManager
WARNING: Timed out waiting for group information to appear
Feb 07, 2018 1:04:55 PM io.vertx.spi.cluster.zookeeper.impl.ZKAsyncMultiMap
WARNING: connection to the zookeeper server have suspended.
Feb 07, 2018 1:05:08 PM io.vertx.core.impl.HAManager
WARNING: Timed out waiting for group information to appear
EVERE: Failed to handle memberRemoved
io.vertx.core.VertxException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /io.vertx/syncMap/__vertx.haInfo/d213ca95-30de-4a15-a064-6d4abe744cfe
        at io.vertx.spi.cluster.zookeeper.impl.ZKSyncMap.get(ZKSyncMap.java:95)
        at io.vertx.spi.cluster.zookeeper.impl.ZKSyncMap.lambda$entrySet$4(ZKSyncMap.java:182)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1553)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at io.vertx.spi.cluster.zookeeper.impl.ZKSyncMap.entrySet(ZKSyncMap.java:184)
        at io.vertx.core.impl.HAManager.nodeLeft(HAManager.java:321)
        at io.vertx.core.impl.HAManager.access$100(HAManager.java:107)
        at io.vertx.core.impl.HAManager$1.nodeLeft(HAManager.java:157)

Job file (if appropriate)

job "VertxLoadTest" {

  datacenters = ["dc1"]
  type = "service"

  update {
    stagger = "10s"
    max_parallel = 1
  }

  group "SimpleConsumer" {
    count = 1

    restart {
      attempts = 1
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }

    task "SimpleKafka" {
      driver = "java"

      config {
        jar_path    = "tmp/vertxJars/SimpleKafka.jar"
        jvm_options = ["-Xmx512m"]
        args = ["syslog"]

      }

      artifact {
        source = "http://somerepo/vertxJars/SimpleKafka.jar"
        destination = "tmp/vertxJars/"
      }

    }

  }
  group "ParamsLoader" {
    count = 1

    restart {
      attempts = 1
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }

    task "Params" {
      driver = "java"

      config {
        jar_path    = "tmp/vertxJars/Params.jar"
          jvm_options = ["-Xmx512m"]
        args = ["true"]

      }

      artifact {
        source = "http://somerepo/vertxJars/Params.jar"
        destination = "tmp/vertxJars/"
      }

    }

  }
}

themjobspec typquestion

Source

cernerpradeep

All 4 comments

But when I run micro services via nomad some of the micro services works for couple of mins and then drops out of zookeeper cluster and stops working. In addition, vertx eventbus is unable to distribute load across cluster evenly.

You need to assign an amount of resources to your tasks: https://www.nomadproject.io/docs/job-specification/resources.html

They're getting a default set of resources which is too low.

In addition, I would like to know there a way to start all these process from normal root user and not chroot. I did user user option in task but it doesn't work.

In order to run processes as root you will need to set an empty user.blacklist in your Nomad client's configuration.

Hope that helps!

schmichael on 9 Feb 2018

👍1

@schmichael Thanks for the answers. That helped.
I am running into resource issues. When I run outside nomad there is plenty of resource for the process but when I run it inside it halts with not enough memory and not enough cpu.
ID = 68d12feb
Name = servername
Class =
DC = dc1
Drain = false
Status = ready
Drivers =
Uptime = 1h33m45s

Allocated Resources
CPU Memory Disk IOPS
0/19976 MHz 0 B/31 GiB 0 B/41 GiB 0/0

Allocation Resource Utilization
CPU Memory
0/19976 MHz 0 B/31 GiB

Host Resource Utilization
CPU Memory Disk
0/19976 MHz 445 MiB/31 GiB 3.4 GiB/49 GiB

Allocations
No allocations placed

Job file:
job "VertxLoadTest" {

datacenters = ["dc1"]
type = "service"

update {
  stagger = "10s"
  max_parallel = 1
}

group "SimpleConsumer" {
count = 1

  restart {
    attempts = 1
    interval = "5m"
    delay = "25s"
    mode = "delay"
  }

  task "SimpleKafka" {
     driver = "java"
     user = "root"
    config {
      jar_path    = "tmp/vertxJars/SimpleKafka.jar"
      jvm_options = ["-Xmx512m","-Xms256m"]
      #args = ["syslogd"]

    }

    artifact {
      source = "http://somerepo/vertxJars/SimpleKafka.jar"
      destination = "tmp/vertxJars/"
    }

  }

}

    group "ParamsLoader" {
      count = 10

      restart {
        attempts = 1
        interval = "5m"
        delay = "25s"

}

      task "Params" {
        driver = "java"
        user = "root"
        config {
          jar_path    = "tmp/vertxJars/Params1.jar"
          jvm_options = ["-Xmx512m"]
          args = ["true"]

        }

        artifact {
          source = "http://somerepo/vertxJars/Params1.jar"
          destination = "tmp/vertxJars/"
        }

       resources {
          cpu    = 10000
          memory = 16384

          network {
          }
        }
      }
      }

}

Our system is 4 cores with 32GB Memory. But when I run the job. I get below error.

Evaluation triggered by job "VertxLoadTest"
Evaluation within deployment: "acecd9c1"
Allocation "4d24753d" created: node "68d12feb", group "ParamsLoader"
Allocation "8dfceee5" created: node "68d12feb", group "SimpleConsumer"
Evaluation status changed: "pending" -> "complete"

==> Evaluation "eb11ac6b" finished with status "complete" but failed to place all allocations:
Task Group "ParamsLoader" (failed to place 9 allocations):
* Resources exhausted on 1 nodes
Dimension "cpu" exhausted on 1 nodes
Evaluation "a768363e" waiting for additional capacity to place remainder

Task Group "ParamsLoader" (failed to place 7 allocations):
* Resources exhausted on 1 nodes
Dimension "memory" exhausted on 1 nodes

Why via nomad I am not able leverage complete resource.

cernerpradeep on 9 Feb 2018

@schmichael I understood this. I was giving lot of resource per micro service. So it was running out of resources. Because the count is 10 for params task. So I reduced the resource for each task then it the calculation showed perfectly. Thanks for the help 💯

cernerpradeep on 9 Feb 2018

🎉1

Glad you go it working @cernerpradeep!

schmichael on 10 Feb 2018

Was this page helpful?

0 / 5 - 0 ratings