Describe the bug
在 workspace > nodes 新增 node,會在這個 node 上進行 worker 及 broker 的部署。但是部屬 broker 會失敗,可以透過 K8S 的 get pods 命令觀察到這個問題
/cc @jackyoh @chia7712 @konekoya @wu87988622
這個錯誤主要是後來建立的二台 Node, Broker ID 重複, 我會再使用測試程式去測試來確認問題
這個錯誤主要是後來建立的二台 Node, Broker ID 重複, 我會再使用測試程式去測試來確認問題
Will be glad to help if there's anything that we can do to fix this in the frontend :)
ohara-agent 建立 broker 的邏輯主要會去查詢之前 container 的環境變數, 像是 BROKER_ID 就會把之前建立所有 container 的 BROKER_ID 最大值取出來, 之後加 1 設定給新建立 container 的 BROKER_ID. Kafka Broker 要求所有的 BROKER_ID 都要唯一, 如果 BROKER_ID 有重覆broker 就會丟出錯誤.
這個 bug 發生的原因主要是, 先建立一個 Workspace broker 的 container, 然後再建立 node 時直接選擇 2 個 node 建立 broker container. 目前猜測是在建立 2 個 container 時, 同時拿到第一個 container BROKER_ID 的資訊 0. 這樣會導致新建立的二個 container 的 BROKER_ID 都相同為 1, 而使其中的一個
broker container error, 這樣就無法建立副本為 3 的 topic.
我這邊有嘗試做每一個 broker container 每一次只建立一個 container, BROKER_ID 沒有出現重複的問題
有沒有log資訊?
是 configurator 還是 broker 的 log?
broker
或者說有觀察到那些資訊就應該要貼上來
ok, 我等等整理一下貼上來
以下是 k8s 的 container 資訊, k8soccl-qca2as03x3-bk-809dac7 是第二次建立的其中一個 container
[ohara@ohara-demo-k8s ~]$ kubectl get pods
NAME READY STATUS RESTARTS AGE
k8soccl-7u9fk15ghh-zk-107a74f 1/1 Running 0 42s
k8soccl-qca2as03x3-bk-3f92f78 1/1 Running 0 37s
k8soccl-qca2as03x3-bk-809dac7 0/1 Error 1 22s
k8soccl-qca2as03x3-bk-f1b3b9c 1/1 Running 0 22s
k8soccl-wk-wk-5d60176 1/1 Running 0 22s
k8soccl-wk-wk-75cbcbe 1/1 Running 0 35s
k8soccl-wk-wk-a349636 1/1 Running 0 22s
broker container 的 log 如下:
[2019-07-09 07:41:12,683] INFO Starting log cleanup with a period of 300000 ms. (kafka.log.LogManager)
[2019-07-09 07:41:12,687] INFO Starting log flusher with a default period of 9223372036854775807 ms. (kafka.log.LogManager)
[2019-07-09 07:41:13,133] INFO Awaiting socket connections on s0.0.0.0:18672. (kafka.network.Acceptor)
[2019-07-09 07:41:13,204] INFO [SocketServer brokerId=1] Created data-plane acceptor and processors for endpoint : EndPoint(null,18672,ListenerName(PLAINTEXT),PLAINTEXT) (kafka.network.SocketServer)
[2019-07-09 07:41:13,206] INFO [SocketServer brokerId=1] Started 1 acceptor threads for data-plane (kafka.network.SocketServer)
[2019-07-09 07:41:13,238] INFO [ExpirationReaper-1-Produce]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2019-07-09 07:41:13,239] INFO [ExpirationReaper-1-Fetch]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2019-07-09 07:41:13,240] INFO [ExpirationReaper-1-DeleteRecords]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2019-07-09 07:41:13,240] INFO [ExpirationReaper-1-ElectPreferredLeader]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2019-07-09 07:41:13,256] INFO [LogDirFailureHandler]: Starting (kafka.server.ReplicaManager$LogDirFailureHandler)
[2019-07-09 07:41:13,333] INFO Creating /brokers/ids/1 (is it secure? false) (kafka.zk.KafkaZkClient)
[2019-07-09 07:41:13,412] ERROR Error while creating ephemeral at /brokers/ids/1, node already exists and owner '72137912235786241' does not match current session '72137912235786248' (kafka.zk.KafkaZkClient$CheckedEphemeral)
[2019-07-09 07:41:13,421] ERROR [KafkaServer id=1] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
at org.apache.zookeeper.KeeperException.create(KeeperException.java:122)
at kafka.zk.KafkaZkClient$CheckedEphemeral.getAfterNodeExists(KafkaZkClient.scala:1784)
at kafka.zk.KafkaZkClient$CheckedEphemeral.create(KafkaZkClient.scala:1722)
at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1689)
at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:97)
at kafka.server.KafkaServer.startup(KafkaServer.scala:260)
at kafka.server.KafkaServerStartable.startup(KafkaServerStartable.scala:38)
at kafka.Kafka$.main(Kafka.scala:75)
at kafka.Kafka.main(Kafka.scala)
[2019-07-09 07:41:13,423] INFO [KafkaServer id=1] shutting down (kafka.server.KafkaServer)
每個 broker container 的環境變數如下:
k8soccl-qca2as03x3-bk-3f92f78
Environment:
PROMETHEUS_EXPORTER_PORT: 47430
BROKER_ADVERTISED_HOSTNAME: ohara-demo-01
BROKER_ADVERTISED_CLIENT_PORT: 18672
JMX_HOSTNAME: ohara-demo-01
CCI_ZOOKEEPER_CLUSTER_NAME: 7u9fk15ghh
BROKER_CLIENT_PORT: 18672
BROKER_ZOOKEEPERS: ohara-demo-01:61283
BROKER_ID: 0
JMX_PORT: 50142
k8soccl-qca2as03x3-bk-809dac7
Environment:
PROMETHEUS_EXPORTER_PORT: 47430
BROKER_ADVERTISED_HOSTNAME: ohara-demo-02
BROKER_ADVERTISED_CLIENT_PORT: 18672
JMX_HOSTNAME: ohara-demo-02
CCI_ZOOKEEPER_CLUSTER_NAME: 7u9fk15ghh
BROKER_CLIENT_PORT: 18672
BROKER_ZOOKEEPERS: ohara-demo-01:61283
BROKER_ID: 1
JMX_PORT: 50142
k8soccl-qca2as03x3-bk-f1b3b9c
Environment:
PROMETHEUS_EXPORTER_PORT: 47430
BROKER_ADVERTISED_HOSTNAME: ohara-demo-03
BROKER_ADVERTISED_CLIENT_PORT: 18672
JMX_HOSTNAME: ohara-demo-03
CCI_ZOOKEEPER_CLUSTER_NAME: 7u9fk15ghh
BROKER_CLIENT_PORT: 18672
BROKER_ZOOKEEPERS: ohara-demo-01:61283
BROKER_ID: 1
JMX_PORT: 50142
@konekoya @wu87988622 可否分享一下建立workspace過程中分別送了那些請求?
@jackyoh 我試了一下用ssh模式下一次在兩個節點上開workspace沒有問題
因為取 BROKER_ID 程式的邏輯都是使用 BrokerCollie 這個 class, 我再想是否 K8S 有一些非同步機制導致沒有取得新建立的 BROKER_ID, 而拿到最早建立的 BROKER_ID. 而使 BROKER_ID 有重複的問題
@konekoya @wu87988622 可否分享一下建立workspace過程中分別送了那些請求?
No problem. Just a min...
ping @chia7712 你測試的 ssh 模式是直接使用 restful API 還是使用 web ui 來建立 workerspace
@chia7712
When creating a workspace, the following requests are sent to the configurator
API: POST v0/zookeepers
request:
{
"name":"ox68a6oloj",
"clientPort":39359,
"peerPort":63372,
"electionPort":48549,
"nodeNames":[
"ohara-dev-103"
]
}
After the above request is sent, a follow-up request is sent (may send multiple times if the container is not ready):
API: GET v0/containers/ox68a6oloj
API: POST v0/brokers
request:
{
"name":"atzvfwacv8",
"zookeeperClusterName":"ox68a6oloj",
"nodeNames":[
"ohara-dev-103"
],
"clientPort":27249,
"exporterPort":30644,
"jmxPort":44729
}
After the above request is sent, a follow-up request is sent (may send multiple times if the container is not ready):
API: GET v0/containers/atzvfwacv8
API: POST v0/workers
request:
{
"name":"sdfsdfdsfdsfsdf",
"jmxPort":56338,
"brokerClusterName":"atzvfwacv8",
"clientPort":42499,
"nodeNames":[
"ohara-dev-103"
],
"jars":[
],
"groupId":"16g7nouhc5",
"configTopicName":"cykpbybqff",
"offsetTopicName":"czmw3806xv",
"statusTopicName":"rix4rm2rj6"
}
After the above request is sent, a follow-up request is sent (may send multiple times if the container is not ready):API: GET v0/containers/sdfsdfdsfdsfsdf
目前在 demo 環境測試, 還是有問題.
目前在 demo 環境測試, 還是有問題.
馬上處理!!
Thank you
在 demo 環境測試沒問題了. Thank you @wu87988622
Thanks for this PR. @wu87988622 @jackyoh @konekoya @chia7712