Ohara: Cannot remove exited containers of a cluster in docker environment

Created on 31 May 2019  ·  9Comments  ·  Source: oharastream/ohara

In docker environment, we can remove a container "gracefully" by calling the following commands sequentially :

docker stop XXXX
docker rm XXX

By default, docker stop will send another kill signal after stop command is not complete for over 10 seconds. We need to control suitable timeout between docker and ssh connection to avoid possible situation that ssh connection closed but docker stop command is not complete.

Another problem is ClusterCache only get active containers, we should handle the exited containers in docker remove command.

Following tasks will be accomplish:

  • change the default ClusterCache will get only running containers in docker and k8s env
  • make sure all collies could only get running containers in ssh env
  • make sure streamApp should assume cluster dead only if all containers are dead
  • make sure exited containers could be removed during cluster removing
bug v0.6.0

All 9 comments

/cc @chia7712
I have found this problem in testing my streamApp cluster...If there was no other concern, I will fix this bug today.

Nice finding. +1

BTW, does it cause error if we try to stop a exited container?

BTW, does it cause error if we try to stop a exited container?

Seems not. an exited container is actually stop working. Calling stop command will do nothing.

Pardon me, is there a bug in this issue?

Pardon me, is there a bug in this issue?

In my IT tests, if I ran a streamApp cluster and it failed later, use the cluster.remove() command will remove the clusterCache successfully, but the actual container are not removed.

In my IT tests, if I ran a streamApp cluster and it failed later, use the cluster.remove() command will remove the clusterCache successfully, but the actual container are not removed.

The root cause is that ssh collie does care for active containers only. Hence, the exited containers disappear from the cluster list.

The root cause is that ssh collie does care for active containers only. Hence, the exited containers disappear from the cluster list.

True. ClusterCache will get all active containers only, but since I use cluster remove function, the clusterCache remove the cluster (either running or exited) is right behavior, but docker should take the responsibility to remove all containers that belong to this cluster.

這個議題應該要考慮兩個點:
第一個就是上面提到的,已經失敗的contaienrs會從cluster中被移除,因此呼叫remove(cluster)的時候不會刪除exited containers

第二個則是stop container的時候要否帶timeout,預設10秒後stop指定就會送出kill的指令了,只是這邊可能會跟ssh connection的timeout有衝突,兩邊需要有一個一致的設定(例如ssh timeout = 20, stop timeout = 10之類)

@saivirtue please revise the description if we are on the same page.

Was this page helpful?
0 / 5 - 0 ratings