What happened:
Kicked off multiple builds in our CI environment, some tests use KIND to spin up clusters. Saw s bunch of failures:
✗ Preparing nodes 📦
ERROR: failed to create cluster: docker run error: command "docker run --hostname ci-a510791-control-plane --name ci-a510791-control-plane --label io.x-k8s.kind.role=control-plane --privileged --security-opt seccomp=unconfined --security-opt apparmor=unconfined --tmpfs /tmp --tmpfs /run --volume /var --volume /lib/modules:/lib/modules:ro --detach --tty --label io.x-k8s.kind.cluster=ci-a510791 --net kind --restart=on-failure:1 --volume=/workspace/pr-113/e2e/etc/rootca1.crt:/usr/local/share/ca-certificates/rootca1.crt:ro --volume=/workspace/pr-113/e2e/etc/rootca2.crt:/usr/local/share/ca-certificates/rootca2.crt:ro --publish=127.0.0.1:40759:6443/TCP kindest/node:v1.17.5" failed with error: exit status 125
Command Output: 4614c0b36ac6a3e641b0a300d07b6b0bc7317132fab3d494d21a3e4777aa5d5a
docker: Error response from daemon: network kind is ambiguous (2 matches found on name).
What you expected to happen:
No interference between tests, as we experienced w/ KIND 0.7.x
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
# docker network ls
NETWORK ID NAME DRIVER SCOPE
98fdc5e5af26 bridge bridge local
6f8de42fadad host host local
de87cb7dd35e kind bridge local
158648a47e91 kind bridge local
75afe274f8f8 none null local
Environment:
kind version): kind v0.8.1 go1.13.9 linux/amd64kubectl version):$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.5", GitCommit:"e0fccafd69541e3750d460ba0f9743b90336f24f", GitTreeState:"clean", BuildDate:"2020-04-16T11:44:03Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
docker info):$ docker info
Client:
Debug Mode: false
Server:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 122
Server Version: 19.03.6
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version:
runc version:
init version:
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.3.0-46-generic
Operating System: Ubuntu 19.10
OSType: linux
Architecture: x86_64
CPUs: 104
Total Memory: 754.6GiB
Name: 7959d5c46f-m9c7p
ID: UMPE:ZM2Z:POMD:VQDB:7JAM:7OV5:LNJE:XP5W:EX4Z:CA5N:GO35:IBFY
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
/etc/os-release):NAME="Ubuntu"
VERSION="19.10 (Eoan Ermine)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 19.10"
VERSION_ID="19.10"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=eoan
UBUNTU_CODENAME=eoan
For now you're going to need to serialize creating the first cluster. I'm not sure if there's a non-racy way to do this in docker.
xref: https://github.com/moby/moby/issues/20648 docker-compose has this same issue :/
I don't think docker gives us sufficient tools to avoid a race condition here coordinating a docker network between multiple processes, unless we do our own out-of-band multi-process locking.
... that's not something I'm super excited to add right now and full of potential problems, would it be acceptable instead if we developed sufficient tooling to allow kind to natively create multiple clusters in one command? I've been sketching out a design for that functionality anyhow.
thanks, but that would not solve my problem. our CI jobs are independent
and don't coordinate with each other
On Wed, May 20, 2020 at 3:29 PM Benjamin Elder notifications@github.com
wrote:
I don't think docker gives us sufficient tools to avoid a race condition
here coordinating a docker network between multiple processes, unless we do
our own out-of-band multi-process locking.... that's not something I'm super excited to add right now and full of
potential problems, would it be acceptable instead if we developed
sufficient tooling to allow kind to natively create multiple clusters in
one command? I've been sketching out a design for that functionality anyhow.—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes-sigs/kind/issues/1596#issuecomment-631678717,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAR5KLGIH3UKUMECFWYVPVTRSQVRZANCNFSM4NBGCXNQ
.
--
James DeFelice
585.241.9488 (voice)
650.649.6071 (fax)
er, but they're using the same docker instance?
are they on the same filesystem even ...?
EDIT: I ask because if not (it sounds like perhaps not) even something like deciding on our own file lock path would not work.
Currently workarounds are:
kind docker network (and try to get the options right / reasonable)KIND_EXPERIMENTAL_DOCKER_NETWORK env to set the network to be unique per cluster, knowing that you'll have to deal with cleanup or have potentially infinite networks, and that we may not choose to support this long term.I built a test bed with https://godoc.org/github.com/docker/docker/api/types#NetworkCreate CheckDuplicate and it is reliably insufficient.
package main
import (
"context"
"fmt"
"sync"
"github.com/docker/docker/api/types"
"github.com/docker/docker/client"
)
func main() {
cli, err := client.NewClientWithOpts(client.FromEnv)
if err != nil {
panic(err)
}
networkName := "test"
createNetwork := func() {
r, e := cli.NetworkCreate(context.Background(), networkName, types.NetworkCreate{
CheckDuplicate: true,
Driver: "bridge",
})
fmt.Println(r, e)
}
deleteNetwork := func() {
fmt.Println(cli.NetworkRemove(context.Background(), networkName))
}
var wg sync.WaitGroup
wg.Add(2)
go func() {
createNetwork()
wg.Done()
}()
go func() {
createNetwork()
wg.Done()
}()
wg.Wait()
deleteNetwork()
}
results (always the same, except the random IDs):
$ go run .
{8d6b80658e72d596f19c35bd90226171056dc9f93610aec3c2b55b20ad55ff4e } <nil>
{ad09baf925e2a213132c1b9072ec54bc70aaaa0e558a771cc3de2b509d72e948 } <nil>
Error response from daemon: network test is ambiguous (2 matches found based on name)
I've got a pretty good idea how we can hack a working solution but it's going to be ... a hack.
Wrote up a detailed outline of the hack I'm considering
https://docs.google.com/document/d/1Q7Njyco2mAz66lS44pVV7ixT22RAkqBrmVMetG1zuT4
(Shared with [email protected], our standard SIG Testing group. I can't open documents to the entire internet by automated policy, but I can share with groups. This group is open to join, this is common for Kubernetes documents)
This should be mitigated in 0.9.0 (just released, this was the last blocking issue), please let us know if you still encounter issues.
FYI @howardjohn @JeremyOT it _should_ be safe to do concurrent multi-cluster bringup in CI in v0.9.0 without any workarounds.