Origin: `oc cluster up` hangs at "checking for available ports"

Created on 21 Jun 2016  路  49Comments  路  Source: openshift/origin

Following the instructions at https://www.openshift.com/devpreview/register.html I'm running

oc cluster up --version=v1.3.0-alpha.1

And this is what I get:

-- Checking Docker client ... OK
-- Checking for existing OpenShift container ... OK
-- Checking for openshift/origin:v1.3.0-alpha.1 image ... OK
-- Checking Docker daemon configuration ... OK
-- Checking for available ports ... ^C

until I break it off manually (waited hours at some point, no difference).

Version

oc v1.3.0-alpha.1
kubernetes v1.3.0-alpha.1-331-g0522e63

Steps To Reproduce
  1. run oc cluster up --version=v1.3.0-alpha.1
    Current Result
-- Checking Docker client ... OK
-- Checking for existing OpenShift container ... OK
-- Checking for openshift/origin:v1.3.0-alpha.1 image ... OK
-- Checking Docker daemon configuration ... OK
-- Checking for available ports ... ^C
Expected Result

More than the current result ;)

Additional Information

areusability componencli kinbug prioritP3

All 49 comments

@quintesse can you please include a run with a --loglevel=5 argument ?

Sure:

-- Checking Docker client ...
I0622 12:57:33.897344    9576 up.go:316] No Docker environment variables found. Will attempt default socket.
I0622 12:57:33.897375    9576 up.go:321] No Docker host (DOCKER_HOST) configured. Will attempt default socket.
I0622 12:57:33.897989    9576 up.go:338] Docker ping succeeded
-- Checking for existing OpenShift container ...
I0622 12:57:33.898008    9576 helper.go:163] Inspecting docker container "origin"
I0622 12:57:33.898299    9576 helper.go:167] Container "origin" was not found
-- Checking for openshift/origin:v1.3.0-alpha.1 image ...
I0622 12:57:33.898313    9576 helper.go:110] Inspecting Docker image "openshift/origin:v1.3.0-alpha.1"
I0622 12:57:33.899051    9576 helper.go:113] Image "openshift/origin:v1.3.0-alpha.1" found: &docker.Image{ID:"sha256:9e0fdce45db01ae8423565b3521b79a3889026874eed908ffff59a85e258f457", Parent:"", Comment:"", Created:time.Time{sec:63599823142, nsec:966249810, loc:(*time.Location)(0x3198cc0)}, Container:"62310c2081bdfe1c7f0af3471b3b863ff1a10308964c3a6f4aa45ae861191e6b", ContainerConfig:docker.Config{Hostname:"62310c2081bd", Domainname:"", User:"", Memory:0, MemorySwap:0, MemoryReservation:0, KernelMemory:0, CPUShares:0, CPUSet:"", AttachStdin:false, AttachStdout:false, AttachStderr:false, PortSpecs:[]string(nil), ExposedPorts:map[docker.Port]struct {}(nil), StopSignal:"", Tty:false, OpenStdin:false, StdinOnce:false, Env:[]string{"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"}, Cmd:[]string{"sleep 86400"}, DNS:[]string(nil), Image:"openshift/origin-base", Volumes:map[string]struct {}(nil), VolumeDriver:"", VolumesFrom:"", WorkingDir:"", MacAddress:"", Entrypoint:[]string{"/bin/sh", "-c"}, NetworkDisabled:false, SecurityOpts:[]string(nil), OnBuild:[]string(nil), Mounts:[]docker.Mount(nil), Labels:map[string]string{"build-date":"2016-05-16", "io.k8s.description":"OpenShift Origin is a platform for developing, building, and deploying containerized applications. See https://docs.openshift.org/latest for more on running OpenShift Origin.", "io.k8s.display-name":"OpenShift Origin Application Platform", "license":"GPLv2", "name":"CentOS Base Image", "vendor":"CentOS"}}, DockerVersion:"1.9.1", Author:"", Config:(*docker.Config)(0xc820110000), Architecture:"amd64", Size:415088397, VirtualSize:415088397, RepoDigests:[]string{}}
-- Checking Docker daemon configuration ...
I0622 12:57:33.899121    9576 helper.go:72] Retrieving Docker daemon info
I0622 12:57:34.005148    9576 helper.go:78] Docker daemon info: &docker.Env{"OSType=linux", "HttpsProxy=", "Name=rhlaptop", "Containers=15", "CpuCfsPeriod=true", "CPUSet=true", "PkgVersion=docker-1.10.3-19.gitee81b72.fc24.x86_64", "ExecutionDriver=native-0.2", "Labels=null", "ExperimentalBuild=false", "ClusterAdvertise=", "ContainersPaused=0", "CPUShares=true", "Debug=false", "NGoroutines=49", "ClusterStore=", "DriverStatus=[[\"Pool Name\",\"docker-8:3-685902-pool\"],[\"Pool Blocksize\",\"65.54 kB\"],[\"Base Device Size\",\"107.4 GB\"],[\"Backing Filesystem\",\"xfs\"],[\"Data file\",\"/dev/loop0\"],[\"Metadata file\",\"/dev/loop1\"],[\"Data Space Used\",\"6.438 GB\"],[\"Data Space Total\",\"107.4 GB\"],[\"Data Space Available\",\"24.38 GB\"],[\"Metadata Space Used\",\"12.39 MB\"],[\"Metadata Space Total\",\"2.147 GB\"],[\"Metadata Space Available\",\"2.135 GB\"],[\"Udev Sync Supported\",\"true\"],[\"Deferred Removal Enabled\",\"false\"],[\"Deferred Deletion Enabled\",\"false\"],[\"Deferred Deleted Device Count\",\"0\"],[\"Data loop file\",\"/var/lib/docker/devicemapper/devicemapper/data\"],[\"Metadata loop file\",\"/var/lib/docker/devicemapper/devicemapper/metadata\"],[\"Library Version\",\"1.02.122 (2016-04-09)\"]]", "NFd=19", "Architecture=x86_64", "RegistryConfig={\"IndexConfigs\":{\"docker.io\":{\"Mirrors\":null,\"Name\":\"docker.io\",\"Official\":true,\"Secure\":true}},\"InsecureRegistryCIDRs\":[\"172.30.0.0/16\",\"127.0.0.0/8\"],\"Mirrors\":null}", "ID=OAIZ:3FRZ:M4AK:BRCO:V25P:DV6J:MY7L:OZU5:WJTK:ZOQI:PSQB:UFHH", "BridgeNfIptables=true", "InitPath=", "NoProxy=", "Registries=[{\"Name\":\"docker.io\",\"Secure\":true}]", "ContainersStopped=15", "Plugins={\"Authorization\":null,\"Network\":[\"null\",\"host\",\"bridge\"],\"Volume\":[\"local\"]}", "LoggingDriver=journald", "OomKillDisable=true", "KernelVersion=4.5.7-300.fc24.x86_64", "OperatingSystem=Fedora 24 (Workstation Edition)", "IndexServerAddress=https://index.docker.io/v1/", "ContainersRunning=0", "Images=180", "Driver=devicemapper", "MemoryLimit=true", "MemTotal=16501252096", "ServerVersion=1.10.3", "BridgeNfIp6tables=true", "SystemTime=2016-06-22T12:57:34.004727286+02:00", "NEventsListener=0", "IndexServerName=docker.io", "SystemStatus=null", "SwapLimit=true", "CpuCfsQuota=true", "IPv4Forwarding=true", "InitSha1=b6a3efd42bd0b3818bae0a57dc24315471eadbe2", "NCPU=8", "DockerRootDir=/var/lib/docker", "HttpProxy="}
I0622 12:57:34.005258    9576 helper.go:43] Found RegistryConfig entry: RegistryConfig={"IndexConfigs":{"docker.io":{"Mirrors":null,"Name":"docker.io","Official":true,"Secure":true}},"InsecureRegistryCIDRs":["172.30.0.0/16","127.0.0.0/8"],"Mirrors":null}
I0622 12:57:34.005287    9576 helper.go:51] Unmarshalled registry config to &dockerhelper.RegistryConfig{InsecureRegistryCIDRs:[]string{"172.30.0.0/16", "127.0.0.0/8"}}
I0622 12:57:34.005298    9576 helper.go:58] Looking for "172.30.0.0/16" in []string{"172.30.0.0/16", "127.0.0.0/8"}
I0622 12:57:34.005304    9576 helper.go:61] Found "172.30.0.0/16"
-- Checking for available ports ...
I0622 12:57:34.005329    9576 run.go:143] Creating container named ""
config:
  image: openshift/origin:v1.3.0-alpha.1
  entry point:
    /bin/bash
  command:
    -c
    cat /proc/net/tcp /proc/net/tcp6

host config:
  pid mode: host
  network mode: host

I0622 12:57:34.341733    9576 run.go:152] Container created with id "1f68ff285b18ca0dbc0945359e7a3221489a48ca9b635316f2ddfecd8d2a2017"
I0622 12:57:34.341760    9576 run.go:153] Container: &docker.Container{ID:"1f68ff285b18ca0dbc0945359e7a3221489a48ca9b635316f2ddfecd8d2a2017", Created:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}, Path:"", Args:[]string(nil), Config:(*docker.Config)(nil), State:docker.State{Running:false, Paused:false, Restarting:false, OOMKilled:false, Pid:0, ExitCode:0, Error:"", StartedAt:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}, FinishedAt:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, Image:"", Node:(*docker.SwarmNode)(nil), NetworkSettings:(*docker.NetworkSettings)(nil), SysInitPath:"", ResolvConfPath:"", HostnamePath:"", HostsPath:"", LogPath:"", Name:"", Driver:"", Mounts:[]docker.Mount(nil), Volumes:map[string]string(nil), VolumesRW:map[string]bool(nil), HostConfig:(*docker.HostConfig)(nil), ExecIDs:[]string(nil), RestartCount:0, AppArmorProfile:""}
I0622 12:57:34.341836    9576 run.go:186] Attaching to container "1f68ff285b18ca0dbc0945359e7a3221489a48ca9b635316f2ddfecd8d2a2017"
I0622 12:57:34.342408    9576 run.go:209] Attach is successful.
I0622 12:57:34.342422    9576 run.go:213] Starting container "1f68ff285b18ca0dbc0945359e7a3221489a48ca9b635316f2ddfecd8d2a2017"
I0622 12:57:34.570700    9576 run.go:219] signaling attached channel
I0622 12:57:34.570731    9576 run.go:221] Waiting for container "1f68ff285b18ca0dbc0945359e7a3221489a48ca9b635316f2ddfecd8d2a2017"
I0622 12:57:35.327882    9576 run.go:226] Done waiting for container "1f68ff285b18ca0dbc0945359e7a3221489a48ca9b635316f2ddfecd8d2a2017", rc=0. Waiting for attach routine

@quintesse so I tried to reproduce on fedora 24, same Docker version ... started a vm, installed docker, got the 'oc' client. I ran oc cluster up and it got stuck in the same place. I then restarted the vm and ran it again, and everything worked ok and I couldn't reproduce again. I will try a new vm and see if I can get it in the same state. In the meantime, can you try just restarting your machine?

I've rebooted but still the same.

Which was more or less what I expected because I actually upgraded to F24 yesterday and it didn't work for me under F23 either. So I already had rebooted a couple of times.

I'm no longer able to reproduce even in a new vm. There must be something else in the environment that's causing this. Can you include the output of docker info from your machine?

Also, try running the same command manually (this is the point where it seems to be getting stuck):
docker run --rm --privileged --net=host --pid=host --entrypoint=/bin/bash openshift/origin:v1.3.0-alpha.1 -c "cat /proc/net/tcp /proc/net/tcp6" and see if that succeeds.

@bparees

docker info:

$ docker info
Containers: 17
 Running: 0
 Paused: 0
 Stopped: 17
Images: 180
Server Version: 1.10.3
Storage Driver: devicemapper
 Pool Name: docker-8:3-685902-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 107.4 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 6.449 GB
 Data Space Total: 107.4 GB
 Data Space Available: 24.39 GB
 Metadata Space Used: 12.48 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.135 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.122 (2016-04-09)
Execution Driver: native-0.2
Logging Driver: journald
Plugins:
 Volume: local
 Network: host bridge null
Kernel Version: 4.5.7-300.fc24.x86_64
Operating System: Fedora 24 (Workstation Edition)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 0
CPUs: 8
Total Memory: 15.37 GiB
Name: rhlaptop
ID: OAIZ:3FRZ:M4AK:BRCO:V25P:DV6J:MY7L:OZU5:WJTK:ZOQI:PSQB:UFHH
Username: quintesse
Registry: https://index.docker.io/v1/
Registries: docker.io (secure)

docker run ...:

$ docker run --rm --privileged --net=host --pid=host --entrypoint=/bin/bash openshift/origin:v1.3.0-alpha.1 -c "cat /proc/net/tcp /proc/net/tcp6"
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
   0: 00000000:170D 00000000:0000 0A 00000000:00000000 00:00000000 00000000  1000        0 24059 1 ffff8803f3f88780 100 0 0 10 0
   1: 00000000:006F 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 21853 1 ffff880400600000 100 0 0 10 0
   2: 017AA8C0:0035 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 26011 1 ffff8803f77c0000 100 0 0 10 0
   3: 0100007F:1735 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 13232 1 ffff8803f3c08000 100 0 0 10 0
   4: 00000000:0016 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 22511 1 ffff880400600780 100 0 0 10 0
   5: 0100007F:0277 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 35641 1 ffff8803f1491680 100 0 0 10 0
   6: 0100007F:0019 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 35565 1 ffff8803f1490f00 100 0 0 10 0
   7: 2E01A8C0:0016 0201A8C0:EE5C 01 00000000:00000000 02:000AE5AB 00000000     0        0 44937 2 ffff8803f3f88f00 23 4 33 10 -1
  sl  local_address                         remote_address                        st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode   0: 00000000000000000000000000000000:170D 00000000000000000000000000000000:0000 0A 00000000:00000000 00:00000000 00000000  1000        0 24060 1 ffff8800c5268000 100 0 0 10 0
   1: 00000000000000000000000000000000:006F 00000000000000000000000000000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 21856 1 ffff88040a200000 100 0 0 10 0
   2: 00000000000000000000000000000000:0016 00000000000000000000000000000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 22513 1 ffff88040a200840 100 0 0 10 0
   3: 00000000000000000000000001000000:0277 00000000000000000000000000000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 35640 1 ffff8800c5020000 100 0 0 10 0

@quintesse you can also attempt to harvest stacktraces from the process by killing it while it's hung like so, where $OC_PID is the process ID of the oc command:

$ kill -SIGABRT "${OC_PID}"

This should dump a stacktrace to stderr.

Sure, here you are: oc.out.txt

I just tried with the latest alpha.2 and the problem is still there. Any progress on figuring out what could be wrong?

@quintesse So based on your thread dump, I can see where it's stuck. The go docker client is waiting to read from the docker socket and never coming back: https://github.com/openshift/origin/blob/6e83535e99fad9420ea8979e97701700905ba9ed/Godeps/_workspace/src/github.com/fsouza/go-dockerclient/client.go#L655
However, I don't yet have a good theory why this is happening on your environment and nowhere else I've tried it.

I just tried with the Alpha 3 and the problem is still there unfortunately.

Btw, the docker version I should be using is the one included in Fedora, right? (1.10,3)
Because I see that on the Docker site they're at 1.12.

Hi @quintesse, just something to try... before you run 'oc cluster up', can you try flushing your iptables rules? (sudo iptables -F)

Yay! Success!

.
.
-- Checking for available ports ...
   WARNING: Binding DNS on port 8053 instead of 53, which may be not be resolvable from all clients.
-- Checking type of volume mount ...
   Using nsenter mounter for OpenShift volumes
.
.
.
etc

\o/ :)

Thanks for the help @csrwng !
Not sure if this issue should stay open or is this considered a valid work-around and we should close it?

NB: unrelated but hopefully you can help: what's the best place to go to for technical questions related to OpenShift? I tried the IRC channel once but people were not responsive at all unfortunately.

I'm looking for answers to the questions like: what could be the reason for the following error in my application:

Caused by: java.io.IOException: Cannot create dirs: /app/?/.ceylon/cache/org/jboss/logging/3.3.0.Final

Especially the ? is strange.

I only have this with the OS created by the oc cluster up command btw, the one running in a VM created by vagrant doesn't show this problem.

@quintesse there's a few issues that are resolved by running iptables -F (especially on Fedora 24) before running oc cluster up. We still need to investigate why that is and hopefully check the current state of the current rules to prevent it from causing problems. We already have an issue to investigate this: https://github.com/openshift/origin/issues/10139, so if you like, go ahead and close this one.

As far as your question goes ... do you know what the actual directory should be? Without knowing anything else, it looks like the ? is a placeholder for something else that somehow did not get resolved properly in this environment.

Ok, closing!

Btw, I don't want to muddy the waters of this issue with unrelated stuff, I just wanted to know what the best place would be for what surely are problems in my application and/or my understanding of how OpenShift works :)
I sent it now to the openshift google groups mailing list to see if I get any response there.
Thanks!

I'm seeing exactly this bug trying to run "oc cluster up" on Fedora 23, but "iptables -F" doesn't change anything.

@jim-minter reopening this one rather than filing a new one because all symptoms/logs are exactly as in this bug (except that "iptables -F" has no effect

@danwinship can you please include a pastebin of the output of oc cluster up --loglevel=8?

@danwinship I setup an env similar to yours ... fedora 23 w/ devicemapper loopback storage driver + docker 1.10.3 (on a vm) but still can't reproduce.

If you can, please capture 2 goroutine stack traces for me:

  1. Run 'oc cluster up' to the point that it hangs, then kill -6 [pid of oc]
  2. Run it again to the point where it hangs, and kill -10 [pid of docker daemon]
    The second one should make the daemon write out a log entry with a goroutine stack trace but should not kill it. You can find it as the last entry in journalctl -u docker

All a bit of a mystery so far - spent this morning trying to replicate but haven't managed it. Finding it hard to believe this isn't a docker bug (e.g. their issue 27289 looks pretty similar, but I don't think it affects 1.10 servers), but who knows...

thanks - docker goroutine dump separated out at https://paste.fedoraproject.org/501081/12566214/

So on the oc client side, we're stuck waiting to copy from the network (~line 105 from http://paste.fedoraproject.org/501076/48112497/) after the hijack call. On the Docker side, it seems like we're trying to acquire a semaphore when reading from the network? (lines 298-318 from https://paste.fedoraproject.org/501081/12566214/)

The oc client is waiting to read on the hijacked connection. The docker backend has multiple pairs of goroutines (one for each stdout, one for each stderr) sitting waiting in io.Copy for incoming data, as well as associated goroutines waiting for the former goroutines to terminate.

Effectively I suppose that the io.Copy goroutines have somehow missed the fact that the stream has ended and they should pack up and go home.

This is surely a docker server issue. I suspect that it's a timing issue like https://github.com/docker/docker/issues/27289 - e.g. perhaps that the exec'ed command runs so quickly, it completes before the stdout and stderr can be attached properly?

It would be nice if csrwng or I could reproduce this - @danwinship is this happening every time for you?

perhaps that the exec'ed command runs so quickly, it completes before the stdout and stderr can be attached properly?

We can simulate this by changing the place where we signal the attach goroutine to proceed:
Move https://github.com/openshift/origin/blob/master/pkg/bootstrap/docker/run/run.go#L237
to some place after
https://github.com/openshift/origin/blob/master/pkg/bootstrap/docker/run/run.go#L239

@jim-minter maybe we can interrupt the attach if the WaitContainer call has returned and the attach is not done.
OR
at least for this case, we don't need an attach. Simply run the container. After it runs, get its output, then remove it.

It would be nice if csrwng or I could reproduce this - @danwinship is this happening every time for you?

It has so far. And I'm on bare metal, not a VM, so it's plausible that it could be happening because the code runs faster for me than for you

@danwinship yup I was able to reproduce by simulating the container finishing before we proceed with the attach. All you have to do is get a slower computer :D ... that's your problem right thar

@jim-minter an alternate solution is to change the command we run on the container to read from stdin and redirect to /dev/null before proceeding with the 'cat [blah]'. That way it has to wait for the attach to succeed.

Note: also need to refactor to run TestPorts() once to return the list of ports - at the moment we run it multiple times, which can't help

@danwinship can you please try with this binary? https://drive.google.com/open?id=0Bxi-kAIS_6Y7Qk9sZEtlUUludWM
I am making it read from input so that the attach will succeed.

/cc @ncdc re: exec returning too quickly leading to races and hung connections

@danwinship can you please try with this binary?

uh... can you give me a patch to test instead?

@danwinship what... you don't trust random binaries from the internet? :)

Here it is
https://gist.github.com/csrwng/9057649d90e089a11e62b57208a30261

yup, that works

thx, @danwinship. I'll put together a PR.

@jim-minter so looking at this fix: https://github.com/docker/docker/pull/21048, it seems that container.CancelAttachContext() is only called when the start fails and the container needs to be cleaned up. However, in this case it looks like the container has finished before the attach starts copying. A possible fix on the Docker side would be to call CancelAttachContext when the container has finished as well. @runcom fyi

The fix of waiting for stdin works ok, and seems to be the least disruptive on our side.

@csrwng ack. Can you guys test if Docker 1.12.x has the same issue though? Docker-1.10.3 is pretty old at this point and we're moving towards 1.12.x. If you could fix this in origin it would be better w/o carrying another backport on the Docker 1.10 branch and rebuild everything for F24 (rhel is moving to 1.12 soon).

FWIW we are using 1.12.2 on the CI jobs

FWIW we are using 1.12.2 on the CI jobs

is the CI job failing as well with this issue? BTW, can you update the CI to use at least 1.12.3? as it's fixes many IO stuff. (We'll soon have a 1.12.4 which will further fixes IO related flaky(s))

is the CI job failing as well with this issue?

I am not sure. The reports here are for local uses. If we saw it in CI, it would answer your question re: 1.12, though.

BTW, can you update the CI to use at least 1.12.3?

yes

@runcom yup, submitted fix for origin (#12189). For this particular issue only @danwinship can reliably reproduce locally. I'm not sure whether he can try with 1.12 as well.

There's no 1.12 package for f23 and the f25 package doesn't seem to build successfully on f23, so I can't easily test it

Was this page helpful?
0 / 5 - 0 ratings