Seeing this issue in the networking extended tests:
Jun 10 18:11:26.242: INFO: At 2016-06-10 18:11:24 +0000 UTC - event for service-webserver: {kubelet nettest-node-1} Failed: Failed to start container with docker id f5ecb2f20b59 with error: API error (500): Cannot start container f5ecb2f20b591e2ca313d7bd98641873d2a7999721d64154855f0a86d6dee05a: Path /var/lib/openshift.local.volumes/pods/bde5b5fe-2f36-11e6-a3a6-024287af2ef2/containers/service-webserver-container/780e3547 is mounted on / but it is not a shared or slave mount.
As per https://github.com/docker/docker/issues/19625#issuecomment-203891275, it seems that the unit file setting MountFlags=slave causes this to occur.
In our AMI, we have:
$ sudo cat /usr/lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network.target
Wants=docker-storage-setup.service
Requires=rhel-push-plugin.socket
[Service]
Type=notify
NotifyAccess=all
EnvironmentFile=-/etc/sysconfig/docker
EnvironmentFile=-/etc/sysconfig/docker-storage
EnvironmentFile=-/etc/sysconfig/docker-network
Environment=GOTRACEBACK=crash
ExecStart=/bin/sh -c '/usr/bin/docker-current daemon \
--authorization-plugin=rhel-push-plugin \
$OPTIONS \
$DOCKER_STORAGE_OPTIONS \
$DOCKER_NETWORK_OPTIONS \
$ADD_REGISTRY \
$BLOCK_REGISTRY \
$INSECURE_REGISTRY \
2>&1 | /usr/bin/forward-journald -tag docker'
LimitNOFILE=1048576
LimitNPROC=1048576
LimitCORE=infinity
MountFlags=slave
TimeoutStartSec=10min
Restart=on-abnormal
StandardOutput=null
StandardError=null
[Install]
WantedBy=multi-user.target
We don't seem to have updated the version of Docker we're running yet ...
$ docker version
Client:
Version: 1.10.3
API version: 1.22
Package version: docker-common-1.10.3-25.el7.x86_64
Go version: go1.4.2
Git commit: 86bbf84/1.10.3
Built:
OS/Arch: linux/amd64
Server:
Version: 1.10.3
API version: 1.22
Package version: docker-common-1.10.3-25.el7.x86_64
Go version: go1.4.2
Git commit: 86bbf84/1.10.3
Built:
OS/Arch: linux/amd64
Question becomes... why are we seeing this now? Do we need to change that option?
/cc @marun @danwinship @runcom @smarterclayton
We recently changed Docker 1.10.3 default behavior wrt mount propagation from rprivate to rslave and I very well remember that error in Docker's integration tests which is usually related to the fact that you have to bind mount and you can fix this by postfixing ":rprivate" to the volume definition "-v". That said, I'm not sure how tests are run in this specific scenario.
Upstream Docker has recently removed MountFlag=slave also. I think you should do the same here, I'll defer this to @rhvgoyal though.
I also don't this we're shipping with that flag in the unit in fedora so I'm not sure how that got there
Something must have changed elsewhere as we have had the MountFlags=slave entry in that unit file at least since 06/02 as can be seen here.
I think changing the default mount propagation in docker itself is the issue triggering this. I think you added it to overcome that specific BZ but then we fixed docker?
@runcom No, that post was a simple cat of the unit file. I never changed what was there
Build number 25 seems like it did not have the mount propagation fix in docker indeed (we're at 40 now)
Alright, I'll let Vivek speak since I'm out of idea - I still think we should remove it now since we changed default mount propagation in docker to be rslave
While this may get fixed next week when @tdawson is back to update the version of Docker we're using ... we need to figure out why were OK before and broken now ... might be a DIND thing?
By _before_ you mean with 1.9?
And yes it's really likely a DIND thingie
@danwinship confirmed the error we are seeing is from the daemon inside of DIND.
Only one change has gone in to Origin between the last successful test from test_origin_pull_requests_origin_networking at 10:25am and now ... its https://github.com/openshift/origin/pull/8371
@miminar any thoughts on how your changes could contribute to this from an Origin behavior standpoint?
I'm sending a custom repo you can add to get the newest builds.
@lsm5 I'm not sure that's necessary or going to be ultimately useful -- I don't think the DIND builds are hitting the bleeding edge repository as it is
@runcom re: Fedora... on an F23 machine here and this is my unit file:
$ sudo cat /usr/lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network.target
Wants=docker-storage-setup.service
[Service]
Type=notify
NotifyAccess=all
EnvironmentFile=-/etc/sysconfig/docker
EnvironmentFile=-/etc/sysconfig/docker-storage
EnvironmentFile=-/etc/sysconfig/docker-network
Environment=GOTRACEBACK=crash
ExecStart=/bin/sh -c '/usr/bin/docker daemon \
$OPTIONS \
$DOCKER_STORAGE_OPTIONS \
$DOCKER_NETWORK_OPTIONS \
$INSECURE_REGISTRY \
2>&1 | /usr/bin/forward-journald -tag docker'
LimitNOFILE=1048576
LimitNPROC=1048576
LimitCORE=infinity
MountFlags=slave
StandardOutput=null
StandardError=null
TimeoutStartSec=0
Restart=on-abnormal
[Install]
WantedBy=multi-user.target
Flag is set to slave unlike what you say should be the case
Right we changed it in F24 and rawhide and there no more MountFlags there - on F23 I think it comes from the old Docker 1.9 and (blame me) probably we forgot to remove it from the unit file when we switched to 1.10.3 (where this flag isn't needed at all and may cause issues as probably here)
As per the original thread's suggestions:
$ sudo nsenter --mount=/proc/$(cat /var/run/docker.pid)/ns/mnt findmnt -o TARGET,PROPAGATION
TARGET PROPAGATION
/ private,slave
โโ/tmp private,slave
โโ/var/lib/docker/containers/b3e3763922d3286d59745edee89918565b7a7ff969afd21b76807639e332240d/shm private
โโ/dev private,slave
โ โโ/dev/shm private,slave
โ โโ/dev/pts private,slave
โ โโ/dev/mqueue private,slave
โ โโ/dev/hugepages private,slave
โโ/proc private,slave
โ โโ/proc/sys/fs/binfmt_misc private,slave
โโ/sys private,slave
โ โโ/sys/kernel/security private,slave
โ โโ/sys/fs/cgroup private,slave
โ โ โโ/sys/fs/cgroup/systemd private,slave
โ โ โโ/sys/fs/cgroup/devices private,slave
โ โ โโ/sys/fs/cgroup/memory private,slave
โ โ โโ/sys/fs/cgroup/cpuset private,slave
โ โ โโ/sys/fs/cgroup/cpu,cpuacct private,slave
โ โ โโ/sys/fs/cgroup/blkio private,slave
โ โ โโ/sys/fs/cgroup/hugetlb private,slave
โ โ โโ/sys/fs/cgroup/freezer private,slave
โ โ โโ/sys/fs/cgroup/net_cls private,slave
โ โ โโ/sys/fs/cgroup/perf_event private,slave
โ โโ/sys/fs/pstore private,slave
โ โโ/sys/kernel/config private,slave
โ โโ/sys/fs/selinux private,slave
โ โโ/sys/kernel/debug private,slave
โโ/run private,slave
โ โโ/run/user/1000 private,slave
โ โโ/run/user/0 private,slave
โ โโ/run/docker/netns/2a1a92a7a55a private,slave
โโ/mnt/openshift-xfs-vol-dir private,slave
โโ/var/lib/docker/devicemapper private
โโ/var/lib/docker/devicemapper/mnt/b1e1592b93c30e1f80f983cd7bec275ddda2f46fd9bbbcbd603aaee8e3adbe4b private
$ sudo ls -l /proc/1/ns/ /proc/$(cat /var/run/docker.pid)/ns/
/proc/1/ns/:
total 0
lrwxrwxrwx. 1 root root 0 Jun 10 16:10 ipc -> ipc:[4026531839]
lrwxrwxrwx. 1 root root 0 Jun 10 16:10 mnt -> mnt:[4026531840]
lrwxrwxrwx. 1 root root 0 Jun 10 16:10 net -> net:[4026532028]
lrwxrwxrwx. 1 root root 0 Jun 10 16:10 pid -> pid:[4026531836]
lrwxrwxrwx. 1 root root 0 Jun 10 16:10 uts -> uts:[4026531838]
/proc/2104/ns/:
total 0
lrwxrwxrwx. 1 root root 0 Jun 10 16:10 ipc -> ipc:[4026531839]
lrwxrwxrwx. 1 root root 0 Jun 10 16:10 mnt -> mnt:[4026532249]
lrwxrwxrwx. 1 root root 0 Jun 10 16:10 net -> net:[4026532028]
lrwxrwxrwx. 1 root root 0 Jun 10 16:10 pid -> pid:[4026531836]
lrwxrwxrwx. 1 root root 0 Jun 10 16:10 uts -> uts:[4026531838]
(addition to my previous comment, we should remove it also from Docker in rhel since we're on 1.10.3 there as well)
Something in last night's F23 docker-1.10.3-24 update is the culprit. Forcibly downgrading to docker-1.10.3-22 makes the tests pass again. (I didn't test -23, which looks like it was never released anyway.)
what changed between 22 and 23 (and 24 for the matter) is that we changed default volume propagation mode in docker from rprivate to rslave as requested in a bugzilla (I don't remember which one but we extensively talked there with @rhatdan @rhvgoyal and @ncdc)
Also,just heard from Vivek as well and that mountflag will be removed from F23 (doing it right now) so if it's causing issues you can remove it or wait for next update
Found the BZ https://bugzilla.redhat.com/show_bug.cgi?id=1339146
Yes going forward we should not set the mount propagation in the docker unit file. As @runcom says this is intentional to allow people to specify the mount propagation at the volume mount point.
@stevekuznetsov dind wasn't compatible with docker 1.10. #7668 fixes that.
This should make it:
sudo mount --make-shared /
Most helpful comment
This should make it:
sudo mount --make-shared /