Reason:ContainerCannotRun,Message:invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:258: applying cgroup configuration for process caused \\"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod86daabc9_9486_11e7_a2b8_42010a800005.slice/docker-ca877fc648dcf466185be3e66b95e774f4589f866620920615c40c12bfe86f2a.scope/cgroup.procs: no such file or directory\\"\"n"
Since this could be a runtime regression I'd like this triaged and investigated. I assume it could be just a poorly handled error but that's no excuse.
@kargakis @jwhonce @bparees @mfojtik @ewolinetz @simo5 @pweil- @tnozicka @sjenning I have a suspicion that #16994, #16881, #16985, #16023 and this issue are all duplicates of each-other. If you think so, I suggest closing the others and renaming this issue process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory.
@vikaschoudhary16 can you take a look at this one please. I've dupped many issues to this one so it seems to happen pretty frequently.
@sjenning sure!!
we're experiencing the very same issue in CRI-O, it's either a runc or systemd bug :confused:
/cc @mrunalp
Here is the experimental PR, https://github.com/openshift/origin/pull/17600
I found a lock missing in runc. With this PR, i am verifying if this missing lock is the reason for this flake. I will keep this PR open for a very long time and keep rerunning CI tests to see if this same flake re-appears on the PR.
Meanwhile if the flake is observed often on other PRs and not on this experimental PR ever, over a longer period of time, chances will be high that this lock has really helped.
Corresponding PR on the runc: https://github.com/opencontainers/runc/pull/1668
was this flake noticed recently? I am not able to notice it in past couple of days.
This issue is definitely not an Origin issue. Here its reported on OSP:
https://bugzilla.redhat.com/show_bug.cgi?id=1523043
systemdThere are several open bugs, on several repos, which are basically about a, rarely occuring, container creation failure. Containers fails to join pids cgroup controller.
process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory
Following are some of these issues:
T0, is still in the process of unit creation. systemd also trying to create cgroup paths and deletes the .scope directory which is created at time T1 by runc from here in the code. Here is the related call trace from systemd:gdb) bt
#0 cg_trim (controller=<optimized out>, path=<optimized out>, delete_root=<optimized out>) at src/shared/cgroup-util.c:575
#1 0x0000564b1c6b85b0 in cg_create_everywhere (path=0x564b1caefc90 "/libcontainer_54684_systemd_test_default.slice", mask=_CGROUP_CONTROLLER_MASK_ALL, supported=63)
at src/shared/cgroup-util.c:1637
#2 unit_create_cgroups (mask=_CGROUP_CONTROLLER_MASK_ALL, u=0x564b1c9e4e70) at src/core/cgroup.c:679
#3 unit_realize_cgroup_now.26298 (u=u@entry=0x564b1c9e4e70, state=MANAGER_RUNNING) at src/core/cgroup.c:751
#4 0x0000564b1c6b8a89 in unit_realize_cgroup (u=<optimized out>) at src/core/cgroup.c:853
#5 0x0000564b1c688e9e in slice_start.65343 (u=<optimized out>) at src/core/slice.c:187
#6 0x0000564b1c69dd9c in unit_start (u=<optimized out>) at src/core/unit.c:1486
#7 job_run_and_invalidate (j=<optimized out>) at src/core/job.c:532
#8 manager_dispatch_run_queue.4007 (source=<optimized out>, userdata=0x564b1c9d3e10) at src/core/manager.c:1539
#9 0x0000564b1c6cb032 in source_dispatch (s=s@entry=0x564b1c9d27e0) at src/libsystemd/sd-event/sd-event.c:2155
#10 0x0000564b1c6cb2aa in sd_event_dispatch (e=0x564b1c9d47e0) at src/libsystemd/sd-event/sd-event.c:2472
#11 0x0000564b1c6704d7 in manager_loop (m=0x564b1c9d3e10) at src/core/manager.c:2212
#12 main (argc=4, argv=0x7fff036998b8) at src/core/main.c:1801
.scope directory created by it, tried to write container PID to the cgroup.procs, which results in the container creation failure with the error message shown in the beginning. To fix any issue or to verify the effect of a probable fix, one should be able to reproduce the issue.
This issue was not easily reproducible. I built runc from src code and used runc --systemd-cgroup create <container-id> in a go script, test-runc.go to launch ~200 containers in parallel using go-routines.
func main() {
ch := make(chan int, 201)
for i := 100; i < 300; i++ {
go run(i, ch)
}
for i := 100; i < 301; i++ {
fmt.Printf("waiting on channel i ,%v\n", i)
j := <-ch
fmt.Printf("received on channel %v\n", j)
}
close(ch)
fmt.Printf("All created\n")
}
func run(i int, ch chan int) {
var stderr bytes.Buffer
id := strconv.Itoa(i)
cmd := exec.Command("runc", "--systemd-cgroup", "create", id)
cmd.Dir = "/mycontainer"
cmd.Stderr = &stderr
//cmd.Start()
ch <- i
err := cmd.Run()
if err != nil {
fmt.Println(fmt.Sprint(err) + ": " + stderr.String())
fmt.Printf("wrror channel i ,%v\n", i)
//fmt.Println("err: " + fmt.Sprint(err))
os.Exit(-1)
}
fmt.Printf("writing to channel i ,%v\n", i)
}
Dint work out. Every time, script was creating 200 containers successfully. I ensured that my systemd, runc and kernel versions are same as in the reported issues. Beyond 200, container creation started failing with too many open files error. After increasing system's default limits, script started creating 500 containers successfully, but original issue dint get reproduced. What next?
Did some analysis on the situation.
...
func join(c *configs.Cgroup, subsystem string, pid int) (string, error) {
...
350 if err := os.MkdirAll(path, 0755); err != nil {
351 return "", err
352 }
353 if err := cgroups.WriteCgroupProc(path, pid); err != nil {
354 return "", err
355 }
356 return path, nil
357 }
...
There could be only one of the two possible reasons of runc not able to find cgroup.procs file at L#353:
I strongly felt second one to be the case, which proved to be true after some kernel code browsing in cgroup.c.
This means systemd must be deleting the file after the execution of L#350 and before the execution of L#353. Since these are almost consecutive lines, systemd is getting a very small time window to delete the file and thus making hard to reproduce issue.
Lets increase this time window, to delete the file, for systemd by adding a sleep of 1 sec between L#350 and L#353. Made the changes in runc to add sleep as following and reran test-runc.go:
...
350 if err := os.MkdirAll(path, 0755); err != nil {
351 return "", err
352 }
353 if strings.Contains(path, "pids") {
354 time.Sleep(1 * time.Second)
355 }
356 if err := cgroups.WriteCgroupProc(path, pid); err != nil {
...
And BANG!!! issue was reproducible now on each run of test-runc.go i.e race would hit for sure each time when script will try to create 200 containers.
Then i ran strace on PID1 while executing test-runc.go. On Comparing sys calls made by systemd for a successful container create and failure case, it got further verified that systemd is actually doing a rmdir on the scope unit. To be specific, there was failed rmdir() on pids cgroup scope in each successful case while success rmdir() on pids cgroup in the race hit container case.
But how to know that from where and why in the code this rmdir() sys call on pids cgroup is invoked?
Now to debug further and understand the cause, a systemd dev setup was needed, with an instrumented systemd, built from source, running.
systemd-nspawn. Did not solve the purpose. Inside launched container /sys/fs/cgroup was read-only and it was not possible to create any cgroups. May be i missed some configuration. make; make install on systemd source inside a fedora vm. VM was getting hanged at reboot. Could not find the reason. Dropped this approach as well.$ yum -y --releasever=25 --nogpg --installroot=$HOME/distro-trees/fedora-24/ --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal
$ cd systemd; DESTDIR=$HOME/distro-trees/fedora-24 make install
$ cd ../kvmtool
$ ./lkvm run --disk /root/distro-trees/fedora-24 --kernel ../bzImage --network bridge --mem 8024
This approach seemed to be working. I set-up runc, systemd and test-runc.go in this vm. But still good times were a bit far. test-runc.go was able to create as many as 500 containers each time. And issue was not getting reproduced. When verified through strace and systemd logs, found that systemd, somehow, was not invoking the rmdir() syscall on pids cgroup at all. Since from BM, I saw connection between the race issue and rmdir() on pids cgroup, it was understood that why issue was not getting reproduced in kvmtool vm.
While I was still trying to understand the cause for missing rmdir and reason for diffrence in behavior from the BM, one of the team mates suggested to use rpm based systemd installation on BM.
rhpkg and rpmbuild to create and install patched systemd. It worked like a charm. I finally had the setup that i wanted for tapping into systemd.There is a patch in systemd-219 which is generated from the sources fetched using rhpkg, 0475-core-add-support-for-the-pids-cgroup-controller.patch. This patch adds CGROUP_PIDS to CGroupControllerMask (along with some other changes) but does not update _CGROUP_CONTROLLER_MASK_ALL from 31 to 63. This mistake makes the check at L#1628 to fail for pids and therefore for pids, cg_trim() is invoked always.
Here is the relevant code:
```golang1604 static const char mask_names[] =
1605 "cpu\0"
1606 "cpuacct\0"
1607 "blkio\0"
1608 "memory\0"
1609 "devices\0"
1610 "pids\0";
1611
1612 int cg_create_everywhere(CGroupControllerMask supported, CGroupControllerMask mask, const char path) {
1613 CGroupControllerMask bit = 1;
1614 const char *n;
1615 int r;
1616
1617 / This one will create a cgroup in our private tree, but also
1618 * duplicate it in the trees specified in mask, and remove it
1619 * in all others /
1620
1621 / First create the cgroup in our own hierarchy. /
1622 r = cg_create(SYSTEMD_CGROUP_CONTROLLER, path);
1623 if (r < 0)
1624 return r;
1625
1626 / Then, do the same in the other hierarchies */
1627 NULSTR_FOREACH(n, mask_names) {
1628 if (mask & bit)
1629 cg_create(n, path);
1630 else if (supported & bit)
1631 cg_trim(n, path, true);
1632
1633 bit <<= 1;
1634 }
1635
1636 return 0;
1637 }
src/shared/cgroup-util.h
32 typedef enum CGroupControllerMask {
33 CGROUP_CPU = 1,
34 CGROUP_CPUACCT = 2,
35 CGROUP_BLKIO = 4,
36 CGROUP_MEMORY = 8,
37 CGROUP_DEVICE = 16,
38 CGROUP_PIDS = 32,
39 _CGROUP_CONTROLLER_MASK_ALL = 31
40 } CGroupControllerMask;
```
At L39, there should be 63
Yes, here is the fix. Wait on channel till StartTransientUnit is Done
@vikaschoudhary16 nice write-up, and I'm glad I was on the right lines :) Have you notified the systemd folks about the needed _CGROUP_CONTROLLER_MASK_ALL = 63 change?
@vikaschoudhary16 -- great job isolating this.
@jim-minter not yet, today will talk to @lnykrn(Luk谩拧 Nykr媒n)
Reported : https://github.com/lnykryn/systemd-rhel/issues/180
Fix got merged both at systemd and runc:
https://github.com/lnykryn/systemd-rhel/pull/181
https://github.com/opencontainers/runc/pull/1683
Most helpful comment
How I debugged a race in
systemdContext/Background
There are several open bugs, on several repos, which are basically about a, rarely occuring, container creation failure. Containers fails to join
pidscgroup controller.Following are some of these issues:
Where is the race in code:
T0, is still in the process of unit creation. systemd also trying to create cgroup paths and deletes the.scopedirectory which is created at timeT1by runc from here in the code. Here is the related call trace from systemd:.scopedirectory created by it, tried to write container PID to thecgroup.procs, which results in the container creation failure with the error message shown in the beginning.Debugging Steps:
To fix any issue or to verify the effect of a probable fix, one should be able to reproduce the issue.
This issue was not easily reproducible. I built
runcfrom src code and usedrunc --systemd-cgroup create <container-id>in a go script,test-runc.goto launch ~200 containers in parallel using go-routines.Dint work out. Every time, script was creating 200 containers successfully. I ensured that my systemd, runc and kernel versions are same as in the reported issues. Beyond 200, container creation started failing with
too many open fileserror. After increasing system's default limits, script started creating 500 containers successfully, but original issue dint get reproduced. What next?Did some analysis on the situation.
There could be only one of the two possible reasons of
runcnot able to findcgroup.procsfile at L#353:I strongly felt second one to be the case, which proved to be true after some kernel code browsing in cgroup.c.
This means
systemdmust be deleting the file after the execution of L#350 and before the execution of L#353. Since these are almost consecutive lines, systemd is getting a very small time window to delete the file and thus making hard to reproduce issue.Lets increase this time window, to delete the file, for systemd by adding a sleep of 1 sec between L#350 and L#353. Made the changes in
runcto add sleep as following and rerantest-runc.go:And BANG!!! issue was reproducible now on each run of
test-runc.goi.e race would hit for sure each time when script will try to create 200 containers.strace on systemd:
Then i ran strace on PID1 while executing
test-runc.go. On Comparing sys calls made by systemd for a successful container create and failure case, it got further verified that systemd is actually doing armdiron thescopeunit. To be specific, there was failed rmdir() on pids cgroup scope in each successful case while success rmdir() on pids cgroup in the race hit container case.But how to know that from where and why in the code this rmdir() sys call on pids cgroup is invoked?
Now to debug further and understand the cause, a systemd dev setup was needed, with an instrumented systemd, built from source, running.
Building and installing instrumented systemd:
systemd-nspawn. Did not solve the purpose. Inside launched container/sys/fs/cgroupwas read-only and it was not possible to create any cgroups. May be i missed some configuration.make; make installon systemd source inside a fedora vm. VM was getting hanged at reboot. Could not find the reason. Dropped this approach as well.This approach seemed to be working. I set-up
runc,systemdandtest-runc.goin this vm. But still good times were a bit far.test-runc.gowas able to create as many as 500 containers each time. And issue was not getting reproduced. When verified throughstraceand systemd logs, found that systemd, somehow, was not invoking thermdir()syscall on pids cgroup at all. Since from BM, I saw connection between the race issue and rmdir() on pids cgroup, it was understood that why issue was not getting reproduced in kvmtool vm.While I was still trying to understand the cause for missing rmdir and reason for diffrence in behavior from the BM, one of the team mates suggested to use rpm based systemd installation on BM.
rhpkgandrpmbuildto create and install patched systemd. It worked like a charm. I finally had the setup that i wanted for tapping into systemd.Why systemd invokes rmdir() on pids cgroup?
There is a patch in systemd-219 which is generated from the sources fetched using rhpkg, 0475-core-add-support-for-the-pids-cgroup-controller.patch. This patch adds
CGROUP_PIDStoCGroupControllerMask(along with some other changes) but does not update_CGROUP_CONTROLLER_MASK_ALLfrom 31 to 63. This mistake makes the check at L#1628 to fail forpidsand therefore forpids,cg_trim()is invoked always.Here is the relevant code:
```golang1604 static const char mask_names[] =
1605 "cpu\0"
1606 "cpuacct\0"
1607 "blkio\0"
1608 "memory\0"
1609 "devices\0"
1610 "pids\0";
1611
1612 int cg_create_everywhere(CGroupControllerMask supported, CGroupControllerMask mask, const char path) {
1613 CGroupControllerMask bit = 1;
1614 const char *n;
1615 int r;
1616
1617 / This one will create a cgroup in our private tree, but also
1618 * duplicate it in the trees specified in mask, and remove it
1619 * in all others /
1620
1621 / First create the cgroup in our own hierarchy. /
1622 r = cg_create(SYSTEMD_CGROUP_CONTROLLER, path);
1623 if (r < 0)
1624 return r;
1625
1626 / Then, do the same in the other hierarchies */
1627 NULSTR_FOREACH(n, mask_names) {
1628 if (mask & bit)
1629 cg_create(n, path);
1630 else if (supported & bit)
1631 cg_trim(n, path, true);
1632
1633 bit <<= 1;
1634 }
1635
1636 return 0;
1637 }
src/shared/cgroup-util.h
32 typedef enum CGroupControllerMask {
33 CGROUP_CPU = 1,
34 CGROUP_CPUACCT = 2,
35 CGROUP_BLKIO = 4,
36 CGROUP_MEMORY = 8,
37 CGROUP_DEVICE = 16,
38 CGROUP_PIDS = 32,
39 _CGROUP_CONTROLLER_MASK_ALL = 31
40 } CGroupControllerMask;
```
At L39, there should be 63
Can this be handled on runc side?
Yes, here is the fix. Wait on channel till
StartTransientUnitisDone