Origin: process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory

Created on 8 Sep 2017  路  14Comments  路  Source: openshift/origin

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16232/test_pull_request_origin_extended_conformance_gce/7498/

Reason:ContainerCannotRun,Message:invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:258: applying cgroup configuration for process caused \\"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod86daabc9_9486_11e7_a2b8_42010a800005.slice/docker-ca877fc648dcf466185be3e66b95e774f4589f866620920615c40c12bfe86f2a.scope/cgroup.procs: no such file or directory\\"\"n"

componenkubernetes kintest-flake prioritP1

Most helpful comment

How I debugged a race in systemd

Context/Background

There are several open bugs, on several repos, which are basically about a, rarely occuring, container creation failure. Containers fails to join pids cgroup controller.

process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory

Following are some of these issues:

Where is the race in code:

  • T0: runc triggers a systemd unit creation asynchronously from here
  • T1: runc then moves ahead and starts creating cgroup paths(.scope directories), here. Kernel creates .scope directory and cgroup.procs file(along with other default files) in the directory automatically, in an atomic manner.
  • T3: systemd execution thread which was invoked at time T0, is still in the process of unit creation. systemd also trying to create cgroup paths and deletes the .scope directory which is created at time T1 by runc from here in the code. Here is the related call trace from systemd:
gdb) bt
#0  cg_trim (controller=<optimized out>, path=<optimized out>, delete_root=<optimized out>) at src/shared/cgroup-util.c:575
#1  0x0000564b1c6b85b0 in cg_create_everywhere (path=0x564b1caefc90 "/libcontainer_54684_systemd_test_default.slice", mask=_CGROUP_CONTROLLER_MASK_ALL, supported=63)
    at src/shared/cgroup-util.c:1637
#2  unit_create_cgroups (mask=_CGROUP_CONTROLLER_MASK_ALL, u=0x564b1c9e4e70) at src/core/cgroup.c:679
#3  unit_realize_cgroup_now.26298 (u=u@entry=0x564b1c9e4e70, state=MANAGER_RUNNING) at src/core/cgroup.c:751
#4  0x0000564b1c6b8a89 in unit_realize_cgroup (u=<optimized out>) at src/core/cgroup.c:853
#5  0x0000564b1c688e9e in slice_start.65343 (u=<optimized out>) at src/core/slice.c:187
#6  0x0000564b1c69dd9c in unit_start (u=<optimized out>) at src/core/unit.c:1486
#7  job_run_and_invalidate (j=<optimized out>) at src/core/job.c:532
#8  manager_dispatch_run_queue.4007 (source=<optimized out>, userdata=0x564b1c9d3e10) at src/core/manager.c:1539
#9  0x0000564b1c6cb032 in source_dispatch (s=s@entry=0x564b1c9d27e0) at src/libsystemd/sd-event/sd-event.c:2155
#10 0x0000564b1c6cb2aa in sd_event_dispatch (e=0x564b1c9d47e0) at src/libsystemd/sd-event/sd-event.c:2472
#11 0x0000564b1c6704d7 in manager_loop (m=0x564b1c9d3e10) at src/core/manager.c:2212
#12 main (argc=4, argv=0x7fff036998b8) at src/core/main.c:1801
  • T4: runc, unaware that systemd has already deleted the .scope directory created by it, tried to write container PID to the cgroup.procs, which results in the container creation failure with the error message shown in the beginning.

Debugging Steps:

To fix any issue or to verify the effect of a probable fix, one should be able to reproduce the issue.
This issue was not easily reproducible. I built runc from src code and used runc --systemd-cgroup create <container-id> in a go script, test-runc.go to launch ~200 containers in parallel using go-routines.

func main() {
    ch := make(chan int, 201)
    for i := 100; i < 300; i++ {
        go run(i, ch)
    }
    for i := 100; i < 301; i++ {
        fmt.Printf("waiting on channel i ,%v\n", i)
        j := <-ch
        fmt.Printf("received on channel  %v\n", j)
    }
    close(ch)
    fmt.Printf("All created\n")

}
func run(i int, ch chan int) {
    var stderr bytes.Buffer
    id := strconv.Itoa(i)
    cmd := exec.Command("runc", "--systemd-cgroup", "create", id)
    cmd.Dir = "/mycontainer"
    cmd.Stderr = &stderr
    //cmd.Start()
    ch <- i
    err := cmd.Run()
    if err != nil {
        fmt.Println(fmt.Sprint(err) + ": " + stderr.String())
        fmt.Printf("wrror channel i ,%v\n", i)
        //fmt.Println("err: " + fmt.Sprint(err))
        os.Exit(-1)
    }
    fmt.Printf("writing to channel i ,%v\n", i)
}

Dint work out. Every time, script was creating 200 containers successfully. I ensured that my systemd, runc and kernel versions are same as in the reported issues. Beyond 200, container creation started failing with too many open files error. After increasing system's default limits, script started creating 500 containers successfully, but original issue dint get reproduced. What next?
Did some analysis on the situation.

...
func join(c *configs.Cgroup, subsystem string, pid int) (string, error) {
...
350         if err := os.MkdirAll(path, 0755); err != nil {
351                 return "", err
352         }
353         if err := cgroups.WriteCgroupProc(path, pid); err != nil {
354                 return "", err
355         }
356         return path, nil
357 }
...

There could be only one of the two possible reasons of runc not able to find cgroup.procs file at L#353:

  • First, L#350 returned only after creating .scope directory and the children files are not created yet by the kernel but runc execution has reached #L353
  • OR cgroup.procs got created by kernel in atomic fashion but some other process(most probably systemd) deleted the file(and .scope directory) between the execution of L#350 and L#353

I strongly felt second one to be the case, which proved to be true after some kernel code browsing in cgroup.c.
This means systemd must be deleting the file after the execution of L#350 and before the execution of L#353. Since these are almost consecutive lines, systemd is getting a very small time window to delete the file and thus making hard to reproduce issue.
Lets increase this time window, to delete the file, for systemd by adding a sleep of 1 sec between L#350 and L#353. Made the changes in runc to add sleep as following and reran test-runc.go:

...
350         if err := os.MkdirAll(path, 0755); err != nil {
351                 return "", err
352         }
353         if strings.Contains(path, "pids") {
354                 time.Sleep(1 * time.Second)
355         }
356         if err := cgroups.WriteCgroupProc(path, pid); err != nil {
...

And BANG!!! issue was reproducible now on each run of test-runc.go i.e race would hit for sure each time when script will try to create 200 containers.

strace on systemd:

Then i ran strace on PID1 while executing test-runc.go. On Comparing sys calls made by systemd for a successful container create and failure case, it got further verified that systemd is actually doing a rmdir on the scope unit. To be specific, there was failed rmdir() on pids cgroup scope in each successful case while success rmdir() on pids cgroup in the race hit container case.
But how to know that from where and why in the code this rmdir() sys call on pids cgroup is invoked?
Now to debug further and understand the cause, a systemd dev setup was needed, with an instrumented systemd, built from source, running.

Building and installing instrumented systemd:

  • First i tried systemd-nspawn. Did not solve the purpose. Inside launched container /sys/fs/cgroup was read-only and it was not possible to create any cgroups. May be i missed some configuration.
  • Tried make; make install on systemd source inside a fedora vm. VM was getting hanged at reboot. Could not find the reason. Dropped this approach as well.
  • Then i tried kvmtool to launch a vm with filesystem prepared seperately with systemd installed.
$ yum -y --releasever=25 --nogpg --installroot=$HOME/distro-trees/fedora-24/ --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal
$ cd systemd; DESTDIR=$HOME/distro-trees/fedora-24 make install
$ cd ../kvmtool
$ ./lkvm run --disk /root/distro-trees/fedora-24 --kernel ../bzImage --network bridge --mem 8024

This approach seemed to be working. I set-up runc, systemd and test-runc.go in this vm. But still good times were a bit far. test-runc.go was able to create as many as 500 containers each time. And issue was not getting reproduced. When verified through strace and systemd logs, found that systemd, somehow, was not invoking the rmdir() syscall on pids cgroup at all. Since from BM, I saw connection between the race issue and rmdir() on pids cgroup, it was understood that why issue was not getting reproduced in kvmtool vm.
While I was still trying to understand the cause for missing rmdir and reason for diffrence in behavior from the BM, one of the team mates suggested to use rpm based systemd installation on BM.

  • Then finally i used rhpkg and rpmbuild to create and install patched systemd. It worked like a charm. I finally had the setup that i wanted for tapping into systemd.

Why systemd invokes rmdir() on pids cgroup?

There is a patch in systemd-219 which is generated from the sources fetched using rhpkg, 0475-core-add-support-for-the-pids-cgroup-controller.patch. This patch adds CGROUP_PIDS to CGroupControllerMask (along with some other changes) but does not update _CGROUP_CONTROLLER_MASK_ALL from 31 to 63. This mistake makes the check at L#1628 to fail for pids and therefore for pids, cg_trim() is invoked always.

Here is the relevant code:
```golang1604 static const char mask_names[] =
1605 "cpu\0"
1606 "cpuacct\0"
1607 "blkio\0"
1608 "memory\0"
1609 "devices\0"
1610 "pids\0";
1611
1612 int cg_create_everywhere(CGroupControllerMask supported, CGroupControllerMask mask, const char path) {
1613 CGroupControllerMask bit = 1;
1614 const char *n;
1615 int r;
1616
1617 /
This one will create a cgroup in our private tree, but also
1618 * duplicate it in the trees specified in mask, and remove it
1619 * in all others /
1620
1621 /
First create the cgroup in our own hierarchy. /
1622 r = cg_create(SYSTEMD_CGROUP_CONTROLLER, path);
1623 if (r < 0)
1624 return r;
1625
1626 /
Then, do the same in the other hierarchies */
1627 NULSTR_FOREACH(n, mask_names) {
1628 if (mask & bit)
1629 cg_create(n, path);
1630 else if (supported & bit)
1631 cg_trim(n, path, true);
1632
1633 bit <<= 1;
1634 }
1635
1636 return 0;
1637 }

src/shared/cgroup-util.h
32 typedef enum CGroupControllerMask {
33 CGROUP_CPU = 1,
34 CGROUP_CPUACCT = 2,
35 CGROUP_BLKIO = 4,
36 CGROUP_MEMORY = 8,
37 CGROUP_DEVICE = 16,
38 CGROUP_PIDS = 32,
39 _CGROUP_CONTROLLER_MASK_ALL = 31
40 } CGroupControllerMask;
```
At L39, there should be 63

Can this be handled on runc side?

Yes, here is the fix. Wait on channel till StartTransientUnit is Done

All 14 comments

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/test_branch_origin_extended_conformance_gce/1757/

Since this could be a runtime regression I'd like this triaged and investigated. I assume it could be just a poorly handled error but that's no excuse.

@kargakis @jwhonce @bparees @mfojtik @ewolinetz @simo5 @pweil- @tnozicka @sjenning I have a suspicion that #16994, #16881, #16985, #16023 and this issue are all duplicates of each-other. If you think so, I suggest closing the others and renaming this issue process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory.

@vikaschoudhary16 can you take a look at this one please. I've dupped many issues to this one so it seems to happen pretty frequently.

@sjenning sure!!

we're experiencing the very same issue in CRI-O, it's either a runc or systemd bug :confused:

/cc @mrunalp

Here is the experimental PR, https://github.com/openshift/origin/pull/17600
I found a lock missing in runc. With this PR, i am verifying if this missing lock is the reason for this flake. I will keep this PR open for a very long time and keep rerunning CI tests to see if this same flake re-appears on the PR.
Meanwhile if the flake is observed often on other PRs and not on this experimental PR ever, over a longer period of time, chances will be high that this lock has really helped.

Corresponding PR on the runc: https://github.com/opencontainers/runc/pull/1668

was this flake noticed recently? I am not able to notice it in past couple of days.

This issue is definitely not an Origin issue. Here its reported on OSP:
https://bugzilla.redhat.com/show_bug.cgi?id=1523043

How I debugged a race in systemd

Context/Background

There are several open bugs, on several repos, which are basically about a, rarely occuring, container creation failure. Containers fails to join pids cgroup controller.

process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory

Following are some of these issues:

Where is the race in code:

  • T0: runc triggers a systemd unit creation asynchronously from here
  • T1: runc then moves ahead and starts creating cgroup paths(.scope directories), here. Kernel creates .scope directory and cgroup.procs file(along with other default files) in the directory automatically, in an atomic manner.
  • T3: systemd execution thread which was invoked at time T0, is still in the process of unit creation. systemd also trying to create cgroup paths and deletes the .scope directory which is created at time T1 by runc from here in the code. Here is the related call trace from systemd:
gdb) bt
#0  cg_trim (controller=<optimized out>, path=<optimized out>, delete_root=<optimized out>) at src/shared/cgroup-util.c:575
#1  0x0000564b1c6b85b0 in cg_create_everywhere (path=0x564b1caefc90 "/libcontainer_54684_systemd_test_default.slice", mask=_CGROUP_CONTROLLER_MASK_ALL, supported=63)
    at src/shared/cgroup-util.c:1637
#2  unit_create_cgroups (mask=_CGROUP_CONTROLLER_MASK_ALL, u=0x564b1c9e4e70) at src/core/cgroup.c:679
#3  unit_realize_cgroup_now.26298 (u=u@entry=0x564b1c9e4e70, state=MANAGER_RUNNING) at src/core/cgroup.c:751
#4  0x0000564b1c6b8a89 in unit_realize_cgroup (u=<optimized out>) at src/core/cgroup.c:853
#5  0x0000564b1c688e9e in slice_start.65343 (u=<optimized out>) at src/core/slice.c:187
#6  0x0000564b1c69dd9c in unit_start (u=<optimized out>) at src/core/unit.c:1486
#7  job_run_and_invalidate (j=<optimized out>) at src/core/job.c:532
#8  manager_dispatch_run_queue.4007 (source=<optimized out>, userdata=0x564b1c9d3e10) at src/core/manager.c:1539
#9  0x0000564b1c6cb032 in source_dispatch (s=s@entry=0x564b1c9d27e0) at src/libsystemd/sd-event/sd-event.c:2155
#10 0x0000564b1c6cb2aa in sd_event_dispatch (e=0x564b1c9d47e0) at src/libsystemd/sd-event/sd-event.c:2472
#11 0x0000564b1c6704d7 in manager_loop (m=0x564b1c9d3e10) at src/core/manager.c:2212
#12 main (argc=4, argv=0x7fff036998b8) at src/core/main.c:1801
  • T4: runc, unaware that systemd has already deleted the .scope directory created by it, tried to write container PID to the cgroup.procs, which results in the container creation failure with the error message shown in the beginning.

Debugging Steps:

To fix any issue or to verify the effect of a probable fix, one should be able to reproduce the issue.
This issue was not easily reproducible. I built runc from src code and used runc --systemd-cgroup create <container-id> in a go script, test-runc.go to launch ~200 containers in parallel using go-routines.

func main() {
    ch := make(chan int, 201)
    for i := 100; i < 300; i++ {
        go run(i, ch)
    }
    for i := 100; i < 301; i++ {
        fmt.Printf("waiting on channel i ,%v\n", i)
        j := <-ch
        fmt.Printf("received on channel  %v\n", j)
    }
    close(ch)
    fmt.Printf("All created\n")

}
func run(i int, ch chan int) {
    var stderr bytes.Buffer
    id := strconv.Itoa(i)
    cmd := exec.Command("runc", "--systemd-cgroup", "create", id)
    cmd.Dir = "/mycontainer"
    cmd.Stderr = &stderr
    //cmd.Start()
    ch <- i
    err := cmd.Run()
    if err != nil {
        fmt.Println(fmt.Sprint(err) + ": " + stderr.String())
        fmt.Printf("wrror channel i ,%v\n", i)
        //fmt.Println("err: " + fmt.Sprint(err))
        os.Exit(-1)
    }
    fmt.Printf("writing to channel i ,%v\n", i)
}

Dint work out. Every time, script was creating 200 containers successfully. I ensured that my systemd, runc and kernel versions are same as in the reported issues. Beyond 200, container creation started failing with too many open files error. After increasing system's default limits, script started creating 500 containers successfully, but original issue dint get reproduced. What next?
Did some analysis on the situation.

...
func join(c *configs.Cgroup, subsystem string, pid int) (string, error) {
...
350         if err := os.MkdirAll(path, 0755); err != nil {
351                 return "", err
352         }
353         if err := cgroups.WriteCgroupProc(path, pid); err != nil {
354                 return "", err
355         }
356         return path, nil
357 }
...

There could be only one of the two possible reasons of runc not able to find cgroup.procs file at L#353:

  • First, L#350 returned only after creating .scope directory and the children files are not created yet by the kernel but runc execution has reached #L353
  • OR cgroup.procs got created by kernel in atomic fashion but some other process(most probably systemd) deleted the file(and .scope directory) between the execution of L#350 and L#353

I strongly felt second one to be the case, which proved to be true after some kernel code browsing in cgroup.c.
This means systemd must be deleting the file after the execution of L#350 and before the execution of L#353. Since these are almost consecutive lines, systemd is getting a very small time window to delete the file and thus making hard to reproduce issue.
Lets increase this time window, to delete the file, for systemd by adding a sleep of 1 sec between L#350 and L#353. Made the changes in runc to add sleep as following and reran test-runc.go:

...
350         if err := os.MkdirAll(path, 0755); err != nil {
351                 return "", err
352         }
353         if strings.Contains(path, "pids") {
354                 time.Sleep(1 * time.Second)
355         }
356         if err := cgroups.WriteCgroupProc(path, pid); err != nil {
...

And BANG!!! issue was reproducible now on each run of test-runc.go i.e race would hit for sure each time when script will try to create 200 containers.

strace on systemd:

Then i ran strace on PID1 while executing test-runc.go. On Comparing sys calls made by systemd for a successful container create and failure case, it got further verified that systemd is actually doing a rmdir on the scope unit. To be specific, there was failed rmdir() on pids cgroup scope in each successful case while success rmdir() on pids cgroup in the race hit container case.
But how to know that from where and why in the code this rmdir() sys call on pids cgroup is invoked?
Now to debug further and understand the cause, a systemd dev setup was needed, with an instrumented systemd, built from source, running.

Building and installing instrumented systemd:

  • First i tried systemd-nspawn. Did not solve the purpose. Inside launched container /sys/fs/cgroup was read-only and it was not possible to create any cgroups. May be i missed some configuration.
  • Tried make; make install on systemd source inside a fedora vm. VM was getting hanged at reboot. Could not find the reason. Dropped this approach as well.
  • Then i tried kvmtool to launch a vm with filesystem prepared seperately with systemd installed.
$ yum -y --releasever=25 --nogpg --installroot=$HOME/distro-trees/fedora-24/ --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal
$ cd systemd; DESTDIR=$HOME/distro-trees/fedora-24 make install
$ cd ../kvmtool
$ ./lkvm run --disk /root/distro-trees/fedora-24 --kernel ../bzImage --network bridge --mem 8024

This approach seemed to be working. I set-up runc, systemd and test-runc.go in this vm. But still good times were a bit far. test-runc.go was able to create as many as 500 containers each time. And issue was not getting reproduced. When verified through strace and systemd logs, found that systemd, somehow, was not invoking the rmdir() syscall on pids cgroup at all. Since from BM, I saw connection between the race issue and rmdir() on pids cgroup, it was understood that why issue was not getting reproduced in kvmtool vm.
While I was still trying to understand the cause for missing rmdir and reason for diffrence in behavior from the BM, one of the team mates suggested to use rpm based systemd installation on BM.

  • Then finally i used rhpkg and rpmbuild to create and install patched systemd. It worked like a charm. I finally had the setup that i wanted for tapping into systemd.

Why systemd invokes rmdir() on pids cgroup?

There is a patch in systemd-219 which is generated from the sources fetched using rhpkg, 0475-core-add-support-for-the-pids-cgroup-controller.patch. This patch adds CGROUP_PIDS to CGroupControllerMask (along with some other changes) but does not update _CGROUP_CONTROLLER_MASK_ALL from 31 to 63. This mistake makes the check at L#1628 to fail for pids and therefore for pids, cg_trim() is invoked always.

Here is the relevant code:
```golang1604 static const char mask_names[] =
1605 "cpu\0"
1606 "cpuacct\0"
1607 "blkio\0"
1608 "memory\0"
1609 "devices\0"
1610 "pids\0";
1611
1612 int cg_create_everywhere(CGroupControllerMask supported, CGroupControllerMask mask, const char path) {
1613 CGroupControllerMask bit = 1;
1614 const char *n;
1615 int r;
1616
1617 /
This one will create a cgroup in our private tree, but also
1618 * duplicate it in the trees specified in mask, and remove it
1619 * in all others /
1620
1621 /
First create the cgroup in our own hierarchy. /
1622 r = cg_create(SYSTEMD_CGROUP_CONTROLLER, path);
1623 if (r < 0)
1624 return r;
1625
1626 /
Then, do the same in the other hierarchies */
1627 NULSTR_FOREACH(n, mask_names) {
1628 if (mask & bit)
1629 cg_create(n, path);
1630 else if (supported & bit)
1631 cg_trim(n, path, true);
1632
1633 bit <<= 1;
1634 }
1635
1636 return 0;
1637 }

src/shared/cgroup-util.h
32 typedef enum CGroupControllerMask {
33 CGROUP_CPU = 1,
34 CGROUP_CPUACCT = 2,
35 CGROUP_BLKIO = 4,
36 CGROUP_MEMORY = 8,
37 CGROUP_DEVICE = 16,
38 CGROUP_PIDS = 32,
39 _CGROUP_CONTROLLER_MASK_ALL = 31
40 } CGroupControllerMask;
```
At L39, there should be 63

Can this be handled on runc side?

Yes, here is the fix. Wait on channel till StartTransientUnit is Done

@vikaschoudhary16 nice write-up, and I'm glad I was on the right lines :) Have you notified the systemd folks about the needed _CGROUP_CONTROLLER_MASK_ALL = 63 change?

@vikaschoudhary16 -- great job isolating this.

@jim-minter not yet, today will talk to @lnykrn(Luk谩拧 Nykr媒n)

Was this page helpful?
0 / 5 - 0 ratings