Podman: pursuing conventional systemd+podman interaction

Created on 27 May 2020 · 63Comments · Source: containers/podman

This is an RFE after talking with @mheon for a bit in IRC (thanks for that, sorry I kept you so late). In the shortest form I can think of the enhancement would be: facilitate podman/conmon interacting with systemd in a way that provides console output for systemctl and journalctl. In bullet form:

create a "system" user (e.g. UID/GID less than 1000 by convention), set shell to /sbin/nologin
create a sub-UID and sub-GID mapping for that user
create a "system" level unit file (e.g. /etc/systemd/system/<unit>.service) that specifies that "system" user in User=.
systemctl start <unit>.service and be able to see the console output of the container
journalctl -u <unit>.service and be able to see the historical console output of the container

My use case is that I want to use podman to run images that are essentially "system" services, but as "user" because I want the rootless isolation. I've been consuming podman for a bit now (starting with 1.8.2) and am likely stuck on that version because in new versions my approach gets broken: I loose all logging from the container. I have tried --log-driver=journald but have no idea how to find a hand-hold for the console output (what -u should I be looking for, because its not .service, and it's not the container... and it's not podman-.scope). Basically podman doesn't provide the init system with a console hand-hold so I'm rolling blind.

Here is an example of mattermost, under 1.8.2 this works how I'd like it to work (e.g. I'm getting console output). I'm doing some things that are different than what podman generate systemd offers, but it's because my explicit goal is to:

run the container rootless under a "system" user
see what the heck is going on inside the container with my init system (rather than having to sudo -u <user> -h <home> podman logs <container-name>)

[root@vault ~]# systemctl cat podman-mattermost.service 
# /etc/systemd/system/podman-mattermost.service
[Unit]
Description=Podman running mattermost
Wants=network.target
After=network-online.target
Requires=podman-mattermost-postgres.service

[Service]
WorkingDirectory=/app/gitlab
User=gitlab
Group=gitlab
Restart=no
ExecStartPre=/usr/bin/rm -f %T/%N.pid %T/%N.cid
ExecStartPre=/usr/bin/podman rm --ignore -f mattermost
ExecStart=/usr/bin/podman run --conmon-pidfile %T/%N.pid --cidfile %T/%N.cid --cgroups=no-conmon \
  --name=mattermost \
  --env-file /app/gitlab/mattermost/mattermost.env \
  --publish 127.0.0.1:8065:8065 \
  --security-opt label=disable \
  --health-cmd=none \
  --volume /app/gitlab/mattermost/data:/mattermost/data \
  --volume /app/gitlab/mattermost/logs:/mattermost/logs \
  --volume /app/gitlab/mattermost/config:/mattermost/config \
  --volume /app/gitlab/mattermost/plugins:/mattermost/client/plugins \
  docker.io/mattermost/mattermost-team-edition:release-5.24
ExecStop=/usr/bin/podman stop --ignore mattermost -t 30
ExecStopPost=/usr/bin/podman rm --ignore -f mattermost
ExecStopPost=/usr/bin/rm -f %T/%N.pid %T/%N.cid
KillMode=none
Type=simple

[Install]
WantedBy=multi-user.target default.target

[root@vault ~]# systemctl cat podman-mattermost-postgres.service 
# /etc/systemd/system/podman-mattermost-postgres.service
[Unit]
Description=Podman running postgres for mattermost
Wants=network.target
After=network-online.target podman-mattermost.service
PartOf=podman-mattermost.service

[Service]
WorkingDirectory=/app/gitlab
User=gitlab
Group=gitlab
Restart=no
ExecStartPre=/usr/bin/rm -f %T/%N.pid %T/%N.cid
ExecStartPre=/usr/bin/podman rm --ignore -f postgres
ExecStart=/usr/bin/podman run --conmon-pidfile %T/%N.pid --cidfile %T/%N.cid --cgroups=no-conmon \
  --name=postgres \
  --env-file /app/gitlab/mattermost/postgres.env \
  --net=container:mattermost \
  --volume /app/gitlab/mattermost/postgres:/var/lib/postgresql/data:Z \
  docker.io/postgres:12
ExecStop=/usr/bin/podman stop --ignore postgres -t 30
ExecStopPost=/usr/bin/podman rm --ignore -f postgres
ExecStopPost=/usr/bin/rm -f %T/%N.pid %T/%N.cid
KillMode=none
Type=simple

[Install]
WantedBy=multi-user.target default.target

With these units above I am able to:

run as rootless as the "system" user (in this case I'm running both gitlab and mattermost)
see the console output (notice the Type=simple and lack of -d)
- in both systemctl <unit> and journalctl -u <unit>
have the container instance be ephemeral (excessive ExecPre and ExecStop)
have a shared networking namespace so that mattermost and postgres can talk
have container level dependencies represented through the init system
- podman-mattermost.service requires the podman-mattermost-postgres.service (Requires=)
- podman-mattermost-postgres.service will get a stop signal if I stop podman-mattermost.service (PartOf=)
- there are challenges here where podman-mattermost.service closes out the networking namespace before podman-mattermost-postgres.service can finish up (I think), so it's not ideal... i'd be interested in suggestions.

Tagging @lsm5 as well since I think for my use case I'm relegated to use 1.8.2 in F32 for the time being... so I am wondering if that is going away any-time soon?

stale-issue

Source

storrgie

👍1

Most helpful comment

I think we definitely need a single page containing everything we recommend about running containers inside units (best practices, and the reasons for them). I've probably explained why we made the choice for forking vs simple five times at this point; having a single page with a definitive answer on that would be greatly helpful to everyone. We'll need to hash some things out as part of this, especially the use of rootless Podman + root systemd as this issue asks, but even getting the basics written down would be a start.

mheon on 13 Jun 2020

🎉4

All 63 comments

If I didn't hit it clearly, I did try to adopt 1.9.2. It requires a couple things (but ultimately does not work well). #6084 has some more information as well.

do the loginctl enable-linger on the "system" user
switch over to the more conventional things from podman generate systemd like -d, Type=forking

Starting this up, you can only see console output from the container by doing sudo -u <user> -h <home> podman logs <containername> where systemctl/journalctl give you nothing.

The --log-driver=journald doesn't allow for anything better... because I can't figure out what the unit is to actually query logs from (I think it might be some composite of the container id?)... and when you do a sudo -u <user> -h <home> podman logs <containername> you get nothing.

storrgie on 27 May 2020

you can get 1.8.2-2 from https://koji.fedoraproject.org/koji/buildinfo?buildID=1479547

I'll save it to my fedorapeople page as well and send you the URL later.

lsm5 on 27 May 2020

if you enable linger mode and there is already the user session running, is there any disadvantage in in installing the .service file into ~/.config/systemd/user/?

giuseppe on 27 May 2020

@giuseppe for you to be able to do that you'd need a shell for that "system" account. Above I'm creating the user as root with a /sbin/nologin shell. To access the systemctl --user session you'd actually need to login, or you'd need to set the XDG_RUNTIME_DIR variable... I think... (it could also be DBUS_SESSION_BUS_ADRESS) like XDG_RUNTIME_DIR=/run/user/$UID systemctl --user status. It generally gets messy.

Also, this suggestion doesn't address what I'm primarily asking for above: desiring console output from the running container to be seen by systemd/journald. Without being able to see the combination on:

systemd log output
podman log output
the console output of the running container

You have an extremely hard time figuring out what is going on with the system (you have to look in multiple places to piece together the state of errors).

storrgie on 27 May 2020

@vrothberg The core ask here (viewing logs for systemd-managed Podman) seems to be a pretty valid one - our current forking= approach does break this, and podman logs becomes very inconvenient when the services are running as rootless and you have to sudo into each of them to get logs.

I was thinking that it ought to be possible for the journald log driver to write straight to the logs for the unit file if we know it, and we did add something similar for auto-update?

mheon on 27 May 2020

There was a very similar request by @lucab :https://github.com/coreos/fedora-coreos-docs/pull/75#issuecomment-633512357

I was also thinking about the log driver :+1:

vrothberg on 27 May 2020

@ashley-cui Could you look into the --log-driver changes?

rhatdan on 9 Jun 2020

🎉1

@storrgie I've been pursuing similar things recently.

Do -d, and keep the forking.
Enable --log-driver journald

That alone should take care of all container logs showing up in journald, you just need to do
journalctl CONTAINER_NAME=mattermost

As conmon will be providing those keys - CONTAINER_ID an CONTAINER_NAME. I've been doing lots of testing, basically what I've been doing is - start a container, generate the output to journald, then use journalctl -n 10 to grab the last 10 lines and find a line it logged, tweaking for 20 or 30 lines or whatever it takes. Then journalctl -n 10 -o json-pretty or -o json to get the raw line and figure out what other metadata you have to work with.

You could use CONTAINER_TAG too... i.e. add --log-opt tag=WhateverYouWant and find it with
journalctl CONTAINER_TAG=WhateverYouWant

If you want it to show under the unit, like I do, I do this:
--cgroup-parent=/system.slice/%n --cgroup-manager cgroupfs

Note, my container is root, not rootless, and the host is running Flatcar. My guess is you can get similar results by possibly tweaking the cgroup-parent. By putting the processes under the cgroup, systemd finds that they're associated with a unit - but I'd expect conmon being in the correct cgroup SHOULD be all you need.

The added benefit of running all the processes in the systemd service's cgroup is that bind mounted /dev/log ALSO associates to the unit file, automagically. You don't get the automagic CONTAINER_NAME from conmon journald records, but you DO get anything you put in the service file as a LogExtraField - so you could use that to find your logs as well.

goochjj on 12 Jun 2020

I'm running rootless containers on Fedora Server. I'm able to see logs using --log-opt tag=<tag> and journalctl CONTAINER_TAG=<tag>. However, when I add --cgroup-parent=/system.slice/%n --cgroup-manager cgroupfs, my units fail with result 'exit-code. @rhatdan, are they failing because they're rootless?

teebzz on 12 Jun 2020

I really do not recommend running --cgroup-manager=cgroupfs with systemd-managed Podman - you end up with both systemd and Podman potentially altering the same cgroup, and I think there's the potential for them to trample each other. If you want to stay in the systemd cgroup, I'd recommend using the crun OCI runtime and passing --cgroups=disabled to prevent Podman from creating a container cgroup. We lose the ability to set resource limits, but you can just set them from within the systemd unit, so it's not a big loss.

mheon on 12 Jun 2020

(There is also --cgroups=no-conmon to only place Conmon in the systemd cgroup - we use that by default in unit files from podman generate systemd)

mheon on 12 Jun 2020

I see traffic on the mailing list from @rhatdan about an FAQ... I'm feeling more and more as I learn about this project that the idea this can "replace" docker is basically gimmicky at this stage. There is no clear golden pathway for running containers as daemons on systems with podman+systemd. It seems fraught with edge cases. I'd really love to see this ticket be taken seriously as I think there are a LOT of people trying to depart docker land and systemd+podman is a way to rid yourself of the docker monolithic daemon.

storrgie on 13 Jun 2020

mheon on 13 Jun 2020

🎉4

@mheon that would indeed help, but I'm not sure that's going to solve much. For example, from the thread at https://github.com/coreos/fedora-coreos-docs/pull/75, that content currently exists in the form of a blog post which unfortunately is:

already stale at this point (podman-generate does not generate that unit anymore)
not really integrating well with systemd service handling (e.g. journald, sd-notify, user setting, etc)
somehow concerning/fragile (e.g. KillMode=none)

I think it would be better to first devise a podman mode which works well when integrated in the systemd ecosystem, and only then document it.

As a sidenote, many containerized services (eg. etcd, haproxy, etc.) do use sd-notify in order to signal when they are actually initialized and ready to start serving requests. For that kind of autoscale-friendly logic to work, a Type=notify service unit would be required.

lucab on 13 Jun 2020

🚀1

I believe the reason we can't auto-generate Type=notify is because things
are not good if the app in the container does not support it (Podman can
hang) but it should work if you set it (though I'm actually not sure if it
respects our PID files - if it acts like Type=simple in that respect it
will never be really safe to use.)

On the rest, I think the most important thing is getting logging via
Journald working properly. Some things like KillMode I do not expect to be
resolved, and I honestly don't view it as a problem - our design here is
different than typical services by necessity (running without a daemon
forced this), so we don't quite fit into the usual pattern Systemd expects.
Podman will still guarantee that things are cleaned up on stop, as we would
if we are not managed by Systemd.

On Sat, Jun 13, 2020, 04:29 Luca Bruno notifications@github.com wrote:

@mheon https://github.com/mheon that would indeed help, but I'm not
sure that's going to solve much. For example, from the thread at
coreos/fedora-coreos-docs#75
https://github.com/coreos/fedora-coreos-docs/pull/75, that content
currently exists in the form of a blog post
https://www.redhat.com/sysadmin/podman-shareable-systemd-services which
unfortunately is:

already stale at this point (podman-generate does generate that unit
anymore)

not really integrating well with systemd service handling (e.g.
journald, sd-notify, user setting, etc)

somehow concerning/fragile (e.g. KillMode)

I think it would be better to first devise a podman mode which works well
when integrated in the systemd ecosystem, and only then document it.

As a sidenote, many containerized services (eg. etcd, haproxy, etc.) do
use sd-notify in order to signal when they are actually initialized and
ready to start serving requests. For that kind of autoscale-friendly logic
to work, a Type=notify service unit would be required.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/containers/libpod/issues/6400#issuecomment-643591082,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AB3AOCDQFY7UN2HBATOOAETRWM2GLANCNFSM4NLPF4LQ
.

mheon on 13 Jun 2020

On the user setting specifically - I still believe that is a an issue with
Systemd. We're in contact with the Systemd team to try and find a solution.

On Sat, Jun 13, 2020, 10:11 Matthew Heon matthew.heon@gmail.com wrote:

I believe the reason we can't auto-generate Type=notify is because things
are not good if the app in the container does not support it (Podman can
hang) but it should work if you set it (though I'm actually not sure if it
respects our PID files - if it acts like Type=simple in that respect it
will never be really safe to use.)

On the rest, I think the most important thing is getting logging via
Journald working properly. Some things like KillMode I do not expect to be
resolved, and I honestly don't view it as a problem - our design here is
different than typical services by necessity (running without a daemon
forced this), so we don't quite fit into the usual pattern Systemd expects.
Podman will still guarantee that things are cleaned up on stop, as we would
if we are not managed by Systemd.

On Sat, Jun 13, 2020, 04:29 Luca Bruno notifications@github.com wrote:

@mheon https://github.com/mheon that would indeed help, but I'm not
sure that's going to solve much. For example, from the thread at
coreos/fedora-coreos-docs#75
https://github.com/coreos/fedora-coreos-docs/pull/75, that content
currently exists in the form of a blog post
https://www.redhat.com/sysadmin/podman-shareable-systemd-services
which unfortunately is:

already stale at this point (podman-generate does generate that
unit anymore)

not really integrating well with systemd service handling (e.g.
journald, sd-notify, user setting, etc)

somehow concerning/fragile (e.g. KillMode)

I think it would be better to first devise a podman mode which works well
when integrated in the systemd ecosystem, and only then document it.

As a sidenote, many containerized services (eg. etcd, haproxy, etc.) do
use sd-notify in order to signal when they are actually initialized and
ready to start serving requests. For that kind of autoscale-friendly logic
to work, a Type=notify service unit would be required.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/containers/libpod/issues/6400#issuecomment-643591082,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AB3AOCDQFY7UN2HBATOOAETRWM2GLANCNFSM4NLPF4LQ
.

mheon on 13 Jun 2020

We got the User setting working, it was mainly a problem with -d, no? Unless there's something else outstanding, I think that's solved. Similarly, the journald log-driver works well for me... unless you try to log a tty, which would be a bad idea anyway, now that exec is fixed.

Systemd integration isn't great with docker either - docker's log-driver is exactly analogous to what conmon does, docker's containers are launched by the daemon which puts them in another cgroup, unless you use cgroup-parent tricks, and sometimes getting the container to work right w.r.t. logging and groups requires hacks like systemd-docker which throws a hacky shim around sd-notify. So are we really saying podman+systemd is somehow worse? Or just not better? Because it seems better to me. Doesn't seem like Docker has a golden pathway either.

I've run docker w/ cgroup-parent sharing the unit's cgroup and systemd-docker (even though it's unsupported) for over a year, and haven't had any problems with systemd and docker fighting. I'm not sure why podman would... but I defer to the experts.

The only thing I have with docker now that I don't have with podman is bind mounting /dev/log works - because I put the docker container in the same cgroup as the unit. Without that, I'd need some sort of syslog proxy, which would probably have to live in conmon, and is a whole other discussion and probably only relevant to me.

goochjj on 13 Jun 2020

@mheon that would indeed help, but I'm not sure that's going to solve much. For example, from the thread at coreos/fedora-coreos-docs#75, that content currently exists in the form of a blog post which unfortunately is:
* already stale at this point (podman-generate does generate that unit anymore)

That's not accurate. We just updated the blog post last week and do that regularly. The units are still generated the same way. Once Podman v2 is out, we need to create some upstream docs as a living document and point the blog post there.

* not really integrating well with systemd service handling (e.g. journald, sd-notify, user setting, etc)

We only support Type=forking with podman generate systemd.

* somehow concerning/fragile (e.g. `KillMode=none`)

We've been discussing that already in depth. We want Podman to handle shutdown (and killing) and prevent signal races with systemd which does not know the order in which all processes should be killed.

I think it would be better to first devise a podman mode which works well when integrated in the systemd ecosystem, and only then document it.

As a sidenote, many containerized services (eg. etcd, haproxy, etc.) do use sd-notify in order to signal when they are actually initialized and ready to start serving requests. For that kind of autoscale-friendly logic to work, a Type=notify service unit would be required.

Type=notify is supported but we don't generate them with podman generate systemd. I guess this could be part of an upstream doc?

vrothberg on 15 Jun 2020

I think we definitely need a single page containing everything we recommend about running containers inside units (best practices, and the reasons for them). I've probably explained why we made the choice for forking vs simple five times at this point; having a single page with a definitive answer on that would be greatly helpful to everyone. We'll need to hash some things out as part of this, especially the use of rootless Podman + root systemd as this issue asks, but even getting the basics written down would be a start.

I agree and made a similar conclusion last week when working with support on some issues. Once v2 is out (and all fixes are in), I'd love us to create a living upstream document that the blog post can link to.

vrothberg on 15 Jun 2020

I opened https://github.com/containers/libpod/issues/6604 to break out the logging discussion.

vrothberg on 15 Jun 2020

❤1

@vrothberg thanks! I shouldn't have piled up more topics in here, sorry for that.
If you prefer, I can split the other ones (e.g. sd-notify) to their own tickets, so they can be incrementally closed as soon as we are done.

lucab on 15 Jun 2020

No worries at all, @lucab! All input and feedback is much appreciated.

If you prefer, I can split the other ones (e.g. sd-notify) to their own tickets, so they can be incrementally closed as soon as we are done.

That would be great, sure. While we support sd-notify, we don't generate these types. Having a dedicated issue will help us agree on how such a unit should look like and eventually get that into upstream docs (and man pages). Thanks a lot!

vrothberg on 15 Jun 2020

Since we're having this discussion, and there's plenty of talk about Killmode, and cgroups, and where things should reside - it makes sense to me that podman's integration with systemd already has a blueprint - that being systemd-nspawn. The [email protected] unit includes things like:

KillMode=mixed
Delegate=yes
Slice=machine.slice

This means (among other things) you end up with
/machine.slice/unit.service/supervisor - which contains the systemd-nspawn ("conmon"-esque) process, and
/machine.slice/unit.service/payload - which contains the contained processes

And systemd has no problem monitoring the supervisor Pid, I'm guessing because Delegate is set, and it's a sub-cgroup.

nspawn has options like --slice, --property, --register, and --keep-unit - probably all of which should be implemented similarly in podman... and the caveats are already spelled out in the documentation.

https://www.freedesktop.org/software/systemd/man/systemd-nspawn.html

nspawn also has options for the journal - how it's bind mounted and supported, plus setting the machine ID properly for those logs... etc.

I'd imagine we'd want nspawn to be the template?

goochjj on 17 Jun 2020

🚀1 👍1

And doing Delegate and sub-cgroups like that also means systemctl status knows the Main PID is the supervisor, but shows the full process tree including the payload clearly in the status output, and the service type is sd-notify, so I imagine it's talking back to systemd to let it know these things.

goochjj on 17 Jun 2020

For that matter I've wondered if it's possible to use/wrap/hack/mangle something into place to allow systemd-nspawn itself to be the OCI container runtime, instead of crun or runc. Moreso a thought experiment than anything else, but the key hangup seems to be nspawn wants a specific mount to use, which podman can provide since it already did all the work to create the appropriate overlay bindmount.

Probably involves reading config.json and turning it into command line arguments? I'm unclear separation-wise which parts of the above fit into which parts of the execution lifecycle.

goochjj on 17 Jun 2020

There was talk about making nspawn accept OCI specs, even that may not be necessary. I don't know how well it would interface with Conmon though.

On the Delegate change - I'd have to think more about what this means for containers which forward host cgroups into the container (we'll need a way to guarantee that the entire unit cgroup isn't forwarded). I also think we'll need to ensure that the container remembers it was started with cgroupfs, so that other Podman commands launched from outside the unit file that require cgroups (e.g. podman stats) still work.

mheon on 17 Jun 2020

to simulate what nspawn does we'd need to tell the OCI runtime to use the cgroup already created by conmon instead of creating a new one.

Next crun version will automatically create a /container subcgroup in the same way nspawn does.

I think we can go a step further and get closer to what nspawn does by having a single cgroup for conmon+container payload

giuseppe on 17 Jun 2020

PoC implementation: https://github.com/containers/libpod/pull/6666

giuseppe on 18 Jun 2020

@giuseppe, do we need to additionally set delegation in the units?

vrothberg on 29 Jun 2020

yes, we need to add Delegate=true under [Service] so podman is able to manage the cgroup. For rootless it should happen only with cgroup v2.

giuseppe on 29 Jun 2020

👍1

I am trying to use this rootless on Fedora CoreOS which runs cgroups v1 and I am getting

Error: mkdir /sys/fs/cgroup/pids/system.slice/elasticsearch.service/supervisor: permission denied

Here is my unit:

[Unit]
      Description=Elasticsearch Service
      Wants=network.target
      After=network-online.target
      After=mycool-pod.service

      [Service]
      Delegate=true
      User=mycool
      Group=mycool
      Environment=PODMAN_SYSTEMD_UNIT=%n
      Restart=on-failure
      ExecStartPre=-/usr/local/bin/podman pull elasticsearch:7.5.2
      ExecStartPre=-/usr/local/bin/podman volume create %N
      ExecStart=/usr/local/bin/podman run --replace --rm -d --log-driver=journald --log-opt tag="{{.ImageName}}" --pod mycool-pod --name %N -e ES_JAVA_OPTS="-Xms512m -Xmx512m" --conmon /usr/local/bin/conmon --cgroups=split --conmon-pidfile=%T/%N.pid --env-file /opt/mycool/envs/elasticsearch.env --volume %N:/usr/share/elasticsearch/data:Z elasticsearch:7.5.2
      ExecStop=/usr/local/bin/podman stop -t 10 %N
      ExecStopPost=/usr/local/bin/podman stop -t 10 %N
      PIDFile=%T/%N.pid
      KillMode=none
      Type=forking
      SyslogIdentifier=%N

      [Install]
      WantedBy=multi-user.target default.target

Will this only work with rootless on cgroups v2?

jdoss on 29 Jun 2020

I don't think this helps much for cgroups v1 + rootless.

1) Delegate=true only gives ownership to the user for "unified" and "systemd" controllers.
2) Podman could be modified to skip pids, cpu, etc and only do the systemd ones for the supervisor group... but.
3) Rootless podman will not tell runc or crun to move the cgroup... because runc and crun can only receive a single cgroup path, and systemd will spawn this process with various cgroups (devices will be in /, others will be in /user.slice, etc). Even if passed it seems like runc throws it away when it can't set the cpuset group, and crun just ignores it.

So you end up with your entire container and conmon running in %N.service/supervisor, which is no different than just using cgroups enabled - everything will just be in %N.service

split does work with root containers - mainly because all containers can be delegated at that point.

goochjj on 29 Jun 2020

I just rebooted Fedora CoreOS into cgroups v2 and I am seeing this now when setting --cgroups=split

Jun 29 21:32:21 mycool mycool-elasticsearch[2765]: Error: cannot set limits without cgroups: OCI runtime error

Jun 29 21:31:39 mycool systemd[1]: Starting Forem 12345 Elasticsearch Service...
Jun 29 21:31:40 mycool mycool-elasticsearch[1754]: Trying to pull registry.fedoraproject.org/elasticsearch:7.5.2...
Jun 29 21:31:40 mycool mycool-elasticsearch[1754]:   manifest unknown: manifest unknown
Jun 29 21:31:40 mycool mycool-elasticsearch[1754]: Trying to pull registry.access.redhat.com/elasticsearch:7.5.2...
Jun 29 21:31:40 mycool mycool-elasticsearch[1754]:   name unknown: Repo not found
Jun 29 21:31:40 mycool mycool-elasticsearch[1754]: Trying to pull registry.centos.org/elasticsearch:7.5.2...
Jun 29 21:31:40 mycool mycool-elasticsearch[1754]:   manifest unknown: manifest unknown
Jun 29 21:31:40 mycool mycool-elasticsearch[1754]: Trying to pull docker.io/library/elasticsearch:7.5.2...
Jun 29 21:31:42 mycool mycool-elasticsearch[1754]: Getting image source signatures
Jun 29 21:31:42 mycool mycool-elasticsearch[1754]: Copying blob sha256:c82eff1e95f223957666595df82e112d158b37b577d3e3525bdd58890d3ffb0a
Jun 29 21:31:42 mycool mycool-elasticsearch[1754]: Copying blob sha256:63248d573ce9f12efb8d5de9d49e8b7beb5ce9c2b4ed1f2bd8c43fa123ec4781
Jun 29 21:31:42 mycool mycool-elasticsearch[1754]: Copying blob sha256:ab5ef0e5819490abe86106fd9f4381123e37a03e80e650be39f7938d30ecb530
Jun 29 21:31:42 mycool mycool-elasticsearch[1754]: Copying blob sha256:ac819c75e084c8a2b60fec278e7b0b4109aad3f68b4c549566dc99bd51e4ccca
Jun 29 21:31:42 mycool mycool-elasticsearch[1754]: Copying blob sha256:cca059a702d34723ea312d8b4fd3ab4943eb36cbec50252b443a25fc7c1683a7
Jun 29 21:31:42 mycool mycool-elasticsearch[1754]: Copying blob sha256:4a32d65abda10235eac68cfba8dc027d034247ebd09b40beefef9e7574750ec2
Jun 29 21:31:43 mycool mycool-elasticsearch[1754]: Copying blob sha256:6ce84b7d8f2193b31b410756c95cac55a575e133757dce774a9398007b78727d
Jun 29 21:32:01 mycool mycool-elasticsearch[1754]: Copying config sha256:929d271f17988709f8e34bc2e907265f6dc9fc5742326349e0ad808bb213f97a
Jun 29 21:32:01 mycool mycool-elasticsearch[1754]: Writing manifest to image destination
Jun 29 21:32:01 mycool mycool-elasticsearch[1754]: Storing signatures
Jun 29 21:32:20 mycool mycool-elasticsearch[1754]: 929d271f17988709f8e34bc2e907265f6dc9fc5742326349e0ad808bb213f97a
Jun 29 21:32:20 mycool mycool-elasticsearch[2725]: mycool-elasticsearch
Jun 29 21:32:21 mycool mycool-elasticsearch[2765]: Error: cannot set limits without cgroups: OCI runtime error
Jun 29 21:32:21 mycool systemd[1]: mycool-elasticsearch.service: Control process exited, code=exited, status=126/n/a
Jun 29 21:32:21 mycool mycool-elasticsearch[2914]: Error: container d07027fe16dd460d2409c1153ea0b3d4fa614aa9b7f6d8f05cc918585ebafefe does not exist in database: no such container
Jun 29 21:32:21 mycool systemd[1]: mycool-elasticsearch.service: Control process exited, code=exited, status=125/n/a
Jun 29 21:32:21 mycool systemd[1]: mycool-elasticsearch.service: Failed with result 'exit-code'.
Jun 29 21:32:21 mycool systemd[1]: Failed to start My Cool Elasticsearch Service.
Jun 29 21:32:21 mycool systemd[1]: mycool-elasticsearch.service: Consumed 27.470s CPU time.
Jun 29 21:32:21 mycool systemd[1]: mycool-elasticsearch.service: Scheduled restart job, restart counter is at 1.

$ rpm -qa runc crun 
runc-1.0.0-144.dev.gite6555cc.fc32.x86_64
crun-0.13-2.fc32.x86_64
$ /usr/local/bin/podman --version
podman version 2.1.0-dev
$ /usr/local/bin/conmon --version
conmon version 2.0.19-dev
commit: ab8f5e5a9b808f7ab3c2098eeada04795914a161

$ cat /etc/os-release 
NAME=Fedora
VERSION="32.20200625.1.0 (CoreOS)"
ID=fedora
VERSION_ID=32
VERSION_CODENAME=""
PLATFORM_ID="platform:f32"
PRETTY_NAME="Fedora CoreOS 32.20200625.1.0"
ANSI_COLOR="0;34"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:32"
HOME_URL="https://getfedora.org/coreos/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora-coreos/"
SUPPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
BUG_REPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=32
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=32
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="CoreOS"
VARIANT_ID=coreos
OSTREE_VERSION='32.20200625.1.0'

Here is my fcct config for Ignition:

variant: fcos
version: 1.0.0
passwd:
  users:
    - name: core
      ssh_authorized_keys:
        - ssh-ed25519 snip 
    - name: mycool
      system: false

storage:
  directories:
  - path: /opt/mycool/pids
    mode: 0755
    user:
      name: mycool
    group:
      name: mycool
  - path: /opt/mycool/envs
    mode: 0750
    user:
      name: mycool
    group:
      name: mycool
  - path: /opt/mycool/configs
    mode: 0750
    user:
      name: mycool
    group:
      name: mycool
  - path: /opt/mycool/tmp
    mode: 0755
    user:
      name: mycool
    group:
      name: mycool

  files:
  - path: /etc/systemd/system.conf.d/accounting.conf
    mode: 0644
    contents:
      inline: |
        [Manager]
        DefaultCPUAccounting=yes
        DefaultMemoryAccounting=yes
        DefaultBlockIOAccounting=yes

  - path: /etc/sysctl.d/max-user-watches.conf
    mode: 0644
    contents:
      inline: |
        fs.inotify.max_user_watches=16184

  - path: /etc/zincati/config.d/55-updates-strategy.toml
    mode: 0644
    contents:
      inline: |
        [updates]
        strategy = "fleet_lock"
        [updates.fleet_lock]
        base_url = "https://updates.forem.com/"

  - path: /etc/zincati/config.d/90-disable-auto-updates.toml
    mode: 0644
    contents:
      inline: |
        [updates]
        enabled = false

  - path: /etc/hostname
    mode: 0644
    contents:
      inline: |
        mycool

  - path: /usr/local/bin/podman
    contents:
      source: https://joedoss.com/downloads/podman.gz
      compression: gzip
      verification:
        hash: sha512-edcca442e664c64eef694b19beac69ad88efd19916494a1845021d489715df7890a677de302357b4422ec2db623beddcf446b304c76a180b74cafad2c4c55fa0
    mode: 0555

  - path: /usr/local/bin/conmon
    contents:
      source: https://joedoss.com/downloads/conmon.gz
      compression: gzip
      verification:
        hash: sha512-b85087042347de5fe417266ce4300d23475cc7d9a089c87d4337830c52cbbe434899659b471356f27f285bb37f17f411244fe2aa9d3b7bb27b05d123fc35bdc3
    mode: 0555

  - path: /opt/mycool/envs/elasticsearch.env
    mode: 0640
    user:
      name: mycool
    group:
      name: mycool
    contents:
      inline: |
         discovery.type=single-node
         cluster.name=forem
         bootstrap.memory_lock=true
         discovery.type=single-node
         xpack.security.enabled=false
         xpack.monitoring.enabled=false
         xpack.graph.enabled=false
         xpack.watcher.enabled=false

systemd:
  units:
  - name: enable-cgroups-v2.service
    enabled: true
    contents: |
      [Unit]
      Description=Enable cgroups v2 (systemd.unified_cgroup_hierarchy=0)
      ConditionFirstBoot=true
      Wants=basic.target
      Before=multi-user.target mycool-pod.service

      [Service]
      Type=oneshot
      ExecStart=/usr/bin/rpm-ostree kargs --delete systemd.unified_cgroup_hierarchy=0 --reboot
      ExecStartPost=/usr/bin/sleep infinity

      [Install]
      WantedBy=basic.target

  - name: mycool-pod.service
    enabled: true
    contents: |
      [Unit]
      Description=My Cool pod service
      Wants=network.target
      After=network-online.target
      Before=mycool-elasticsearch.service

      [Service]
      User=mycool
      Group=mycool
      Environment=PODMAN_SYSTEMD_UNIT=%n
      Restart=on-failure
      ExecStartPre=-/usr/local/bin/podman pod create --conmon /usr/local/bin/conmon --infra-conmon-pidfile %T/%N.pid --name %N -p 443:443 -p 80:80 -p 9090:9090
      ExecStart=/usr/local/bin/podman pod start %N
      ExecStop=/usr/local/bin/podman pod stop -t 10 %N
      ExecStopPost=/usr/local/bin/podman pod stop -t 10 %N
      PIDFile=%T/%N.pid
      KillMode=none
      Type=forking
      SyslogIdentifier=%N

      [Install]
      WantedBy=multi-user.target default.target

  - name: mycool-elasticsearch.service
    enabled: true
    contents: |
      [Unit]
      Description=My Cool Elasticsearch Service
      Wants=network.target
      After=network-online.target
      After=mycool-pod.service

      [Service]
      Delegate=true
      User=mycool
      Group=mycool
      Environment=PODMAN_SYSTEMD_UNIT=%n
      Restart=on-failure
      ExecStartPre=-/usr/local/bin/podman pull elasticsearch:7.5.2
      ExecStartPre=-/usr/local/bin/podman volume create %N
      ExecStart=/usr/local/bin/podman run --replace --rm -d --log-driver=journald --log-opt tag="{{.ImageName}}" --pod mycool-pod --name %N -e ES_JAVA_OPTS="-Xms512m -Xmx512m" --conmon /usr/local/bin/conmon --cgroups=split --conmon-pidfile=%T/%N.pid --env-file /opt/mycool/envs/elasticsearch.env --volume %N:/usr/share/elasticsearch/data:Z elasticsearch:7.5.2
      ExecStop=/usr/local/bin/podman stop -t 10 %N
      ExecStopPost=/usr/local/bin/podman stop -t 10 %N
      PIDFile=%T/%N.pid
      KillMode=none
      Type=forking
      SyslogIdentifier=%N

      [Install]
      WantedBy=multi-user.target default.target

jdoss on 30 Jun 2020

Some super sweet podman run --log-level debug logs below:

Jun 30 00:13:22 mycool mycool-elasticsearch[33637]: time="2020-06-30T00:13:22Z" level=debug msg="running conmon: /usr/local/bin/conmon" args="[--api-version 1 -c 1221b4fc5b0884db2b86641b33c59723e10e035222ba9739f47c9a25b5433657 -u 1221b4fc5b0884db2b86641b33c59723e10e035222ba9739f47c9a25b5433657 -r /usr/bin/crun -b /var/home/forem-12345/.local/share/containers/storage/overlay-containers/1221b4fc5b0884db2b86641b33c59723e10e035222ba9739f47c9a25b5433657/userdata -p /tmp/run-1001/containers/overlay-containers/1221b4fc5b0884db2b86641b33c59723e10e035222ba9739f47c9a25b5433657/userdata/pidfile -n mycool-elasticsearch --exit-dir /tmp/run-1001/libpod/tmp/exits --socket-dir-path /tmp/run-1001/libpod/tmp/socket -l journald --log-level debug --syslog --log-tag docker.io/library/elasticsearch:7.5.2 --conmon-pidfile /tmp/mycool-elasticsearch.pid --exit-command /var/usrlocal/bin/podman --exit-command-arg --root --exit-command-arg /var/home/forem-12345/.local/share/containers/storage --exit-command-arg --runroot --exit-command-arg /tmp/run-1001/containers --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /tmp/run-1001/libpod/tmp --exit-command-arg --runtime --exit-command-arg /usr/bin/crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mount_program=/usr/bin/fuse-overlayfs --exit-command-arg --events-backend --exit-command-arg file --exit-command-arg --syslog --exit-command-arg true --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --rm --exit-command-arg 1221b4fc5b0884db2b86641b33c59723e10e035222ba9739f47c9a25b5433657]"
Jun 30 00:13:22 mycool mycool-elasticsearch[33649]: [conmon:d]: failed to write to /proc/self/oom_score_adj: Permission denied
Jun 30 00:13:22 mycool conmon[33649]: conmon 1221b4fc5b0884db2b86 <ndebug>: failed to write to /proc/self/oom_score_adj: Permission denied
Jun 30 00:13:22 mycool conmon[33650]: conmon 1221b4fc5b0884db2b86 <ninfo>: attach sock path: /tmp/run-1001/libpod/tmp/socket/1221b4fc5b0884db2b86641b33c59723e10e035222ba9739f47c9a25b5433657/attach
Jun 30 00:13:22 mycool conmon[33650]: conmon 1221b4fc5b0884db2b86 <ninfo>: addr{sun_family=AF_UNIX, sun_path=/tmp/run-1001/libpod/tmp/socket/1221b4fc5b0884db2b86641b33c59723e10e035222ba9739f47c9a25b5433657/attach}
Jun 30 00:13:22 mycool conmon[33650]: conmon 1221b4fc5b0884db2b86 <ninfo>: terminal_ctrl_fd: 13
Jun 30 00:13:22 mycool conmon[33650]: conmon 1221b4fc5b0884db2b86 <ninfo>: winsz read side: 15, winsz write side: 15
Jun 30 00:13:22 mycool conmon[33651]: conmon 1221b4fc5b0884db2b86 <nwarn>: Failed to chown stdin
Jun 30 00:13:22 mycool conmon[33650]: conmon 1221b4fc5b0884db2b86 <error>: Failed to create container: exit status 1
Jun 30 00:13:22 mycool mycool-elasticsearch[33637]: time="2020-06-30T00:13:22Z" level=debug msg="Received: -1"
Jun 30 00:13:22 mycool mycool-elasticsearch[33637]: time="2020-06-30T00:13:22Z" level=debug msg="Cleaning up container 1221b4fc5b0884db2b86641b33c59723e10e035222ba9739f47c9a25b5433657"
Jun 30 00:13:22 mycool mycool-elasticsearch[33637]: time="2020-06-30T00:13:22Z" level=debug msg="unmounted container \"1221b4fc5b0884db2b86641b33c59723e10e035222ba9739f47c9a25b5433657\""
Jun 30 00:13:22 mycool mycool-elasticsearch[33637]: time="2020-06-30T00:13:22Z" level=debug msg="ExitCode msg: \"cannot set limits without cgroups: oci runtime error\""
Jun 30 00:13:22 mycool mycool-elasticsearch[33637]: Error: cannot set limits without cgroups: OCI runtime error
Jun 30 00:13:22 mycool systemd[1]: mycool-elasticsearch.service: Control process exited, code=exited, status=126/n/a

jdoss on 30 Jun 2020

Jun 29 21:32:21 mycool mycool-elasticsearch[2765]: Error: cannot set limits without cgroups: OCI runtime error

that is a problem in crun, that is fixed upstrem. I am going to cut a new release in the next days.

giuseppe on 30 Jun 2020

🎉1

@giuseppe awesome! I just RTFMed a bit but I couldn't find any info on setting podman to use a different crun binary. Is there a flag I can set to call a different binary until this fix get merged in upstream crun and eventually in FCOS next stream?

jdoss on 30 Jun 2020

@jdoss you could specify it on the command line like podman --runtime /path/to/the/other/executable/crun .. or you can override its path from the containers.conf file

giuseppe on 30 Jun 2020

@jdoss If you're using selinux, I suggest you compile and place the crun binary in /usr/local/bin, as that folder is recognized in the policy. If you're going to have a local podman or runc or crun it should be there, and chcon'd to match, i.e. chcon --reference=/usr/bin/crun /usr/local/bin/crun

In /etc/containers/containers.conf:

runtime = "crun"

[engine.runtimes]
crun = [ "/usr/local/bin/crun" ]

Or specify it on the command line as @giuseppe indicated.

goochjj on 30 Jun 2020

@goochjj and @giuseppe I just compiled crun from master and put it in /usr/local/bin/crun and it's still getting the same error:

# /usr/local/bin/crun --version
crun version 0.13.227-d38b
commit: d38b8c28fc50a14978a27fa6afc69a55bfdd2c11
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL

Jun 30 15:48:12 mycool mycool-elasticsearch[65963]: [conmon:d]: failed to write to /proc/self/oom_score_adj: Permission denied
Jun 30 15:48:12 mycool conmon[65963]: conmon c0cf8da55a1936150298 <ndebug>: failed to write to /proc/self/oom_score_adj: Permission denied
Jun 30 15:48:12 mycool conmon[65964]: conmon c0cf8da55a1936150298 <ninfo>: attach sock path: /tmp/run-1001/libpod/tmp/socket/c0cf8da55a1936150298fdf9608ad04154a9de89c774c0fc93e2809b489a97b7/attach
Jun 30 15:48:12 mycool conmon[65964]: conmon c0cf8da55a1936150298 <ninfo>: addr{sun_family=AF_UNIX, sun_path=/tmp/run-1001/libpod/tmp/socket/c0cf8da55a1936150298fdf9608ad04154a9de89c774c0fc93e2809b489a97b7/attach}
Jun 30 15:48:12 mycool conmon[65964]: conmon c0cf8da55a1936150298 <ninfo>: terminal_ctrl_fd: 13
Jun 30 15:48:12 mycool conmon[65964]: conmon c0cf8da55a1936150298 <ninfo>: winsz read side: 15, winsz write side: 15
Jun 30 15:48:12 mycool conmon[65965]: conmon c0cf8da55a1936150298 <nwarn>: Failed to chown stdin
Jun 30 15:48:12 mycool conmon[65964]: conmon c0cf8da55a1936150298 <error>: Failed to create container: exit status 1
Jun 30 15:48:12 mycool mycool-elasticsearch[65952]: time="2020-06-30T15:48:12Z" level=debug msg="Received: -1"
Jun 30 15:48:12 mycool mycool-elasticsearch[65952]: time="2020-06-30T15:48:12Z" level=debug msg="Cleaning up container c0cf8da55a1936150298fdf9608ad04154a9de89c774c0fc93e2809b489a97b7"
Jun 30 15:48:12 mycool mycool-elasticsearch[65952]: time="2020-06-30T15:48:12Z" level=debug msg="unmounted container \"c0cf8da55a1936150298fdf9608ad04154a9de89c774c0fc93e2809b489a97b7\""
Jun 30 15:48:12 mycool mycool-elasticsearch[65952]: time="2020-06-30T15:48:12Z" level=debug msg="ExitCode msg: \"cannot set limits without cgroups: oci runtime error\""
Jun 30 15:48:12 mycool mycool-elasticsearch[65952]: Error: cannot set limits without cgroups: OCI runtime error
Jun 30 15:48:12 mycool systemd[1]: mycool-elasticsearch.service: Control process exited, code=exited, status=126/n/a

jdoss on 30 Jun 2020

Add --pids-limit 0 to your run args

goochjj on 30 Jun 2020

Wait you're cgroups v2 now? I don't have that problem under cgroups v2 rootless. What does cat /proc/self/cgroup show?

goochjj on 30 Jun 2020

--pids-limit 0 does let the containers start, but yea, I booted FCOS into cgroups v2 with rootless here. I have a non-root user mycool that is being used via systemd to launch these containers.

[core@mycool ~]$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-1.scope

jdoss on 30 Jun 2020

I can't get the infra container to start, because you're binding to ports 80 and 443 as non-root...

goochjj on 30 Jun 2020

setting /proc/sys/net/ipv4/ip_unprivileged_port_start

goochjj on 30 Jun 2020

Hmm and there it is

If I remove your --pod it works

goochjj on 30 Jun 2020

 - path: /etc/sysctl.d/90-ip-unprivileged-port-start.conf
    mode: 0644
    contents:
      inline: |
        net.ipv4.ip_unprivileged_port_start = 0

To allow the pod to bind to those ports.

jdoss on 30 Jun 2020

I think it's because you're using a pod.

When I run this as the user, rootless, I get this:

Pod creates:
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-(infracid).scope/container
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-conmon-(infracid).scope

Container (without split) creates:
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-(escid).scope/container
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-conmon-(escid).scope

Through Systemd as the user, I get this:
Pod creates:
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-(infracid).scope/container
/system.slice/mycool-pod.service

Container (without split) creates:
/user.slice/user-(uid).slice/user@(uid).service/user.slice/user-libpod_pod_(podid).slice/libpod-(escid).scope/container
/system.slice/mycool-elasticsearch.service

goochjj on 30 Jun 2020

TLDR, @giuseppe would have to modify/extend another PR to handle pods.

It looks like when a container is spawned in a pod, it assumes its parent slice will be the parent cgroup path. (Which is reasonable) Since pod create doesn't have a --cgroups split option, the pod's conmon is attached to the service cgroup, and the pod's slice is in the user slice, divorced from the service's cgroup.

You can't simultaneously have a service (i.e. elasticsearch) be part of the unit's service, and also the pod's slice. Nor can you have a second systemd unit muck around with the pod's cgroup - that's probably a bad idea.

What's your desired outcome here, @jdoss jdoss?

/system.slice/mycool-pod.service/supervisor -> pod conmon
/system.slice/mycool-pod.service/container -> infra container
/system.slice/mycool-elasticsearch.service/supervisor -> conmon
/system.slice/mycool-elasticsearch.service/container -> ES processes

Then ALL the pod services aren't contained in a slice.

Right now it's
/system.slice/mycool-pod.service -> pod conmon
/(user's systemd service)/user.slice/user-libpod_pod_(podid).slice/libpod-(cid).scope/container -> infra procs
/system.slice/mycool-elasticsearch.service -> conmon
/(user's systemd service)/user.slice/user-libpod_pod_(podid).slice/libpod-(cid).scope/container -> elasticsearch procs

Is this insufficient in some way?

goochjj on 30 Jun 2020

Or maybe we should do this in a more systemd-like way?

i.e. Slice=machines-mycool_pod.slice

Pod
/machines.slice/machines-mycool_pod.slice/mycool-pod.service/supervisor -> pod conmon
/machines.slice/machines-mycool_pod.slice/mycool-pod.service/container -> infra container
/machines.slice/machines-mycool_pod.slice/mycool-elasticsearch.service/supervisor -> conmon
/machines.slice/machines-mycool_pod.slice/mycool-elasticsearch.service/container -> ES processes

Then everything is properly in a parent slice - is this what we'd want split to do with pods?

If so, the --cgroups split would have to be set at the pod create level, and child services would have to know if split is passed, to not inherit the cgroup-parent of the pod.

goochjj on 30 Jun 2020

--pids-limit 0 does let the containers start, but yea, I booted FCOS into cgroups v2 with rootless here. I have a non-root user mycool that is being used via systemd to launch these containers.
[core@mycool ~]$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-1.scope

@giuseppe I don't know what's causing this - but there are times when I need to set --pids-limit 0. It seems like there's a default of pids-limit 2048 coming from somewhere, not the config file and not the command line, and then when crun sees it can't do cgroups with pids-limit, it throws the runtime error.

If you happen to get the cgroup right - i.e. it's something crun can modify and it has a pids controller, then the error isn't present.

goochjj on 30 Jun 2020

@goochjj I am trying to set things up so I can have many pods running under a rootless user/users via systemd units with the User= directive for each stack of applications running as rootless containers inside the pod. Having everything in it's own pod namespace as a rootless user is pretty great so I don't need to juggle ports on each stack of application, just the pod ports. I also like the isolation pods give each application stack deployment.

Since FCOS doesn't support user systemd units via Ignition, I have to set them up in as system units. Which is fine since I like using system units over user units anyways to prevent them from modified by nonroot users.

jdoss on 30 Jun 2020

Right, but all this works for you without --cgroups split, correct? Is there something you're hoping to gain with --cgroups split?

goochjj on 30 Jun 2020

The pids-limit is probably Podman automatically trying to set the maximum available for that rlimit - we should code that to only happen if cgroups are present.

mheon on 30 Jun 2020

@goochjj I was running FCOS with cgroups v1 up until I saw this thread that introduced --cgroups split so I started down this road of giving it a try with cgroups v2. Trying my old setup that works on FCOS cgroups v1 on FCOS with cgroups v2 doesn't work at all without setting --pids-limit 0.

I am am not trying to gain anything specific by using --cgroups split. I thought it would help provide me with a better setup for my use case.

jdoss on 30 Jun 2020

@mehon I'm unclear on why cgroups aren't present... let alone that default.

It's really annoying, and seems to be cgroupsv1 specific. Should I create this as a separate issue?

goochjj on 1 Jul 2020

I believe that's a requirement forced on us by cgroups v1 not being safe for rootless use, unless I'm greatly misunderstanding?

mheon on 1 Jul 2020

@mheon I'm fine with that, as long as it doesn't explicitly require me to --pids-limit 0 everything, which it's currently doing.

This code

118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 302)            // then ignore the settings.  If the caller asked for a
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 303)            // non-default, then try to use it.
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 304)            setPidLimit := true
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 305)            if rootless.IsRootless() {
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 306)                    cgroup2, err := cgroups.IsCgroup2UnifiedMode()
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 307)                    if err != nil {
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 308)                            return nil, err
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 309)                    }
4352d58549 (Daniel J Walsh    2020-03-27 10:13:51 -0400 310)                    if (!cgroup2 || (runtimeConfig != nil && runtimeConfig.Engine.CgroupManager != cconfig.SystemdCgroupsManager)) && config.Resources.PidsLimit == sysinfo.GetDefaultPidsLimit() {
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 311)                            setPidLimit = false
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 312)                    }
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 313)            }
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 314)            if setPidLimit {
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 315)                    g.SetLinuxResourcesPidsLimit(config.Resources.PidsLimit)
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 316)                    addedResources = true
118cf1fc63 (Daniel J Walsh    2019-09-14 06:21:10 -0400 317)            }

in pkg/spec/spec.go seems to indicate it should already be ignoring the default on cgroups v1. I'm digging.

goochjj on 1 Jul 2020

Cuz this isn't great.

(focal)mrwizard@FocalCG1Dev:~/src/podman
$ podman run --rm -it alpine sh
Error: cannot set limits without cgroups: OCI runtime error

goochjj on 1 Jul 2020

This is definitely a bug. Is this 2.0? pkg/spec is deprecated, we've moved to pkg/specgen/generate - so the offending code likely lives there.

mheon on 1 Jul 2020

2.1.0-dev. Actually, master, plus my sdnotify

So, sounds like I should create a new issue.
:-D

goochjj on 1 Jul 2020

6834

goochjj on 1 Jul 2020

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] on 1 Aug 2020

Fixed in master.

rhatdan on 4 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Error pulling image , Error committing the finished image: error adding layer with blob ...

juansuerogit · 5Comments

Document using rootless container with systemd

MatMaul · 5Comments

podman logs crash

yangm97 · 5Comments

podman exec with heredoc results in read: connection reset by peer

pkubatrh · 4Comments

Problem with 'podman --storage-driver vfs' in rootless mode

SergeyBear · 4Comments