podman in corrupted state after filesystem filled up to 100 %

Created on 28 Sep 2020 · 17Comments · Source: containers/podman

/kind bug

Hello,

Env:
rootless container in user namespace
6/6 containers are running fine
managed by systemd

crash:
100% home full

current:
5 of 6 containers are working again
1 has problems.

infos:
/bin/podman run --rm --name test_service --image-volume=ignore --authfile /home/cadmin/.podman_creds.json registry.example/test/alpine:3.12.0
Error: error creating container storage: the container name "test_service" is already in use by "6e5d7bcf14a33187db1667493281a2a939859954b4a90c54de168243411fada9". You have to remove that container to be able to reuse that name.: that name is already in use

/bin/podman ps -a
no showing any other container than the 5 running.
I would expect a stopped/exited/created one.

also tried --sync

/bin/podman rm -f --storage 6e5d7bcf14a33187db1667493281a2a939859954b4a90c54de168243411fada9
Error: error unmounting container "6e5d7bcf14a33187db1667493281a2a939859954b4a90c54de168243411fada9": layer not known

Debug Level also didnt show any other errors.
Where does podman search for this names ?

Output of podman version:

Version:      2.0.4
API Version:  1
Go Version:   go1.13.4
Built:        Thu Jan  1 01:00:00 1970
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.15.0
  cgroupVersion: v1
  conmon:
    package: conmon-2.0.20-1.el8.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.20, commit: 838d2c05b5b53eff3f1cd1a06dbd81d8153feea3'
  cpus: 4
  distribution:
    distribution: '"centos"'
    version: "8"
  eventLogger: file
  hostname: herewasahostname
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1002
      size: 1
    - container_id: 1
      host_id: 231072
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1002
      size: 1
    - container_id: 1
      host_id: 231072
      size: 65536
  kernel: 4.18.0-193.6.3.el8_2.x86_64
  linkmode: dynamic
  memFree: 8551780352
  memTotal: 16644939776
  ociRuntime:
    name: runc
    package: runc-1.0.0-65.rc10.module_el8.2.0+305+5e198a41.x86_64
    path: /usr/bin/runc
    version: 'runc version spec: 1.0.1-dev'
  os: linux
  remoteSocket:
    path: /run/user/1002/podman/podman.sock
  rootless: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-0.4.2-3.git21fdece.module_el8.2.0+305+5e198a41.x86_64
    version: |-
      slirp4netns version 0.4.2+dev
      commit: 21fdece2737dc24ffa3f01a341b8a6854f8b13b4
  swapFree: 5000392704
  swapTotal: 5003800576
  uptime: 7h 12m 54.6s (Approximately 0.29 days)
registries:
  search:
  - registry.example.de
store:
  configFile: /home/user/.config/containers/storage.conf
  containerStore:
    number: 8
    paused: 0
    running: 8
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /bin/fuse-overlayfs
      Package: fuse-overlayfs-0.7.2-5.module_el8.2.0+305+5e198a41.x86_64
      Version: |-
        fuse-overlayfs: version 0.7.2
        FUSE library version 3.2.1
        using FUSE kernel interface version 7.26
  graphRoot: /home/user/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 17
  runRoot: /tmp/run-1002
  volumePath: /home/user/.local/share/containers/storage/volumes
version:
  APIVersion: 1
  Built: 0
  BuiltTime: Thu Jan  1 01:00:00 1970
  GitCommit: ""
  GoVersion: go1.13.4
  OsArch: linux/amd64
  Version: 2.0.4

Package info (e.g. output of rpm -q podman or apt list podman):

podman-2.0.4-1.el8.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

Troubleshooting Guide Yes

i changed the names

kinbug stale-issue

Source

dzerhusen

All 17 comments

podman system reset

Should clean up all containers and images and reset you to initial state.

rhatdan on 28 Sep 2020

👍1

Currently this isn't a solution for me because this will result in a downtime for all containers.

dzerhusen on 29 Sep 2020

Well you could remove your libpod database, which will make podman loose the containers, but still have all of the images in storage.

rhatdan on 29 Sep 2020

This one seems kind of bizarre - c/storage is complaining that the name is in use, but simultaneously that the associated layer does not exist. @nalind Are we looking at potentially inconsistent c/storage state here?

mheon on 29 Sep 2020

@rhatdan can you give me some instructions what to do exactly? Shall i remove the .local/share/containers/storage/libpod folder with the datafile in it? "Loose" means i dont see any container with "podman ps" anymore?

PS: I will go for the reset next week if we release a new version.

dzerhusen on 30 Sep 2020

Fixed it today with a downtime:

mv ~/.local ~/.local_old

I think this is mostly the same as "podman system reset" does.

dzerhusen on 6 Oct 2020

I removed an entry the id of which matches with the error message from storage/overlay-containers/containers.json manually. Then it seems to work well.

ryohayakawa on 7 Oct 2020

👍1

I saw this entry too. But i was too scared to remove it causing a more inconsistent state.
Thanks for the hint.

dzerhusen on 8 Oct 2020

Hey there! I'm experiencing the same error and manually altering containers.json fixes it. The difference in my case is that this happened after some sort of force-full system reset (power loss).

I think podman ps -a should indeed show containers if they still can be found in:

/run/containers/storage/overlay-containers/bbc080aab414c5812eea011d3af6afaf548cdba9fa1b6b092f03b279f17bc185
/var/lib/containers/storage/overlay-containers/bbc080aab414c5812eea011d3af6afaf548cdba9fa1b6b092f03b279f17bc185
/var/lib/containers/storage/overlay-containers/containers.json

As @mheon said I too think this is an inconsistency. I can't trust the podman ps command, I cannot use the command to resolve the problem and this situation not only occurs on a full disk but also in industry applications where the system state might not always be handled gracefully. I think you understand the problem when someone in a power plant disconnect a device running podman and on next boot randomly containers fail to start because of this inconsistent state.

Usually my systemd units have a ExecStartPre=-/usr/bin/podman rm "whatever" which ensures, that any left-overs are removed before attempting to create a new container. In this case, this command returns that there is no such container. Creating a container with the name "whatever" then fails, because although podman said before that a container with that name does not exist it now complains that the name is already taken by a container with an id which podman ps also does not present.

I know it would be better to file a PR with a fix instead of telling you all that from just my personal experience but honestly I don't know what the actual problem is and from what I read the issue does not receive the attention it maybe should given the implications of it

w4tsn on 4 Nov 2020

As of Podman v2.1.1, you can you podman ps --storage to see containers that are not in Podman's database but are present in the storage library. They can then be removed via podman rm --storage on the container ID.

mheon on 4 Nov 2020

❤1

(The rm --storage bit has worked for quite a long time - since 1.6.x, I believe?)

mheon on 4 Nov 2020

@mheon thanks a lot for your response! Could you elaborate on this? Naturally I'd assume that using podman rm deletes any remains in order to return to a coherent state. podman rm --storage seems to only remove containers from storage and only does so if it's not in the libpod database. Although that makes sense to some degree I'm still missing a single command with the semantics "remove this container such that it's as if it never existed".

In the example of the systemd-based approach I'd like to ensure a state in the ExecStartPre commands that I the actual podman container run|create commands will always work.

I assume the current expected way to do this is to have two ExecStartPre= statements, one with ExecStartPre=-/usr/bin/podman rm <whatever> the next with ExecStartPre=-/usr/bin/podman rm --storage <whatever>?

In my opinion the way this works with these two commands is quite unintuitive and too technical for a user with the distinction libpod db and podman storage.

Can you point me to some rational in a commit or something, there sure is a good explanation to why you decided to do it like this. :)

As this now tends to become a little of-topic I'll reach out via Matrix/IRC to elaborate a bit more about the whole systemd podman stuff.

w4tsn on 5 Nov 2020

At this point, if you're seeing this as a consistent problem, something is seriously wrong - we've done a lot to make sure that c/storage is reliably removed at the same time as the container, even in cases of error. We have, however, had some known issues where improperly-written systemd unitfiles can do this; any chance you can post the unit file in question?

mheon on 5 Nov 2020

There you go:

[Unit]
Description=NodeRed

[Service]
Type=simple
TimeoutStartSec=5m
Environment="NODERED_CONTAINER_VERSION=latest"
Environment="TZ=Europe/Berlin"
EnvironmentFile=-/etc/default/nodered
ExecStartPre=-/usr/bin/podman stop "nodered-runtime"
ExecStartPre=-/usr/bin/podman rm "nodered-runtime"
ExecStartPre=-/usr/bin/podman rm --storage "nodered-runtime"
ExecStartPre=/usr/bin/mkdir -p /var/srv/nodered/
ExecStartPre=-/usr/bin/cp --no-clobber /etc/node-red/initial-flows.json /var/srv/nodered/flows.json
ExecStartPre=/usr/bin/chown 1000:1000 -R /var/srv/nodered/
ExecStartPre=/usr/bin/chcon -Rt container_file_t /var/srv/nodered/
# Group dialout = 18
# Group tty = 5
# Use privileged and network=host for debugging purposes
ExecStart=/usr/bin/podman run \
  --name "nodered-runtime" \
  --authfile /etc/nodered.auth \
  --read-only \
  --memory=750M \
  --systemd=true \
  --privileged \
  --group-add=18 \
  --group-add=5 \
  --user=root \
  --network=host \
  -v /var/srv/nodered/:/data \
  -v /dev:/dev \
  -e NODERED_ADMIN_PW=${NODERED_ADMIN_PASSWORD} \
  -e TZ=${TZ} \
  docker.io/nodered/node-red:${NODERED_CONTAINER_VERSION}
ExecReload=/usr/bin/podman stop "nodered-runtime"
ExecStop=/usr/bin/podman stop "nodered-runtime"
Restart=always
RestartSec=30

[Install]
WantedBy=multi-user.target
RequiredBy=boot-complete.target

w4tsn on 10 Nov 2020

I'm noting a few issues immediately:

First, and probably your issue: we strongly recommend the use of KillMode=none on unit files launching Podman. We launch several processes after the container exits to clean up after it, and systemd has an annoying habit of shutting down these cleanup processes mid-execution when it wants to stop or restart a unit, which can lead to issues depending on when it was stopped.
Second, and somewhat less relevant: we strongly recommend the use of type=forking and using PID files to manage Podman under systemd. The container is not actually a direct child of Podman (it's a child of a monitor process we launch called Conmon, which double-forks to daemonize before launching the container) and, as part of creating the container, we also leave the cgroup of the systemd unit - so it can't actually track the state of the container itself unless given a PID file.

You can use podman generate systemd --new to generate a sample unit file to show our recommended format for these.

mheon on 10 Nov 2020

👍1

A friendly reminder that this issue had no activity for 30 days.