Prefect: Running prefect local agent in a docker container leads to zombie apocalypse ;-)

Created on 26 Apr 2020  路  10Comments  路  Source: PrefectHQ/prefect

Description

I'm running prefect agent inside of a Docker container with local execution. Each run of the process leaves a zombie process, a phenomenon which if left unchecked eventually causes deleterious effects. I noticed this because I was at one point unable to ssh into the node on which the container was running.

Expected Behavior

Somehow the completed processes should be harvested to remove the zombies.

Reproduction

My shell script does

exec prefect agent start -t $prefect_runner_token

(note: removing the exec doesn't help). Here's a simple script to create a flow that runs on a schedule:

import prefect
from prefect import Flow, task
from prefect.schedules import IntervalSchedule
from datetime import timedelta, datetime

import time

schedule = IntervalSchedule(
    start_date=datetime.utcnow() + timedelta(seconds=1),
    interval=timedelta(minutes=2),
)

@task
def run():
    logger = prefect.context.get("logger")
    results = []
    for x in range(3):
        results.append(str(x + 1))
        logger.info("Hello! run {}".format(x + 1))
        time.sleep(3)
    return results

with Flow("Hello", schedule=schedule) as flow:
    results = run()

flow.register(project_name="Hello")

Environment

The container is built on CentOS 7.3. It does not have an init process.

{
  "config_overrides": {},
  "env_vars": [],
  "system_information": {
    "platform": "Linux-3.10.0-957.21.3.el7.x86_64-x86_64-with-glibc2.10",
    "prefect_version": "0.10.4",
    "python_version": "3.8.2"
  }
}
docs

Most helpful comment

I think that's reasonable鈥攁 doc fix would be great to consider!

All 10 comments

What I am finding is that each run produces three subprocesses. The process with the smallest PID takes the longest to run as seems to be reaped eventually. The other two processes seem to exit more quickly but are never reaped. Thus each flow run adds net two zombies.

Congratulations @mcg1969 I think this means that you are patient zero! I will look into this behavior. What are you using as the base image for your container?

I'm afraid I can't share the exact container, though I don't mind that you know it's the one that we use inside of Anaconda Enterprise, and @jcrist might have some familiarity with that. That said, it's based on a CentOS 7.3 base image, with Miniconda installed within. I'm happy to share the precise conda environment I was using too if that helps.

No worries! Was only wondering if it had some possible weird dependencies but this is enough information to go off of 馃槃

Here's the conda environment, re-creatable with

conda create -n testprefect -c defaults -c conda-forge --file ...

The file:

# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
_libgcc_mutex=0.1=main
appdirs=1.4.3=pyh91ea838_0
asn1crypto=1.3.0=py38_0
ca-certificates=2020.1.1=0
certifi=2020.4.5.1=py38_0
cffi=1.14.0=py38h2e261b9_0
chardet=3.0.4=py38_1003
click=7.1.1=py_0
cloudpickle=1.2.2=py_0
croniter=0.3.30=py_0
cryptography=2.8=py38h1ba5d50_0
cytoolz=0.10.1=py38h7b6447c_0
dask-core=2.14.0=py_0
distributed=2.14.0=py38_0
docker-py=4.2.0=py38_0
docker-pycreds=0.4.0=py_0
heapdict=1.0.1=py_0
idna=2.9=py_1
ld_impl_linux-64=2.33.1=h53a641e_7
libedit=3.1.20181209=hc058e9b_0
libffi=3.2.1=hd88cf55_4
libgcc-ng=9.1.0=hdf63c60_0
libstdcxx-ng=9.1.0=hdf63c60_0
marshmallow=3.5.1=py_0
marshmallow-oneofschema=2.0.1=py_0
msgpack-python=1.0.0=py38hfd86e86_1
mypy_extensions=0.4.3=py38_0
ncurses=6.2=he6710b0_0
openssl=1.1.1g=h7b6447c_0
packaging=20.3=py_0
pendulum=2.1.0=py38_1
pip=20.0.2=py38_1
prefect=0.10.4=py_0
psutil=5.7.0=py38h7b6447c_0
pycparser=2.20=py_0
pyopenssl=19.1.0=py38_0
pyparsing=2.4.6=py_0
pysocks=1.7.1=py38_0
python=3.8.2=hcf32534_0
python-box=4.2.2=py_0
python-dateutil=2.8.1=py_0
python-slugify=3.0.4=py_0
pytz=2019.3=py_0
pytzdata=2019.3=py_0
pyyaml=5.3.1=py38h7b6447c_0
readline=8.0=h7b6447c_0
requests=2.23.0=py38_0
ruamel.yaml=0.16.10=py38h7b6447c_1
ruamel.yaml.clib=0.2.0=py38h7b6447c_0
setuptools=46.1.3=py38_0
six=1.14.0=py38_0
sortedcontainers=2.1.0=py38_0
sqlite=3.31.1=h62c20be_1
tabulate=0.8.3=py38_0
tblib=1.6.0=py_0
text-unidecode=1.3=py_0
tk=8.6.8=hbc83047_0
toml=0.10.0=pyh91ea838_0
toolz=0.10.0=py_0
tornado=6.0.4=py38h7b6447c_1
unidecode=1.1.1=py_0
urllib3=1.25.8=py38_0
websocket-client=0.57.0=py38_1
wheel=0.34.2=py38_0
xz=5.2.5=h7b6447c_0
yaml=0.1.7=had09818_2
zict=2.0.0=py_0
zlib=1.2.11=h7b6447c_3

I wouldn't say there's anything about the container I would expect to cause problems. Anything is possible, of course. But the container doesn't have an init process.

I have been able to verify that adding an init process like tini (https://github.com/krallin/tini) to the container, and running everything under that, reaps the zombies properly.

Glad to hear it! Currently it looks like we're implicitly relying on the init process to prune orphaned processes (which IMO is fine, if not ideal). We could possibly fix this in the future, but for now I think I'm fine saying that we require an init process when using the local agent. Leaving it open though. Thanks for the report @mcg1969!

I think that's reasonable鈥攁 doc fix would be great to consider!

Just adding here from IRL convo: we think the docs note should be on the page describing the local agent.

Was this page helpful?
0 / 5 - 0 ratings