Airflow: Can we add configable debug settings for delay pod delete when there is a `Error` state of pods ?

Created on 26 Apr 2020  ·  8Comments  ·  Source: apache/airflow

Description

Add configable debug settings for delay pod delete when there is a Error state of pods.

Use case / motivation

In apache/airflow:1.10.10 image.

I'm deploy a airflow in k8s, want to use Kubernetes Executor for task excute.
If the pod got Error state, airflow scheduler would delete pod immediately.
So we can not see what happend, pod is deleted in some seconds.

When I add time.sleep() in kubernetes_executor.py:896 , like this:

    def _change_state(self, key, state, pod_id, namespace):
        if state != State.RUNNING:
            if self.kube_config.delete_worker_pods:
                for x in range(120):
                    self.log.info(str(x) + ": sleep 1s for...")
                    time.sleep(1)
                self.kube_scheduler.delete_pod(pod_id, namespace)
                self.log.info('Deleted pod: %s in namespace %s', str(key), str(namespace))
            try:
                self.running.pop(key)
            except KeyError:
                self.log.debug('Could not find key: %s', str(key))
        self.event_buffer[key] = state

When trigger execute manully, I can see pod got Error state soon.

➜  ~ kubectl get po
NAME                                                         READY   STATUS    RESTARTS   AGE
airflow-564c84ff46-tn5mg                                     2/2     Running   0          67s
examplebashoperatorrunme0-76fd68aa96d64e8c93c7c87904f3312a   0/1     Error     0          24s

Watch pod's log:

➜  ~ kubectl logs -f examplebashoperatorrunme0-76fd68aa96d64e8c93c7c87904f3312a
Traceback (most recent call last):
  File "/home/airflow/.local/bin/airflow", line 23, in <module>
    import argcomplete
ModuleNotFoundError: No module named 'argcomplete'

It's a error in container. It's easy to debug now.

feature

Most helpful comment

AIRFLOW__KUBERNETES__RUN_AS_USER: "50000"

All 8 comments

Thanks for opening your first issue here! Be sure to follow the issue template!

Hi gwind,
how did you solve the container error?
ModuleNotFoundError: No module named 'argcomplete'
I have the same issue in pods with the Kubernetes executor and the example DAGs

There is an option to keep / not delete worker pods:
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: "false"

Hi gwind,
how did you solve the container error?
ModuleNotFoundError: No module named 'argcomplete'

I've solved it by hack the airflow code.

I have the same issue in pods with the Kubernetes executor and the example DAGs

There is an option to keep / not delete worker pods:
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: "false"

👍 Indeed, there is a option in https://github.com/apache/airflow/blob/master/airflow/config_templates/default_airflow.cfg#L828

# If True, all worker pods will be deleted upon termination
delete_worker_pods = True

# If False (and delete_worker_pods is True),
# failed worker pods will not be deleted so users can investigate them.
delete_worker_pods_on_failure = False

But for the guys who use apache/airflow:1.10.10 image, should check it.

Thanks !

AIRFLOW__KUBERNETES__RUN_AS_USER: "50000"

Hi guys!

How did you solve the problem ?

ModuleNotFoundError: No module named 'argcomplete'

is there any setting etc to fix it???

Hi guys!

How did you solve the problem ?

ModuleNotFoundError: No module named 'argcomplete'

is there any setting etc to fix it???

This bug caused by wrong user environment in the airflow POD mostly.

You can use kubectl exec -it ${THE_POD} bash go to inside of the airflow POD, then run airflow command for testing. You would be found that which user is working then.

Hi!
Thanks!

I've setup it as proposed:
AIRFLOW__KUBERNETES__RUN_AS_USER: "50000"

And it worked. :)

This has already been added to master via https://github.com/apache/airflow/pull/7507 (and then renamed in https://github.com/apache/airflow/pull/8312/) -- we're hoping to include this in the 1.10.11 release.

@gwind Are you happy that the above two PRs would give you the behaviour you are after (once they are released of course)

Was this page helpful?
0 / 5 - 0 ratings