Airflow: Pod logs from KubernetesPodOperator occasionally get replaced with "Task is not able to run"

Created on 1 Jul 2020 · 28Comments · Source: apache/airflow

Apache Airflow version: 1.10.10
Kubernetes version (if you are using kubernetes) (use kubectl version): 1.17.2

Environment:

Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): CentOS Linux
Kernel (e.g. uname -a): Linux airflow-worker-0 5.6.13-1.el7.elrepo.x86_64 #1 SMP Thu May 14 08:05:24 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Others:

What happened:

I run Airflow as a way to orchestrate jobs on Kubernetes using the KubernetesPodOperator. While most of the time logs appear correctly in the Airflow webserver, I do notice increasingly that the logs do not appear and instead just show a message, "Task is not able to be run", such as in the snippet below:

*** Reading remote log from s3://*****/****/******/2020-06-30T10:00:00+00:00/1.log.
[2020-06-30 23:07:40,362] {taskinstance.py:663} INFO - Dependencies not met for <TaskInstance: ****.***** 2020-06-30T10:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
[2020-06-30 23:07:40,363] {logging_mixin.py:112} INFO - [2020-06-30 23:07:40,363] {local_task_job.py:91} INFO - Task is not able to be run

Unusually, when I go check what is happening on the Kubernetes cluster, the pod is actually running and emitting logs when I run a kubectl logs command. When the pod is complete, Airflow will reflect that the task has completed as well.

What you expected to happen: I expected pod logs to be printed out.

How to reproduce it: Very unfortunately, I am unsure what circumstances cause this error and am currently trying to gather evidence to replicate.

Anything else we need to know:

I have remote logging set to an S3 bucket.
I've noticed this issue increasingly with the 1.10.10 update, and I find this error daily.

kubernetes bug high

Source

chrismclennon

👍5

Most helpful comment

Few new findings:

was able to reproduce this issue for task running more than a minute by setting visibility_timeout = 60 in [celery_broker_transport_options] in airflow.cfg file.
Therefore, this is happening because Celery expects the task to complete within an hour and if not assigns another worker for the task, during this transition, worker uploads the logs with "Task is not able to run" to S3
Can see another worker getting same task from logs Received task: airflow.executors.celery_executor.execute_command[b40cacbb-9dd3-4681-8454-0e1df2dbc910] with same id seconding that Celery is assigning this task to another worker.

Modifying "visibility_timeout = 86400 # 1day" in airflow.cfg doesn't resolve this issue and logs in UI are corrupted after an hour
Even tried "visibility_timeout = 7200 # 2 hours" in airflow.cfg but can still see this issue after an hour.
Seems the issue is similar to https://github.com/celery/celery/issues/5935, but according to this it should be resolved in Celery version 4.4.5 but, we still see the same issue even though Airflow 1.10.10 uses Celery version 4.4.6

( CC: @chrismclennon @dimberman )

raj-manvar on 10 Sep 2020

👍4 🎉3

All 28 comments

Thanks for opening your first issue here! Be sure to follow the issue template!

boring-cyborg[bot] on 1 Jul 2020

Similar issue in #9626

zikun on 8 Jul 2020

I'm receiving the same message, when I'm running long task (duration more than 1 day) with KubernetesPodOperator.

{taskinstance.py:624} INFO - Dependencies not met for <TaskInstance: **** 2020-07-23T11:23:35+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.

Task finishes successfully, but Airflow status is marked as FAILED and downstream tasks are not run

Pseverin on 24 Jul 2020

👍6

Hi @chrismclennon,

Two questions about this: 1. Are these happening to very long-running tasks? 2. Are these tied to scheduler restarts? It's possible this might be fixed in 1.10.12 as we added some fixes for long running tasks in the k8spodoperator

dimberman on 31 Aug 2020

Hey @dimberman.

We've observed this for all tasks that run longer than 1 hour.
I don't believe this is tied to scheduler restarts. Whether we restart the scheduler or not, we still observe this behaviour at the 1 hour mark.

We'll be making the update to 1.10.12 within the next ~month or so. If this issue is still under investigation by then I'll be sure to report back if the version update made any changes.

Thanks!

chrismclennon on 2 Sep 2020

@ashb @kaxil have you seen anything like this for any other super long running tasks? Is there a timeout on tasks that I should be aware of?

dimberman on 3 Sep 2020

Possibly a long running task makes it more likely to hit a network blip speaking to the Kube API?

ashb on 3 Sep 2020

@chrismclennon @Pseverin just to check: Does the task in the airflow UI still show up as running, but the logs get funky? Like they're not put into "up_for_retry" or something? Also do all older logs disappear in s3?

@ashb what's strange here is that these logs would suggest that airflow is trying to restart the task while it is still in the running state. Is there a situation where the SchedulerJob would die because of an OOM or something causing the task to go back in the queue?

dimberman on 3 Sep 2020

Hey @dimberman. The Airflow task does show as running -- it's not in any retry or error state. Older logs are not disappearing, as I'm able to pull logs from start of year. We don't have any sort of object deletion in our S3 bucket either. When the job does eventually complete, this log is replaced with the actual, correct log uploaded to S3. This only happens for tasks that are in the running state.

The best we've been able to do in terms of replicating this issue is to execute long running tasks and see this behaviour pop up. We're also still investigating what the underlying issue could be.

Thank you!

chrismclennon on 3 Sep 2020

Ok yeah that's pretty consistent with the report we received from a customer (that the logs show up when the task completes, but for some reason stops showing up while the task is running).

@ashb could this have something to do with how we retrieve logs? Like maybe airflow is attempting to do a retry while the original task is running and the task logs are picking that up?

@chrismclennon are you seeing a different try_number when it fails to retrieve logs?

dimberman on 3 Sep 2020

@dimberman Is there a good way to check whether it uses a different try_number on retrieving logs? I just pulled up an instance that is showing this error. The log looks like this:

*** Reading remote log from s3://***/***/***/***/2020-09-01T09:00:00+00:00/1.log.
[2020-09-02 10:06:39,304] {taskinstance.py:663} INFO - Dependencies not met for <TaskInstance: ****.**** 2020-09-01T09:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
[2020-09-02 10:06:39,501] {logging_mixin.py:112} INFO - [2020-09-02 10:06:39,402] {local_task_job.py:91} INFO - Task is not able to be run

It looks like it's trying to read the remote log from S3 on try_number=1 (judging from 1.log)(?), which is the current attempt number. Interestingly, I expected that since the task is in the running state, it would pull from the logs being served on port 8793.

chrismclennon on 3 Sep 2020

It might be noteworthy to add that I'm running all airflow components on a Kubernetes cluster. To allow communication with port 8793, I've deployed the workers as a StatefulSet. This is what the header of a normal running log looks like:

*** Log file does not exist: /var/local/airflow/logs/****/****/2020-09-03T16:00:00+00:00/1.log
*** Fetching from: http://airflow-worker-1.airflow-worker.airflow-core--prod.svc.cluster.local:8793/log/***/***/2020-09-03T16:00:00+00:00/1.log


[2020-09-03 18:12:28,965] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: ***.*** 2020-09-03T16:00:00+00:00 [queued]>
[2020-09-03 18:12:28,981] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: ***.*** 2020-09-03T16:00:00+00:00 [queued]>
[2020-09-03 18:12:28,981] {taskinstance.py:879} INFO - 
--------------------------------------------------------------------------------
[2020-09-03 18:12:28,983] {taskinstance.py:880} INFO - Starting attempt 1 of 1
[2020-09-03 18:12:28,983] {taskinstance.py:881} INFO - 
--------------------------------------------------------------------------------

... actual runtime logs are populated from here ...

The workers aren't restarting when we see this issue since we make use of CeleryExecutor, so I don't believe the worker deployment itself should be the culprit here.

chrismclennon on 3 Sep 2020

@chrismclennon does this happen when you are only running a single worker? I'd be interested if perhaps a second worker is pulling the task while the first worker is working on it, which would create a race condition. Also you mentioned that all logs appear when the task completes, are any of the dependency failures in the logs?

dimberman on 3 Sep 2020

Hey @dimberman. This still happens when we are running a single worker, so I don't believe it's a race condition between workers. I don't notice any dependency failure in the completed logs loaded in S3.

Some other interesting observations we've made:

The logs in S3 has the "Task is not able to run" message
In reproducing the issue, we saw that after a task ran 1 hour 10 mins, the worker uploads the bad logs to s3
If we manually mark the task as succeeded, the worker then uploads the good log
All the tasks which encounter this issue do not have the log present locally. So after an hour or so, if the task is still running, the worker uploads the error log to S3, and then we see the issue as the read pulls the log down from S3.

So it seems like the current question on our mind is why the worker is uploading this bad log to S3 in the first place.

chrismclennon on 4 Sep 2020

@chrismclennon Do you have any non K8sPodOperator tasks that run for over an hour? Is it possible that this happens with any task that runs for a long time?

dimberman on 8 Sep 2020

Few new findings:

was able to reproduce this issue for task running more than a minute by setting visibility_timeout = 60 in [celery_broker_transport_options] in airflow.cfg file.
Therefore, this is happening because Celery expects the task to complete within an hour and if not assigns another worker for the task, during this transition, worker uploads the logs with "Task is not able to run" to S3
Can see another worker getting same task from logs Received task: airflow.executors.celery_executor.execute_command[b40cacbb-9dd3-4681-8454-0e1df2dbc910] with same id seconding that Celery is assigning this task to another worker.

Modifying "visibility_timeout = 86400 # 1day" in airflow.cfg doesn't resolve this issue and logs in UI are corrupted after an hour
Even tried "visibility_timeout = 7200 # 2 hours" in airflow.cfg but can still see this issue after an hour.
Seems the issue is similar to https://github.com/celery/celery/issues/5935, but according to this it should be resolved in Celery version 4.4.5 but, we still see the same issue even though Airflow 1.10.10 uses Celery version 4.4.6

( CC: @chrismclennon @dimberman )

raj-manvar on 10 Sep 2020

👍4 🎉3

Hi @raj-manvar Thank you for updating us! Yes this is similar to what we've been seeing. We have both seen that this is tied to the celery visibility_timeout and that increasing the timeout significantly seems to make this happen after an hour.

dimberman on 10 Sep 2020

Hmm That's unfortunate that the upgrade doesn't seem to have worked for us. I think we're going to need to bump that thread. @ashb is it possible that downgrading to 4.4.5 might fix this?

dimberman on 10 Sep 2020

Another ticket tied to this same bug https://github.com/celery/celery/issues/6229. This person also attempted 4.4.6.

dimberman on 10 Sep 2020

I tried with Celery version 4.4.5 as well but had the same issue.

raj-manvar on 11 Sep 2020

also tried with 4.4.0 and 4.4.3 but had similar issues.

rmanvar-indeed on 24 Sep 2020

We're experiencing the same issue across Operators/Sensors (Airflow 1.10.11, CeleryExecutor with Redis backend)

For the two jobs of the same task that gets enqueued an hour apart, the first job continues to run and the logs don't appear in the UI. The second job completes immediately because its dependencies (The first job enqueued which is still running is a dependency) are not met and you see "Task is not able to be run" in the UI logs.

If the first job fails or completes (success/fails), the task status gets updated accordingly and the logs will then be added to Airflow UI.

If the first job doesn't complete (noticed cases where the job just hangs and not sure how worker restarts impact this), the task attempt will remain in the running state. In this scenario, task execution timeout isn't honored, so the task can run for a really long time.

Unknowns -

Why is job being enqueued twice?
Why airflow isn't honoring task execution timeout in this scenario?

The latter unknown is causing issues for us as tasks end up running for hours, and we have to manually intervene and restart the task.

dennyac on 9 Oct 2020

Hi @dennyac this is tied to a Celery issue https://github.com/celery/celery/issues/6229

dimberman on 9 Oct 2020

@dennyac @rmanvar-indeed @chrismclennon Could y'all please +1 that ticket? Hopefully will catch the attention of a Celery maintainer. Note that one potential solution in 2.0 for this will be the ability to launch individual KubernetesExecutor tasks using the CeleryKubernetesExecutor.

dimberman on 9 Oct 2020

👍4

Upvoted that ticket. I understand that this is tied to a celery issue about repeated enqueueing. But there could be valid scenarios where the same job is enqueued more than once.

I'm concerned about how airflow handles this scenario as well. Especially about not honoring task execution timeout, determining which job to use to determine task state and logs.

Would be good to have a solution for non Kubernetes deployments as well, as this issue is more generic.

dennyac on 10 Oct 2020

The current work-around we have is a view plugin which fetches logs from the workers instead of S3.

raj-manvar on 10 Oct 2020

👍1

Thanks @raj-manvar. I'm actually more concerned about tasks being stuck in running state for prolonged periods (beyond the specified task execution timeout), which has caused delays and cascading failures. As of now, we just manually restart the task whenever we encounter this.

dennyac on 12 Oct 2020

👍2

We'll be making the update to 1.10.12 within the next ~month or so. If this issue is still under investigation by then I'll be sure to report back if the version update made any ch

Hey @chrismclennon , is there any update regarding this?