Apache Airflow version: 1.10.10
Kubernetes version (if you are using kubernetes) (use kubectl version): 1.17.2
Environment:
uname -a): Linux airflow-worker-0 5.6.13-1.el7.elrepo.x86_64 #1 SMP Thu May 14 08:05:24 EDT 2020 x86_64 x86_64 x86_64 GNU/LinuxWhat happened:
I run Airflow as a way to orchestrate jobs on Kubernetes using the KubernetesPodOperator. While most of the time logs appear correctly in the Airflow webserver, I do notice increasingly that the logs do not appear and instead just show a message, "Task is not able to be run", such as in the snippet below:
*** Reading remote log from s3://*****/****/******/2020-06-30T10:00:00+00:00/1.log.
[2020-06-30 23:07:40,362] {taskinstance.py:663} INFO - Dependencies not met for <TaskInstance: ****.***** 2020-06-30T10:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
[2020-06-30 23:07:40,363] {logging_mixin.py:112} INFO - [2020-06-30 23:07:40,363] {local_task_job.py:91} INFO - Task is not able to be run
Unusually, when I go check what is happening on the Kubernetes cluster, the pod is actually running and emitting logs when I run a kubectl logs command. When the pod is complete, Airflow will reflect that the task has completed as well.
What you expected to happen: I expected pod logs to be printed out.
How to reproduce it: Very unfortunately, I am unsure what circumstances cause this error and am currently trying to gather evidence to replicate.
Anything else we need to know:
Thanks for opening your first issue here! Be sure to follow the issue template!
Similar issue in #9626
I'm receiving the same message, when I'm running long task (duration more than 1 day) with KubernetesPodOperator.
{taskinstance.py:624} INFO - Dependencies not met for <TaskInstance: **** 2020-07-23T11:23:35+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
Task finishes successfully, but Airflow status is marked as FAILED and downstream tasks are not run
Hi @chrismclennon,
Two questions about this: 1. Are these happening to very long-running tasks? 2. Are these tied to scheduler restarts? It's possible this might be fixed in 1.10.12 as we added some fixes for long running tasks in the k8spodoperator
Hey @dimberman.
We'll be making the update to 1.10.12 within the next ~month or so. If this issue is still under investigation by then I'll be sure to report back if the version update made any changes.
Thanks!
@ashb @kaxil have you seen anything like this for any other super long running tasks? Is there a timeout on tasks that I should be aware of?
Possibly a long running task makes it more likely to hit a network blip speaking to the Kube API?
@chrismclennon @Pseverin just to check: Does the task in the airflow UI still show up as running, but the logs get funky? Like they're not put into "up_for_retry" or something? Also do all older logs disappear in s3?
@ashb what's strange here is that these logs would suggest that airflow is trying to restart the task while it is still in the running state. Is there a situation where the SchedulerJob would die because of an OOM or something causing the task to go back in the queue?
Hey @dimberman. The Airflow task does show as running -- it's not in any retry or error state. Older logs are not disappearing, as I'm able to pull logs from start of year. We don't have any sort of object deletion in our S3 bucket either. When the job does eventually complete, this log is replaced with the actual, correct log uploaded to S3. This only happens for tasks that are in the running state.
The best we've been able to do in terms of replicating this issue is to execute long running tasks and see this behaviour pop up. We're also still investigating what the underlying issue could be.
Thank you!
Ok yeah that's pretty consistent with the report we received from a customer (that the logs show up when the task completes, but for some reason stops showing up while the task is running).
@ashb could this have something to do with how we retrieve logs? Like maybe airflow is attempting to do a retry while the original task is running and the task logs are picking that up?
@chrismclennon are you seeing a different try_number when it fails to retrieve logs?
@dimberman Is there a good way to check whether it uses a different try_number on retrieving logs? I just pulled up an instance that is showing this error. The log looks like this:
*** Reading remote log from s3://***/***/***/***/2020-09-01T09:00:00+00:00/1.log.
[2020-09-02 10:06:39,304] {taskinstance.py:663} INFO - Dependencies not met for <TaskInstance: ****.**** 2020-09-01T09:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
[2020-09-02 10:06:39,501] {logging_mixin.py:112} INFO - [2020-09-02 10:06:39,402] {local_task_job.py:91} INFO - Task is not able to be run
It looks like it's trying to read the remote log from S3 on try_number=1 (judging from 1.log)(?), which is the current attempt number. Interestingly, I expected that since the task is in the running state, it would pull from the logs being served on port 8793.
It might be noteworthy to add that I'm running all airflow components on a Kubernetes cluster. To allow communication with port 8793, I've deployed the workers as a StatefulSet. This is what the header of a normal running log looks like:
*** Log file does not exist: /var/local/airflow/logs/****/****/2020-09-03T16:00:00+00:00/1.log
*** Fetching from: http://airflow-worker-1.airflow-worker.airflow-core--prod.svc.cluster.local:8793/log/***/***/2020-09-03T16:00:00+00:00/1.log
[2020-09-03 18:12:28,965] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: ***.*** 2020-09-03T16:00:00+00:00 [queued]>
[2020-09-03 18:12:28,981] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: ***.*** 2020-09-03T16:00:00+00:00 [queued]>
[2020-09-03 18:12:28,981] {taskinstance.py:879} INFO -
--------------------------------------------------------------------------------
[2020-09-03 18:12:28,983] {taskinstance.py:880} INFO - Starting attempt 1 of 1
[2020-09-03 18:12:28,983] {taskinstance.py:881} INFO -
--------------------------------------------------------------------------------
... actual runtime logs are populated from here ...
The workers aren't restarting when we see this issue since we make use of CeleryExecutor, so I don't believe the worker deployment itself should be the culprit here.
@chrismclennon does this happen when you are only running a single worker? I'd be interested if perhaps a second worker is pulling the task while the first worker is working on it, which would create a race condition. Also you mentioned that all logs appear when the task completes, are any of the dependency failures in the logs?
Hey @dimberman. This still happens when we are running a single worker, so I don't believe it's a race condition between workers. I don't notice any dependency failure in the completed logs loaded in S3.
Some other interesting observations we've made:
So it seems like the current question on our mind is why the worker is uploading this bad log to S3 in the first place.
@chrismclennon Do you have any non K8sPodOperator tasks that run for over an hour? Is it possible that this happens with any task that runs for a long time?
Few new findings:
visibility_timeout = 60 in [celery_broker_transport_options] in airflow.cfg file. Received task: airflow.executors.celery_executor.execute_command[b40cacbb-9dd3-4681-8454-0e1df2dbc910] with same id seconding that Celery is assigning this task to another worker.( CC: @chrismclennon @dimberman )
Hi @raj-manvar Thank you for updating us! Yes this is similar to what we've been seeing. We have both seen that this is tied to the celery visibility_timeout and that increasing the timeout significantly seems to make this happen after an hour.
Hmm That's unfortunate that the upgrade doesn't seem to have worked for us. I think we're going to need to bump that thread. @ashb is it possible that downgrading to 4.4.5 might fix this?
Another ticket tied to this same bug https://github.com/celery/celery/issues/6229. This person also attempted 4.4.6.
I tried with Celery version 4.4.5 as well but had the same issue.
also tried with 4.4.0 and 4.4.3 but had similar issues.
We're experiencing the same issue across Operators/Sensors (Airflow 1.10.11, CeleryExecutor with Redis backend)
For the two jobs of the same task that gets enqueued an hour apart, the first job continues to run and the logs don't appear in the UI. The second job completes immediately because its dependencies (The first job enqueued which is still running is a dependency) are not met and you see "Task is not able to be run" in the UI logs.
If the first job fails or completes (success/fails), the task status gets updated accordingly and the logs will then be added to Airflow UI.
If the first job doesn't complete (noticed cases where the job just hangs and not sure how worker restarts impact this), the task attempt will remain in the running state. In this scenario, task execution timeout isn't honored, so the task can run for a really long time.
Unknowns -
The latter unknown is causing issues for us as tasks end up running for hours, and we have to manually intervene and restart the task.
Hi @dennyac this is tied to a Celery issue https://github.com/celery/celery/issues/6229
@dennyac @rmanvar-indeed @chrismclennon Could y'all please +1 that ticket? Hopefully will catch the attention of a Celery maintainer. Note that one potential solution in 2.0 for this will be the ability to launch individual KubernetesExecutor tasks using the CeleryKubernetesExecutor.
Upvoted that ticket. I understand that this is tied to a celery issue about repeated enqueueing. But there could be valid scenarios where the same job is enqueued more than once.
I'm concerned about how airflow handles this scenario as well. Especially about not honoring task execution timeout, determining which job to use to determine task state and logs.
Would be good to have a solution for non Kubernetes deployments as well, as this issue is more generic.
The current work-around we have is a view plugin which fetches logs from the workers instead of S3.
Thanks @raj-manvar. I'm actually more concerned about tasks being stuck in running state for prolonged periods (beyond the specified task execution timeout), which has caused delays and cascading failures. As of now, we just manually restart the task whenever we encounter this.
We'll be making the update to 1.10.12 within the next ~month or so. If this issue is still under investigation by then I'll be sure to report back if the version update made any ch
Hey @chrismclennon , is there any update regarding this?
Most helpful comment
Few new findings:
visibility_timeout = 60in [celery_broker_transport_options] in airflow.cfg file.Received task: airflow.executors.celery_executor.execute_command[b40cacbb-9dd3-4681-8454-0e1df2dbc910]with same id seconding that Celery is assigning this task to another worker.( CC: @chrismclennon @dimberman )