Airflow: faq.rst provides incorrect instructions for reducing scheduling latency

Created on 13 Nov 2020  路  7Comments  路  Source: apache/airflow

Regarding the FAQ note "How to reduce airflow dag scheduling latency in production" -
Per https://github.com/apache/airflow/blame/f097ae39a7243bd25d4d26664bc259981b2ba217/docs/faq.rst#L209:

User should consider to increase scheduler_heartbeat_sec config to a higher value (e.g 60 secs) which controls how frequent the airflow scheduler gets the heartbeat and updates the job's entry in database.

However, since scheduler_heartbeat_sec is used as a duration (not as heartbeats/sec rate) in airflow/jobs/scheduler_job.py, increasing it to 60 (from the default 5 sec) would actually cause scheduling to become more sluggish, thus increasing latency.

docs bug

Most helpful comment

Yes. The task lag was fixed by AIP-15 and in my tests the delay between tasks is down to 0.18s

Try out 2.0.0beta2?

All 7 comments

@vitaly-krugl I think you are right. @ashb ?

Yeah, and also in 2.0 that is basically not true anymore. Don't think it was true for 1.10.4+ either!

I am running airflow 1.10.11, and empirically just realized that scheduler_heartbeat_sec has no effect on latencies.

I just examined BaseJob.heartbeat() and see that it just updates the job's heartbeat timestamp in db and checks for SHUTDOWN.

Hi @ashb - The reason I was looking at scheduler_heartbeat_sec is that I am trying to improve the performance of system-level tests in my airflow-based app.

What I am seeing is that even under very low utilization, Airflow adds latency of 4-5 seconds for executing each task. I was looking at airflow.cfg options to tune to eliminate this 4-5 second per task latency on my testing setup. I haven't been able to find any combination of options that would reduce the latency below 4-5 seconds.

Any suggestions about how to eliminate these Airflow latencies for my testbed?

My test setup: one DAG with two tasks: Task A and Task B, with A >> B relationship. Implementation is in python. Each one of the python callbacks does minimum (almost no-op) work that shows up in the logs at < 1 sec. However, each dagrun takes upwards of 9 seconds and I observe from the timestamps (dagrun, Task A start/end and Task B start/end) that there are 3-5 second gaps between dagrun and Task A start, as well as between Task A end and Task B start.

Yes. The task lag was fixed by AIP-15 and in my tests the delay between tasks is down to 0.18s

Try out 2.0.0beta2?

Thanks @ashb. What is "AIP-15"? Is the latency fix available only in 2.0? Not in 1.x.x?

AIP = Airflow Improvement Proposal.

The most important bit of AIP-15 was this https://github.com/apache/airflow/pull/10956

And given how big a change that was, no, it's not available in 1.10.x, sorry

Was this page helpful?
0 / 5 - 0 ratings