Apache Airflow version: 1.10.11
Kubernetes version (if you are using kubernetes) (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-065dce", GitCommit:"065dcecfcd2a91bd68a17ee0b5e895088430bd05", GitTreeState:"clean", BuildDate:"2020-07-16T01:44:47Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
What happened:
We've been seeing occasional issues in our logs where the Kubernetes executor throws an API exception on this stream call:
[2020-10-25 15:59:15,636] {{kubernetes_executor.py:277}} ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py", line 271, in run
self.worker_uuid, self.kube_config)
File "/usr/local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py", line 299, in _run
**kwargs):
File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 177, in stream
status=obj['code'], reason=reason)
kubernetes.client.exceptions.ApiException: (410)
Reason: Gone: too old resource version: 46672510 (46702381)
This is a normal response (and handled in the process_error method), and should be handled gracefully, probably like the event is (catching & resetting self.resource_version).
Anything else we need to know:
This seems to be triggered by having very long-running (multiple days old) task pods in our system. These aren't normal operations, but were the result of some deadlocking bugs.
Thanks for opening your first issue here! Be sure to follow the issue template!
We also encountered this issue. Turns out the root cause was the newest release of the k8s python client https://github.com/kubernetes-client/python/releases/tag/v12.0.0. Code was added to handle the 410 status code and raise an Exception in this PR. In Airflow however, the KubernetesJobWatcher is expecting an event, which is would then handle the status code gracefully, by resetting the resource_version number in process_error. It never actually gets to that point in the code. As you can see the Exception thrown is from here.
Our work around was to explicitly use the previous version of the k8s python client, by using the appropriate constrained/"know-to-be-working" version of Airflow and its libraries.
pip install \
apache-airflow[kubernetes]==1.10.12 \
--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-1.10.12/constraints-3.7.txt"
馃 So, I am not sure what the apache-airflow is is, since we install the kubernetes python library. Should I just pin the version from kubernetes==12.0.0 to pip install kubernetes==11.0.0?
As a community, we heartily recommend using the official constraints of Airflow to install it.
You can see the constraints described in https://airflow.apache.org/docs/stable/installation.html (if you just care about the user story) as well as some details on how and why it works in https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pinned-constraint-files
Those constraint files contain a set of "known to be working" versions for Airflow - those are automatically upgraded during our test harness when we find them passing the tests and consistent with other limitations. While we cannot block you, from upgrading, using the versions from the constraints is the safest way to proceed. We are just about to release a bugfix 1.10.14 release and we are also upgrading the constraints there.
As pointed out by @alaiou - you can use the --constraint from GitHub, or download the constraint file and use it locally. On your own risk, you can also modify and use other versions. You can also try the latest 1-10 version of the constraints (candidate to 1.10.14) - just specify constraint-1.10 instead of the full version:
pip install \
apache-airflow[kubernetes]==1.10.12 \
--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-1.10/constraints-3.7.txt"
And while we can provide those "known to be working" set of versions, if you have your own libraries/requirements - you can modify them yourself. In the upcoming version we will make sure that the constraints are fully consistent (so 'pip check` does not complain when you install all dependencies) - and we recommend you do the same in your installation.
BTW. In both 2.0.0beta (constraint-master branch) and upcoming 1.10.14 (constraints-1-10) the kubernetes version is set to 11.0.0
I am closing the issue as it is clearly about newer version of kubernetes that is not supported.
Most helpful comment
We also encountered this issue. Turns out the root cause was the newest release of the k8s python client https://github.com/kubernetes-client/python/releases/tag/v12.0.0. Code was added to handle the 410 status code and raise an Exception in this PR. In Airflow however, the
KubernetesJobWatcheris expecting an event, which is would then handle the status code gracefully, by resetting the resource_version number inprocess_error. It never actually gets to that point in the code. As you can see the Exception thrown is from here.Our work around was to explicitly use the previous version of the k8s python client, by using the appropriate constrained/"know-to-be-working" version of Airflow and its libraries.