Apache Airflow version: v2.0.0a1 (latest master)
Environment:
What happened:
Trying to get a task log via the task instance list (http:localhost:8080/taskinstance/list/) yields an error saying that the ServiceAccount airflow-webserver does not have the permission to list pods/log.
*** Trying to get logs (last 100 lines) from worker pod ***
*** Unable to fetch logs from worker pod ***
(403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Tue, 20 Oct 2020 16:36:31 GMT', 'Content-Length': '296'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods is forbidden: User \\"system:serviceaccount:airflow:airflow-webserver\\" cannot list resource \\"pods/log\\" in API group \\"\\" in the namespace \\"airflow\\"","reason":"Forbidden","details":{"kind":"pods"},"code":403}\n'
How to reproduce it:
I created a Kubernetes Cluster using kubeadm and added Flannel as Pod Network. Afterward I built the Airflow production image via breeze, then deployed it to Cluster via helm (Mounting DAGS from an externally populated PVC)
$~ ./breeze build-image --production-image
$~ helm install airflow . \
--namespace airflow \
--set dags.persistence.enabled=true \
--set dags.persistence.existingClaim=my-hostPath-claim \
--set dags.gitSync.enabled=false \
--set uid=1000 \
--set gid=1000 \
--set executor=KubernetesExecutor \
--set images.airflow.tag=master-python3.6
$~ kubectl get pods -n airflow
NAME READY STATUS RESTARTS AGE
airflow-postgresql-0 1/1 Running 0 75m
airflow-scheduler-6df9cf9855-4xzd4 2/2 Running 0 75m
airflow-statsd-5556dc96bc-zdtjp 1/1 Running 0 75m
airflow-webserver-dc8c746b7-9wqlh 1/1 Running 0 75m
I triggered a simple DAG. Also posting it here for completeness.
DAG file
from airflow import DAG
from datetime import timedelta, datetime
from airflow.operators.bash_operator import BashOperator
dag = DAG(
'simple_dag',
default_args= {
'owner': 'airflow',
'depends_on_past': False,
'retries' : 0,
'start_date': datetime(1970, 1, 1),
'retry_delay': timedelta(seconds=30),
},
description='',
schedule_interval=None,
catchup=False,
)
t1 = BashOperator(
task_id='task1',
bash_command='echo 1',
dag=dag
)
Possible solution:
Checking airflow/chart/templates/rbac/pod-launcher-rolebinding.yaml I can verify that the ServiceAccount airflow-webserver can't get the needed airflow-pod-launcher-role permissions (as stated in the error). Also I think airflow/chart/templates/rbac/pod-launcher-role.yaml additionally needs the "list" verb for the "pods/log" resource. Applying these changes gets rid of the error but yields a different error. Nevertheless should I add these changes to the chart templates?
*** Trying to get logs (last 100 lines) from worker pod ***
*** Unable to fetch logs from worker pod ***
(400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 20 Oct 2020 16:29:32 GMT', 'Content-Length': '136'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"name must be provided","reason":"BadRequest","code":400}\n'
@msumit Caan you help with it? I see that you added this feature to Airflow, but you probably forgot about the documentation that describes the required permissions.
@FloChehab Can you look at it also?
@FloChehab Can you look at it also?
Sure, I'll have a look at this ; give me 12h :)
@mik-laj So, I can confirm / reproduce the issue.
I guess it's the webserver that is trying to fetch the logs and not the scheduler.
From what I can see in the chart:
So everything works "as expected" and the chart would need to be updated a bit.
Thanks for checking this out. As described I edited the helm chart to grant the permissions to airflow-webserver which solved the permissions issue for fetching the logs but led to the other error described.
Thanks for checking this out. As described I edited the helm chart to grant the permissions to _airflow-webserver_ which solved the permissions issue for fetching the logs but led to the other error described.
Sorry, I forgot to read the second part of the description. Judging by the second error, I guess there might be a bug in airflow itself (I would have expected a 404 if the pod has been deleted but here it's a 400 that states that the pod name is missing in the request ?)
Some more info on this:
I think #11729 should fix the issue for access to the pod logs (I'm not a Kubernetes expert though).
The other error I mentioned (about getting an error 400 when trying to fetch the logs) was related to setting a different uid/gid in the helm install. The worker pod's were launched with that uid, so the default user was not airflow. I got this log from the worker-pod:
$~ kubectl logs simpledagtask1-12f123a06db04f9684628ff0dedd96cb -n airflow
Traceback (most recent call last):
File "/home/airflow/.local/bin/airflow", line 5, in <module>
from airflow.__main__ import main
ModuleNotFoundError: No module named 'airflow'
After removing the --set uid=1000 from the helm install I could launch worker pods and read logs.
@grepthat we all learn the hard way "once" that we must set the uid / gid to be consistent with the user in the docker image haha. If you want to play with that, the airflow image can be parametrized in that regard.
WebUI/Scheduler pods should run with serviceAccount which have RBAC permissions on k8s cluster to get logs
spec.template.spec.serviceAccount: airflow
apiVersion: v1
kind: ServiceAccount
metadata:
name: airflow
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: airflow
rules:
- apiGroups: [""]
resources: [pods]
verbs: [create, get, delete, list, watch]
- apiGroups: [""]
resources: [pods/log]
verbs: [get, list]
- apiGroups: [""]
resources: [pods/exec]
verbs: [create, get]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: airflow
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: airflow
subjects:
- kind: ServiceAccount
name: airflow