When submitting a pipeline I'm getting a timeout, this has been happening sporadically.
# Run the pipeline on Kubeflow cluster
pipeline_run = (
kfp
.Client(host=f'{host}/pipeline', cookies=cookies)
.create_run_from_pipeline_func(
pipeline,
arguments={},
experiment_name=experiment_name,
namespace=namespace,
run_name=pipeline_name
)
)
/opt/conda/lib/python3.7/site-packages/kfp_server_api/rest.py in request(self, method, url, query_params, headers, body, post_params, _preload_content, _request_timeout)
236
237 if not 200 <= r.status <= 299:
--> 238 raise ApiException(http_resp=r)
239
240 return r
ApiException: (504)
Reason: Gateway Timeout
HTTP response headers: HTTPHeaderDict({'content-length': '24', 'content-type': 'text/plain', 'date': 'Mon, 24 Aug 2020 20:24:57 GMT', 'server': 'envoy', 'x-envoy-upstream-service-time': '300028'})
HTTP response body: upstream request timeout
md5-4b83e67414b79a39ea5ca8d37d981c92
I0824 20:20:00.099206 6 util.go:396] Authorized user e23e799e-de9b-4388-99e7-8efe8ab6c072 in namespace e23e799e-de9b-4388-99e7-8efe8ab6c072
I0824 20:21:58.067818 6 interceptor.go:29] /api.RunService/CreateRun handler starting
I0824 20:21:58.753968 6 util.go:396] Authorized user e23e799e-de9b-4388-99e7-8efe8ab6c072 in namespace e23e799e-de9b-4388-99e7-8efe8ab6c072
I0824 20:23:58.117694 6 interceptor.go:29] /api.RunService/CreateRun handler starting
I0824 20:23:58.798816 6 util.go:396] Authorized user e23e799e-de9b-4388-99e7-8efe8ab6c072 in namespace e23e799e-de9b-4388-99e7-8efe8ab6c072
We know the Azure platform has issues with the Kubernetes API and you have to tweak the tcp keep alive in the applications, so perhaps this could be a solution.
[Miscellaneous information that will assist in solving the issue.]
/kind bug
For contrast we had to modify the Jupyter Web API on Kubeflow to avoid timeouts by adding the code below before instantiating the kubernetes api client, this also solved issues with other applications relying on that API such as Airflow, and JupyterHub. We are wondering if this also will be an issue with KFP.
import socket
from urllib3 import connection
# workaround for azure load balancer issue
connection.HTTPConnection.default_socket_options += [(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 60),
(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 60),
(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 3)]
@maganaluis What version / configuration of AKS are you using?
Do you experience the same problem on KFP 1.1?
@dtzar We are using AKS with Kubernetes version 1.16.10
I don't think there is KFP 1.1 (Kubeflow Pipelines) we are using their latest image tag, however we are using Kubeflow version 1.1
I just reviewed the latest logs, and as I mentioned it could be related to timeouts on the Argo api, and not KFP.
I0824 20:35:44.840283 6 error.go:218] Post https://10.0.0.1:443/apis/argoproj.io/v1alpha1/namespaces/e23e799e-de9b-4388-99e7-8efe8ab6c072/workflows: unexpected EOF
I'm attaching the logs for reference.
I'm curious if you would try to use our sample repo install process if you'd have the same problem (for KFP and/or AKS which uses K8s 1.18.x and some other AKS latest features). I do know that 1.1 RC manifests use a very old argo workflow controller of version 2.3.0 which has a bunch of problems.
I do know that 1.1 RC manifests use a very old argo workflow controller of version 2.3.0 which has a bunch of problems.
/cc @Bobgy @rmgogogo
@dtzar I can definitely try that, I will start with running a k8s api test on version 1.18.x I'm hoping that version allows the job below to pass. Modifying the Argo installation will take some time but we can try that as well.
https://github.com/maganaluis/k8s-api-job/blob/master/job.py
Update 08/25/2020:
That job still times out on 1.18.X, we will update Argo on our Dev cluster.
@dtzar I upgraded Argo to 2.10.0, interestingly the only thing I had to change on the clusterwide installation was the workflow-controller-configmap and it Kubeflow Pipelines worked as expected. However, the timeouts are still there and they are getting more consistent, may be I'm getting crazy here. Could you share the configuration you're using? May be VM types? Istio version?
So turns out this was an issue with AKS; I did develop a simple kubernetes api job to prove this:
https://github.com/maganaluis/k8s-api-golang/blob/time-out/analysis.md
The job has just about all the same library versions that the Kubeflow Pipelines API has, and it will use the Argo client to submit a small workflow mimicking what KFP does here. The job will stay idle for 5 minutes between submissions, which is normal behavior, you don't expect the API to be used 100 percent of the time.
If you run this job without the Istio sidecar enabled the job will complete, this is because of the tcp keep alive settings on golang are fairly robust. However the istio sidecar will alter these settings and probably use the linux defaults.
This is not to say that the fault here is on the istio sidecar, because we did run the same job on AWS and GCP and the job passed without timeouts or connections dropped. Regardless the sidecar is required for multi-user so it must be enabled.
Istio provides you with a quite powerful tool which allows you to set the tcp keep alive setting going to any service in Kubernetes. You can read more about it here:
So the solution here is to setup a destination rule for the kubernetes api:
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: kubernetes-api
spec:
host: "kubernetes.default.svc.cluster.local"
trafficPolicy:
connectionPool:
tcp:
connectTimeout: 10s
tcpKeepalive:
time: 75s
interval: 75s
I would like to see this rule at the Kubeflow installation level to ensure Kubeflow works on any cloud platform. This not only will solve timeouts with Kubeflow Pipelines but any other service that uses the Kubernetes API. Anyway, I'm leaving this here in case anyone else stumbles upon the same issue.
@Ark-kun @Bobgy @rmgogogo @dtzar
@maganaluis thanks for the investigation!
I'm okay with adding the destination rule. It doesn't seem harmful to other platforms. Would you mind opening a PR for it?
@Bobgy Sounds good, I'll make the PR. We can always tune the settings on the tcp keep alive to ensure it works across all platforms.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
The PR is currently being reviewed so there should be an update here soon.
Most helpful comment
So turns out this was an issue with AKS; I did develop a simple kubernetes api job to prove this:
https://github.com/maganaluis/k8s-api-golang/blob/time-out/analysis.md
The job has just about all the same library versions that the Kubeflow Pipelines API has, and it will use the Argo client to submit a small workflow mimicking what KFP does here. The job will stay idle for 5 minutes between submissions, which is normal behavior, you don't expect the API to be used 100 percent of the time.
If you run this job without the Istio sidecar enabled the job will complete, this is because of the tcp keep alive settings on golang are fairly robust. However the istio sidecar will alter these settings and probably use the linux defaults.
This is not to say that the fault here is on the istio sidecar, because we did run the same job on AWS and GCP and the job passed without timeouts or connections dropped. Regardless the sidecar is required for multi-user so it must be enabled.
Istio provides you with a quite powerful tool which allows you to set the tcp keep alive setting going to any service in Kubernetes. You can read more about it here:
https://preliminary.istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings
So the solution here is to setup a destination rule for the kubernetes api:
I would like to see this rule at the Kubeflow installation level to ensure Kubeflow works on any cloud platform. This not only will solve timeouts with Kubeflow Pipelines but any other service that uses the Kubernetes API. Anyway, I'm leaving this here in case anyone else stumbles upon the same issue.
@Ark-kun @Bobgy @rmgogogo @dtzar