Pipelines: Kubeflow Pipelines timing out on Azure deployment

Created on 24 Aug 2020 · 13Comments · Source: kubeflow/pipelines

What steps did you take:

Run a in-house Kubeflow notebook template.
Get a timeout when running the pipeline.

What happened:

When submitting a pipeline I'm getting a timeout, this has been happening sporadically.

# Run the pipeline on Kubeflow cluster
pipeline_run = (
    kfp
    .Client(host=f'{host}/pipeline', cookies=cookies)
    .create_run_from_pipeline_func(
        pipeline,
        arguments={},
        experiment_name=experiment_name,
        namespace=namespace,
        run_name=pipeline_name
    )
)

/opt/conda/lib/python3.7/site-packages/kfp_server_api/rest.py in request(self, method, url, query_params, headers, body, post_params, _preload_content, _request_timeout)
    236 
    237         if not 200 <= r.status <= 299:
--> 238             raise ApiException(http_resp=r)
    239 
    240         return r

ApiException: (504)
Reason: Gateway Timeout
HTTP response headers: HTTPHeaderDict({'content-length': '24', 'content-type': 'text/plain', 'date': 'Mon, 24 Aug 2020 20:24:57 GMT', 'server': 'envoy', 'x-envoy-upstream-service-time': '300028'})
HTTP response body: upstream request timeout



md5-4b83e67414b79a39ea5ca8d37d981c92



I0824 20:20:00.099206       6 util.go:396] Authorized user e23e799e-de9b-4388-99e7-8efe8ab6c072 in namespace e23e799e-de9b-4388-99e7-8efe8ab6c072
I0824 20:21:58.067818       6 interceptor.go:29] /api.RunService/CreateRun handler starting
I0824 20:21:58.753968       6 util.go:396] Authorized user e23e799e-de9b-4388-99e7-8efe8ab6c072 in namespace e23e799e-de9b-4388-99e7-8efe8ab6c072
I0824 20:23:58.117694       6 interceptor.go:29] /api.RunService/CreateRun handler starting
I0824 20:23:58.798816       6 util.go:396] Authorized user e23e799e-de9b-4388-99e7-8efe8ab6c072 in namespace e23e799e-de9b-4388-99e7-8efe8ab6c072

We know the Azure platform has issues with the Kubernetes API and you have to tweak the tcp keep alive in the applications, so perhaps this could be a solution.

[Miscellaneous information that will assist in solving the issue.]

/kind bug

kinbug

Source

maganaluis

Most helpful comment

So turns out this was an issue with AKS; I did develop a simple kubernetes api job to prove this:

https://github.com/maganaluis/k8s-api-golang/blob/time-out/analysis.md

The job has just about all the same library versions that the Kubeflow Pipelines API has, and it will use the Argo client to submit a small workflow mimicking what KFP does here. The job will stay idle for 5 minutes between submissions, which is normal behavior, you don't expect the API to be used 100 percent of the time.

If you run this job without the Istio sidecar enabled the job will complete, this is because of the tcp keep alive settings on golang are fairly robust. However the istio sidecar will alter these settings and probably use the linux defaults.

This is not to say that the fault here is on the istio sidecar, because we did run the same job on AWS and GCP and the job passed without timeouts or connections dropped. Regardless the sidecar is required for multi-user so it must be enabled.

Istio provides you with a quite powerful tool which allows you to set the tcp keep alive setting going to any service in Kubernetes. You can read more about it here:

https://preliminary.istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings

So the solution here is to setup a destination rule for the kubernetes api:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: kubernetes-api
spec:
  host: "kubernetes.default.svc.cluster.local"
  trafficPolicy:
    connectionPool:
      tcp:
        connectTimeout: 10s
        tcpKeepalive:
          time: 75s
          interval: 75s

I would like to see this rule at the Kubeflow installation level to ensure Kubeflow works on any cloud platform. This not only will solve timeouts with Kubeflow Pipelines but any other service that uses the Kubernetes API. Anyway, I'm leaving this here in case anyone else stumbles upon the same issue.

@Ark-kun @Bobgy @rmgogogo @dtzar

maganaluis on 3 Sep 2020

👍2

All 13 comments

For contrast we had to modify the Jupyter Web API on Kubeflow to avoid timeouts by adding the code below before instantiating the kubernetes api client, this also solved issues with other applications relying on that API such as Airflow, and JupyterHub. We are wondering if this also will be an issue with KFP.

import socket
from urllib3 import connection
# workaround for azure load balancer issue
connection.HTTPConnection.default_socket_options += [(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
                                            (socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 60),
                                            (socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 60),
                                            (socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 3)]

maganaluis on 24 Aug 2020

👍1

@maganaluis What version / configuration of AKS are you using?

Do you experience the same problem on KFP 1.1?

dtzar on 25 Aug 2020

@dtzar We are using AKS with Kubernetes version 1.16.10

I don't think there is KFP 1.1 (Kubeflow Pipelines) we are using their latest image tag, however we are using Kubeflow version 1.1

I just reviewed the latest logs, and as I mentioned it could be related to timeouts on the Argo api, and not KFP.

I0824 20:35:44.840283       6 error.go:218] Post https://10.0.0.1:443/apis/argoproj.io/v1alpha1/namespaces/e23e799e-de9b-4388-99e7-8efe8ab6c072/workflows: unexpected EOF

I'm attaching the logs for reference.

ml-pipeline.log

maganaluis on 25 Aug 2020

I'm curious if you would try to use our sample repo install process if you'd have the same problem (for KFP and/or AKS which uses K8s 1.18.x and some other AKS latest features). I do know that 1.1 RC manifests use a very old argo workflow controller of version 2.3.0 which has a bunch of problems.

dtzar on 25 Aug 2020

I do know that 1.1 RC manifests use a very old argo workflow controller of version 2.3.0 which has a bunch of problems.

/cc @Bobgy @rmgogogo

Ark-kun on 25 Aug 2020

@dtzar I can definitely try that, I will start with running a k8s api test on version 1.18.x I'm hoping that version allows the job below to pass. Modifying the Argo installation will take some time but we can try that as well.

https://github.com/maganaluis/k8s-api-job/blob/master/job.py

Update 08/25/2020:

That job still times out on 1.18.X, we will update Argo on our Dev cluster.

maganaluis on 25 Aug 2020

@dtzar I upgraded Argo to 2.10.0, interestingly the only thing I had to change on the clusterwide installation was the workflow-controller-configmap and it Kubeflow Pipelines worked as expected. However, the timeouts are still there and they are getting more consistent, may be I'm getting crazy here. Could you share the configuration you're using? May be VM types? Istio version?

maganaluis on 25 Aug 2020

https://github.com/kaizentm/manifests/blob/eedorenko/kfdef-azure/kfdef/kfctl_azure.v1.1.0.yaml

dtzar on 25 Aug 2020

So turns out this was an issue with AKS; I did develop a simple kubernetes api job to prove this:

https://github.com/maganaluis/k8s-api-golang/blob/time-out/analysis.md

Istio provides you with a quite powerful tool which allows you to set the tcp keep alive setting going to any service in Kubernetes. You can read more about it here:

https://preliminary.istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings

So the solution here is to setup a destination rule for the kubernetes api:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: kubernetes-api
spec:
  host: "kubernetes.default.svc.cluster.local"
  trafficPolicy:
    connectionPool:
      tcp:
        connectTimeout: 10s
        tcpKeepalive:
          time: 75s
          interval: 75s

@Ark-kun @Bobgy @rmgogogo @dtzar

maganaluis on 3 Sep 2020

👍2

@maganaluis thanks for the investigation!

I'm okay with adding the destination rule. It doesn't seem harmful to other platforms. Would you mind opening a PR for it?

Bobgy on 3 Sep 2020

@Bobgy Sounds good, I'll make the PR. We can always tune the settings on the tcp keep alive to ensure it works across all platforms.

maganaluis on 3 Sep 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.