Pipelines: [Limitation] SDK Complier creates bigger YAML exceeding Kubeflow limits quickly

Created on 8 Jul 2020 · 18Comments · Source: kubeflow/pipelines

What steps did you take:

This problem is only noticed in 1.0.0-rc.3, it does not exist in 0.5.1.
Given the same pipeline, 1.0.0-rc.3 Complier creates a YAML file far bigger than the same created with 0.5.1 Compiler.
Thus, this may be a problem for huge workflows, that previously run successfully and can no more be run if compiled with new Compiler.

What happened:

The pipeline compiled with 1.0.0-rc.3 fails to run with the following error: Workflow is longer than maximum allowed size. Size=1048602

What did you expect to happen:

The pipeline must run successfully

Environment:

Kubeflow 1.0
Build commit: 743746b
python 3.7 + kfp 1.0.0-rc.3

How did you deploy Kubeflow Pipelines (KFP)?

Anything else you would like to add: : Reproducing the limit

For this same pipeline:

from kfp import Client
from kfp.components import func_to_container_op
from kfp.dsl import pipeline

def some_func(i: int) -> str:
    msg = f"""{i}
    This is a huge function, with a lot of code.
    Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
    Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
    when an unknown printer took a galley of type and scrambled it to make a type specimen book. 
    It has survived not only five centuries, but also the leap into electronic typesetting, 
    remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets 
    containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker 
    including versions of Lorem Ipsum.
    """
    return msg

@pipeline(name="huge pipeline")
def test_pipeline():

    component = func_to_container_op(func=some_func, base_image="library/python:3.6")

    previous = None
    for i in range(239):
        op = component(i=i)
        if previous:
            op.after(previous)
        previous = op

Compiler().compile(test_pipeline, package_path="toto.yaml")

If compiled with kfp 0.5.1 it produces a file of size 625kb while if compiled with kfp 1.0.0-rc.3 it produces a file of size 1129kb and thus fails to run in cluster.

For kfp 0.5.1 we can increase the size of this example pipeline up to 438 component. At 439 it fails to run (exceeded workflow size limit). While with 1.0.0-rc.3 this limit decreases to 239 component because of additional YAML size.

I am not sure whether it's a bug, but it's a huge limitation for complex training and data preparation workflows.

/area sdk

areperf aresdk lifecyclfrozen prioritp0 statutriaged

Source

radcheb

👍9

Most helpful comment

We are getting this error too. We are using kfp 0.5.1. The yaml created is 1.3 MB in size, and we have 400+ components in the pipeline. We are working on a benchmarking tool that has multiple datasets and sub-pipelines in the graph. Is there a plan to allow larger pipelines than currently allowed? This seems like a common use case.

nikhil-dce on 20 Jul 2020

👍3

All 18 comments

/assign @Ark-kun
/priority p0

Bobgy on 8 Jul 2020

Ideally, this pipeline compiled from that code should be very small since it only needs a single template. But with the current pipeline, each task has its own template, increasing the workflow size.

How limiting is this for your scenarios?
How do you create that many tasks? Do you use a loop like in the description?
It's possible to use the more powerful dsl.ParallelFor loop or recursive loop using @graph_component. Those do not result in size explosion. Check https://github.com/kubeflow/pipelines/blob/master/samples/core/loop_parallelism/loop_parallelism.py https://github.com/kubeflow/pipelines/blob/2268ddd/components/XGBoost/_samples/recursive_training.py

Ark-kun on 8 Jul 2020

Thanks @Ark-kun for your quick reply.

Our real pipeline is mush richer with a mix of data preapration, training and model evaluation. It uses multiple sources of data and thus we run the same compnent (defined ounce) for each source of data. The whole pipeline can get up to more than 100 tasks.

In the example I gave, I used the for loop only to reproduce the issue. In our real case, the pipeline can no more run on cluster with the same error if compiled with 1.0.0-rc.3.

radcheb on 8 Jul 2020

The whole pipeline can get up to more than 100 tasks.

That's great to hear. The real value of the Pipelines starts to manifest when the pipelines are bigger.

we run the same compnent (defined ounce) for each source of data

Roughly, how many instances do you have of same components?

In our real case, the pipeline can no more run on cluster with the same error if compiled with 1.0.0-rc.3.

Sorry to hear about the problem.
There is a workaround to squeeze the size a bit (although you'll be losing some features like artifact types in the Metadata UX)

import yaml

with open('toto.yaml') as f:
    workflow = yaml.load(f)
for template in workflow['spec']['templates']:
    del template.setdefault('metadata', {}).setdefault('annotations', {})['pipelines.kubeflow.org/component_spec']
with open('toto_fixed.yaml') as f:
    yaml.dump(workflow, f)

Ark-kun on 8 Jul 2020

Everyone else independently affected by this issue, please speak up. I'd like to know about your situation.

Ark-kun on 8 Jul 2020

Removing from 1.0 project because this is intended behavior.
If there's a need to change, we can fix it in later releases.

Bobgy on 15 Jul 2020

nikhil-dce on 20 Jul 2020

👍3

we have 400+ components in the pipeline

Are they different components or component instances?

Is there a plan to allow larger pipelines than currently allowed? This seems like a common use case.

Unfortunately, this is a limitation of Kubernetes itself (and partially Argo).

There is a limit on the size of any Kubernetes object. It was 512KiB some time ago, then 1MiB, and now 1.5MiB.

It might be possible to increase the limit though: https://github.com/etcd-io/etcd/blob/master/Documentation/dev-guide/limit.md#request-size-limit

Ark-kun on 24 Jul 2020

@radcheb Are you currently blocked by this issue?

Ark-kun on 24 Jul 2020

@Ark-kun We are still using kfp 0.5.1 for production pipelines. However, this issue is blocking us from migrating to 1.0.0.
We didn't yet try your solution since most of the time we use directly create_run_from_pipeline_func, I will get back to you soon for this.

radcheb on 24 Jul 2020

@Ark-kun we actually implemented and tested your workaround squeezing pipeline size and it has been working with no problems. Thus, we upgraded to 1.0.0 and it's got validated in pre-production. Thanks again for the solution 👏

@nikhil-dce you could use this workaround to reduce yaml size after compilation:

import yaml

with open("big_pipeline.yaml") as f:
    workflow = yaml.load(f)
for template in workflow['spec']['templates']:
    annotations = template.setdefault('metadata', {}).setdefault('annotations', {})
    if 'pipelines.kubeflow.org/component_spec' in annotations:
        del annotations['pipelines.kubeflow.org/component_spec']
with open("smaller_pipeline.yaml", "w") as f:
    yaml.dump(workflow, f)

radcheb on 15 Aug 2020

@Ark-kun shall we close this issue?

radcheb on 15 Aug 2020

The hack workaround is appreciated. I think there should be an option to cull the metadata->annotations->component_spec in the kfp compiler.

brt-timseries on 5 Oct 2020

We're hitting the problem in kubeflow 1.0. The amount of data coming into the system is variable and so the DAG grows with it. We're on GKE and according to this answer we're stuck:

... a suggestion to try the --max-request-bytes flag but it would have no effect on a GKE cluster because you don't have such permission on master node.

Is it possible for kubeflow to break up the resources into smaller chunks? Otherwise this is quite limiting

whillas on 8 Oct 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 10 Jan 2021

/lifecycle frozen

Bobgy on 4 Feb 2021

Encountering this - following and trying out suggestions above

Situation is similar to several above. Using multiple of similar components. Full complete run requires 1000s of components built (dynamically) of those core ~10 component templates.

Been working with setting limits/requests to deal with OOM issues so far, but still encountering this Workflow is longer than maximum allowed size.

susan-shu-c on 26 Feb 2021

Might be a silly question, but where are you putting this workaround?
@Ark-kun @radcheb
I was trying it in the pipeline script after Compiler().compile(test_pipeline, package_path="toto.yaml") since it makes sense there... the file is created after this line

The whole pipeline can get up to more than 100 tasks.

That's great to hear. The real value of the Pipelines starts to manifest when the pipelines are bigger.

we run the same compnent (defined ounce) for each source of data

Roughly, how many instances do you have of same components?

In our real case, the pipeline can no more run on cluster with the same error if compiled with 1.0.0-rc.3.

Sorry to hear about the problem.
There is a workaround to squeeze the size a bit (although you'll be losing some features like artifact types in the Metadata UX)
import yaml

with open('toto.yaml') as f:
    workflow = yaml.load(f)
for template in workflow['spec']['templates']:
    del template.setdefault('metadata', {}).setdefault('annotations', {})['pipelines.kubeflow.org/component_spec']
with open('toto_fixed.yaml') as f:
    yaml.dump(workflow, f)

susan-shu-c on 26 Feb 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Data Versioning with Kubeflow

VindhyaSRajan · 3Comments

[Multi User] Move manifests from kubeflow/manifests back

Bobgy · 5Comments

[Process] Update backend development README

Bobgy · 3Comments

Pending with "Unschedulable: pod has unbound immediate PersistentVolumeClaims"

kim-sardine · 5Comments

Cannot create artifact when using func_to_container_op

Toeplitz · 4Comments