Pipelines: Is there a way to disable cache on a specific pipeline (created through component yml)

Created on 2 Dec 2020 · 13Comments · Source: kubeflow/pipelines

What did you expect to happen:

Is there a way to disable cache on a specific pipeline (created through component yml) using Kubeflow Pipelines on GCP?

I have a pipeline that must run once in a week, because of the cache behavior some nodes are not being executed again, the inputs / parameters are always the same, although, internally it does a Select inside Big Query (which will get the updated data to preprocess). If i can disable this behavior it would work as expected.

PS: I've tried this steps: https://www.kubeflow.org/docs/pipelines/caching/ but they didnt worked out with GCP Pipelines.

Any ideas?

Environment:

Google Cloud Platform

How did you deploy Kubeflow Pipelines (KFP)?
Through Google Cloud Platform ( AI Platform -> Pipelines)

/kind question

kinquestion

Source

cabjr

Most helpful comment

@cabjr Hello.
Let me help you with this issue.
The supported way to disable caching for a certain step is described in the documentation:

def some_pipeline():
      # task is a target step in a pipeline
      task_never_use_cache = some_op()
      task_never_use_cache.execution_options.caching_strategy.max_cache_staleness = "P0D"

Please try this and tell us whether this helps.

The format of produced workflow files is an implementation detail and is subject to change.
KFP supports pipelines produced by the KFP SDK. As you see by the errors ("Invalid input error: Please provide a valid pipeline spec"), manually editing the compiled workflow files can lead to incorrect Kubernetes object format.
If you're interested, you can observe the changes in the compiled pipeline YAML file when you set .execution_options.caching_strategy.max_cache_staleness = "P0D". It results in adding "pipelines.kubeflow.org/max_cache_staleness: 'P0D' to the metadata annotations section of the corresponding workflow template.

  - name: some-name
    metadata:
      annotations:
        "pipelines.kubeflow.org/max_cache_staleness": P0D
    container: ...

Ark-kun on 8 Dec 2020

👍2

All 13 comments

Have you tried set max_cache_staleness to 0 on certain step? https://www.kubeflow.org/docs/pipelines/caching/#managing-caching-staleness

rui5i on 2 Dec 2020

Have you tried set max_cache_staleness to 0 on certain step? https://www.kubeflow.org/docs/pipelines/caching/#managing-caching-staleness

How can i specify it on a yml pipeline? i didnt found an example.

My yaml file:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: project-pipeline
  annotations: {pipelines.kubeflow.org/kfp_sdk_version: 1.1.1, pipelines.kubeflow.org/pipeline_compilation_time: '2020-12-01T17:23:37.893312',
    pipelines.kubeflow.org/pipeline_spec: '{"description": "Kubeflow pipeline for
      stock availability project", "name": "Project pipeline"}'}
  labels: {pipelines.kubeflow.org/kfp_sdk_version: 1.1.1}
spec:
  entrypoint: kedro-pipeline
  templates:
  - name: computing-loss-corrected-montly-mape
    container:
      args: [run, --node, computing_loss_corrected_montly_mape]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: computing-mape
    container:
      args: [run, --node, computing_mape]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: computing-montly-mape
    container:
      args: [run, --node, computing_montly_mape]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: creating-master
    container:
      args: [run, --node, creating_master]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: dcm-sku-query
    container:
      args: [run, --node, dcm_sku_query]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: estimating-loss-meta
    container:
      args: [run, --node, estimating_loss_meta]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: estimating-loss-total
    container:
      args: [run, --node, estimating_loss_total]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: estimating-ts
    container:
      args: [run, --node, estimating_ts]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: generating-daily-measurements
    container:
      args: [run, --node, generating_daily_measurements]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: kedro-pipeline
    dag:
      tasks:
      - name: computing-loss-corrected-montly-mape
        template: computing-loss-corrected-montly-mape
        dependencies: [posprocessing-measurements]
      - name: computing-mape
        template: computing-mape
        dependencies: [posprocessing-measurements]
      - name: computing-montly-mape
        template: computing-montly-mape
        dependencies: [posprocessing-measurements]
      - name: creating-master
        template: creating-master
        dependencies: [resampling-orders, resampling-stock]
      - {name: dcm-sku-query, template: dcm-sku-query}
      - name: estimating-loss-meta
        template: estimating-loss-meta
        dependencies: [dcm-sku-query, estimating-loss-total, preprocessing-orders,
          sku-filter-query]
      - name: estimating-loss-total
        template: estimating-loss-total
        dependencies: [posprocessing-measurements]
      - name: estimating-ts
        template: estimating-ts
        dependencies: [creating-master]
      - name: generating-daily-measurements
        template: generating-daily-measurements
        dependencies: [posprocessing-measurements]
      - {name: orders-query, template: orders-query}
      - name: posprocessing-measurements
        template: posprocessing-measurements
        dependencies: [estimating-ts, sku-target-query]
      - name: preprocessing-orders
        template: preprocessing-orders
        dependencies: [orders-query]
      - name: preprocessing-stock
        template: preprocessing-stock
        dependencies: [stock-query]
      - name: resampling-orders
        template: resampling-orders
        dependencies: [preprocessing-orders]
      - name: resampling-stock
        template: resampling-stock
        dependencies: [preprocessing-stock]
      - {name: sku-filter-query, template: sku-filter-query}
      - {name: sku-target-query, template: sku-target-query}
      - {name: stock-query, template: stock-query}
  - name: orders-query
    container:
      args: [run, --node, orders_query]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: posprocessing-measurements
    container:
      args: [run, --node, posprocessing_measurements]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: preprocessing-orders
    container:
      args: [run, --node, preprocessing_orders]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: preprocessing-stock
    container:
      args: [run, --node, preprocessing_stock]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: resampling-orders
    container:
      args: [run, --node, resampling_orders]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: resampling-stock
    container:
      args: [run, --node, resampling_stock]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: sku-filter-query
    container:
      args: [run, --node, sku_filter_query]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: sku-target-query
    container:
      args: [run, --node, sku_target_query]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  - name: stock-query
    container:
      args: [run, --node, stock_query]
      command: [kedro]
      image: gcr.io/sandbox-ml-pipeline/stock_availability
      imagePullPolicy: 'Always'
  arguments:
    parameters: []
  serviceAccountName: pipeline-runner

Thanks in advance,

cabjr on 2 Dec 2020

Can you try to add "pipelines.kubeflow.org/cache_enabled:false" to your pipeline yaml's labels and see if this works?

rui5i on 3 Dec 2020

Can you try to add "pipelines.kubeflow.org/cache_enabled:false" to your pipeline yaml's labels and see if this works?

I did as you suggested, and the pipeline was loaded as expected, but when i try to run it throws the following error:

Run creation failed
{"error":"Failed to create a new run.: Failed to fetch workflow spec.: Invalid input error: Please provide a valid pipeline spec","message":"Failed to create a new run.: Failed to fetch workflow spec.: Invalid input error: Please provide a valid pipeline spec","code":3,"details":[{"@type":"type.googleapis.com/api.Error","error_message":"Please provide a valid pipeline spec","error_details":"Failed to create a new run.: Failed to fetch workflow spec.: Invalid input error: Please provide a valid pipeline spec"}]}

cabjr on 3 Dec 2020

Can you also try to set "pipelines.kubeflow.org/max_cache_staleness: 'P0D'" on the yaml annotations and remove the labels?

rui5i on 3 Dec 2020

Can you also try to set "pipelines.kubeflow.org/max_cache_staleness: 'P0D'" on the yaml annotations and remove the labels?

The nodes keep using the cached executions, even after the modifications

cabjr on 3 Dec 2020

Hi Alexey, can you help take a look at this issue?

/assign @Ark-kun

rui5i on 4 Dec 2020

@cabjr Hello.
Let me help you with this issue.
The supported way to disable caching for a certain step is described in the documentation:

def some_pipeline():
      # task is a target step in a pipeline
      task_never_use_cache = some_op()
      task_never_use_cache.execution_options.caching_strategy.max_cache_staleness = "P0D"

Please try this and tell us whether this helps.

  - name: some-name
    metadata:
      annotations:
        "pipelines.kubeflow.org/max_cache_staleness": P0D
    container: ...

Ark-kun on 8 Dec 2020

👍2

P.S. I've noticed that your pipeline does not use any data passing. I se no components, not inputs and outputs and no argument passing. System-managed data passing is one of the most important features of KFP and is important for getting value. The caching system relies on the data passing information to decide when to reuse an execution (the cached value are reused when all input arguments are the same and the component is the same).

Perhaps you can create KFP components with inputs and outputs for your pipeline steps and create a pipeline where they pass data explicitly. Then the caching will start working better for you without needing tweaks.

Please check the following tutorial: https://github.com/Ark-kun/kfp_samples/blob/ae1a5b6/2019-10%20Kubeflow%20summit/106%20-%20Creating%20components%20from%20command-line%20programs/106%20-%20Creating%20components%20from%20command-line%20programs.ipynb

Ark-kun on 8 Dec 2020

pipelines.kubeflow.org/max_cache_staleness": P0D

Hi @Ark-kun , I've tried adding the "pipelines.kubeflow.org/max_cache_staleness": P0D specification inside annotations, but it doesnt seem to work.

I cannot specify the "task_never_use_cache.execution_options.caching_strategy.max_cache_staleness = "P0D" because my pipelines are generated (yaml) from a kedro pipeline (ml framework for experimentation), thats the same reason as why i cannot specify/use the data inputs/outputs from the kubeflow pipelines itself, internally my image already uses a data catalog that points out to GCS and BigQuery. Thats why i'm trying to disable the cache behavior.

The main idea about my project is to allow the DS team to prototype with kedro, then deploy it in kubeflow pipelines with the minimal (none if possible) modification. As it is suposed to be running in recurring runs (jobs), the cache behavior is a problem for us.

cabjr on 8 Dec 2020

@cabjr you might want to follow instructions in https://www.kubeflow.org/docs/pipelines/caching/#disabling-caching-in-your-kubeflow-pipelines-deployment to disable caching for your KFP instance, so that all pipelines are not cached.