Is there a way to disable cache on a specific pipeline (created through component yml) using Kubeflow Pipelines on GCP?
I have a pipeline that must run once in a week, because of the cache behavior some nodes are not being executed again, the inputs / parameters are always the same, although, internally it does a Select inside Big Query (which will get the updated data to preprocess). If i can disable this behavior it would work as expected.
PS: I've tried this steps: https://www.kubeflow.org/docs/pipelines/caching/ but they didnt worked out with GCP Pipelines.
Any ideas?
Google Cloud Platform
How did you deploy Kubeflow Pipelines (KFP)?
Through Google Cloud Platform ( AI Platform -> Pipelines)
/kind question
Have you tried set max_cache_staleness to 0 on certain step? https://www.kubeflow.org/docs/pipelines/caching/#managing-caching-staleness
Have you tried set max_cache_staleness to 0 on certain step? https://www.kubeflow.org/docs/pipelines/caching/#managing-caching-staleness
How can i specify it on a yml pipeline? i didnt found an example.
My yaml file:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: project-pipeline
annotations: {pipelines.kubeflow.org/kfp_sdk_version: 1.1.1, pipelines.kubeflow.org/pipeline_compilation_time: '2020-12-01T17:23:37.893312',
pipelines.kubeflow.org/pipeline_spec: '{"description": "Kubeflow pipeline for
stock availability project", "name": "Project pipeline"}'}
labels: {pipelines.kubeflow.org/kfp_sdk_version: 1.1.1}
spec:
entrypoint: kedro-pipeline
templates:
- name: computing-loss-corrected-montly-mape
container:
args: [run, --node, computing_loss_corrected_montly_mape]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
- name: computing-mape
container:
args: [run, --node, computing_mape]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
- name: computing-montly-mape
container:
args: [run, --node, computing_montly_mape]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
- name: creating-master
container:
args: [run, --node, creating_master]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
- name: dcm-sku-query
container:
args: [run, --node, dcm_sku_query]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
- name: estimating-loss-meta
container:
args: [run, --node, estimating_loss_meta]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
- name: estimating-loss-total
container:
args: [run, --node, estimating_loss_total]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
- name: estimating-ts
container:
args: [run, --node, estimating_ts]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
- name: generating-daily-measurements
container:
args: [run, --node, generating_daily_measurements]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
- name: kedro-pipeline
dag:
tasks:
- name: computing-loss-corrected-montly-mape
template: computing-loss-corrected-montly-mape
dependencies: [posprocessing-measurements]
- name: computing-mape
template: computing-mape
dependencies: [posprocessing-measurements]
- name: computing-montly-mape
template: computing-montly-mape
dependencies: [posprocessing-measurements]
- name: creating-master
template: creating-master
dependencies: [resampling-orders, resampling-stock]
- {name: dcm-sku-query, template: dcm-sku-query}
- name: estimating-loss-meta
template: estimating-loss-meta
dependencies: [dcm-sku-query, estimating-loss-total, preprocessing-orders,
sku-filter-query]
- name: estimating-loss-total
template: estimating-loss-total
dependencies: [posprocessing-measurements]
- name: estimating-ts
template: estimating-ts
dependencies: [creating-master]
- name: generating-daily-measurements
template: generating-daily-measurements
dependencies: [posprocessing-measurements]
- {name: orders-query, template: orders-query}
- name: posprocessing-measurements
template: posprocessing-measurements
dependencies: [estimating-ts, sku-target-query]
- name: preprocessing-orders
template: preprocessing-orders
dependencies: [orders-query]
- name: preprocessing-stock
template: preprocessing-stock
dependencies: [stock-query]
- name: resampling-orders
template: resampling-orders
dependencies: [preprocessing-orders]
- name: resampling-stock
template: resampling-stock
dependencies: [preprocessing-stock]
- {name: sku-filter-query, template: sku-filter-query}
- {name: sku-target-query, template: sku-target-query}
- {name: stock-query, template: stock-query}
- name: orders-query
container:
args: [run, --node, orders_query]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
- name: posprocessing-measurements
container:
args: [run, --node, posprocessing_measurements]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
- name: preprocessing-orders
container:
args: [run, --node, preprocessing_orders]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
- name: preprocessing-stock
container:
args: [run, --node, preprocessing_stock]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
- name: resampling-orders
container:
args: [run, --node, resampling_orders]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
- name: resampling-stock
container:
args: [run, --node, resampling_stock]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
- name: sku-filter-query
container:
args: [run, --node, sku_filter_query]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
- name: sku-target-query
container:
args: [run, --node, sku_target_query]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
- name: stock-query
container:
args: [run, --node, stock_query]
command: [kedro]
image: gcr.io/sandbox-ml-pipeline/stock_availability
imagePullPolicy: 'Always'
arguments:
parameters: []
serviceAccountName: pipeline-runner
Thanks in advance,
Can you try to add "pipelines.kubeflow.org/cache_enabled:false" to your pipeline yaml's labels and see if this works?
Can you try to add "pipelines.kubeflow.org/cache_enabled:false" to your pipeline yaml's labels and see if this works?
I did as you suggested, and the pipeline was loaded as expected, but when i try to run it throws the following error:
Run creation failed
{"error":"Failed to create a new run.: Failed to fetch workflow spec.: Invalid input error: Please provide a valid pipeline spec","message":"Failed to create a new run.: Failed to fetch workflow spec.: Invalid input error: Please provide a valid pipeline spec","code":3,"details":[{"@type":"type.googleapis.com/api.Error","error_message":"Please provide a valid pipeline spec","error_details":"Failed to create a new run.: Failed to fetch workflow spec.: Invalid input error: Please provide a valid pipeline spec"}]}
Can you also try to set "pipelines.kubeflow.org/max_cache_staleness: 'P0D'" on the yaml annotations and remove the labels?
Can you also try to set "pipelines.kubeflow.org/max_cache_staleness: 'P0D'" on the yaml annotations and remove the labels?
The nodes keep using the cached executions, even after the modifications
Hi Alexey, can you help take a look at this issue?
/assign @Ark-kun
@cabjr Hello.
Let me help you with this issue.
The supported way to disable caching for a certain step is described in the documentation:
def some_pipeline():
# task is a target step in a pipeline
task_never_use_cache = some_op()
task_never_use_cache.execution_options.caching_strategy.max_cache_staleness = "P0D"
Please try this and tell us whether this helps.
The format of produced workflow files is an implementation detail and is subject to change.
KFP supports pipelines produced by the KFP SDK. As you see by the errors ("Invalid input error: Please provide a valid pipeline spec"), manually editing the compiled workflow files can lead to incorrect Kubernetes object format.
If you're interested, you can observe the changes in the compiled pipeline YAML file when you set .execution_options.caching_strategy.max_cache_staleness = "P0D". It results in adding "pipelines.kubeflow.org/max_cache_staleness: 'P0D' to the metadata annotations section of the corresponding workflow template.
- name: some-name
metadata:
annotations:
"pipelines.kubeflow.org/max_cache_staleness": P0D
container: ...
P.S. I've noticed that your pipeline does not use any data passing. I se no components, not inputs and outputs and no argument passing. System-managed data passing is one of the most important features of KFP and is important for getting value. The caching system relies on the data passing information to decide when to reuse an execution (the cached value are reused when all input arguments are the same and the component is the same).
Perhaps you can create KFP components with inputs and outputs for your pipeline steps and create a pipeline where they pass data explicitly. Then the caching will start working better for you without needing tweaks.
Please check the following tutorial: https://github.com/Ark-kun/kfp_samples/blob/ae1a5b6/2019-10%20Kubeflow%20summit/106%20-%20Creating%20components%20from%20command-line%20programs/106%20-%20Creating%20components%20from%20command-line%20programs.ipynb
pipelines.kubeflow.org/max_cache_staleness": P0D
Hi @Ark-kun , I've tried adding the "pipelines.kubeflow.org/max_cache_staleness": P0D specification inside annotations, but it doesnt seem to work.
I cannot specify the "task_never_use_cache.execution_options.caching_strategy.max_cache_staleness = "P0D" because my pipelines are generated (yaml) from a kedro pipeline (ml framework for experimentation), thats the same reason as why i cannot specify/use the data inputs/outputs from the kubeflow pipelines itself, internally my image already uses a data catalog that points out to GCS and BigQuery. Thats why i'm trying to disable the cache behavior.
The main idea about my project is to allow the DS team to prototype with kedro, then deploy it in kubeflow pipelines with the minimal (none if possible) modification. As it is suposed to be running in recurring runs (jobs), the cache behavior is a problem for us.
@cabjr you might want to follow instructions in https://www.kubeflow.org/docs/pipelines/caching/#disabling-caching-in-your-kubeflow-pipelines-deployment to disable caching for your KFP instance, so that all pipelines are not cached.
but reminder that running arbitrary argo workflows with KFP may not keep working in the future, KFP has its own sdk for building workflows.
I've done as Bobgy suggested, disabled cache for the entire KFP Instance, might not be the ideal solution but it works as expected,
Thanks for the help
Most helpful comment
@cabjr Hello.
Let me help you with this issue.
The supported way to disable caching for a certain step is described in the documentation:
Please try this and tell us whether this helps.
The format of produced workflow files is an implementation detail and is subject to change.
KFP supports pipelines produced by the KFP SDK. As you see by the errors ("Invalid input error: Please provide a valid pipeline spec"), manually editing the compiled workflow files can lead to incorrect Kubernetes object format.
If you're interested, you can observe the changes in the compiled pipeline YAML file when you set
.execution_options.caching_strategy.max_cache_staleness = "P0D". It results in adding"pipelines.kubeflow.org/max_cache_staleness:'P0D'to themetadataannotationssection of the corresponding workflow template.